# CS 124

Course Notes 1

Spring 2002

An algorithm is a recipe or a well-deﬁned procedure for performing a calculation, or in general, for transforming some input into a desired output. Perhaps the most familiar algorithms are those those for adding and multiplying integers. Here is a multiplication algorithm that is different from the standard algorithm you learned in school: write the multiplier and multiplicand side by side. Repeat the following operations - divide the ﬁrst number by 2 (throw out any fractions) and multiply the second by 2, until the ﬁrst number is 1. This results in two columns of numbers. Now cross out all rows in which the ﬁrst entry is even, and add all entries of the second column that haven’t been crossed out. The result is the product of the two numbers.

75 37 18 9 4 2 1

29 58 116 232 464 928 1856 2175

29 x 1001011 29 58 232 1856 2175

Figure 1.1: A different multiplication algorithm.

1-1

1-2

In this course we will ask a number of basic questions about algorithms:

• Does it halt? The answer for the algorithm given above is clearly yes, provided we are multiplying positive integers. The reason is that for any integer greater than 1, when we divide it by 2 and throw out the fractional part, we always get a smaller integer which is greater than or equal to 1. Hence our ﬁrst number is eventually reduced to 1 and the process halts. • Is it correct? To see that the algorithm correctly computes the product of the integers, observe that if we write a 0 for each crossed out row, and 1 for each row that is not crossed out, then reading from bottom to top just gives us the ﬁrst number in binary. Therefore, the algorithm is just doing standard multiplication, with the multiplier written in binary. • Is it fast? It turns out that the above algorithm is about as fast as the standard algorithm you learned in school. Later in the course, we will study a faster algorithm for multiplying integers. • How much memory does it use? The memory used by this algorithm is also about the same as that of standard algorithm.

1-3

The history of algorithms for simple arithmetic is quite fascinating. Although we take these algorithms for granted, their widespread use is surprisingly recent. The key to good algorithms for arithmetic was the positional number system (such as the decimal system). Roman numerals (I, II, III, IV, V, VI, etc) are just the wrong data structure for performing arithmetic efﬁciently. The positional number system was ﬁrst invented by the Mayan Indians in Central America about 2000 years ago. They used a base 20 system, and it is unknown whether they had invented algorithms for performing arithmetic, since the Spanish conquerors destroyed most of the Mayan books on science and astronomy. The decimal system that we use today was invented in India in roughly 600 AD. This positional number system, together with algorithms for performing arithmetic, were transmitted to Persia around 750 AD, when several important Indian works were translated into Arabic. Around this time the Persian mathematician Al-Khwarizmi wrote his Arabic textbook on the subject. The word “algorithm” comes from Al-Khwarizmi’s name. Al-Khwarizmi’s work was translated into Latin around 1200 AD, and the positional number system was propagated throughout Europe from 1200 to 1600 AD. The decimal point was not invented until the 10th century AD, by a Syrian mathematician al-Uqlidisi from Damascus. His work was soon forgotten, and ﬁve centuries passed before decimal fractions were re-invented by the Persian mathematician al-Kashi. With the invention of computers in this century, the ﬁeld of algorithms has seen explosive growth. There are a number of major successes in this ﬁeld:

• Parsing algorithms - these form the basis of the ﬁeld of programming languages • Fast Fourier transform - the ﬁeld of digital signal processing is built upon this algorithm. • Linear programming - this algorithm is extensively used in resource scheduling. • Sorting algorithms - until recently, sorting used up the bulk of computer cycles. • String matching algorithms - these are extensively used in computational biology. • Number theoretic algorithms - these algorithms make it possible to implement cryptosystems such as the RSA public key cryptosystem. • Compression algorithms - these algorithms allow us to transmit data more efﬁciently over, for example, phone lines.

1-4

• Geometric algorithms - displaying images quickly on a screen often makes use of sophisticated algorithmic techniques.

In designing an algorithm, it is often easier and more productive to think of a computer in abstract terms. Of course, we must carefully choose at what level of abstraction to think. For example, we could think of computer operations in terms of a high level computer language such as C or Java, or in terms of an assembly language. We could dip further down, and think of the computer at the level AND and NOT gates. For most algorithm design we undertake in this course, it is generally convenient to work at a fairly high level. We will usually abstract away even the details of the high level programming language, and write our algorithms in ”pseudo-code”, without worrying about implementation details. (Unless, of course, we are dealing with a programming assignment!) Sometimes we have to be careful that we do not abstract away essential features of the problem. To illustrate this, let us consider a simple but enlightening example.

1-5

1.1 Computing the nth Fibonacci number
Remember the famous sequence of numbers invented in the 15th century by the Italian mathematician Leonardo Fibonacci? The sequence is represented as F0 , F1 , F2 . . ., where F0 = 0, F1 = 1, and for all n ≥ 2, Fn is deﬁned as Fn−1 + Fn−2 . The ﬁrst few Fibonacci numbers are 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, . . . The value of F30 is greater than a million! It is easy to see that the Fibonacci numbers grow exponentially. As an exercise, try to show that Fn ≥ 2n/2 for sufﬁciently large n by a simple induction. Here is a simple program to compute Fibonacci numbers that slavishly follows the deﬁnition. function F(n: integer): integer if n = 0 then return 0 else if n = 1 then return 1 else return F(n − 1) + F(n − 2)

The program is obviously correct. However, it is woefully slow. As it is a recursive algorithm, we can naturally express its running time on input n with a recurrence equation. In fact, we will simply count the number of addition operations the program uses, which we denote by T (n). To develop a recurrence equation, we express T (n) in terms of smaller values of T . We shall see several such recurrence relations in this class. It is clear that T (0) = 0 and T (1) = 0. Otherwise, for n ≥ 2, we have T (n) = T (n − 1) + T (n − 2) + 1, because to computer F(n) we compute F(n − 1) and F(n − 2) and do one other addition besides. This is (almost) the Fibonacci equation! Hence we can see that the number of addition operations is growing very large; it is at least 2n/2 for n ≥ 4.

1-6

Can we do better? This is the question we shall always ask of our algorithms. The trouble with the naive algorithm the wasteful recursion: the function F is called with the same argument over and over again, exponentially many times (try to see how many times F(1) is called in the computation of F(5)). A simple trick for improving performance is to avoid repeated calculations. In this case, this can be easily done by avoiding recursion and just calculating successive values: function F(n: integer): integer array A[0 . . . n] of integer A[0] = 0; A[1] = 1 for i = 2 to n do: A[i] = A[i − 1] + A[i − 2] return A[n] This algorithm is of course correct. Now, however, we only do n − 1 additions.

1-7

It seems that we have come so far, from exponential to polynomially many operations, that we can stop here. But in the back of our heads, we should be wondering an we do even better? Surprisingly, we can. We rewrite our equations in matrix notation. Then  Similarly,  and in general, Similarly,  F2 F3   =    F1 F2  =   ·  =   0 1 1 1    ·  F0 F1  . 2   ·  . F0 F1  ,

0 1 1 1 Fn Fn+1

F1 F2

=

0 1 1 1 F0 F1

0 1 1 1

n   ·

So, in order to compute Fn , it sufﬁces to raise this 2 by 2 matrix to the nth power. Each matrix multiplication takes 12 arithmetic operations, so the question boils down to the following: how many multiplications does it take to raise a base (matrix, number, anything) to the nth power? The answer is O(log n). To see why, consider the case where n > 1 is a power of 2. To raise X to the nth power, we compute X n/2 and then square it. Hence the number of multiplications T (n) satisﬁes T (n) = T (n/2) + 1, from which we ﬁnd T (n) = log n. As an exercise, consider what you have to do when n is not a power of 2. (Hint: consider the connection with the multiplication algorithm of the ﬁrst section; there too we repeatedly halved a number...) So we have reduced the computation time exponentially again, from n − 1 arithmetic operations to O(log n), a great achievement. Well, not really. We got a little too abstract in our model. In our accounting of the time requirements for all three methods, we have made a grave and common error: we have been too liberal about what constitutes an elementary step. In general, we often assume that each arithmetic step takes unit time, because the numbers involved will be typically small enough that we can reasonably expect them to ﬁt within a computer’s word. Remember, the number n is only log n bits in length. But in the present case, we are doing arithmetic on huge numbers, with about n bits, where n is pretty large. When dealing with such huge numbers, if exact computation is required we have to use sophisticated long integer packages. Such algorithms take O(n) time to add two n-bit numbers. Hence the complexity of the ﬁrst two methods was larger than we actually thought: not really O(Fn ) and O(n), but instead O(nFn ) and O(n2 ), respectively. The second algorithm is still exponentially faster. What is worse, the third algorithm involves multiplications of O(n)-bit integers. Let M(n) be the time required to multiply two n-bit numbers. Then the running time of the third algorithm is in fact O(M(n)).

1-8

The comparison between the running times of the second and third algorithms boils down to a most important and ancient issue: can we multiply two n-bit integers faster than Ω(n 2 ) ? This would be faster than the method we learn in elementary school or the clever halving method explained in the opening of these notes. As a ﬁnal consideration, we might consider the mathematicians’ solution to computing the Fibonacci numbers. A mathematician would quickly determine that 1 Fn = √ 5 √ 1+ 5 2
n

√ 1− 5 2

n

.

Using this, how many operations does it take to compute Fn ? Note that this calculation would require ﬂoating point arithmetic. Whether in practice that would lead to a faster or slower algorithm than one using just integer arithmetic might depend on the computer system on which you run the algorithm.

CS 124

Lecture 2

In order to discuss algorithms effectively, we need to start with a basic set of tools. Here, we explain these tools and provide a few examples. Rather than spend time honing our use of these tools, we will learn how to use them by applying them in our studies of actual algorithms.

Induction
The standard form of the induction principle is the following: If a statement P(n) holds for n = 1, and if for every n ≥ 1 P(n) implies P(n + 1), then P holds for all n. Let us see an example of this: Claim 2.1 Let S(n) = ∑n i. Then S(n) = i=1
n(n+1) 2 .

Proof: The proof is by induction. Base Case: We show the statement is true for n = 1. As S(1) = 1 = Induction Hypothesis: We assume S(n) = Reduction Step: We show S(n + 1) =
n(n+1) 2 . 1(2) 2 ,

the statement holds.

(n+1)(n+2) . 2

Note that S(n + 1) = S(n) + n + 1. Hence

S(n + 1) = S(n) + n + 1 n(n + 1) +n+1 = 2 n = (n + 1) +1 2 (n + 1)(n + 2) = . 2

2-1

2-2

The proof style is somewhat pedantic, but instructional and easy to read. We break things down to the base case – showing that the statement holds when n = 1; the induction hypothesis – the statement that P(n) is true; and the reduction step – showing that P(n) implies P(n + 1). Induction is one of the most fundamental proof techniques. The idea behind induction is simple: take a large problem (P(n + 1)), and somehow reduce its proof to a proof of a smaller problems (such as P(n); P(n) is smaller in the sense that n < n + 1). If every problem can thereby be broken down to a small number of instances (we keep reducing down to P(1)), these can be checked easily. We will see this idea of reduction, whereby we reduce solving a problem to a solving an easier problem, over and over again throughout the course. As one might imagine, there are other forms of induction besides the speciﬁc standard form we gave above. Here’s a different form of induction, called strong induction: If a statement P(n) holds for n = 1, and if for every n ≥ 1 the truth of P(i) for all i ≤ n implies P(n + 1), then P holds for all n. Exercise: show that every number has a unique prime factorization using strong induction.

2-3

O Notation
When measuring, for example, the number of steps an algorithm takes in the worst case, our result will generally be some function T (n) of the input size, n. One might imagine that this function may have some complex form, such as T (n) = 4n2 − 3n log n + n2/3 + log3 n − 4. In very rare cases, one might wish to have such an exact form for the running time, but in general, we are more interested in the rate of growth of T (n) rather than its exact form. The O notation was developed with this in mind. With the O notation, only the fastest growing term is important, and constant factors may be ignored. More formally:

Deﬁnition 2.2 We say for non-negative functions f (n) and g(n) that f (n) is O(g(n)) if there exist positive constants c and N such that for all n ≥ N, f (n) ≤ cg(n).

2-4

Let us try some examples. We claim that 2n 3 + 4n2 is O(n3 ). It sufﬁces to show that 2n3 + 4n2 ≤ 6n3 for n ≥ 1, by deﬁnition. But this is clearly true as 4n 3 ≥ 4n2 for n ≥ 1. (Exercise: show that 2n3 + 4n2 is O(n4 ).) We claim 10 log 2 n is O(ln n). This follows from the fact that 10 log 2 n ≤ (10 log 2 e) ln n. If T (n) is as above, then T (n) is O(n2 ). This is a bit harder to prove, because of all the extraneous terms. It is, however, easy to see; 4n2 is clearly the fastest growing term, and we can remove the constant with O notation. Note, though, that T (n) is O(n3 ) as well! The O notation is not tight, but more like a ≤ comparison.

2-5

Similarly, there is notation for ≥ and = comparisons. Deﬁnition 2.3 We say for non-negative functions f (n) and g(n) that f (n) is is Ω(g(n)) if there exist positive constants c and N such that for all n ≥ N, f (n) ≥ cg(n). We say that f (n) is Θ(g(n)) if both f (n) is O(g(n)) and f (n) is Ω(g(n)). The O notation has several useful properties that are easy to prove. Lemma 2.4 If f 1 (n) is O(g1 (n)) and f 2 (n) is O(g2 (n)) then f 1 (n) + f2 (n) is O(g1 (n) + g2 (n)). Proof: There exist positive constants c 1 , c2 , N1 , and N2 such that f 1 (n) ≤ c1 g1 (n) for n ≥ N1 and f2 (n) ≤ c2 g2 (n) for n ≥ N2 . Hence f1 (n) + f2 (n) ≤ max{c1 , c2 }(g1 (n) + g2 (n)) for n ≥ max{N1 , N2 }. Exercise: Prove similar lemmata for f 1 (n) f2 (n). Prove the lemmata when O is replaced by Ω or Θ.

2-6

Finally, there is a bit for notation corresponding to <<, when one function is (in some sense) much less than another.

Deﬁnition 2.5 We say for non-negative functions f (n) and g(n) that f (n) is is o(g(n)) if f (n) = 0. n→∞ g(n) lim Also, f (n) is ω(g(n)) if g(n) is o( f (n)).

We emphasize that the O notation is a tool to help us analyze algorithms. It does not always accurately tell us how fast an algorithm will run in practice. For example, constant factors make a huge difference in practice (imagine increasing your bank account by a factor of 10), and they are ignored in the O notation. Like any other tool, the O notation is only useful if used properly and wisely. Use it as a guide, not as the last word, to judging an algorithm.

2-7

Recurrence Relations
A recurrence relation deﬁnes a function using an expression that includes the function itself. For example, the Fibonacci numbers are deﬁned by: F(n) = F(n − 1) + F(n − 2), F(1) = F(2) = 1. This function is well-deﬁned, since we can compute a unique value of F(n) for every positive integer n. Note that recurrence relations are similar in spirit to the idea of induction. The relations deﬁnes a function value F(n) in terms of the function values at smaller arguments (in this case, n − 1 and n − 2), effectively reducing the problem of computing F(n) to that of computing F at smaller values. Base cases (the values of F(1) and F(2)) need to be provided. Finding exact solutions for recurrence relations is not an extremely difﬁcult process; however, we will not focus on solution methods for them here. Often a natural thing to do is to try to guess a solution, and then prove it by induction. Alternatively, one can use a symbolic computation program (such as Maple or Mathematica); these programs can often generate solutions. We will occasionally use recurrence relations to describe the running times of algorithms. For our purposes, we often do not need to have an exact solution for the running time, but merely an idea of its asymptotic rate of growth. For example, the relation T (n) = 2T (n/2) + 2n, T (1) = 1 has the exact solution (for n a power of 2) of T (n) = 2n log 2 n + n. (Exercise: Prove this by induction.) But for our purposes, it is generally enough to know that the solution is Θ(n log n).

2-8

The following theorem is extremely useful for such recurrence relations:

Theorem 2.6 The solution to the recurrence relation T (n) = aT (n/b) + cn k , where a ≥ 1 and b ≥ 2 are integers and c and k are positive constants satisﬁes:   O nlogb a    T (n) is O nk log n     O nk if a > bk if a = bk if a < bk .

2-9

Data Structures
We shall regard integers, real numbers, and bits, as well as more complicated objects such as lists and sets, as primitive data structures. Recall that a list is just an ordered sequence of arbitrary elements. List q := [x1 , x2 , . . . , xn ]. x1 is called the head of the list. xn is called the tail of the list. n = |q| is the size of the list. We denote by ◦ the concatenation operation. Thus q ◦ r is the list that results from concatenating the list q with the list r. The operations on lists that are especially important for our purposes are: head(q) push(q, x) pop(q) inject(q, x) eject(q) size(q) return(x1 ) q := [x] ◦ q q := [x2 , . . . , xn ], return(x1 ) q := q ◦ [x] q := [x1 , x2 , . . . , xn−1 ], return(xn ) return(n)

The head, pop, and eject operations are not deﬁned for empty lists. Appropriate return values (either an error, or an empty symbol) can be designed depending on the implementation. A stack is a list that supports operations head, push, pop. A queue is a list that supports operations head, inject and pop. A deque supports all these operations. Note that we can implement lists either by arrays or using pointers as the usual linked lists. Arrays are often faster in practice, but they are often more complicated to program (especially if there is no implicit limit on the number of items). In either case, each of the above operations can be implemented in a constant number of steps.

2-10

Application: Mergesort
For the rest of the lecture, we will review the procedure mergesort. The input is a list of n numbers, and the output is a list of the given numbers sorted in increasing order. The main data structure used by the algorithm will be a queue. We will assume that each queue operation takes 1 step, and that each comparison (is x > y?) takes 1 step. We will show that mergesort takes O(n log n) steps to sort a sequence of n numbers. The procedure mergesort relies on a function merge which takes as input two sorted (in increasing order) lists of numbers and outputs a single sorted list containing all the given numbers (with repetition).

2-11

function merge (s,t) list s,t if s = [ ] then return t else if t = [ ] then return s else if s(1) ≤ t(1) then u:= pop(s) else u:= pop(t) return push(u, merge(s,t)) end merge

function mergesort (s) list s, q q=[] for x ∈ s inject(q, [x]) rof while size(q) ≥ 2 u := pop(q) v := pop(q) inject(q, merge(u, v)) end if q = [ ] return [ ] else return q(1) end mergesort

2-12

The correctness of the function merge follows from the following fact: the smallest number in the input is either s(1) or t(1), and must be the ﬁrst number in the output list. The rest of the output list is just the list obtained by merging s and t after deleting that smallest number. The number of steps for each invocation of function merge is O(1) steps. Since each recursive invocation of merge removes an element from either s or t, it follows that function merge halts in O(|s| + |t|) steps. Question: Can you design an iterative (rather than recursive) version of merge? How much time does is take? Which version would be faster in practice– the recursive or the iterative?

2-13

Q : [ [7, 9], [1, 4], [6, 16], [2, 10] ∗ [3, 11, 12, 14], [5, 8, 13, 15] ] Q : [ [6, 16], [2, 10] ∗ [3, 11, 12, 14], [5, 8, 13, 15], [1, 4, 7, 9] ]

Figure 2.1: One step of the mergesort algorithm. The iterative algorithm mergesort uses q as a queue of lists. (Note that it is perfectly acceptable to have lists of lists!) It repeatedly merges together the two lists at the front of the queue, and puts the resulting list at the tail of the queue. The correctness of the algorithm follows easily from the fact that we start with sorted lists (of length 1 each), and merge them in pairs to get longer and longer sorted lists, until only one list remains. To analyze the running time of this algorithm, let us place a special marker ∗ initially at the end of the q. Whenever the marker ∗ reaches the front of q, and is either the ﬁrst or the second element of q, we move it back to the end of q. Thus the presence of the marker ∗ makes no difference to the actual execution of the algorithm. Its only purpose is to partition the execution of the algorithm into phases: where a phase is the time between two successive visits of the marker ∗ to the end of the q. Then we claim that the total time per phase is O(n). This is because each phase just consists of pairwise merges of disjoint lists in the queue. Each such merge takes time proportional to the sum of the lengths of the lists, and the sum of the lengths of all the lists in q is n. On the other hand, the number of lists is halved in each phase, and therefore the number of phases is at most log n. Therefore the total running time of mergesort is O(n log n).

2-14

An alternative analysis of mergesort depends on a recursive, rather than iterative, description. Suppose we have an operation that takes a list and splits it into two equal-size parts. (We will assume our list size is a power of 2, so that all sublists we ever obtain have even size or are of length 1.) Then a recursive version of mergesort would do the following: function mergesort (s) list s, s1 , s2 if size(s) = 1 then return(s) split(s, s1 , s2 ) s1 = mergesort(s1 ) s2 = mergesort(s2 ) return(merge(s1 , s2 )) end mergesort

Here split splits the list s into two parts of equal length s 1 and s2 . The correctness follows easily from induction. Let T (n) be the number of comparisons mergesort performs on lists of length n. Then T (n) satisﬁes the recurrence relation T (n) ≤ 2T (n/2) + n − 1. This follows from the fact that to sort lists of length n we sort two sublists of length n/2 and then merge them using (at most) n − 1 comparisons. Using our general theorem on solutions of recurrence relations, we ﬁnd that T (n) = O(n log n). Question: The iterative version of mergesort uses a queue. Implicitly, the recursive version is using a stack. Explain the implicit stack in the recursive version of mergesort. Question: Solve the recurrence relation T (n) = 2T (n/2) + n − 1 exactly to obtain an upper bound on the number of comparisons performed by the recursive mergesort variation.

CS124

Lecture 3

Spring 2002

Graphs and modeling
Formulating a simple, precise speciﬁcation of a computational problem is often a prerequisite to writing a computer program for solving the problem. Many computational problems are best stated in terms of graphs. A directed graph G(V, E) consists of a ﬁnite set of vertices V and a set of (directed) edges or arcs E. An arc is an ordered pair of vertices (v, w) and is usually indicated by drawing a line between v and w, with an arrow pointing towards w. Stated in mathematical terms, a directed graph G(V, E) is just a binary relation E ⊆ V ×V on a ﬁnite set V . Undirected graphs may be regarded as special kinds of directed graphs, such that (u, v) ∈ E ↔ (v, u) ∈ E. Thus, since the directions of the edges are unimportant, an undirected graph G(V, E) consists of a ﬁnite set of vertices V , and a set of edges E, each of which is an unordered pair of vertices {u, v}. Graphs model many situations. For example, the vertices of a graph can represent cities, with edges representing highways that connect them. In this case, each edge might also have an associated length. Alternatively, an edge might represent a ﬂight from one city to another, and each edge might have a weight which represents the cost of the ﬂight. A typical problem in this context is to compute shortest paths: given that you wish to travel from city X to city Y, what is the shortest path (or the cheapest ﬂight schedule). We will ﬁnd very efﬁcient algorithms for solving these problems. A seemingly similar problem is the traveling salesman problem. Supposing that a traveling salesman wishes to visit each city exactly once and return to his starting point, in what order should he visit the cities to minimize the total distance traveled? Unlike the shortest paths problem, however, this problem has no known efﬁcient algorithm. This is an example of an NP-complete problem, and one we will study towards the end of this course.

3-1

3-2

A different context in which graphs play a critical modeling role is in networks of pipes or communication links. These can, in general, be modeled by directed graphs with capacities on the edges. A directed edge from u to v with capacity c might represent a cable that can carry a ﬂow of at most c calls per unit time from u to v. A typical problem in this context is the max-ﬂow problem: given a communications network modeled by a directed graph with capacities on the edges, and two special vertices — a source s and a sink t — what is the maximum rate at which calls from s to t can be made? There are ingenious techniques for solving these types of ﬂow problems. In all the cases mentioned above, the vertices and edges of the graph represented something quite concrete such as cities and highways. Often, graphs will be used to represent more abstract relationships. For example, the vertices of a graph might represent tasks, and the edges might represent precedence constraints: a directed edge from u to v says that task u must be completed before v can be started. An important problem in this context is scheduling: in what order should the tasks be scheduled so that all the precedence constraints are satisﬁed. There are extremely fast algorithms for this problem that we will see shortly.

3-3

Representing graphs on the computer

1 Generally,

we use either n or |V | for the number of nodes in a graph, and m or |E| for the number of edges.

3-4

Depth ﬁrst search
There are two fundamental algorithms for searching a graph: depth ﬁrst search and breadth ﬁrst search. To better understand the need for these procedures, let us imagine the computer’s view of a graph that has been input into it, in the adjacency list representation. The computer’s view is fundamentally local to a speciﬁc vertex: it can examine each of the edges adjacent to a vertex in turn, by traversing its adjacency list; it can also mark vertices as visited. One way to think of these operations is to imagine exploring a dark maze with a ﬂashlight and a piece of chalk. You are allowed to illuminate any corridor of the maze emanating from your current position, and you are also allowed to use the chalk to mark your current location in the maze as having been visited. The question is how to ﬁnd your way around the maze. We now show how the depth ﬁrst search allows the computer to ﬁnd its way around the input graph using just these primitives. (We will examine breadth ﬁrst search shortly.) Depth ﬁrst search is technique for exploring a graph using a stack as the basic data structure. We start by deﬁning a recursive procedure search (the stack is implicit in the recursive calls of search): search is invoked on a vertex v, and explores all previously unexplored vertices reachable from v. Procedure search(v) vertex v explored(v) := 1 previsit(v) for (v, w) ∈ E if explored(w) = 0 then search(w) rof postvisit(v) end search

Procedure DFS (G(V, E)) graph G(V, E) for each v ∈ V do explored(v) := 0 rof for each v ∈ V do if explored(v) = 0 then search(v) rof end DFS

3-5

By modifying the procedures previsit and postvisit, we can use DFS to solve a number of important problems, as we shall see. It is easy to see that depth ﬁrst search takes O(|V | + |E|) steps (assuming previsit and postvisit take O(1) time), since it explores from each vertex once, and the exploration involves a constant number of steps per outgoing edge. The procedure search deﬁnes a tree in a natural way: each time that search discovers a new vertex, say w, we can incorporate w into the tree by connecting w to the vertex v it was discovered from via the edge (v, w). The remaining edges of the graph can be classiﬁed into three types: • Forward edges - these go from a vertex to a descendant (other than child) in the DFS tree. • Back edges - these go from a vertex to an ancestor in the DFS tree. • Cross edges - these go from “right to left”– there is no ancestral relation. Question: Explain why if the graph is undirected, there can be no cross edges. One natural use of previsit and postvisit could each keep a counter that is increased each time one of these routines is accessed; this corresponds naturally to a notion of time. Each routine could assign to each vertex a preorder number (time) and a postorder number (time) based on the counter. If we think of depth ﬁrst search as using an explicit stack, then the previsit number is assigned when the vertex is ﬁrst placed on the stack, and the postvisit number is assigned when the vertex is removed from the stack. Note that this implies that the intervals [preorder(u), postorder(u)] and [preorder(v), postorder(v)] are either disjoint, or one contains the other.

3-6

An important property of depth-ﬁrst search is that the contents of the stack at any time yield a path from the root to some vertex in the depth ﬁrst search tree. (Why?) This allows us to prove the following property of the postorder numbering: Claim 3.1 If (u, v) ∈ E then postorder(u) < postorder(v) ⇐⇒ (u, v) is a back edge. Proof: If postorder(u) < postorder(v) then v must be pushed on the stack before u. Otherwise, the existence of edge (u, v) ensures that v must be pushed onto the stack before u can be popped, resulting in postorder(v) < postorder(u) — contradiction. Furthermore, since v cannot be popped before u, it must still be on the stack when u is pushed on to it. It follows that v is on the path from the root to u in the depth ﬁrst search tree, and therefore (u, v) is a back edge. The other direction is trivial. Exercise: What conditions to the preorder and postorder numbers have to satisfy if (u, v) is a forward edge? A cross edge? Claim 3.2 G(V, E) has a cycle iff the DFS of G(V, E) yields a back edge. Proof: If (u, v) is a back edge, then (u, v) together with the path from v to u in the depth ﬁrst tree form a cycle. Conversely, for any cycle in G(V, E), consider the vertex assigned the smallest postorder number. Then the edge leaving this vertex in the cycle must be a back edge by Claim 3.1, since it goes from a lower postorder number to a higher postorder number.

3-7

A

A

B

E

B

E

C

F

C

F

D

D

Graph is explored in preorder ABCDEF. Postorder is DCBAFE. DB is a back edge. AD is a forward edge. EC is a cross edge.

Figure 3.1: A sample depth-ﬁrst search.

Application of DFS: Topological sort
We now suggest an algorithm for the scheduling problem described previously. Given a directed graph G(V, E), whose vertices V = {v1 , . . . vn } represent tasks, and whose edges represent precedence constraints: a directed edge from u to v says that task u must be completed before v can be started. The problem of topological sorting asks: in what order should the tasks be scheduled so that all the precedence constraints are satisﬁed. Note: The graph must be acyclic for this to be possible. (Why?) Directed acyclic graphs appear so frequently they are commonly referred to as DAGs. Claim 3.3 If the tasks are scheduled by decreasing postorder number, then all precedence constraints are satisﬁed. Proof: If G is acyclic then the DFS of G produces no back edges by Claim 3.2. Therefore by Claim 3.1, (u, v) ∈ G implies postorder(u) > postorder(v). So, if we process the tasks in decreasing order by postorder number, when task v is processed, all tasks with precedence constraints into v (and therefore higher postorder numbers) must already have been processed. There’s another way to think about topologically sorting a DAG. Each DAG has a source, which is a vertex with no incoming edges. Similarly, each DAG has a sink, which is a vertex with no outgoing edges. (Proving this is an exercise.) Another way to topologically order the vertices of a DAG is to repeatedly output a source, remove it from the graph, and repeat until the graph is empty. Why does this work? Similarly, once could repeatedly output sinks, and this gives the reverse of a valid topological order. Again, why?

3-8

Strongly Connected Components
Connectivity in undirected graphs is rather straightforward. A graph that is not connected can naturally be decomposed into several connected components (Figure 3.2). DFS does this handily: each restart of DFS marks a new connected component.

1 3 4 6 7

2 5

12

13

14 8

9

10

11

Figure 3.2: An undirected graph

3-9

In directed graphs, what connectivity means is more subtle. In some primitive sense, the directed graph in Figure 3.3 appears connected, since if it were an undirected graph, it would be connected. But there is no path from vertex 12 to 6, or from 6 to 1, so saying the graph is connected would be misleading. We must begin with a meaningful deﬁnition of connectivity in directed graphs. Call two vertices u and v of a directed graph G = (V, E) connected if there is a path from u to v, and one from v to u. This relation between vertices is reﬂexive, symmetric, and transitive (check!), so it is an equivalence relation on the vertices. As such, it partitions V into disjoint sets, called the strongly connected components (SCC’s) of the graph (in Figure 3.3 there are four SCC’s). Within a strongly connected component, every pair of vertices are connected.
1 2 3 1 2-4-5 3-6

4 5

6

7-8-9-10-11-12 7 8

9

10 12

11

Figure 3.3: A directed graph and its SCC’s

3-10

We now imagine shrinking each SCC into a vertex (a supervertex), and draw an edge (a superedge) from SCC X to SCC Y if there is at least one edge from a vertex in X to a vertex in Y . The resulting directed graph has to be a directed acyclic graph (DAG) – that is to say, it can have no cycles (see Figure 3.3). The reason is simple: a cycle containing several SCC’s would merge to a single SCC, since there would be a path between every pair of vertices in the SCC’s of the cycle. Hence, every directed graph is a DAG of its SCC’s. This important decomposition theorem allows one to think of connectivity information of a directed graph in two levels. At the top level we have a DAG, which has a useful, simple structure. For example, as we have mentioned before, a DAG is guaranteed to have at least one source (a vertex without incoming edges) and a sink (a vertex without outgoing edges). If we want more details, we could look inside a vertex of the DAG to see the full-ﬂedged SCC —a completely connected graph— that lies there. This decomposition is extremely useful and informative; it is thus very fortunate that we have a very efﬁcient algorithm, based on DFS, that ﬁnds the strongly connected components in linear time! We motivate this algorithm next. It is based on several interesting and slightly subtle properties of DFS:

3-11

Property 1: If DFS is started at a vertex v, then it will get stuck and restarted precisely when all vertices in the SCC of v, and in all the SCC’s that are reachable from the SCC of v, are visited. Consequently, if DFS is started at a vertex of a sink SCC (a SCC that has no edges leaving it in the DAG of SCC’s), then it will get stuck after it visits precisely the vertices of this SCC. For example, if DFS is started at vertex 11 in Figure 3.3 (a vertex in the only sink SCC in this graph), then it will visit the six vertices in the sink SCC before getting stuck: vertices 12, 10, 9, 7, 8. Property 1 suggests a way of starting a decomposition algorithm, by ﬁnding the ﬁrst SCC: start DFS from a vertex in a sink SCC, and, when stuck, output the vertices that have been visited. They form an SCC! Of course, this leaves us with two problems: (A) How to guess a vertex in a sink SCC, and (B) how to continue our algorithm by outputting the next SCC, and so on.

3-12

Let us ﬁrst face Problem (A). It turns out that it will be easier not to look for vertices in a sink SCC, but instead look for vertices in a source SCC. In particular: Property 2: The vertex with the highest postorder number in DFS (that is, the vertex where the DFS ends) belongs to a source SCC. The proof is by contradiction. If Property 2 were not not true, and v is the vertex with the highest post-order number, then there would be an incoming edge (u, w) with u not in the SCC of v and w in the SCC of v. If u were searched before v, then u clearly has a higher postorder number. If u were searched after v, then since u does not lie in v’s SCC, it must not be searched until v is popped from the search stack, so again u must have a higher postorder number than v. The reason behind Property 2 is thus not hard to see: if there is an SCC “above” the SCC of the vertex where the DFS ends, then the DFS should have ended in that SCC (reaching it either by restarting or by backtracking). Property 2 provides an indirect solution to Problem (A). Consider a graph G and the reverse graph G R —G with the directions of all edges reversed. G R has precisely the same SCC’s as G (why?). So, if we make a DFS in G R , then the vertex where we end (the one with the highest post-order) belongs to a source SCC of G R —that is to say, a sink SCC of G. We have solved Problem (A).

3-13

Onwards to Problem (B). How does the algorithm continue after the ﬁrst sink component is output? The solution is clear: delete the SCC just output from G R , and make another DFS in the remaining graph. The only problem is, this would be a quadratic, not linear, algorithm, since we would run an O(m) DFS algorithm for up to each or O(n) vertices. How can we avoid this extra work? The key observation here is that we do not have to make a new DFS in the remaining graph: Property 3: If we make a DFS in a directed graph, and then delete a source SCC of this graph, what remains is a DFS in the remaining graph (the pre-order and post-order numbers may now not be consecutive, but they will be of the right relative magnitude). This is also easy to justify. We just imagine two runs of the DFS algorithm, one with and one without the source SCC. Consider a transcript recording the steps of the DFS algorithm. It is easy to see that the transcript of both runs would be the same (assuming they both made the same choices of what edges to follow at what points), except where the the ﬁrst went through the source SCC.

3-14

Property 3 allows us to use induction to continue our SCC algorithm. After we output the ﬁrst SCC, we can use the same DFS information from GR to output the second SCC, the third SCC, and so on. The full algorithm can thus be described as follows: Step 1: Perform DFS on GR . Step 2: Perform DFS on G, processing unsearched vertices in the order of decreasing postorder numbers from the DFS of Step 1. At the beginning and every restart print “New SCC:” When visiting vertex v, print v. This algorithm is linear-time, since the total work is really just two depth-ﬁrst searches, each of which is linear time. Question: (How does one construct G R from G?) If we run this algorithm on Figure 3.3, Step 1 yields the following order on the vertices (decreasing postorder in G R ’s DFS): 7, 9, 10, 12, 11, 8, 3, 6, 2, 5, 4, 1. Step 2 now produces the following output: New SCC: 7, 8, 10, 9, 11, 12, New SCC: 3, 6, New SCC: 2, 4, 5, New SCC: 1.

3-15

Incidentally, there is more sophisticated connectivity information that one can derive from undirected graphs. An articulation point is a vertex whose deletion increases the number of connected components in the undirected graph. In Figure 3.2 there are 4 articulation points: 3, 6, 8, and 13. Articulation points divide the graph into biconnected components (the pieces of the graph between articulation points) and bridge edges. Biconnected components are maximal edge sets (of at least 2 edges) such that any two edges on the set lie on a common cycle. For example, the large connected component of the graph in Figure 3.2 contains the biconnected components on edges between vertices 1-2-3-4-5-7-8 and 6-9-10. The remaining edges are 3-6 and 8-11 are bridge edges; they disconnect the graph. Not coincidentally, this more sophisticated and subtle connectivity information can also be captured by DFS.

3-16

Putting in Into Practice
Suppose you are debugging your latest huge software program for a major industrial client. The program has hundreds of procedures, each of which must be carefully tested for bugs. You realize that, to save yourself some work, it would be best to analyze the procedures in a particular order. For instance, if procedure Write Check() calls Get Check Number(), you would probably want to test Get Check Number() ﬁrst. That way, when you look for the bugs in Write Check(), you do not have to worry about checking (or re-checking) Get Check Number(). (Let’s ignore the specious argument that if there are no bugs, you might avoid testing and debugging Get Check Number() altogether by starting with Write Check().) You can easily generate a list of what procedures each procedure calls with a single pass through the code. So here’s the problem: given your program, determine what schedule you should give your testing and debugging team, so that a procedure will be debugged only after anything it calls will be debugged. Go through the program, creating one vertex for each procedure. Introduce a directed edge from vertex A to vertex B if the procedure A calls B. This directed edge represents the fact that A must be debugged before B. We call this graph the procedure graph. If this graph is acyclic, then the topological sort will give you a valid ordering for the debugging. What if the graph is not acyclic? Then your program uses mutual recursion; that is, there is some chain of procedures through which a procedure might end up calling itself. For example, this would be the case if procedure A calls procedure B, procedure B calls procedure C, and procedure C calls procedure A. A topological sort will detect these cycles, but what we really want is a list of them, since instances of mutual recursion are harder to test and debug. In this case, we should use the strongly connected components algorithm on the procedure graph. The SCC algorithm will ﬁnd all the cycles, showing all instances of mutual recursion. Moreover, if we collapse the cycles in the graph, so that instances of mutual recursion are treated as one large super-procedure, then the SCC algorithm will provide a valid debugging ordering for all the procedures in this modiﬁed graph. That is, the SCC algorithm will topologically sort the underlying SCC DAG.

CS124

Lecture 4

Spring 2002

A searching technique with different properties than DFS is Breadth-First Search (BFS). While DFS used an implicit stack, BFS uses an explicit queue structure in determining the order in which vertices are searched. Also, generally one does not restart BFS, because BFS only makes sense in the context of exploring the part of the graph that is reachable from a particular vertex (s in the algorithm below). Procedure BFS (G(V, E), s ∈ V ) graph G(V, E) array[|V |] of integers dist queue q; dist[s] := 0 inject(q, s) while size(q) > 0 v := pop(q) previsit(v) explored(v) := 1 for (v, w) ∈ E if explored(w) = 0 then inject(q, w) dist(w) = dist(v)+1 ﬁ rof end while end BFS

BFS runs, of course, in linear time O(|E|), under the assumption that |E| ≥ |V |. The reason is that BFS visits each edge exactly once, and does a constant amount of work per edge.

4-1

4-2

S

0

1

2

1 2

2

2

3

Figure 4.1: BFS of a directed graph Although BFS does not have the same subtle properties of DFS, it does provide useful information. BFS visits vertices in order of increasing distance from s. In fact, our BFS algorithm above labels each vertex with the distance from s, or the number of edges in the shortest path from s to the vertex. For example, applied to the graph in Figure 4.1, this algorithm labels the vertices (by the array dist) as shown. Why are we sure that the array dist is the shortest-path distance from s? A simple induction proof sufﬁces. It is certainly true if the distance is zero (this happens only at s). And, if it is true for dist(v) = d, then it can be easily shown to be true for values of dist equal to d + 1 —any vertex that receives this value has an edge from a vertex with dist d, and from no vertex with lower value of dist. Notice that vertices not reachable from s will not be visited or labeled.

4-3

Single-Source Shortest Paths —Nonnegative Lengths
What if each edge (v, w) of our graph has a length, a positive integer denoted length(v, w), and we wish to ﬁnd the shortest paths from s to all vertices reachable from it? (What if we are interested only in the shortest path from s to a speciﬁc node t? As it turns out, all algorithms known for this problem have to compute the shortest path from s to all vertices reachable from it.) BFS offers a possible solution. We can subdivide each edge (u, v) into length(u, v) edges, by inserting length(u, v) − 1 “dummy” nodes, and then apply DFS to the new graph. This algorithm solves

the shortest-path problem in time O( ∑(u,v)∈E length(u, v)). Unfortunately, this can be very large —lengths could be in the thousands or millions. So we need to ﬁnd a better way. The problem is that this BFS-based algorithm will spend most of its time visiting “dummy” vertices; only occasionally will it do something truly interesting, like visit a vertex of the original graph. What we would like to do is run this algorithm, but only do work for the “interesting” steps.

4-4

To do this, We need to generalize BFS. Instead of using a queue, we will instead use a heap or priority queue of vertices. A heap is an data structure that keeps a set of objects, where each object has an associated value. The operations a heap H implements include the following: deletemin(H) insert(x, y, H) change(x, y, H) return the object with the smallest value insert a new object x/value y pair in the structure if y is smaller than x’s current value, change the value of object x to y We will not distinguish between insert and change, since for our purposes, they are essentially equivalent; changing the value of a vertex will be like re-inserting it. (In all heap implementations we assume that we have an array of pointers that gives, for each vertex, its position in the heap, if any. This allows us to always have at most one copy of each vertex in the heap. Furthermore, it makes changes and inserts essentially equivalent operations.) Each entry in the heap will stand for a projected future “interesting event” of our extended BFS. Each entry will correspond to a vertex, and its value will be the current projected time at which we will reach the vertex. Another way to think of this is to imagine that, each time we reach a new vertex, we can send an explorer down each adjacent edge, and this explorer moves at a rate of 1 unit distance per second. With our heap, we will keep track of when each vertex is due to be reached for the ﬁrst time by some explorer. Note that the projected time until we reach a vertex can decrease, because the new explorers that arise when we reach a newly explored vertex could reach a vertex ﬁrst (see node b in Figure 4.2). But one thing is certain: the most imminent future scheduled arrival of an explorer must happen, because there is no other explorer who can reach any vertex faster. The heap conveniently delivers this most imminent event to us.

4-5

As in all shortest path algorithms we shall see, we maintain two arrays indexed by V . The ﬁrst array, dist[v], will eventually contain the true distance of v from s. The other array, prev[v], will contain the last node before v in the shortest path from s to v. Our algorithm maintains a useful invariant property: at all times dist[v] will contain a conservative over-estimate of the true shortest distance of v from s. Of course dist[s] is initialized to its true value 0, and all other dist’s are initialized to ∞, which is a remarkably conservative overestimate. The algorithm is known as Djikstra’s algorithm, named after the inventor. Algorithm Djikstra (G = (V, E, length); s ∈ V ) v, w: vertices dist: array[V ] of integer prev: array[V ] of vertices H: priority heap of V H := {s : 0} for v ∈ V do dist[v] := ∞, prev[v] :=nil rof dist[s] := 0 / while H = 0 v := deletemin(h) for (v, w) ∈ E if dist[w] > dist[v]+ length(v, w) dist[w] := dist[v] + length(v, w), prev[w] := v, insert(w,dist[w], H) ﬁ rof end while end shortest paths 1

4-6

a2 s0 2 3 6 b4 5

1 1 2

c3

4 2 2

e6

1

d6

f5

Figure 4.2: Shortest paths The algorithm, run on the graph in Figure 4.2, will yield the following heap contents (node: dist/priority pairs) at the beginning of the while loop: {s : 0}, {a : 2, b : 6}, {b : 5, c : 3}, {b : 4, e : 7, f : 5}, {e : 7, f : 5, d : 6}, {e : 6, d : 6}, {e : 6}, {}. The distances from s are shown in Figure 2, together with the shortest path tree from s, the rooted tree deﬁned by the pointers prev.

4-7

What is the running time of this algorithm? The algorithm involves |E| insert operations and |V | deletemin operations on H, and so the running time depends on the implementation of the heap H. There are many ways to implement a heap. Even an unsophisticated implementation as a linked list of node/priority pairs yields an interesting time bound, O(|V |2 ) (see ﬁrst line of the table below). A binary heap would give O(|E| log |V |). Which of the two should we prefer? The answer depends on how dense or sparse our graphs are. In all graphs, |E| is between |V | and |V |2 . If it is Ω(|V |2 ), then we should use the linked list version. If it is anywhere below we should use binary heaps.
|V |2 log |V | ,

heap implementation linked list binary heap d-ary heap Fibonacci heap

deletemin O(|V |) O(log |V |) O(log |V |) O( d log |V | ) log d

insert O(1) O(log |V |) O( log |V | ) log d O(1) amortized

|V |×deletemin+|E|×insert O(|V |2 ) O(|E| log |V |) O((|V | · d + |E|) log |V | log d

O(|V | log |V | + |E|)

A more sophisticated data structure, the d-ary heap, performs even better. A d-ary heap is just like a binary heap, except that the fan-out of the tree is d, instead of 2. (Here d should be at least 2, however!) Since the depth of any such tree with |V | nodes is
log |V | log d ,

it is easy to see that inserts take this amount of time. Deletemins take d times

that, because deletemins go down the tree, and must look at the children of all vertices visited. The complexity of this algorithm is a function of d. We must choose d to minimize it. A natural choice is d=
|E| |V | ,

which is the the average degree! (Note that this is the natural choice because it equalizes the two terms of

|E| + |V | · d. Alternatively, the “exact” value can be found using calculus.) This yields an algorithm that is good for both sparse and dense graphs. For dense graphs, its running time is O(|V | 2 ). For graphs with |E| = O(|V |), it is |V | log |V |. Finally, for graphs with intermediate density, such as |E| = |V | 1+δ , where δ is the density of the graph, the algorithm is linear! The fastest known implementation of Djikstra’s algorithm uses a data structure known as a Fibonacci heap, which we will not cover here. Note that the bounds for the insert operation for Fibonacci heaps are amortized bounds: certain operations may be expensive, but the average cost over a sequence of operations is constant.

4-8

Single-Source Shortest Paths: General Lengths
Our argument of correctness of our shortest path algorithm was based on the “time metaphor:” the most imminent prospective event (arrival of an explorer) must take place, exactly because it is the most imminent. This however would not work if we had negative edges. (Imagine explorers being able to arrive before they left!) If the length of edge (a, b) in Figure 2 were −1, the shortest path from s to b would have value 1, not 4, and our simple algorithm fails. Obviously, with negative lengths we need more involved algorithms, which repeatedly update the values of dist. We can describe a general paradigm for constructing shortest path algorithms with arbitrary edge weights. The algorithms use arrays dist and prev, and again we maintain the invariant that dist is always a conservative overestimate of the true distance from s. (Again, dist is initialized to ∞ for all nodes, except for s for which it is 0). The algorithms maintain dist so that it is always a conservative overestimate; it will only update the a value when a suitable path is discovered to show that the overestimate can be lowered. That is, suppose we ﬁnd a neighbor w of v, with dist[v] > dist[w] + length(w, v). Then we have found an actual path that shows the distance estimate is too conservative. We therefore repeatedly apply the following update rule.

4-9

procedure update ( (w, v) ) edge (w, v) if dist[v] > dist[w]+ length(w, v) then dist[v] := dist[w] + length(w, v), prev[v] := w

A crucial observation is that this procedure is safe, in that it never invalidates our “invariant” that dist is a conservative overestimate. The key idea is to consider how these updates along edges should occur. In Djikstra’s algorithm, the edges are updated according to the time order of the imaginary explorers. But this only works with positive edge lengths. A second crucial observation concerns how many updates we have to do. Let a = s be a node, and consider the shortest path from s to a, say s, v1 , v2 , . . . , vk = a for some k between 1 and n − 1. If we perform update ﬁrst on (s, v 1 ), later on (v1 , v2 ), and so on, and ﬁnally on (vk−1 , a), then we are sure that dist(a) contains the true distance from s to a, and that the true shortest path is encoded in prev. (Exercise: Prove this, by induction.) We must thus ﬁnd a sequence of updates that guarantee that these edges are updated in this order. We don’t care if these or other edges are updated several times in between, since all we need is to have a sequence of updates that contains this particular subsequence. There is a very easy way to guarantee this: update all edges |V | − 1 times in a row!

4-10

Algorithm Shortest Paths 2 (G = (V, E, length); s ∈ V ) v, w: vertices dist: array[V ] of integer prev: array[V ] of vertices i: integer for v ∈ V do dist[v] := ∞, prev[v] :=nil rof dist[s] := 0 for i = 1 . . . n − 1 for (w, v) ∈ E update(w, v) end shortest paths 2

This algorithm solves the general single-source shortest path problem in O(|V | · |E|) time.

4-11

Negative Cycles
In fact, there is a further problem that negative edges can cause. Suppose the length of edge (b, a) in Figure 2 were changed to −5. The the graph would have a negative cycle (from a to b and back). On such graphs, it does not make sense to even ask the shortest path question. What is the shortest path from s to c in the modiﬁed graph? The one that goes directly from s to a to c (cost: 3), or the one that goes from s to a to b to a to c (cost: 1), or the one that takes the cycle twice (cost: -1)? And so on. The shortest path problem is ill-posed in graphs with negative cycles. It makes no sense and deserves no answer. Our algorithm in the previous section works only in the absence of negative cycles. (Where did we assume no negative cycles in our correctness argument? Answer: When we asserted that a shortest path from s to a exists!) But it would be useful if our algorithm were able to detect whether there is a negative cycle in the graph, and thus to report reliably on the meaningfulness of the shortest path answers it provides. This is easily done. After the |V | − 1 rounds of updates of all edges, do a last update. If any changes occur during this last round of updates, there is a negative cycle. This must be true, because if there were no negative cycles, |V | − 1 rounds of updates would have been sufﬁcient to ﬁnd the shortest paths.

4-12

Shortest Paths on DAG’s
There are two subclasses of weighted graphs that automatically exclude the possibility of negative cycles: graphs with non-negative weights and DAG’s. We have already seen that there is a fast algorithm when the weights are non-negative. Here we will give a linear algorithm for single-source shortest paths in DAG’s. Our algorithm is based on the same principle as our algorithm for negative weights. We are trying to ﬁnd a sequence of updates, such that all shortest paths are its subsequences. But in a DAG we know that all shortest paths from s must go in the topological order of the DAG. All we have to do then is ﬁrst topologically sort the DAG using a DFS, and then visit all edges coming out of nodes in the topological order. This algorithm solves the general single-source shortest path problem for DAG’s in O(m) time.

CS124

Lecture 5

Spring 2002

Minimum Spanning Trees
A tree is an undirected graph which is connected and acyclic. It is easy to show that if graph G(V, E) that satisﬁes any two of the following properties also satisﬁes the third, and is therefore a tree: • G(V, E) is connected • G(V, E) is acyclic • |E| = |V | − 1 (Exercise: Show that any two of the above properties implies the third (use induction).) A spanning tree in an undirected graph G(V, E) is a subset of edges T ⊆ E that are acyclic and connect all the vertices in V . It follows from the above conditions that a spanning tree must consist of exactly n − 1 edges. Now weights of its edges; w(T ) = ∑e∈T w(e). The minimum spanning tree in a weighted graph G(V, E) is one which has the smallest weight among all spanning trees in G(V, E). As an example of why one might want to ﬁnd a minimum spanning tree, consider someone who has to install the wiring to network together a large computer system. The requirement is that all machines be able to reach each other via some sequence of intermediate connections. By representing each machine as a vertex and the cost of wiring two machines together by a weighted edge, the problem of ﬁnding the minimum cost wiring scheme reduces to the minimum spanning tree problem. In general, the number of spanning trees in G(V, E) grows exponentially in the number of vertices in G(V, E). (Exercise: Try to determine the number of different spanning trees for a complete graph on n vertices.) Therefore it is infeasible to search through all possible spanning trees to ﬁnd the lightest one. Luckily it is not necessary to examine all possible spanning trees; minimum spanning trees satisfy a very important property which makes it possible to efﬁciently zoom in on the answer. suppose that each edge has a weight associated with it: w : E → Z. Say that the weight of a tree T is the sum of the

5-1

Lecture 5

5-2

We shall construct the minimum spanning tree by successively selecting edges to include in the tree. We will guarantee after the inclusion of each new edge that the selected edges, X, form a subset of some minimum spanning tree, T . How can we guarantee this if we don’t yet know any minimum spanning tree in the graph? The following property provides this guarantee: Cut property: Let X ⊆ T where T is a MST in G(V, E). Let S ⊂ V such that no edge in X crosses between S and V − S; i.e. no edge in X has one endpoint in S and one endpoint in V − S. Among edges crossing between S and V − S, let e be an edge of minimum weight. Then X ∪ {e} ⊆ T where T is a MST in G(V, E). The cut property says that we can construct our tree greedily. Our greedy algorithms can simply take the minimum weight edge across two regions not yet connected. Eventually, if we keep acting in this greedy manner, we will arrive at the point where we have a minimum spanning tree. Although the idea of acting greedily at each point may seem quite intuitive, it is very unusual for such a strategy to actually lead to an optimal solution, as we will see when we examine other problems! Proof: Suppose e ∈ T . Adding e into T creates a unique cycle. We will remove a single edge e from this / unique cycle, thus getting T = T ∪ {e} − {e }. It is easy to see that T must be a tree — it is connected and has n − 1 edges. Furthermore, as we shall show below, it is always possible to select an edge e in the cycle such that it crosses between S and V − S. Now, since e is a minimum weight edge crossing between S and V − S, w(e ) ≥ w(e). Therefore w(T ) = w(T ) + w(e) − w(e ) ≤ w(T ). However since T is a MST, it follows that T is also a MST and w(e) = w(e ). Furthermore, since X has no edge crossing between S and V − S, it follows that X ⊆ T and thus X ∪ {e} ⊆ T . How do we know that there is an edge e = e in the unique cycle created by adding e into T , such that e crosses between S and V − S? This is easy to see, because as we trace the cycle, e crosses between S and V − S, and we must cross back along some other edge to return to the starting point.

Lecture 5

5-3

In light of this, the basic outline of our minimum spanning tree algorithms is going to be the following: X := { }. Repeat until |X| = n − 1. Pick a set S ⊆ V such that no edge in X crosses between S and V − S. Let e be a lightest edge in G(V, E) that crosses between S and V − S. X := X ∪ {e}.

The difference between minimum spanning tree algorithms lies in how we pick the set S at each step.

Lecture 5

5-4

Prim’s algorithm:
In the case of Prim’s algorithm, X consists of a single tree, and the set S is the set of vertices of that tree. One way to think of the algorithm is that it grows a single tree, adding a new vertex at each step, until it has the minimum spanning tree. In order to ﬁnd the lightest edge crossing between S and V − S, Prim’s algorithm maintains a heap containing all those vertices in V − S which are adjacent to some vertex in S. The priority of a vertex v, according to which the heap is ordered, is the weight of its lightest edge to a vertex in S. This is reminiscent of Dijkstra’s algorithm (where distance was used for the heap instead of the edge weight). As in Dijkstra’s algorithm, each vertex v will also have a parent pointer prev(v) which is the other endpoint of the lightest edge from v to a vertex in S. The pseudocode for Prim’s algorithm is almost identical to that for Dijkstra’s algorithm: Procedure Prim(G(V, E), s) v, w: vertices dist: array[V ] of integer prev: array[V ] of vertices S: set of vertices, initially empty H: priority heap of V H := {s : 0} for v ∈ V do dist[v] := ∞, prev[v] :=nil rof dist[s] := 0 / while H = 0 v := deletemin(h) S := S ∪ {v} for (v, w) ∈ E and w ∈ V − S do if dist[w] > length(v, w) dist[w] := length(v, w), prev[w] := v, insert(w,dist[w], H) ﬁ rof end while end Prim

Note that each vertex is “inserted” on the heap at most once; other insert operations simply change the value on the heap. The vertices that are removed from the heap form the set S for the cut property. The set X of edges chosen to be included in the MST are given by the parent pointers of the vertices in the set S. Since the smallest key in the heap at any time gives the lightest edge crossing between S and V − S, Prim’s algorithm follows the generic outline for a MST algorithm presented above, and therefore its correctness follows from the cut property. The running time of Prim’s algorithm is clearly the same as Dijkstra’s algorithm, since the only change is how we prioritize nodes in the heap. Thus, if we use d-heaps, the running time of Prim’s algorithm is O(m log m/n n).

Lecture 5

5-5

Kruskal’s algorithm:
Kruskal’s algorithm uses a different strategy from Prim’s algorithm. Instead of growing a single tree, Kruskal’s algorithm attempts to put the lightest edge possible in the tree at each step. Kruskal’s algorithm starts with the edges sorted in increasing order by weight. Initially X = { }, and each vertex in the graph regarded as a trivial tree (with no edges). Each edge in the sorted list is examined in order, and if its endpoints are in the same tree, then the edge is discarded; otherwise it is included in X and this causes the two trees containing the endpoints of this edge to merge into a single tree. Note that, by this process, we are implicitly choosing a set S ⊆ V with no edge in X crossing between S and V − S, so this ﬁts in our basic outline of a minimum spanning tree algorithm. To implement Kruskal’s algorithm, given a forest of trees, we must decide given two vertices whether they belong to the same tree. For the purposes of this test, each tree in the forest can be represented by a set consisting of the vertices in that tree. We also need to be able to update our data structure to reﬂect the merging of two trees into a single tree. Thus our data structure will maintain a collection of disjoint sets (disjoint since each vertex is in exactly one tree), and support the following three operations:

• MAKESET(x): Create a new x containing only the element x. • FIND(x): Given an element x, which set does it belong to? • UNION(x,y): replace the set containing x and the set containing y by their union. The pseudocode for Kruskal’s algorithm follows: Function Kruskal(graph G(V, E)) set X X ={} E:= sort E by weight for u ∈ V MAKESET(u) rof for (u, v) ∈ E (in increasing order) do if FIND(u) = FIND(v) do X = X ∪ {(u, v)} UNION(u,v) rof return(X) end Kruskal

Lecture 5

5-6

The correctness of Kruskal’s algorithm follows from the following argument: Kruskal’s algorithm adds an edge e into X only if it connects two trees; let S be the set of vertices in one of these two trees. Then e must be the ﬁrst edge in the sorted edge list that has one endpoint in S and the other endpoint in V − S, and is therefore the lightest edge that crosses between S and V − S. Thus the cut property of MST implies the correctness of the algorithm. The running time of the algorithm, assuming the edges are given in sorted order, is dominated by the set operations: UNION and FIND. There are n − 1 UNION operations (one corresponding to each edge in the spanning UNION). We will soon show that this is O(m log ∗ n). Note that, if the edges are not initially given in sorted order, then to sort them in the obvious way takes O(m log m) time, and this would be the dominant part of the running time of the algorithm. tree), and 2m FIND operations (2 for each edge). Thus the total time of Kruskal’s algorithm is O(m × FIND + n ×

Lecture 5

5-7

Exchange Property
Actually spanning trees satisfy an even stronger property than the cut property — the exchange property. The exchange property is quite remarkable since it implies that we can “walk” from any spanning tree T to a minimum ˆ spanning tree T by a sequence of exchange moves — each such move consists of throwing an edge out of the current ˆ ˆ tree that is not in T , and adding a new edge into the current tree that is in T . Moreover, each successive tree in the “walk” is guaranteed to weigh no more than its predecessor. Exchange property: Let T and T be spanning trees in G(V, E). Given any e ∈ T − T , there exists an edge e ∈ T − T such that (T − {e}) ∪ {e } is also a spanning tree. The proof is quite similar to that of the cut property. Adding e into T results in a unique cycle. There must be some edge in this cycle that is not in T (since otherwise T must have a cycle). Call this edge e. Then deleting e restores a spanning tree, since connectivity is not affected, and the number of edges is restored to n − 1. To see how one may use this exchange property to “walk” from any spanning tree to a MST: let T be any ˆ spanning tree and let T be a MST in G(V, E). Let e be the lightest edge that is not in both trees. Perform an exchange using this edge. Since the exchange was done with the lightest such edge, the new tree must be lighter than ˆ the old one. Since T is already a MST, it follows that the exchange must have been performed upon T and results in ˆ a lighter spanning tree which has more edges in common with T (if there are several edges of the same weight, then ˆ the new tree might not be lighter, but it still has more edges in common with T ).

Lecture 5

5-8

1 3 4 2 3 5 5

5 2 1 7 6

Figure 5.1: An example of Prim’s algorithm and Kruskal’s algorithm. Which is which?

CS124

Lecture 6

Spring 2002

Disjoint set (Union-Find)
For Kruskal’s algorithm for the minimum spanning tree problem, we found that we needed a data structure for maintaining a collection of disjoint sets. That is, we need a data structure that can handle the following operations:

• MAKESET(x) - create a new set containing the single element x • UNION(x, y) - replace two sets containing x and y by their union. • FIND(x) - return the name of the set containing the element x Naturally, this data structure is useful in other situations, so we shall consider its implementation in some detail. Within our data structure, each set is represented by a tree, so that each element points to a parent in the tree. The root of each tree will point to itself. In fact, we shall use the root of the tree as the name of the set itself; hence the name of each set is given by a canonical element, namely the root of the associated tree. It is convenient to add a fourth operation LINK(x, y) to the above, where we require for LINK that x and y are two roots. LINK changes the parent pointer of one of the roots, say x, and makes it point to y. It returns the root of the now composite tree y. With this addition, we have UNION(x, y) = LINK(FIND(x),FIND(y)), so the main problem is to arrange our data structure so that FIND operations are very efﬁcient.

6-1

Lecture 6

6-2

Notice that the time to do a FIND operation on an element corresponds to its depth in the tree. Hence our goal is to keep the trees short. Two well-known heuristics for keeping trees short in this setting are UNION BY RANK and PATH COMPRESSION. We start with the UNION BY RANK heuristic. The idea of UNION BY RANK is to ensure that when we combine two trees, we try to keep the overall depth of the resulting tree small. This is implemented as follows: the rank of an element x is initialized to 0 by MAKESET. An element’s rank is only updated by the LINK operation. If x and y have the same rank r, then invoking LINK(x, y) causes the parent pointer of x to be updated to point to y, and the rank of y is then updated to r + 1. On the other hand, if x and y have different rank, then when invoking LINK(x, y) the parent point of the element with smaller rank is updated to point to the element with larger rank. The idea is that the rank of the root is associated with the depth of the tree, so this process keeps the depth small. (Exercise: Try some examples by hand with and without using the UNION BY RANK heuristic.) The idea of PATH COMPRESSION is that, once we perform a FIND on some element, we should adjust its parent pointer so that it points directly to the root; that way, if we ever do another FIND on it, we start out much closer to the root. Note that, until we do a FIND on an element, it might not be worth the effort to update its parent pointer, since we may never access it at all. Once we access an item, however, we must walk through every pointer to the root, so modifying the pointers only changes the cost of this walk by a constant factor.

Lecture 6

6-3

procedure MAKESET(x) p(x) := x rank(x) := 0 end

function FIND(x) if x = p(x) then p(x) := FIND(p(x)) return(p(x)) end

function LINK(x, y) if rank(x) > rank(y) then x ↔ y if rank(x) = rank(y) then rank(y) := rank(y) + 1 p(x) := y return(y) end

Lecture 6

6-4

In our analysis, we show that any sequence of m UNION and FIND operations on n elements take at most O((m + n) log∗ n) steps, where log∗ n is the number of times you must iterate the log 2 function on n before getting a number less than or equal to 1. (So log ∗ 4 = 2, log∗ 16 = 3, log ∗ 65536 = 4.) We should note that this is not the tightest analysis possible; however, this analysis is already somewhat complex! Note that we are going to do an amortized analysis here. That is, we are going to consider the cost of the algorithm over a sequence of steps, instead of considering the cost of a single operation. In fact a single UNION or FIND operation could require O(log n) operations. (Exercise: Prove this!) Only by considering an entire sequence of operations at once can obtain the above bound. Our argument will require some interesting accounting to total the cost of a sequence of steps.

Lecture 6

6-5

We ﬁrst make a few observations about rank.

• if v = p(v) then rank(p(v)) > rank(v) • whenever p(v) is updated, rank(p(v)) increases • the number of elements with rank k is at most
n 2k n 2k−1

• the number of elements with rank at least k is at most

The ﬁrst two assertions are immediate from the description of the algorithm. The third assertion follows from the fact that the rank of an element v changes only if LINK(v, w) is executed, rank(v) = rank(w), and v remains the root of the combined tree; in this case v’s rank is incremented by 1. A simple induction then yields that when rank(v) is incremented to k, the resulting tree has at least 2 k elements. The last assertion then follows from the third
n assertion, as ∑∞ 2 j = j=k n . 2k−1

Exercise: Show that the maximum rank an item can have is log n.

Lecture 6

6-6

As soon as an element becomes a non-root, its rank is ﬁxed. Let us divide the (non-root) elements into groups according to their ranks. Group i contains all elements whose rank r satisﬁes log ∗ r = i. For example, elements in group 3 have ranks in the range (4, 16], and the range of ranks associated with group i is (2 i−1 , 22 ). For convenience we shall write this more simply by saying group (k, 2 k ] to mean the group with these ranks. It is easy to establish the following assertions about these groups: • The number of distinct groups is at most log ∗ n. (Use the fact that the maximum rank is log n.) • The number of elements in the group (k, 2 k ] is at most
n . 2k
i−1

Let us assign 2k tokens to each element in group (k, 2 k ]. The total number of tokens assigned to all elements
n from that group is then at most 2k 2k = n, and the total number of groups is at most log ∗ n, so the total number of

tokens given out is n log ∗ n. We use these tokens to account for the work done by FIND operations. Recall that the number of steps for a FIND operation is proportional to the number of pointers that the FIND operation must follow up the tree. We separate the pointers into two groups, depending on the groups of u and p(u) = v, as follows: • Type 1: a pointer is of Type 1 if u and v belong to different groups, or v is the root. • Type 2: a pointer is of Type 2 if u and v belong to the same group. We account for the two Types of pointers in two different ways. Type 1 links are “charged” directly to the FIND operation; Type 2 links are “charged” to u, who “pays” for the operation using one of the tokens. Let us consider these charges more carefully.

Lecture 6

6-7

The number of Type 1 links each FIND operation goes through is at most log ∗ n, since there are only log ∗ n groups, and the group number increases as we move up the tree. What about Type 2 links? We charge these links directly back to u, who is supposed to pay for them with a token. Does u have enough tokens? The point here is that each time a FIND operation goes through an element u, its parent pointer is changed to the current root of the tree (by PATH COMPRESSION), so the rank of its parent increases by at least 1. If u is in the group (k, 2 k ], then the rank of u’s parent can increase fewer than 2 k times before it moves to a higher group. Therefore the 2 k tokens we assign to u are sufﬁcient to pay for all FIND operations that go through u to a parent in the same group.

Lecture 6

6-8

We now count the total number of steps for m UNION and FIND operations. Clearly LINK requires just O(1) steps, and since a UNION operation is just a LINK and 2 FIND operations, it sufﬁces to bound the time for at most 2m FIND OPERATIONS. Each FIND operation is charged at most log ∗ n for a total of O(m log∗ n). The total number of tokens used at most n log ∗ n, and each token pays for a constant number of steps. Therefore the total number of steps is O((m + n) log∗ n). Let us give a more equation-oriented explanation. The total time spent over the course of m UNION and FIND operations is just all FIND ops We split this sum up into two parts:

all FIND ops

(# links in same group) +

all FIND ops

(Technically, the case where a link goes to the root should be handled explicitly; however, this is just O(m) links in total, so we don’t need to worry!) The second term is clearly O(m log ∗ n). The ﬁrst term can be upper bounded by:

all elements u because each element u can be charged only once for each rank in its group. (Note here that this is because the links to the root count in the second sum!) This last sum is bounded above by

(# ranks in the group of u),

all groups This completes the proof.

log∗ n

(# items in group) · (# ranks in group) ≤

k=1

n k 2 ≤ n log∗ n. 2k

Lecture 6

6-9

x

y UNION(x,y) x

y

a b c FIND(d)

a

b

c

d

d

Figure 6.1: Examples of UNION BY RANK and PATH COMPRESSION.

CS124

Lecture 7

In today’s lecture we will be looking a bit more closely at the Greedy approach to designing algorithms. As we will see, sometimes it works, and sometimes even when it doesn’t, it can provide a useful result.

Horn Formulae
A simple application of the greedy paradigm solves an important special case of the SAT problem. We have already seen that 2SAT can be solved in linear time. Now consider SAT instances where in each clause, there is at most one positive literal. Such formulae are called Horn formulae; for example, this is an instance: (x ∨ y ∨ z ∨ w) ∧ (x ∨ y ∨ w) ∧ (x ∨ z ∨ w) ∧ (x ∨ y) ∧ (x) ∧ (z) ∧ (x ∨ y ∨ w). Given a Horn formula, we can separate its clauses into two parts: the pure negative clauses (those without a positive literal) and the implications (those with a positive literal). We call clauses with a positive literal implications because they can be rewritten suggestively as implications; (x ∨ y ∨ z ∨ w) is equivalent to (y ∧ z ∧ w) → x. Note the trivial clause (x) can be thought of as a trivial implication → x. Hence, in the example above, we have the implications (y ∧ z ∧ w → x), (x ∧ z → w), (x → y), (→ x), (x ∧ y → w) and these two pure negative clauses (x ∨ y ∨ w), (z). We can now develop a greedy algorithm. The idea behind the algorithm is that we start with all variables set to false, and we only set variables to T if an implication forces us to. Recall that an implication is not satisﬁed if all variables to the left of the arrow are true and the one to the right is false. This algorithm is greedy, in the sense that it (greedily) tries to ensure the pure negative clauses are satisﬁed, and only changes a variable if absolutely forced.

7-1

Lecture 7

7-2

Algorithm Greedy-Horn(φ: CNF formula with at most one positive literal per clause) Start with the truth assignment t :=FFF· · ·F while there is an implication that is not satisﬁed do make the implied variable T in t if all pure negatives are satisﬁed then return t else return “φ is unsatisﬁable” Once we have the proposed truth assignment, we look at the pure negatives. If there is a pure negative clause that is not satisﬁed by the proposed truth assignment, the formula cannot be satisﬁed. This follows from the fact that all the pure negative clauses will be satisﬁed if any of their variables are set to F. If such a clause is unsatisﬁed, all of its variables must be set to T. But we only set a variable to T if we are forced to by the implications. If all the pure negative clauses are satisﬁed, then we have found a truth assignment. On the example above, Greedy-Horn ﬁrst ﬂips x to true, forced by the implication → x. Then y gets forced to true (from x → y), and similarly w is forced to true. (Why?) Looking at the pure negative clauses, we ﬁnd that the ﬁrst is not satisﬁed, and hence we conclude the original formula had no truth assignment. Exercise: Show that the Horn-greedy algorithm can be implemented in linear time in the length of the formula (i.e., the total number of appearances of all literals).

Lecture 7

7-3

Huffman Coding
Suppose that you must store the map of a chromosome which consists of a sequence of 130 million symbols of the form A, C, G, or T. To store the sequence efﬁciently, you represent each character with just 2 bits: A as 00, C as 01, G as 10, and T as 11. Such a representation is called an encoding. With this encoding, the sequence requires 260 megabits to store. Suppose, however, that you know that some symbols appear more frequently than others. For example, suppose A appears 70 million times, C 3 million times, G 20 million times, and T 37 million times. In this case it seems wasteful to use two bits to represent each A. Perhaps a more elaborate encoding assigning a shorter string to A could save space. We restrict ourselves to encodings that satisfy the preﬁx property: no assigned string is the preﬁx of another. This property allows us to avoid backtracking while decoding. For an example without the preﬁx property, suppose we represented A as 1 and C as 101. Then when we read a 1, we would not know whether it was an A or the beginning of a C! Clearly we would like to avoid such problems, so the preﬁx property is important. You can picture an encoding with the preﬁx property as a binary tree. For example, the binary tree below corresponds to an optimal encoding in the above situation. (There can be more than one optimal encoding! Just ﬂip the left and right hand sides of the tree.) Here a branch to the left represent a 0, and a branch to the right represents a 1. Therefore A is represented by 1, C by 001, G by 000, and T by 01. This encoding requires only 213 million bits – a 17% improvement over the balanced tree (the encoding 00,01,10,11). (This does not include the bits that might be necessary to store the form of the encoding!)

Lecture 7

7-4

0 (60)

1 A (70)

0 (23)

1 T (37)

0

1

G (20)

C (3)

Figure 7.1: A Huffman tree.

Lecture 7

7-5

Let us note some properties of the binary trees that represent encoding. The symbols must correspond to leaves; an internal node that represents a character would violate the preﬁx property. The code words are thus given by all root-to-leaf paths. All internal nodes must have exactly two children, as an internal node with only one child could be deleted to yield a better code. Hence if there are n leaves there are n − 1 internal edges. Also, if we assign frequencies to the internal nodes, so that the frequencies of an internal node are the sums of the frequencies of the children, then the total length produced by the encoding is the sum of the frequencies of all nodes except the root. (A one line proof: each edge corresponds to a bit that is written as many times as the frequency of the node to which it leads.) One ﬁnal property allows us to determine how to build the tree: the two symbols with the smallest frequencies are together at the lowest level of the tree. Otherwise, we could improve the encoding by swapping a more frequently used character at the lowest level up. (This is not a full proof; feel free to complete one.) This tells us how to construct the optimum tree greedily. Take the two symbols with the lowest frequency, delete them from the list of symbols, and replace them with a new meta-character; this new meta-character will lie directly above the two deleted symbols in the tree. Repeat this process until the whole tree is constructed. We can prove by induction that this gives an optimal tree. It works for 2 symbols (base case). We also show that if it works for n letters, it must also work for n + 1 letters. After deleting the two least frequent symbols and replacing them with a meta-character, it as though we have just n symbols. this process yields an optimal tree for these n symbols (by the inductive hypothesis). Expanding the meta-character back into the two deleted nodes must now yield an optimal tree, since otherwise we could have found a better tree for the n symbols.

Lecture 7

7-6

A 60 E 70 I 40 O 50 U 20 Y 30

A 60 E 70 I 40 O 50 [UY] 50

A 60 E 70 O 50 [I[UY]] 90 I

[OA] 110 O E 70 [I[UY]] 90 I A

U

Y

U

Y

U

Y

Figure 7.2: The ﬁrst few steps of building a Huffman tree.

Lecture 7

7-7

It is important to realize that when we say that a Huffman tree is optimal, this does not mean that it gives the best way to compress a string. It only says that we cannot do better by encoding one symbol at a time. By encoding frequently used blocks of letters (such as, in this section, the block “frequen”) we can obtain much better encodings. (Note that ﬁnding the right blocks of letters can be quite complicated.) Given this, one might expect that Huffman coding is rarely used. In fact, many compression schemes use Huffman codes at some point as a basic building block. For example, image and video transmission often use Huffman encoding somewhere along the line. Exercise: Find a Huffman compressor and another compressor, such as a gzip compressor. Test them on some ﬁles. Which compresses better? It is straightforward to write code to generate the appropriate tree, and then use this tree to encode and decode messages. For encoding, we simply build a table with a codeword for each sybmol. To decode, we could read bits in one at a time, and walk down the tree in the appropriate manner. When we reach a leaf, we output the appropriate symbol and return to the top of the tree. In practice, however, if we want to use Huffman coding, there are much faster ways to decode than to explicitly walk down the tree one bit at a time. Using an explicit tree is slow, for a variety of reasons. Exercise: Think about this. One approach is to design a system that performs several steps at a time by reading several bits of input and determining what actions to take according to a big lookup table. For example, we could have a table that represents the information, “If you are currently at this point in the tree, and the next 8 bits are 00110101, then output AC and move to this point in the tree.” This lookup table, which might be huge, encapsulates the information needed to handle eight bits at once. Since computers naturally handle eight bit blocks more easily than single bits, and because table lookups are faster than following pointers down a Huffman tree, substantial speed gains are possible. Notice that this gain in speed comes at the expense of the space required for the lookup table. There are other solutions that work particularly well for very large dictionaries. For example, if you were using Huffman codes on a libraray of newspaper articles, you might treat each work as a symbol that can be encoded. In this case, you would have a lot of symbols! We will not go over these other methods here; a useful paper on the subject is “On the Implementation of Minimum-Redundancy Preﬁx Codes,” by Moffat and Turpin. The key to keep in mind is that while thinking of decoding on the Huffman tree as happening one bit at a time is useful conceptually, good engineering would use more sophisticated methods to increase efﬁciency.

Lecture 7

7-8

The Set Cover Problem
The inputs to the set cover problem are a ﬁnite set X = {x 1 , . . . , xn }, and a collection of subsets S of X such that
S∈S

S = X. The problem is to ﬁnd the subcollection T ⊆ S such that the sets of T cover X, that is T = X.
T ∈T

Notice that such a cover exists, since S is itself a cover. The greedy heuristic suggests that we build a cover by repeatedly including the set in S that will cover the maximum number of as yet uncovered elements. In this case, the greedy heuristic does not yield an optimal solution. Interestingly, however, we can prove that the greedy solution is a good solution, in the sense that it is not too far from the optimal. This is an example of an approximation algorithm. Loosely speaking, with an approximation algorithm, we settle for a result that is not the correct answer. Instead, however, we try to prove a guarantee on how close the algorithm is to the right answer. As we will see later in the course, sometimes this is the best we can hope to do.

Lecture 7

7-9

Claim 7.1 Let k be the size of the smallest set cover for the instance (X, S ). Then the greedy heuristic ﬁnds a set cover of size at most k ln n.

Proof: Let Yi ⊆ X be the set of elements that are still not covered after i sets have been chosen with the greedy heuristic. Clearly Y0 = X. We claim that there must be a set A ∈ S such that |A ∩Yi | ≥ |Yi |/k. To see this, consider the sets in the optimal set cover of X. These sets cover Yi , and there are only k of them, so one of these sets must cover at least a 1/k fraction of Yi . Hence |Yi+1 | ≤ |Yi | − |Yi |/k = (1 − 1/k)|Yi |, and by induction, |Yi | ≤ (1 − 1/k)i |Y0 | = n(1 − 1/k)i < ne−i/k , where the last inequality uses the fact that 1 + x ≤ e x with equality iff x = 0. Hence when i ≥ k ln n we have |Yi | < 1, meaning there are no uncovered elements, and hence the greedy algorithm ﬁnds a set cover of size at most k ln n. Exercise: Show that this bound is tight, up to constant factors. That is, give a family of examples where the set cover has size k and the greedy algorithm ﬁnds a cover of size Ω(k ln n).

CS124

Lecture 8

Spring 2000

Divide and Conquer
We have seen one general paradigm for ﬁnding algorithms: the greedy approach. We now consider another general paradigm, known as divide and conquer. We have already seen an example of divide and conquer algorithms: mergesort. The idea behind mergesort is to take a list, divide it into two smaller sublists, conquer each sublist by sorting it, and then combine the two solutions for the subproblems into a single solution. These three basic steps – divide, conquer, and combine – lie behind most divide and conquer algorithms. With mergesort, we kept dividing the list into halves until there was just one element left. In general, we may divide the problem into smaller problems in any convenient fashion. Also, in practice it may not be best to keep dividing until the instances are completely trivial. Instead, it may be wise to divide until the instances are reasonably small, and then apply an algorithm that is fast on small instances. For example, with mergesort, it might be best to divide lists until there are only four elements, and then sort these small lists quickly by insertion sort.

8-1

Lecture 8

8-2

Maximum/minimum
Suppose we wish to ﬁnd the minimum and maximum items in a list of numbers. How many comparisons does it take? A natural approach is to try a divide and conquer algorithm. Split the list into two sublists of equal size. (Assume that the initial list size is a power of two.) Find the maxima and minima of the sublists. Two more comparisons then sufﬁce to ﬁnd the maximum and minimum of the list. Hence, if T (n) is the number of comparisons, then T (n) = 2T (n/2) + 2. (The 2T (n/2) term comes from conquering the two problems into which we divide the original; the 2 term comes from combining these solutions.) Also, clearly T (2) = 1. By induction we ﬁnd T (n) = (3n/2) − 2, for n a power of 2.

Lecture 8

8-3

Integer Multiplication
The standard multiplication algorithm takes time Θ(n 2 ) to multiply together two n digit numbers. This algorithm is so natural that we may think that no algorithm could be better. Here, we will show that better algorithms exist (at least in terms of asymptotic behavior). Imagine splitting each number x and y into two parts: x = 10 n/2 a + b, y = 10n/2 c + d. Then xy = 10n ac + 10n/2 (ad + bc) + bd. The additions and the multiplications by powers of 10 (which are just shifts!) can all be done in linear time. We have therefore reduced our multiplication problem into four smaller multiplications problems, so the recurrence for the time T (n) to multiply two n-digit numbers becomes T (n) = 4T (n/2) + O(n). The 4T (n/2) term arises from conquering the smaller problems; the O(n) is the time to combine these problems into the ﬁnal solution (using additions and shifts). Unfortunately, when we solve this recurrence, the running time is still Θ(n2 ), so it seems that we have not gained anything.

Lecture 8

8-4

The key thing to notice here is that four multiplications is too many. Can we somehow reduce it to three? It may not look like it is possible, but it is using a simple trick. The trick is that we do not need to compute ad and bc separately; we only need their sum ad + bc. Now note that (a + b)(c + d) = (ad + bc) + (ac + bd). So if we calculate ac , bd, and (a + b)(c + d), we can compute ad + bc by the subtracting the ﬁrst two terms from the third! Of course, we have to do a bit more addition, but since the bottleneck to speeding up this multiplication algorithm is the number of smaller multiplications required, that does not matter. The recurrence for T (n) is now T (n) = 3T (n/2) + O(n), and we ﬁnd that T (n) = nlog2 3 ≈ n1.59 , improving on the quadratic algorithm. If one were to implement this algorithm, it would probably be best not to divide the numbers down to one digit. The conventional algorithm, because it uses fewer additions, is probably more efﬁcient for small values of n. Moreover, on a computer, there would be no reason to continue dividing once the length n is so small that the multiplication can be done in one standard machine multiplication operation! It also turns out that using a more complicated algorithm (based on a similar idea) the asymptotic time for multiplication can be made arbitrarily close to linear– that is, for any ε > 0 there is an algorithm that runs in time O(n1+ε ).

Lecture 8

8-5

Strassen’s algorithm
Divide and conquer algorithms can similarly improve the speed of matrix multiplication. Recall that when multiplying two matrices, A = ai j and B = b jk , the resulting matrix C = cik is given by cik = ∑ ai j b jk .
j

In the case of multiplying together two n by n matrices, this gives us an Θ(n 3 ) algorithm; computing each cik takes Θ(n) time, and there are n2 entries to compute. Let us again try to divide up the problem. We can break each matrix into four submatrices, each of size n/2 by n/2. Multiplying the original matrices can be broken down into eight multiplications of the submatrices, with some additions.        

A B C D

E

F

G H

=

AE + BG

AF + BH

CE + DG CF + DH

Letting T (n) be the time to multiply together two n by n matrices by this algorithm, we have T (n) = 8T (n/2) + Θ(n2 ). Unfortunately, this does not improve the running time; it is still Θ(n 3 ).

Lecture 8

8-6

As in the case of multiplying integers, we have to be a little tricky to speed up matrix multiplication. (Strassen deserves a great deal of credit for coming up with this trick!) We compute the following seven products: • P1 = A(F − H) • P2 = (A + B)H • P3 = (C + D)E • P4 = D(G − E) • P5 = (A + D)(E + H) • P6 = (B − D)(G + H) • P7 = (A −C)(E + F) Then we can ﬁnd the appropriate terms of the product by addition: • AE + BG = P5 + P4 − P2 + P6 • AF + BH = P1 + P2 • CE + DG = P3 + P4 • CF + DH = P5 + P1 − P3 − P7 Now we have T (n) = 7T (n/2) + Θ(n2 ), which give a running time of T (n) = Θ(n log 7 ). Faster algorithms requiring more complex splits exist; however, they are generally too slow to be useful in practice. Strassen’s algorithm, however, can improve the standard matrix multiplication algorithm for reasonably sized matrices, as we will see in our second programming assignment.

CS124

Lecture 9

Spring 2000

9.1 The String reconstruction problem
The greedy approach doesn’t always work, as we have seen. It lacks ﬂexibility; if at some point, it makes a wrong choice, it becomes stuck. For example, consider the problem of string reconstruction. Suppose that all the blank spaces and punctuation marks inadvertantly have been removed from a text ﬁle. You would like to reconstruct the ﬁle, using a dictionary. (We will assume that all words in the ﬁle are standard English.) For example, the string might begin “thesearethereasons”. A greedy algorithm would spot that the ﬁrst two words were “the” and “sea”, but then it would run into trouble. We could backtrack; we have found that sea is a mistake, so looking more closely, we might ﬁnd the ﬁrst three words “the”,“sear”, and “ether”. Again there is trouble. In general, we might end up spending exponential time traveling down false trails. (In practice, since English text strings are so well behaved, we might be able to make this work– but probably not in other contexts, such as reconstructing DNA sequences!)

9-1

Lecture 9

9-2

This problem has a nice structure, however, that we can take advantage of. The problem can be broken down into entirely similar subproblems. For example, we can ask whether the strings “theseare” and “thereasons” both can be reconstructed with a dictionary. If they can, then we can glue the reconstructions together. Notice, however, that this is not a good problem for divide and conquer. The reason is that we do not know where the right dividing point is. In the worst case, we could have to try every possible break! The recurrence would be
n−1

T (n) =

i=1

∑ T (i) + T (n − i).

You can check that the solution to this recurrence grows exponentially. Although divide and conquer directly fails, we still want to make use of the subproblems. The attack we now develop is called dynamic programming. The way to understand dynamic programming is to see that divide and conquer fails because we might recalculate the same thing over and over again. (Much like we saw very early on with the Fibonacci numbers!) If we try divide and conquer, we will repeatedly solve the same subproblems (the case of small substrings) over and over again. The key will be to avoid the recalculations. To avoid recalculations, we use a lookup table.

Lecture 9

9-3

In order for this approach to be effective, we have to think of subproblems as being ordered by size. We solve the subproblems bottom-up, from the smallest to the largest, until we reach the original problem. For this dictionary problem, think of the string as being an array s[1 . . . n]. Then there is a natural subproblem for each substring s[i . . . j]. Consider a two dimensional array D(i, j) that will denote whether s[i . . . j] is the concatenation of words from the dictionary. The size of a subproblem is naturally d = j − i. So now we write a simple loops which solves the subprobelms in order of increasing size: for d := 1 to n − 1 do for i := 1 to n − d do j := i + d; if indict(s[i . . . j]) then D(i, j) := true else for k := i + 1 to j − 1 do if D(i, k) and D(k, j) then D(i, j) := true;

This algorithm runs in time O(n3 ); the three loops each run over at most n values. Pictorially, we can think of the algorithm as ﬁlling in the upper diagonal triangle of a two-dimensional array, starting along the main diagonal and moving up, diagonal by diagonal. We need to add a bit to actually ﬁnd the words. Let F(i, j) be the position of end of the ﬁrst word in s[i . . . j] when this string is a proper concatenation of dictionary words. Initially all F(i, j) should be set to nil. The value for F(i, j) can be set whenever D(i, j) is set to true. Given the F(i, j), we can reconstruct the words simply by ﬁnding the words that make up the string in order. Note also that we can use this to improve the running time; as soon as we ﬁnd a match for the entire string, we can exit the loop and return success! Further optimizations are possible. Let us highlight the aspects of the dynamic programming approach we used. First, we used a recursive description based on subproblems: D(i, j) is true if D(i, k) and D(k, j) for some k. Second, we built up a table containing the answers of the problems, in some natural bottom-up order. Third, we used this table to ﬁnd a way to determine the actual solution. Dynamic programming generally involves these three steps.

Lecture 9

9-4

9.2 Edit distance
A problem that arises in biology is to measure the distance between two strings (of DNA). We will examine the problem in English; the ideas are the same. There are many possible meanings for the distance between two strings; here we focus on one natural measure, the edit distance. The edit distance measures the number of editing operations it would be necessary to perform to transform the ﬁrst string into the second. The possible operations are as follows: • Insert: Insert a character into the ﬁrst string. • Delete: Delete a character from the ﬁrst string. • Replace: Replace a character from the ﬁrst string with another character. Another possibility is to not edit a character, when there is a Match. For example, a transformation from activate to caveat can be represented by D a M c c R t a D i M v v e I M a a M t t D e

The top line represents the operation performed. So the a in activate id deleted, and the t is replaced. The e in caveat is explicitly inserted. The edit distance is the minimal number of edit operations – that is, the number of Inserts, Deletes, or Replaces – necesary to transform one string to the other. Note that Matches do not count. Also, it is possible to have a weighted edit distance, if the different edit operations have different costs. We currently assume all operations have weight 1.

Lecture 9

9-5

We will show how compute the edit distance using dynamic programming. Our ﬁrst step is to deﬁne appropriate subproblems. Let us reprsent our strings by A[1 . . . n] and B[1 . . . m]. Suppose we want to consider what we do with the last character of A. To determine that, we need to know how we might have transformed the ﬁrst n − 1 characters of A. These n−1 characters might have transformed into any number of symbols of B, up to m. Similarly, to compute how we might have transformed the ﬁrst n − 1 characters of A into some part of B, it makes sense to consider how we transformed the ﬁrst n − 2 characters, and so on. This suggests the following submproblems: we will let D(i, j) represent the edit distance between A[1 . . . i] and B[1 . . . j]. We now need a recursive description of the subproblems in order to use dynamic programming. Here the recurrence is: D(i, j) = min[D(i − 1, j) + 1, D(i, j − 1) + 1, D(i − 1, j − 1) + I(i = j)]. In the above, I(i = j) represents the value 1 if i = j and 0 if i = j. We obtain the above expression by considering the possible edit operations available. Suppose our last operation is a Delete, so that we deleted the ith character of A to transform A[1 . . . i] to B[1 . . . j]. Then we must have transformed A[1 . . . i − 1] to B[1 . . . j], and hence the edit distance would be D(i − 1, j) + 1, or the cost of the transformation from A[1 . . . i − 1] to B[1 . . . j] plus one for the cost of the ﬁnal Delete. Similarly, if the last operation is an Insert, the cost would be D(i, j − 1) + 1. The other possibility is that the last operation is a Replace of the ith character of A with the jth character of B, or a Match between these two characters. If there is a Match, then the two characters must be the same, and the cost is D(i − 1, j − 1). If there is a Replace, then the two characters should be different, and the cost is D(i − 1, j − 1) + 1. We combine these two cases in our formula, using D(i − 1, j − 1) + I(i = j). Our recurrence takes the minimum of all these possibilities, expressing the fact that we want the best possible choice for the ﬁnal operation!

Lecture 9

9-6

It is worth noticing that our recursive description does not work when i or j is 0. However, these cases are trivial. We have D(i, 0) = i, since the only way to transform the ﬁrst i characters of A into nothing is to delete them all. Similarly, D(0, j) = j.

Again, it is helpful to think of the computation of the D(i, j) as ﬁlling up a two-dimensional array. Here, we begin with the ﬁrst column and ﬁrst row ﬁlled. We can then ﬁll up the rest of the array in various ways: row by row, column by column, or diagonal by diagonal! Besides computing the distance, we may want to compute the actual transformation. To do this, when we ﬁll the array, we may also picture ﬁlling the array with pointers. For example, if the minimal distance for D(i, j) was obtained by a ﬁnal Delete operation, then the cell (i, j) in the table should have a pointer to (i − 1, j). Note that a cell can have multiple pointers, if the minimum distance could have been achieved in multiple ways. Now any path back from (n, m) to (0, 0) corresponds to a sequence of operations that yields the minimum distance D(n, m), so the transformation can be found by following pointers. The total computation time and space required for this algorithm is O(nm).

Lecture 9

9-7

9.3 All pairs shortest paths
Let G be a graph with positive edge weights. We want to calculate the shortest paths between every pair of nodes. One way to do this is to run Dijkstra’s algorithm several times, once for each node. Here we develop a different dynamic programming solution. Our subproblems will be shortest paths using only nodes 1 . . . k as intermediate nodes. Of course when k equals the number of nodes in the graph, n, we will have solved the original problem. We let the matrix Dk [i. j] represent the length of the shortest path between i and j using intermediate nodes 1 . . . k. Initially, we set a matrix D0 with the direct distances between nodes, given by d i j . Then Dk is easily computed from the subproblems Dk−1 as follows:

Dk [i, j] = min(Dk−1 [i, j], Dk−1 [i, k] + Dk−1 [k, j]). The idea is the shortest path using intermediate nodes 1 . . . k either completely avoids node k, in which case it has the same length as Dk−1 [i, j]; or it goes through k, in which case we can glue together the shortest paths found from i to k and k to j using only intermediate nodes 1 . . . k − 1 to ﬁnd it. It might seem that we need at least two matrices to code this, but in fact it can all be done in one loop. (Exercise: think about it!)

D = (di j ), distance array, with weights from all i to all j for k = 1 to n do for i = 1 to n do for j = 1 to n do D[i, j] = min(D[i, j], D[i.k] + D[k, j])

Note that again we can keep an auxiliary array to recall the actual paths. We simply keep track of the last intermediate node found on the path from i to j. We reconstruct the path by succesively reconstructing intermediate nodes, until we reach the ends.

Lecture 9

9-8

9.4 Traveling salesman problem
Suppose that you are given n cities and the distances d i j between them. The traveling salesman problem (TSP) is to ﬁnd the shortest tour that takes you from your home city to all the other cities and back again. As there are (n − 1)! possible paths, this can clearly be done in O(n!) time by trying all possible paths. Of course this is not very efﬁcient. Since the TSP is NP-complete, we cannot really hope to ﬁnd a polynomial time algorithm. But dynamic programming gives us a much better algorithm than trying all the paths. The key is to deﬁne the appropriate subproblem. Suppose that we label our home city by the symbol 1, and other cities are labeled 2, . . . , n. In this case, we use the following: for a subset S of vertices including 1 and at least one other city, let C(S, j) be the shortest path that start at 1, visits all other nodes in S, and ends at j. Note that our subproblems here look slightly different: instead of ﬁnding tours, we are simply ﬁnding paths. The important point is that the shortest path from i to j through all the vertices in S consists of some shortest path from i to a vertex x, where x ∈ S − { j}, and the additional edge from x to j.

for all j do C({i, j}, j) := d1 j for s = 3 to n do % s is the size of the subset for all subsets S of {1, . . . , n} of size s containing 1 do for all j ∈ S, j = 1 do C(S, j) := mini= j,i∈S [C(S − { j}, i) + di j ] opt := min j=i C({1, . . . , n}, j) + d j1

The idea is to build up paths one node at a time, not worrying (at least temporarily) where they will end up. Once we have paths that go through all the vertices, it is easy to check the tours, since they consists of a shortest path through all the vertices plus an additional edge. The algorithm takes time O(n 2 2n ), as there are O(n2n ) entries in the table (one for each pair of set and city), and each takes O(n) time to ﬁll. Of course we can add in structures so that we can actually ﬁnd the tour as well. Exercise: Consider how memory-efﬁcient you can make this algorithm.

CS124

Lecture 10

Spring 1999

How many people do there need to be in a room before with probability greater than 1/2 some two of them have the same birthday? (Assume birthdays are distributed uniformly at random.) Surprisingly, only 23. This is easily determined as follows: the probability the ﬁrst two people have different birthdays is (1 − 1/365). The probability that the third person in the room then has a birthday different from the ﬁrst two, given the ﬁrst two people have different birthdays, is (1 − 2/365), and so on. So the probability that all of the ﬁrst k people have different birthdays is the product of these terms, or (1 − 2 3 k−1 1 ) · (1 − ) · (1 − ) . . . · (1 − ). 365 365 365 365

Determining the right value of k is now a simple exercise.

10-1

Lecture 10

10-2

10.2 Balls into Bins
Mathematically, the birthday paradox is an example of a more general mathematical question, often formulated in terms of balls and bins. Some number of balls n are thrown into some number of bins m. What does the distribution of balls and bins look like? The birthday paradox is focused on the ﬁrst time a ball lands in a bin with another ball. One might also ask how many of the bins are empty, how many balls are in the most full bin, and other sorts of questions. Let us consider the question of how many bins are empty. Look at the ﬁrst bin. For it to be empty, it has to be missed by all n balls. Since each ball hits the ﬁrst bin with probability 1/m, the probability the ﬁrst bin remains empty is 1 (1 − )n ≈ e−n/m . m Since the same argument holds for all bins, on average a fraction e −n/m of the bins will remain empty. Exercise: Howmany bins have 1 ball? 2?

Lecture 10

10-3

10.3 Hash functions
A hash function is a deterministic mapping from one set into another that appears random. For example, mapping people into their birthdays can be thought of as a hash function. In general, a hash function is a mapping f : {0, . . . , n − 1} → {0, . . . , m − 1}. Generally n >> m; for example, the number of people in the world in much bigger than the number of possible birthdays. There is a great deal of theory behind designing hash functions that “appear random.” We will not go into that theory here, and instead assume that the hash functions we have available are in fact completely random. In other words, we assume that for each i (0 ≤ i ≤ n − 1), the probability that f (i) = j is 1/m (for (0 ≤ j ≤ m − 1). Notice that this does mean that every time we look at f (i), we get a different random answer! The value of f (i) is ﬁxed for all time; it is just equally likely to take on any value in the range. While such completely random hash functions are unavailable in practice, they generally provide a good rough idea of how hashing schemes perform. (An aside: in reality, birthdays are not completely random either. Seasonal distributions skew the calculation. How might this affect the birthday paradox?)

Lecture 10

10-4

h

.

The total space used is merely hm bits. Notice that the Bloom ﬁlter sometimes returns the wrong answer – we may reject a proposed password, even though it is not a common password. This sort of error is probably acceptable, as long as it doesn’t happen so frequently as to bother users. Fortunately this error is one-sided; a common password is never accepted. One must set the parameters m and h appropriately to trade off this error probability against space and time requirements. For example, consider a dictionary of 100,000 common passwords, each of which is on average 7 characters long. Uncompressed this would be 700,000 bytes. Compression might reduce it substantially, to around 300,000 bytes. Of course, then one has the problems of searching efﬁciently on a compressed list. Instead, one could keep a 100,000 byte Bloom ﬁlter, consisting of 5 tables of 160,000 bits. The probability of rejecting a reasonable password is just over 2%. The cost for checking a password is at most 5 hashes and 5 lookups into the table.

CS 124

Lecture 11

11.1 Applications: Fingerprinting for pattern matching
Suppose we are trying to ﬁnd a pattern string P in a long document D. How can we do it quickly and efﬁciently? Hash the pattern P into say a 16 bit value. Now, run through the ﬁle, hashing each set of |P| consecutive characters into a 16 bit value. If we ever get a match for a pattern, we can check to see if it corresponds to an actual pattern match. (In this case, we want to double-check and not report any false matches!) Otherwise we can just move on. We can use more than 16 bits, too; we would like to use enough bits so that we will obtain few false matches. This scheme is efﬁcient, as long as hashing is efﬁcient. Of course hashing can be a very expensive operation, so in order for this approach to work, we need to be able to hash quickly on average. In fact, a simple hashing technique allows us to do so in constant time per operation! The easiest way to picture the process is to think of the ﬁle as a sequence of digits, and the pattern as a number. Then we move a pointer in the ﬁle one character at a time, seeing if the next |P| digits gives us a number equal to the number corresponding to the pattern. Each time we read a character in the ﬁle, the number we are looking at changes is a natural way: the leftmost digit a is removed, and a new rightmost digit b is inserted. Hence, we update an old number N and obtain a new number N by computing N = 10 · (N − 10|P|−1 · a) + b. When dealing with a string, we will be reading characters (bytes) instead of numbers. Also, we will not want to keep the whole pattern as a number. If the pattern is large, then the corresponding number may be too large to do effective comparisons! Instead, we hash all numbers down into say 16 bits, by reducing them modulo some appropriate prime p. We then do all the mathematics (multiplication, addition) modulo p, i.e. N = [10 · (N − 10|P|−1 · a) + b] mod p. All operations mod p can be made quite efﬁcient, so each new hash value takes only constant time to compute! This pattern matching technique is often called ﬁngerprinting. The idea is that the hash of the pattern creates an almost unique identiﬁer for the pattern– like a ﬁngerprint. If we ever ﬁnd two ﬁngerprints that match, we have a good reason to expect that they must come the same pattern. Of course, unlike real ﬁngerprints, our hashing-based ﬁngerprints do not actually uniquely identify a pattern, so we still need to check for false matches. But since false matches should be rare, the algorithm is very efﬁcient! See Figure 11.1 for an example of ﬁngerprinting.

11-1

Lecture 11

11-2

P = 17935 p = 251 P mod p = 114 6386179357342... 63861 mod p = 107 38617 mod p = 214 86179 mod p = 86 61793 mod p = 47 17935 mod p = 114 79357 mod p = 41 93573 mod p = 201 35734 mod p = 92 57342 mod p = 114
Figure 11.1: A ﬁngerprinting example. The pattern P is a 5 digit number. Note successive calculations take constant time: 38617 mod p = ( (63861 mod p) - (60000 mod p)) · 10 + 7 mod p. Also note that false matches are possible (but unlikely); 57432 = 17935 mod p. One question remains. How should we choose the prime p? We would like the prime we choose to work well, in that it should have few false matches. The problem is that for every prime, there are certainly some bad patterns and documents. If we choose a prime in advance, then someone can try to set up a document and pattern that will cause a lot of false matches, making our ﬁngerprinting algorithm go very slowly. A natural approach is to choose the prime p randomly. This way, nobody can set up a bad pattern and document in advance, since they are not sure what prime we will choose. Let us make this a bit more rigorous. Let π(x) represent the number of primes that are less than or equal to x. It will be helpful to use the following fact: Fact:
x ln x x ≤ π(x) ≤ 1.26 ln x .

Consider any point in the algorithm, where the pattern and document do not match. If our pattern has length |P|, then at that point we are comparing two numbers that are each less than 10 |P| . In particular, their difference (in absolute value) is less than 10|P| . What is the probability that a random prime divides this difference? That is, what is the probability that for the random prime we choose, the two numbers corresponding to the pattern and the current |P| digits in the document are equal modulo p. First, note that there are at most log 2 10|P| distinct primes that divide the difference, since the difference is at most 10|P| (in absoulte value), and each distinct prime divisor is at least 2. Hence, if we choose our prime randomly

Lecture 11

11-3

from all primes up to Z, the probability we have a false match is at most log2 10|P| π(Z). Now the probability that we have a false match anywhere is at most |D| times the probability that we have a false match in any single location, by the union bound. Hence the probability that we have a false match anywhere is at most |D| log2 10|P| π(Z). Exercise: How big should we make Z in order to make the probability of a false match anywhere in the algorithm less than 1/100?

Lecture 11

11-4

How could we improve the probability of a false match? One way is to choose from a larger set of primes. Another way is to choose not just one random prime, but several random primes from Z. This is like choosing several hash functions in the Bloom ﬁlter problem. There is a false match only if there is a false match at every random prime we choose. If we choose k primes (with replacement) from the primes up to Z, the probability of a false match at a speciﬁc point is at most log2 10|P| π(Z)
k

.

CS124

Lecture 12

12.1 Near duplicate documents1
Suppose we are designing a major search engine. We would like to avoid answering user queries with multiple copies of the same page. That is, there may be several pages with exactly the same text. These duplicates occur for a variety of reasons. Some are mirror sites, some are copies of common pages (such as Unix man pages), some are multiple spam advertisements, etc. Returning just one of the duplicates should be sufﬁcient for the end user; returning all of them will clutter the response page, wasting valuable real estate and frustraing the user. How can we cope with duplicate pages? Determining exact duplicates has a simple solution, based on hashing. Use the text of each page and an appropriate hash function to hash the text into a 64 bit signature. If two documents have the same signature, it is reasonable to assume that they share the same text. (Why? How often is this assumption wrong? Is it a terrible thing if the assumption turns out to be false?) By comparing signatures on the ﬂy, we can avoid returning duplicates. This solution works extremely well if we want to catch exact duplicates. What if, however, we want to capture the idea of “near duplicate” documents, or similar documents. For example, consider two mirror sites on the Web. It may be that the documents share the same text, except that the text corresponding to the links on the page are different, with each referring to the correct mirror site. In this case, the two pages will not yield the same signature, although again, we would not want to return both pages to the end user, because they are so similar. As another example, consider two copies of a newspaper article, one with a proper copyright notice added, and one without. We do not need to return both pages to the user. Again, hashing the document appears to be of no help. Finally, consider the case of advertisers who submit slightly modiﬁed versions of their ads over and over again, trying to get more or better spots on the response pages sent back to users. We want to stop their nefarious plans! We will describe a scheme used to detect similar documents efﬁciently, using a hashing based scheme. Like the Bloom ﬁlter solution for password dictionaries, our solution is highly efﬁcient in terms of space and time. The cost for this efﬁcienty is accuracy; our algorithm will sometimes make mistakes, because it uses randomness.

12.2 Set resemblance
We describe a more general problem that will relate to our document similarity problem. Consider two sets of numbers, A and B. For concreteness, we will assume that A and B are subsets of 64 bit numbers. We may deﬁne the resemblance of A and B as resemblance(A, B) = R(A, B) = |A ∩ B| . |A ∪ B|

The resemblance is a real number between 0 and 1. Intuitively, the resemblance accurately captures how close the two sets are. Sets and documents will be related, as we will see later.
lecture is based on the work of Andrei Broder, who developed these ideas, and convinced Altavista to use them! (The second feat may have been even more difﬁcult than the ﬁrst.)
1 This

12-1

Lecture 12

12-2

How quickly can we determine the resemblance of two sets? If the sets are each of size n, the natural approach (compare each element to in A to each element in B) is O(n2 ). We can do better by sorting the sets. Still, these approaches are all rather slow, when we consider that we will have many sets to deal with and hence many pairs of sets to consider. Instead we should ocnsider relaxing the problem. Suppose that we do not need an exact calculation of the resemblance R(A, B). A reasonable estimate or approximation of the resemblance will sufﬁce. Also, since we will be answering a variety of queries over a long period of time, it makes sense to consider algorithms that ﬁrst do a preprocessing phase, in order to handle the queries much more quickly. That is, we will ﬁrst do some work, preparing the appropriate data structures and data in a preprocessing phase. The advantage of doing all this work in advance will be that queries regarding resemblance can then be quickly answered. Our estimation process will require a black box that does the following: it produces an effective random permutation on the set of 64 bit numbers. What do we mean by a random permutation? Let us consider just the case of four bit number, of which there are 16. Suppose we write each number on a card. Generating a random permutation is like shufﬂing this deck of 16 cards and looking at the order at which the numbers appear after ths shufﬂing. For example, if we ﬁnd the number 0011 on the ﬁrst card, then our permutation maps the number 3 to the number 1. We write this as π(3) = 1, where π is a function that represents the permutation. Suppose we have an efﬁcient implemenation of random permutations, which we think of as a black box procedure. That is, when we invoke the black box procedure BB(1, x) on a 64 bit number x, we get out y = π (x) for some 1 ﬁxed, completely random permutation π1 . Similarly, if we invoke the black box BB(2, x), we get out π2 (x) for some different random permutation π2 . (In fact in practice we cannot achieve this black box, but we can get close enough that it is useful to think in these terms for analysis.) Let us use the notation π1 (A) to denote the set of elements obtained by computing BB(1, x) for every x in A. Consider the following procedure: we compute the set π1 (A) and π1 (B), and record the minimum of each set. When does min{π1 (A)} = min{π1 (B)}? This happens only when there is some element x satisfying π1 (x) = min{π1 (A)} = min{π1 (B)}. In other words, the element x that is the minimum element in the set A ∪ B has to be the intersection of the sets A ∩ B. If π1 is a random permutation, then every element in A ∪ B has equal probability of mapping to the minimum element after the permutation is applies. That is, for all x and y in A ∪ B, Pr[π1 (x) = min{π1 (A ∪ B)}] = Pr[π1 (y) = min{π1 (A ∪ B)}]. Thus, for the minimum of π1 (A) and π1 (B) to be the same, the minimum element must lie in π1 (A ∩ B) (see Figure 12.1). Hence |A ∩ B| . Pr[min{π1 (A)} = min{π1 (B)}] = |A ∪ B| But this is just the resemblance R(A, B)! This gives us a way to estimate the resemblance. Instead of taking just one permutation, we take many– say 100. For each set A, we preprocess by computing min{πj (A)} for j = 1 to 100, and store these values. To estimate the resemblance of two sets A and B, we count how often the minima are the same, and divide by 100. It is like each permutation gives us a coin ﬂip, where the probability of a heads (a match) is exactly the resemblance R(A, B) of the two sets.

Lecture 12

12-3

A

B

AIB
Figure 12.1: If the minimum element of π1 (A) and π1 (B) are the same, the minimum element must lie in π1 (A ∩ B).

Four score and seven years ago, our founding Four score and seven score and seven years and seven years ago seven years ago our years ago our founding
Figure 12.2: Shingling: the document is broken up into all segments of k consecutive words; each segment leads to a 64 bit hash value.

12.3 Turning Document Similarity into a Set Resemblance Problem
We now return to the original application. How do we turn document similarity into a set resemblance problem? The key idea is to hash pieces of the document– say every four consecutive words– into 64 bit numbers. This process has been called shingling, and each set of consecutive words is called a shingle. (See Figure 12.2.) Using hashing, the shingles give rise to the resulting numbers for the set resemblance problem, so that for each document D there is a set SD . There are many possible variations and improvements possible. For example, one could modify the number of bits in a shingle or the method for shingling. Similarly, one could throw out all shingles that are not 0 mod 16, say, in order to reduce the number of shingles per document. This approach obscures some important information in the document– such as the order paragraphs appear in, say. However, it seems reasonable to say that if the resulting sets have high resemblance, the documents are reasonably similar. Once we have the shingles for the document, we associate a document sketch with each document. The sketch of a document SD is a list of say 100 numbers: (min{π1 (SD )}, min{π2 (SD )}, min{π3 (SD )}, . . . , min{π100 (SD )}). Now we choose a threshold– for example, we might say that two documents are the similar if 90 out of the 100 entries in the sketch match. Now whenever a user queries the search engine, we check the sketches of the documents we wish to return. If two sketches share 90 entries, we only send one of them. (Alternatively, we could catch the duplicates on the crawling side– we check all the documents as we crawl the Web, and whenever two sketches share more than 90 entries, we assume the associated documents are similar, so that we only need to store one of them!) Recall that our scheme uses random permutations. So, if we set our sketch threshold to 90 out of 100 entries,

Lecture 12

12-4

this does not guarantee that any pair of documents with high resemblance are caught. Also, some pairs of documents that do not have high resemblance may get marked as having high resemblance. How well does this scheme do? We analyze how well the scheme does with the following argument. For each permutation π, the probability i that two documents A and B have the same value in the ith position of the sketch is just the resemblance of the two documents R(A, B) = r. (Here the resemblance R(A, B) of course refers to the resemblance of the sets of numbers obtained by shingling A and B.) Hence, the probability p(r) that at least 90 out of the 100 entries in the sketch match is 100 100 k r (1 − r)100−k . p(r) = ∑ k k=90 What does p(r) look like as a function of r? The graph is shown in Figure 12.3. Notice that p(r) stays very small until r approaches 0.9, and then quickly grows towards 1. This is exactly the property we want our scheme to have– if two documents are not similar, we will rarely mistake them for being similar, and if they are similar, we are likely to catch them! For example, even if the resemablance is 0.8, we will only get 90 matches with probability less than 0.006! −18 When the resemblance is only 0.5, the probability of having 90 entries in the sketch match falls to almost 10 ! If documents are not alike, we will rarely mistake them as being similar. If documents are alike, we will most likely catch them. If the resemblance is 0.95, the documents will have 90 or more entries in common in the sketch with probability greater than .988; if the resemblance is 0.96, the probability jumps to over .997. We are dealing with a very large number of dcouments– most search engines currently index twenty-ﬁve to over one hundred million Web pages! So even though the probability of making a mistake is small, it will happen. The worst that happens, though, is that the search engine fails to index a few pages that it should, and it fails to catch a few duplicates that it should. These problems are not a big deal.

Lecture 12

12-5

1
Probability of 90 or more matches

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Resemblance

Figure 12.3: Making the threshold for document similarity 90 out of 100 matches in the sketch leads to the following graph relating resemblance to the probability two documents are considered similar. Notice the sharp change in behavior near where the resemablance is 0.90. Essentially, the procedure behaves like a low pass ﬁlter.

CS124

Lecture 13

Hopefully the ideas we saw in our hashing problems have convinced you that randomness is a useful tool in the design and analysis of algorithms. Just to make sure, we will consider several more example of how to use randomness to design algorithms.

13.1 Primality testing
A great deal of modern cryptography is based on the fact that factoring is apparently hard. At least nobody has published a fast way to factor yet. (It is rumored the NSA knows how to factor, and is keeping it a secret. Some of you might well have worked or will work for the NSA, at which point you will be required to keep this secret. Shame on you.) Of course, certain numbers are easy to factor– numbers with small prime factors, for example. So often, for cryptographic purposes, we may want to generate very large prime numbers and multiply them together. How can we ﬁnd large prime numbers? We are fortunate to ﬁnd that prime numbers are pretty dense. That is, there’s an awful lot of them. Let π(x) be the number of primes less than or equal to x. Then x , π(x) ≈ ln x or more exactly, π(x) lim x = 1.
x→∞ ln x

This means that on average about one out of every ln x numbers is prime, if we are looking for primes about the size 250 of x. So if we want to ﬁnd prime numbers of say 250 digits, we would have to check about ln 10 ≈ 576 numbers on average before ﬁnding a prime. (We can search smarter, too, throwing out multiples of 2,3,5, etc. in order to check fewer numbers.) Hence, all we need is a good method for testing if a number is prime. With such a test, we can generate large primes easily– just keep generating random large numbers, and test them for primality until we ﬁnd a suitable prime number. How can we test if a number n is prime? The pedantic way is to try dividing n by all smaller numbers. √ Alternatively, we can try to divide n by all primes up to n. Of course, both of these approaches are quite slow; √ when n is about 10250 , the value of n is still huge! The point is that 10250 has only 250 (or more generally O(log n)) 250 digits, so we’d like the running time of the algorithm to be based on the size 250, not 10 ! How can we quickly test if a number is prime? Let’s start by looking at some ways that work pretty well, but have a few problems. We will use the following result from number theory: Theorem 13.1 If p is a prime and 1 ≤ a < p, then a p−1 = 1 mod p. Proof: There are two nice proofs for this fact. One uses a simple induction to prove the equivalent statement that a p = a mod p. This is clearly true when a = 1. Now (a + 1) p = ∑
p

i=0

p p−i a . i

13-1

Lecture 13

13-2

The coefﬁcient

p i

is divisible by p, unless i = 0 or i = p. Hence (a + 1) p = a p + 1 mod p = a + 1 mod p,

where the last step follows by the induction hypothesis. An alternative proof uses the following idea. Consider the numbers 1, 2, . . . , p − 1. Multiply them all by a, so now we have a, 2a, . . . , (p − 1)a. Each of these number is distinct mod p, and there are p − 1 such numbers, so in fact the sequence a, 2a, . . . , (p − 1)a is the same as the sequence 1, 2, . . . , p − 1 when considered modulo p, except for the order. Hence 1 · 2 · . . . · (p − 1) = a · 2a · . . . · (p − 1)a mod p = ap−1 · 1 · 2 · . . . · (p − 1) mod p. Thus we have ap−1 = 1 mod p. This immediately suggests one way to check if a number n is prime. Compute 2n−1 mod n. If it is not 1, then n is certainly not prime! Note that we can compute 2n−1 mod n quite efﬁciently, using our previously discussed methods for exponentiation, which require only O(log n) multiplications! Thus this test is efﬁcient. But so far this test is just one-way; if n is composite, we may have that 2n−1 = 1 mod n, so we cannot assume that n is prime just because it passes the test. For example, 2340 = 1 mod 341, and 341 is not prime. Such a number is called a 2-psuedoprime, and unfortunately there are inﬁnitely many of them. (Of course, even though there are inﬁnitely many 2-pseudoprimes, they are not as dense as the primes– that is, there are relatively very few of them. So if we generate a large number n randomly, and see if 2n−1 = 1 mod n, we will most likely be right if we then say n is prime if it passes this test. In practice, this might be good enough! This is not a good primality test, however, if an NSA ofﬁcial you know gives you a number to test for primality, and you think they might be trying to fool you. The NSA might be purposely giving you a 2-pseudoprime. They can be tricky that way.) You might think to try a different base, other than 2. For example, you might choose 3, or a random value of a. Unfortunately, there are inﬁnitely many 3-pseudoprimes. In fact, there are inﬁnitely many composite numbers n such that an−1 = 1 mod n for all a that do not share a factor with n. (That is, for all a such that the greatest common divisor of a and n is 1.) Such n are called Carmichael numbers– the smallest such number is 561. So a test based on this approach is destined to fail for some numbers. There is a way around this problem, due to Rabin. Let n − 1 = 2t u. Suppose we choose a random base a and compute an−1 by ﬁrst computing au and then repeatedly squaring. Along the way, we will check to see for the values au , a2u , . . . whether they have the following property: a2
i−1 u

= ±1 mod n, a2 u = 1 mod n.

i

That is, suppose we ﬁnd a non-trivial square root of 1 modulo n. It turns out that only composite numbers have non-trivial square roots – prime numbers don’t. In fact, if we choose a randomly, and n is composite, for at least 3/4 of the values of a, one of two things will happen: we will either ﬁnd a non-trivial square root of 1 using this process, or we will ﬁnd that an−1 = 1 mod n. In either case, we know that n is composite! A value of a for which either an−1 = 1 mod n or the computation of an−1 yields a non-trivial square root is called a witness to the compositeness of n. We have said that 3/4 of the possible values of a are witnesses (we will not prove this here!). So if we pick a single value of a randomly, and n is composite, we will determine that n is composite with probability at least 3/4. How can we improve the probability of catching when n is composite? The simplest way is just to repeat the test several times, each time choosing a value of a randomly. (Note that we do not even have to go to the trouble of making sure we try different values of a each time; we can choose values with replacement!) Each time we try this we have a probability of at least 3/4 of catching that n is composite, so if

Lecture 13

13-3

k we try the test k times, we will return the wrong answer in the case where n is composite with probability (1/4) . For 50 ; the probability that a random cosmic k = 25, the probability of the algorithm itself making an error is thus (1/2) ray affected your arithmetic unit is probably higher!

This trick comes up again and again with randomized algorithms. If the probability of catching an error on a t single trial is p, the probability of failing to catch an error after t trials is (1 − p) , assuming each trial is independent. By making t sufﬁciently large, the probability of error can be reduced. Since the probability shrinks exponentially in t, few trials can produce a great deal of security in the answer.

CS 124

Lecture 14

14.1 Cryptography Fundamentals
Cryptography is concerned with the following scenario: two people, Alice and Bob, wish to communicate privately in the presence of an eavesdropper, Eve. In particular, suppose Alice wants to send Bob a message x. (For convenience, we will always assume our message has been converted into a bit string.) Using cryptography, Alice would compute a function e(x), the encoding of x, using some secret key, and transmit e(x) to Bob. Bob receives e(x), and using his own secret key, would compute a function d(e(x)) = x. The function d provides the decoding of the encoding e(x). Eve is presumably unable to recover x from e(x) because she does not have the key – without the key, computing x is either impossible or computationally difﬁcult.

A classical cryptographic method is the one-time pad. A one-time pad is a random string of bits r, equal in length to the message x, that Alice and Bob share and is secret. By random, here we mean that r is equally like to be any bit string of the right length, |r|. Alice compute e(x) = x ⊕ r; Bob computes d(e(x)) = e(x) ⊕ r = x ⊕ r ⊕ r = x. The claim is that Eve gets absolutely no information about the message by seeing e(x). More concretely, we claim Pr(message is x | e(x)) = Pr(message is x); that is, knowing e(x) gives no more information to Eve than she already had. This is a nice exercise in condtional probabilities. Since e(x) provides no information, the one-time pad is completely secure. (Notice that this does not rely on notions of computational difﬁculty; Eve really obtains no additional information!) There are, however, crucial drawbacks. • The key r has to be as long as x. • The key r can only be used once. (To see this, suppose we use the same key r to encode x and y. The Eve can compute e(x) ⊕ e(y) = x ⊕ y, which might yield useful information!) 14-1

Lecture 14

14-2

• The key r has to be exchanged, by some other means. (Private courier?)

14.1.2 DES
The Data Encrytpion Standard, or DES, is a U.S. government sponsored cryptographic method proposed in 1976. It uses a 56 bit key, again shared by Alice and Bob, and it encodes blocks of 64 bits using a complicated sequence of bit operations. Many have suspected that the government engineered the DES standard, so that they could break it easily, but
56 nobody has shown a simpler method for breaking DES other than trying the 2 possible keys. These days, however,

trying even this large number of keys can be accomplished in just a few days with specialized hardware. Hence DES is widely considered no longer secure.

14.1.3 RSA
RSA (named after its inventors, Ron Rivest, Adi Shamir, and Len Adleman) was developed around the same time as DES. RSA is an example of public key cryptography. In public key cryptography, Bob has two keys: a public key, ke , known to everyone, and a private key, kd , known only to Bob. If Alice (or anyone else) wants to send a message x to Bob, she encrypts it as e(x) using the public key; Bob then decrypts it using his private key. For this to be secure, the private key must be hard to compute from the public key, and similarly e(x) must be hard to compute from x. The RSA algorithm depends on some number theory and simple algorithms, which we will consider before describing RSA. We will then describe how RSA is efﬁcient and secure.

14.2 Tools for RSA
14.2.1 Primality
For the time being, we will assume that it is possible to generate large prime numbers. In fact, there are simple and efﬁcient randomized algorithms for generating large primes, that we will consider later in the course.

Lecture 14

14-3

14.2.2 Euclid’s Greatest Common Divisor Algorithm
Deﬁnition: The greatest common divisor (or gcd) of integers a, b ≥ 0 is the largest integer d ≥ 0 such that d|a and d|b, where d|a denotes that d divides a. Example: gcd(360,84) = 12. One way of computing the gcd is to factor the two numbers, and ﬁnd the common prime factors (with the right multiplicity). Factoring, however, is a problem for which we do not have general efﬁcient algorithms. The following algorithm, due to Euclid, avoids factoring. Assume a ≥ b ≥ 0. function Euclid(a, b) if b = 0 return(a) return(Euclid(b, a mod b)) end Euclid Euclid’s algorithm relies on the fact that gcd(a, b) = gcd(b, a mod b). You should prove this as an exercise. We need to check that this algorithm is efﬁcient. We will assume that mod operations are efﬁcient (in fact they can be done in O(log2 a) bit operations). How many mod operations must be performed? To analyze this, we notice that in the recursive calls of Euclid’s algorithms, the numbers always get smaller. For the algorithm to be efﬁcient, we’d like to have only about O(log a) recursive calls. This will require the numbers to shrink by a constant factor after a constant number of rounds. In fact, we can show that the larger number shrinks by a factor of 2 every 2 rounds. Claim 1: a mod b ≤ a/2. Proof: The claim is trivially true if b ≤ a/2. If b > a/2, then a mod b = a − b ≤ a/2. Claim 2: On calling Euclid(a, b), after the second recursive call Euclid(a , b ) has a ≤ a/2. Proof: For the second recursive call, we will have a = a mod b.

14.2.3 Extended Euclid’s Algorithm
Euclid’s algorithm can be extended to give not just the greatest common divisor d = gcd(a, b), but also two integers x and y such that ax + by = d. This will prove useful to us subsequently, as we will explain.

Lecture 14

14-4

Extended-Euclid(a, b) if b = 0 return(a, 1, 0) Compute k such that a = bk + (a mod b) (d, x, y) = Extended-Euclid(b, a mod b) return((d, y, x − ky)) end Extended-Euclid

Claim 3: The Extended Euclid’s algorithm returns the correct answer. Proof: By induction on a + b. It clearly works if b = 0. (Note the understanding that all numbers divide 0!) If b = 0, then we may assume the recursive call provides the correct answer by induction, as a mod b < a. Hence we have x and y such that bx + (a mod b)y = d. But (a mod b) = a − bk, and hence by substitution we get bx + (a − bk)y = d, or ay + b(x − ky) = d. This shows the algorithm provides the correct output. Note that the Extended Euclid’s algorithm is clearly efﬁcient, as it requires only a few extra arithmetic operations per recursive call over Euclid’s algorithm. The Extended Euclid’s algorithm is useful if we wish to compute the inverse of a number. That is, suppose we wish to ﬁnd a−1 mod n. The number a has a multiplicative inverse modulo n if and only if the gcd of a and n is 1. Moreover, the Extended Euclid’s algorithm gives us that number. Since in this case computing gcd(a, n) gives x, y such that ax + ny = 1, we have that x = a−1 mod n.

14.2.4 Exponentiation
Suppose we have to compute xy mod z, for integers x, y, z. Multiplying x by itself y times is one possibility, but
2 it is too slow. A more efﬁcient approach is to repeatedly square from x, to get x mod z, x4 mod z, x8 mod z . . .,

x2

log y

mod z. Now xy can be computed by multiplying together modulo z the powers that correspond to ones in the

binary representation of y.

14.3 The RSA Protocol
To create a public key, Bob ﬁnds two large primes, p and q, of roughly the same size. (Large should be a few hundred decimal digits. Recently, with a lot of work, 512-bit RSA has been broken; this corresponds to n = pq being 512

Lecture 14

14-5

bits long.) Bob computes n = pq, and also computes a random integer e, such that gcd((p − 1)(q − 1), e) = 1. (An alternative to choosing e randomly often used in practice is to choose e = 3, in which case p and q cannot equal 1 modulo 3.)
−1 The pair (n, e) is Bob’s public key, which he announces to the world. Bob’s private key is d = e mod (p −

1)(q − 1), which can be computed by Euclid’s algorithm. More speciﬁcally, (p, q, d) is Bob’s private key. Suppose Alice wants to send a message to Bob. We think of the message as being a number x from the range [1, n]. (If the message is too big to be represented by a number this small, it must be broken up into pieces; for example, the message could be broken into bit strings of length log n .) To encode the message, Alice computes and sends to Bob e(x) = xe mod n. Upon receipt, Bob computes d(e(x)) = (e(x))d mod n. To show that this operation decodes correctly, we must prove: Claim 4: d(e(x)) = x. Proof: We use the steps: e(x)d = xde = x1+k(p−1)(q−1) = x mod n.
−1 The ﬁrst equation recalls the deﬁnition of e(x). The second uses the fact that d = e mod (p − 1)(q − 1), and hence

de = 1+ k(p− 1)(q− 1) for some integer k. The last equality is much less trivial. It will help us to have the following lemma: Claim 5: (Fermat’s Little Theorem) If p is prime, then for a = 0 mod p, we have ap−1 = 1 mod p. Proof: Look at the numbers 1, 2, . . . , p − 1. Suppose we multiply them all by a modulo p, to get a · 1 mod p, a · 2 mod p, . . . , a · (p − 1) mod p. We claim that the two sets of numbers are the same! This is because every pair of
−1 numbers in the second group is different; this follows since if a · i = a · j mod p, then by multiplying by a , we

must have i = j mod p. But if all the numbers in the second group are different modulo p, since none of them are 0, they must just be 1, 2, . . . , p − 1. (To get a feel for this, take an example: when p = 7 and a = 5, multiplying a by the numbers {1, 2, 3, 4, 5, 6} yields {5, 3, 1, 6, 4, 2}.) From the above equality of sets of numbers, we conclude 1 · 2 · · · (p − 1) = (a · 1) · (a · 2) · · · (a · (p − 1)) mod p.

Lecture 14

14-6

Multiplying both sides by 1−1 , 2−1 , . . . , (p − 1)−1 we have 1 = a p−1 mod p. This proves Claim 5. We now return to the end of Claim 4, where we must prove x1+k(p−1)(q−1) = x mod n. We ﬁrst claim that x1+k(p−1)(q−1) = x mod p. This is clearly true if x = 0 mod p. If x = 0 mod p, then by Fermat’s Little Theorem, x(p−1) = 1 mod p, and hence xk(p−1)(q−1) = 1 mod p, from which we have x1+k(p−1)(q−1) = x mod p. by the same argument we also have x1+k(p−1)(q−1) = x mod q. But if a number is equal to x both modulo p and modulo q, it is equal to x modulo n = p · q. Hence x1+k(p−1)(q−1) = x mod n, and Claim 4 is proven. We have shown that the RSA protocol allows for correct encoding and decoding. We also should be convinced it is efﬁcient, since it requires only operations that we know to be efﬁcient, such as Euclid’s algorithm and modular exponentiation. One thing we have not yet asked is why the scheme is secure. That is, why can’t the eavesdropper Eve recover the message x also? The answer, unfortunately, is that there is no proof that Eve cannot compute x efﬁciently from e(x). There is simply a belief that this is a hard problem. It is an unproven assumption that there is no efﬁcient algorithm for computing x from e(x). There is the real but unlikely possibility that someone out there can read all messages sent using RSA! Let us seek some idea of why RSA is believed to be secure. If Eve obtains e(x) = xe mod n, what can she do? She could try all possible values of x to try to ﬁnd the correct one; this clearly takes too long. Or she could try to factor n and compute d. Factoring, however, is a widely known and well studied problem, and nobody has come up with a polynomial time algorithm for the problem. In fact, it is widely believed that no such algorithm exists. It would be nice if we could make some sort of guarantee. For example, suppose that breaking RSA allowed us to factor n. Then we could say that RSA is as hard as factoring. Unfortunately, this is not the case either. It is possible that RSA could be broken without providing a general factoring algorithm, although it seems that any natural approach for breaking RSA would also provide a way to factor n.

CS124

Lecture 15

15.1 2SAT
We begin by showing yet another possible way to solve the 2SAT problem. Recall that the input to 2SAT is a logical expression that is the conjunction (AND) of a set of clauses, where each clause is the disjunction (OR) of two literals. (A literal is either a Boolean variable or the negation of a Boolean variable.) For example, the following expression is an instance of 2SAT: (x1 ∨ x2 ) ∧ (x1 ∨ x3 ) ∧ (x1 ∨ x2 ) ∧ (x4 ∨ x3 ) ∧ (x4 ∨ x1 ). A solution to an instance of a 2SAT formula is an assignment of the variables to the values T (true) and F (false) so that all the clauses are satisﬁed– that is, there is at least one true literal in each clause. For example, the assingment x1 = T, x2 = F, x3 = F, x4 = T satisﬁes the 2SAT formula above. Here is a simple randomized solution to the 2SAT problem. Start with some truth assignment, say by setting all the variables to false. Find some clause that is not yet satisﬁed. Randomly choose one the variables in that clause, say by ﬂipping a coin, and change its value. Continue this process, until either all clauses are satisﬁed or you get tired of ﬂipping coins. In the example above, when we begin with all variables set to F, the clause (x1 ∨ x2 ) is not satisﬁed. So we might randomly choose to set x1 to be T. In this case this would leave the clause (x4 ∨ x1 ) unsatisﬁed, so we would have to ﬂip a variable in the clause, and so on. Why would this algorithm tend to lead to a solution? Let us suppose that there is a solution, call it S. Suppose we keep track of the number of variables in our current assignment A that match S. Call this number k. We would like to get to the point where k = n, the number of variables in the formula, for then A would match the solution S. How does k evolve over time? At each step, we choose a clause that is unsatisﬁed. Hence we know that A and S disagree on the value of at least one of the variables in this clause– if they agreed, the clause would have to be satisﬁed! If they disagree on both, then clearly changing either one of the values will increase k. If they disagree on the value one of the two variables, then with probability 1/2 we choose that variable and make increase k by 1; with probability 1/2 we choose the other variable and decrease k by 1. Hence, in the worse case, k behaves like a random walk– it either goes up or down by 1, randomly. This leaves us with the following question: if we start k at 0, how many steps does it take (on average, or with high probability) for k to stumble all the way up to n, the number of variables?
2 We can check that the average amount of steps to walk (randomly) from 0 to n is just n . In fact, the average amount of time to walk from i to n is n2 − i2 . Note that the time average time T (i) to walk from i to n is given by:

T (n) = 0 T (i − 1) T (i + 1) + + 1, i ≥ 1 T (i) = 2 2 T (0) = T (1) + 1.

15-1

Lecture 15

15-2

These equations completely determine T (i), and our solution satisﬁes these equations! Hence, on average, we will ﬁnd a solution in at most n2 steps. (We might do better– we might not start with all of our variables wrong, or we might have some moves where we must improve the number of matches!) We can run our algorithm for say 100n2 steps, and report that no solution was found if none was found. This algorithm might return the wrong answer– there may be a truth assignment, and we have just been unlucky. But most of the time it will be right.

CS124

Lecture 16

An introductory example
Suppose that a company that produces three products wishes to decide the level of production of each so as to maximize proﬁts. Let x1 be the amount of Product 1 produced in a month, x2 that of Product 2, and x3 that of Product 3. Each unit of Product 1 yields a proﬁt of 100, each unit of Product 2 a proﬁt of 600, and each unit of Product 3 a proﬁt of 1400. There are limitations on x1 , x2 , and x3 (besides the obvious one, that x1 , x2 , x3 ≥ 0). First, x1 cannot be more than 200, and x2 cannot be more than 300, presumably because of supply limitations. Also, the sum of the three must be, because of labor constraints, at most 400. Finally, it turns out that Products 2 and 3 use the same piece of equipment, with Product 3 using three times as much, and hence we have another constraint x + 3x3 ≤ 600. 2 What are the best levels of production? We represent the situation by a linear program, as follows:

max 100x1 + 600x2 + 1400x3 x1 ≤ 200 x2 ≤ 300 x1 + x2 + x3 ≤ 400 x2 + 3x3 ≤ 600 x1 , x2 , x3 ≥ 0 The set of all feasible solutions of this linear program (that is, all vectors in 3-d space that satisfy all constraints) is precisely the polyhedron shown in Figure 16.1. We wish to maximize the linear function 100x1 + 600x2 + 1400x3 over all points of this polyhedron. Geometrically, the linear equation 100x1 + 600x2 + 1400x3 = c can be represented by a plane parallel to the one determined by the equation 100x1 + 600x2 + 1400x3 = 0. This means that we want to ﬁnd the plane of this type that touches the polyhedron and is as far towards the positive orthant as possible. Obviously, the optimum solution will be a vertex (or the optimum solution will not be unique, but a vertex will do). Of course, two other possibilities with linear programming are that (a) the optimum solution may be inﬁnity, or (b) that there may be no feasible solution at all. 16-1

Lecture 16

16-2

x2
300 opt

200 200

x1

x3
Figure 16.1: The feasible region In this case, an optimal solution exists, and moreover we shall show that it is easy to ﬁnd.

Linear programs
Linear programs, in general, have the following form: there is an objective function that one seeks to optimize, along with constraints on the variables. The objective function and the constraints are all linear in the variables; that is, all equations have no powers of the variables, nor are the variables multiplied together. As we shall see, almost all problems can be represented by linear programs, and for many problems it is an extremely convenient representation. So once we explain how to solve linear programs, the question then becomes how to reduce other problems to linear programming (LP). There are polynomial time algorithms for solving linear programs. In practice, however, such problems are solved by the simplex method devised by George Dantzig in 1947. The simplex method starts from a vertex (in this

Lecture 16

16-3

case the vertex (0, 0, 0)) and repeatedly looks for a vertex that is adjacent, and has better objective value. That is, it is a kind of hill-climbing in the vertices of the polytope. When a vertex is found that has no better neighbor, simplex stops and declares this vertex to be the optimum. For example, in the ﬁgure one of the possible paths followed by simplex is shown. No known variant of the simplex algorithm has been proven to take polynomial time, and most of the variations used in practice have been shown to take exponential time on some examples. Fortunately, in practice, bad cases rarely arise, and the simplex algorithm runs extremely quickly. There are now implementations of simplex that solve routinely linear programs with many thousands of variables and constraints. Of course, given a linear program, it is possible either that (a) the optimum solution may be inﬁnity, or (b) that there may be no feasible solution at all. If this is the case, simplex algorithm will discover it.

Reductions between versions of simplex
A general linear programming problem may involve constraints that are equalities or inequalities in either direction. Its variables may be nonnegative, or could be unrestricted in sign. And we may be either minimizing or maximizing a linear function. It turns out that we can easily translate any such version to any other. One such translation that is particularly useful is from the general form to the one required by simplex: minimization, nonnegative variables, and equality constraints. To turn an inequality ∑ ai xi ≤ b into an equality constraint, we introduce a new variable s (the slack variable for this inequality), and rewrite this inequality as ∑ ai xi + s = b, s ≥ 0. Similarly, any inequality ∑ ai xi ≥ b is rewritten as ∑ ai xi − s = b, s ≥ 0; s is now called a surplus variable.
+ We handle an unrestricted variable x as follows: we introduce two nonnegative variables, x and x− , and

replace x by x+ − x− everywhere. The idea is that we let x = x+ − x− , where we may restrict both x+ and x− to be nonnegative. This way, x can take on any value, but there are only nonnegative variables. Finally, to turn a maximization problem into a minimization one, we just multiply the objective function by −1.

A production scheduling example
We have the demand estimates for our product for all months of 1997, d : i = 1, . . . , 12, and they are very i uneven, ranging from 440 to 920. We currently have 30 employees, each of which produce 20 units of the product each month at a salary of 2,000; we have no stock of the product. How can we handle such ﬂuctuations in demand? Three ways:

Lecture 16

16-4

• overtime —but this is expensive since it costs 80% more than regular production, and has limitations, as workers can only work 30% overtime. • hire and ﬁre workers —but hiring costs 320, and ﬁring costs 400. • store the surplus production —but this costs 8 per item per month

This rather involved problem can be formulated and solved as a linear program. As in all such reductions, the crucial ﬁrst step is deﬁning the variables: • Let w0 be the number of workers we have the ith month —we have w0 = 30. • Let xi be the production for month i. • oi is the number of items produced by overtime in month i. • hi and fi are the number of workers hired/ﬁred in the beginning of month i. • si is the amount of product stored after the end of month i.

We now must write the constraints: • xi = 20wi + oi —the amount produced is the one produced by regular production, plus overtime. • wi = wi−1 + hi − fi , wi ≥ 0 —the changing number of workers. • si = si−1 + xi − di ≥ 0 —the amount stored in the end of this month is what we started with, plus the production, minus the demand. • oi ≤ 6wi —only 30% overtime.

Finally, what is the objective function? It is min 2000 ∑ wi + 400 ∑ fi + 320 ∑ hi + 8 ∑ si + 180 ∑ oi , where the summations are from i = 1 to 12.

A Communication Network Problem

Lecture 16

16-5

We have a network whose lines have the bandwidth shown in Figure 16.2. We wish to establish three calls: one between A and B (call 1), one between B and C (call 2), and one between A and C (call 3). We must give each call at least 2 units of bandwidth, but possibly more. The link from A to B pays 3 per unit of bandwidth, from B to C pays 2, and from A to C pays 4. Notice that each call can be routed in two ways (the long and the short path), or by a combination (for example, two units of bandwidth via the short route, and three via the long route). Suppose we are a shady network administrator, and our goals is to maximize the network’s income (rather than minimize the overall cost). How do we route these calls to maximize the network’s income?

B 10 13 8 C
Figure 16.2: A communication network This is also a linear program. We have variables for each call and each path (long or short); for example x is 1 the short path for call 1, and x2 the long path for call 2. We demand that (1) no edge bandwidth is exceeded, and (2) each call gets a bandwidth of 2.

6 11 12 A

max 3x1 + 3x1 + 2x2 + 2x2 + 4x3 + 4x3 x1 + x1 + x2 + x2 ≤ 10 x1 + x1 + x3 + x3 ≤ 12 x2 + x2 + x3 + x3 ≤ 8 x1 + x2 + x3 ≤ 6

Lecture 16

16-6

x1 + x2 + x3 ≤ 13 x1 + x2 + x3 ≤ 11 x1 + x1 ≥ 2 x2 + x2 ≥ 2 x3 + x3 ≥ 2 x1 , x1 . . . , x3 ≥ 0 The solution, obtained via simplex in a few milliseconds, is the following: x = 0, x1 = 7, x2 = x2 = 1.5, x3 = 1 .5, x3 = 4.5. Question: Suppose that we removed the constraints stating that each call should receive at least two units. Would the optimum change?

Approximate Separation
An interesting last application: Suppose that we have two sets of points in the plane, the black points (x, yi ) : i i = 1, . . . , m and the white points (xi , yi ) : i = m+ 1, . . . , m+ n. We wish to separate them by a straight line ax+ by = c, so that for all black points ax + by ≤ c, and for all white points ax + by ≥ c. In general, this would be impossible. Still, we may want to separate them by a line that minimizes the sum of the “displacement errors” (distance from the boundary) over all misclassiﬁed points. Here is the LP that achieves this: min e1 e1 ≥ e2 ≥ . . . em ≥ em+1 ≥ . . . em+n ≥ +e2 + . . . + em + em+1 + . . . + em+n ax1 + by1 − c ax2 + by2 − c axm + bym − c c − axm+1 − bym+1 c − axm+n − bym+n ei ≥ 0

Network Flows
Suppose that we are given the network in top of Figure 16.3, where the numbers indicate capacities, that is, the amount of ﬂow that can go through the edge in unit time. We wish to ﬁnd the maximum amount of ﬂow that can go through this network, from S to T .

Lecture 16

16-7

A 5 S 1

3

C 2

2

1

T

2 B A 5 S 2 1 2 2 2 1 3 3 D C

5

2 T

2 B A 5 S 4 1 2 2 2 2 1 2 2 B A 5 S 4 2 2 B 2 3 D 1 2 2 2 2 1 4 3 3 D C 3 3 D C

5

2 T

5

2

minimum cut, capacity 6 T

5

Figure 16.3: Max ﬂow

Lecture 16

16-8

This problem can also be reduced to linear programming. We have a nonnegative variable for each edge, representing the ﬂow through this edge. These variables are denoted fSA , fSB , . . . We have two kinds of constraints: capacity constraints such as fSA ≤ 5 (a total of 9 such constraints, one for each edge), and ﬂow conservation constraints (one for each node except S and T ), such as fAD + fBD = fDC + fDT (a total of 4 such constraints). We wish to maximize fSA + fSB , the amount of ﬂow that leaves S, subject to these constraints. It is easy to see that this linear program is equivalent to the max-ﬂow problem. The simplex method would correctly solve it. In the case of max-ﬂow, it is very instructive to “simulate” the simplex method, to see what effect its various iterations would have on the given network. Simplex would start with the all-zero ﬂow, and would try to improve it. How can it ﬁnd a small improvement in the ﬂow? Answer: it ﬁnds a path from S to T (say, by depth-ﬁrst search), and moves ﬂow along this path of total value equal to the minimum capacity of an edge on the path (it can obviously do no better). This is the ﬁrst iteration of simplex (see Figure 16.3). How would simplex continue? It would look for another path from S to T . Since this time we already partially (or totally) use some of the edges, we should do depth-ﬁrst search on the edges that have some residual capacity, above and beyond the ﬂow they already carry. Thus, the edge CT would be ignored, as if it were not there. The depth-ﬁrst search would now ﬁnd the path S − A − D − T , and augment the ﬂow by two more units, as shown in Figure 16.3. Next, simplex would again try to ﬁnd a path from S to T . The path is now S − A − B − D − T (the edges C − T and A − D are full are are therefore ignored), and we augment the ﬂow as shown in the bottom of Figure 16.3. Next simplex would again try to ﬁnd a path. But since edges A − D, C − T , and S − B are full, they must be ignored, and therefore depth-ﬁrst search would fail to ﬁnd a path, after marking the nodes S, A,C as reachable from S. Simplex then returns the ﬂow shown, of value 6, as maximum. How can we be sure that it is the maximum? Notice that these reachable nodes deﬁne a cut (a set of nodes containing S but not T ), and the capacity of this cut (the sum of the capacities of the edges going out of this set) is 6, the same as the max-ﬂow value. (It must be the same, since this ﬂow passes through this cut.) The existence of this cut establishes that the ﬂow is optimum! There is a complication that we have swept under the rug so far: when we do depth-ﬁrst search looking for a path, we use not only the edges that are not completely full, but we must also traverse in the opposite direction all edges that already have some non-zero ﬂow. This would have the effect of canceling some ﬂow; canceling may be necessary to achieve optimality, see Figure 16.4. In this ﬁgure the only way to augment the current ﬂow is via the path S − B − A − T , which traverses the edge A − B in the reverse direction (a legal traversal, since A − B is carrying

Lecture 16

16-9

non-zero ﬂow).

A 1 S 1 1 T

1 B
Figure 16.4: Flows may have to be canceled

1

In general, a path from the source to the sink along which we can increase the ﬂow is called an augmenting path. We can look for an augmenting path by doing for example a depth ﬁrst search along the residual network, which we now describe. For an edge (u, v), let c(u, v) be its capacity, and let f (u, v) be the ﬂow across the edge. Note that we adopt the following convention: if 4 units ﬂow from u to v, then f (u, v) = 4, and f (v, u) = −4. That is, we interpret the fact that we could reverse the ﬂow across an edge as being equivalent to a “negative ﬂow”. Then the residual capacity of an edge (u, v) is just c(u, v) − f (u, v). The residual network has the same vertices as the original graph; the edges of the residual network consist of all weighted edges with strictly positive residual capacity. The idea is then if we ﬁnd a path from the source to the sink in the residual network, we have an augmenting path to increase the ﬂow in the original network. As an exercise, you may want to consider the residual network at each step in Figure 16.3. Suppose we look for a path in the residual network using depth ﬁrst search. In the case where the capacities are integers, we will always be able to push an integral amount of ﬂow along an augmenting path. Hence, if the maximum ﬂow is f ∗ , the total time to ﬁnd the maximum ﬂow is O(E f ∗ ), since we may have to do an O(E) depth ﬁrst search up to f ∗ times. This is not so great. Note that we do not have to do a depth-ﬁrst search to ﬁnd an augmenting path in the residual network. In fact, using a breadth-ﬁrst search each time yields an algorithm that provably runs in O(V E2 ) time, regardless of whether or not the capacities are integers. We will not prove this here. There are also other algorithms and approaches to the

Lecture 16

16-10

max-ﬂow problem as well that improve on this running time. To summarize: the max-ﬂow problem can be easily reduced to linear programming and solved by simplex. But it is easier to understand what simplex would do by following its iterations directly on the network. It repeatedly ﬁnds a path from S to T along edges that are not yet full (have non-zero residual capacity), and also along any reverse edges with non-zero ﬂow. If an S − T path is found, we augment the ﬂow along this path, and repeat. When a path cannot be found, the set of nodes reachable from S deﬁnes a cut of capacity equal to the max-ﬂow. Thus, the value of the maximum ﬂow is always equal to the capacity of the minimum cut. This is the important max-ﬂow min-cut theorem. One direction (that max-ﬂow≤min-cut) is easy (think about it: any cut is larger than any ﬂow); the other direction is proved by the algorithm just described.

CS124

Lecture 17

A 5 S 1

3

C 2

2

1

T

2 B A 5 S 2 1 2 2 2 1 3 3 D C

5

2 T

2 B A 5 S 4 1 2 2 2 2 1 2 2 B A 5 S 4 2 2 B 2 3 D 1 2 2 2 2 1 4 3 3 D C 3 3 D C

5

2 T

5

2

minimum cut, capacity 6 T

5

Figure 17.1: Max ﬂow

17-1

Lecture 17

17-2

Network Flows
Suppose that we are given the network in top of Figure 17.1, where the numbers indicate capacities, that is, the amount of ﬂow that can go through the edge in unit time. We wish to ﬁnd the maximum amount of ﬂow that can go through this network, from S to T . This problem can also be reduced to linear programming. We have a nonnegative variable for each edge, rep¡ ¡ ¡ ¢¢¢

resenting the ﬂow through this edge. These variables are denoted fSA fSB capacity constraints such as fSA
£

We have two kinds of constraints:

5 (a total of 9 such constraints, one for each edge), and ﬂow conservation con¤ ¥

program is equivalent to the max-ﬂow problem. The simplex method would correctly solve it.

¤

to maximize fSA

fSB , the amount of ﬂow that leaves S, subject to these constraints. It is easy to see that this linear

¤

straints (one for each node except S and T ), such as fAD

fBD

fDC

fDT (a total of 4 such constraints). We wish

Lecture 17

17-3

In the case of max-ﬂow, it is very instructive to “simulate” the simplex method, to see what effect its various iterations would have on the given network. Simplex would start with the all-zero ﬂow, and would try to improve it. How can it ﬁnd a small improvement in the ﬂow? Answer: it ﬁnds a path from S to T (say, by depth-ﬁrst search), and moves ﬂow along this path of total value equal to the minimum capacity of an edge on the path (it can obviously do no better). This is the ﬁrst iteration of simplex (see Figure 17.1). How would simplex continue? It would look for another path from S to T . Since this time we already partially (or totally) use some of the edges, we should do depth-ﬁrst search on the edges that have some residual capacity, above and beyond the ﬂow they already carry. Thus, the edge CT would be ignored, as if it were not there. The
¦ ¦ ¦

depth-ﬁrst search would now ﬁnd the path S Figure 17.1.

A

D

T , and augment the ﬂow by two more units, as shown in

ignored, and therefore depth-ﬁrst search would fail to ﬁnd a path, after marking the nodes S A C as reachable from S. Simplex then returns the ﬂow shown, of value 6, as maximum.

¦

¦

¦

Next simplex would again try to ﬁnd a path. But since edges A

¦

and A

D are full are are therefore ignored), and we augment the ﬂow as shown in the bottom of Figure 17.1. D, C T , and S B are full, they must be

¦

¦

¦

¦

¦

Next, simplex would again try to ﬁnd a path from S to T . The path is now S

A

B

D

T (the edges C

T

Lecture 17

17-4

How can we be sure that it is the maximum? Notice that these reachable nodes deﬁne a cut (a set of nodes containing S but not T ), and the capacity of this cut (the sum of the capacities of the edges going out of this set) is 6, the same as the max-ﬂow value. (It must be the same, since this ﬂow passes through this cut.) The existence of this cut establishes that the ﬂow is optimum! There is a complication that we have swept under the rug so far: when we do depth-ﬁrst search looking for a path, we use not only the edges that are not completely full, but we must also traverse in the opposite direction all edges that already have some non-zero ﬂow. This would have the effect of canceling some ﬂow; canceling may be necessary to achieve optimality, see Figure 17.2. In this ﬁgure the only way to augment the current ﬂow is via the
¦ ¦ ¦ ¦ ¦

path S

B

A

T , which traverses the edge A

B in the reverse direction (a legal traversal, since A

B is carrying

non-zero ﬂow).

A 1 S 1 1 T

1 B
Figure 17.2: Flows may have to be canceled

1

Lecture 17

17-5

In general, a path from the source to the sink along which we can increase the ﬂow is called an augmenting path. We can look for an augmenting path by doing for example a depth ﬁrst search along the residual network, which we now describe. For an edge u v , let c u v be its capacity, and let f u v be the ﬂow across the edge.
¦ ¥ ¨   § ¥ ¨ ¨ ¨   § ¨   § ¨   §

Note that we adopt the following convention: if 4 units ﬂow from u to v, then f u v
§

4, and f v u

4. That is,

we interpret the fact that we could reverse the ﬂow across an edge as being equivalent to a “negative ﬂow”. Then the residual capacity of an edge u v is just
¡ ¨ ¨ § § ¨ ¢ §   § ¦ ©¨   § ¨   §

cuv

f uv

The residual network has the same vertices as the original graph; the edges of the residual network consist of all weighted edges with strictly positive residual capacity. The idea is then if we ﬁnd a path from the source to the sink in the residual network, we have an augmenting path to increase the ﬂow in the original network. As an exercise, you may want to consider the residual network at each step in Figure 17.1. Suppose we look for a path in the residual network using depth ﬁrst search. In the case where the capacities are integers, we will always be able to push an integral amount of ﬂow along an augmenting path. Hence, if the maximum ﬂow is f , the total time to ﬁnd the maximum ﬂow is O E f , since we may have to do an O E depth ﬁrst search up to f times. This is not so great. Note that we do not have to do a depth-ﬁrst search to ﬁnd an augmenting path in the residual network. In fact, using a breadth-ﬁrst search each time yields an algorithm that provably runs in O V E 2 time, regardless of whether or not the capacities are integers. We will not prove this here. There are also other algorithms and approaches to the max-ﬂow problem as well that improve on this running time. To summarize: the max-ﬂow problem can be easily reduced to linear programming and solved by simplex. But it is easier to understand what simplex would do by following its iterations directly on the network. It repeatedly ﬁnds a path from S to T along edges that are not yet full (have non-zero residual capacity), and also along any reverse
¦  

edges with non-zero ﬂow. If an S

T path is found, we augment the ﬂow along this path, and repeat. When a path

cannot be found, the set of nodes reachable from S deﬁnes a cut of capacity equal to the max-ﬂow. Thus, the value of the maximum ﬂow is always equal to the capacity of the minimum cut. This is the important max-ﬂow min-cut theorem. One direction (that max-ﬂow min-cut) is easy (think about it: any cut is larger than any ﬂow); the other direction is proved by the algorithm just described.
£

Lecture 17

17-6

Duality
As it turns out, the max-ﬂow min-cut theorem is a special case of a more general phenomenon called duality. Basically, duality means that for each maximization problem there is a corresponding minimizations problem with the property that any feasible solution of the min problem is greater than or equal any feasible solution of the max problem. Furthermore, and more importantly, they have the same optimum. Consider the network shown in Figure 17.3, and the corresponding max-ﬂow problem. We know that it can be written as a linear program as follows:

A 3 S 1 1 T

2 B
Figure 17.3: A simple max-ﬂow problem

3

fSB fAB fAT fBT
¦ ¤ ¦ ¦ 

¥

fSA

fBT

¥

fSA

fAB fAB

fAT

3 2 1 1 3 0 0 f 0
£ £ £ £

£

¤

max

fSA fSA

fSB

P

Lecture 17

17-7

Consider now the following linear program: min 3ySA ySA
¤

This LP describes the min-cut problem! To see why, suppose that the uA variable is meant to be 1 if A is in the cut with S, and 0 otherwise, and similarly for B (naturally, by the deﬁnition of a cut, S will always be with S in the cut, and T will never be with S). Each of the y variables is to be 1 if the corresponding edge contributes to the cut capacity, and 0 otherwise. Then the constraints make sure that these variables behave exactly as they should. For example, the second constraint states that if A is not with S, then SA must be added to the cut. The third one states
¦ ¤

that if A is with S and B is not (this is the only case in which the sum uA
¦

uB becomes 1), then AB must contribute

to the cut. And so on. Although the y and u’s are free to take values larger than one, they will be “slammed” by the minimization down to 1 or 0.  

¦

yBT

uB 

¦

yAT 

¤

¦

yAB

uA uA 

¤

ySB

uB uB 

¤

¤

¤

¤

2ySB

yAB

yAT

3yBT uA 1 1 0 0 0 y 0

D

Lecture 17

17-8

Let us now make a remarkable observation: these two programs have strikingly symmetric, dual, structure. This structure is most easily seen by putting the linear programs in matrix form. The ﬁrst program, which we call the primal (P), we write as: max 1 1 0 0 0 0 1 0 


1 0 1 0 0 0
¦

0 0 0 1 0 0
¦

0 0 0 0 1 0 1
¦

0 0 0 0 0 1
¥ £ £ £ £ £

3 2 1 1 3 0 0
¥

0 1

1 1

1 1 

0 

Here we have removed the actual variable names, and we have included an additional row at the bottom denoting that all the variables are non-negative. (An unrestricted variable will be denoted by unr. The second program, which we call the dual (D), we write as: min 3 1 0 0 0 0 


2 0 1 0 0 0 

1 0 0 1 0 0 

1 0 0 0 1 0 

unr unr

Each variable of P corresponds to a constraint of D, and vice-versa. Equality constraints correspond to unrestricted variables (the u’s), and inequality constraints to restricted variables. Minimization becomes maximization. The matrices are transpose of one another, and the roles of right-hand side and objective function are interchanged. 

¦

1

0

1 

¦

0

1

0 

¦

0

1

1 

0

0

1  

3 0

0 1

0 0 1 1 0 0 0

Lecture 17

17-9

Such LP’s are called dual to each other. It is mechanical, given an LP, to form its dual. Suppose we start with a maximization problem. Change all inequality constraints into necessary. Then 
£

constraints, negating both sides of an equation if

transpose the coefﬁcient matrix 

invert maximization to minimization 

interchange the roles of the right-hand side and the objective function 

introduce a nonnegative variable for each inequality, and an unrestricted one for each equality 


for each nonnegative variable introduce a constraint.

constraint, and for each unrestricted variable introduce an equality

straints, we make the dual a maximization, and we change the last step so that each nonnegative variable corresponds to a
£

constraint. Note that it is easy to show from this description that the dual of the dual is the original primal

problem! By the max-ﬂow min-cut theorem, the two LP’s P and D above have the same optimum. In fact, this is true for general dual LP’s! This is the duality theorem, which can be stated as follows (we shall not prove it; the best proof comes from the simplex algorithm, very much as the max-ﬂow min-cut theorem comes from the max-ﬂow algorithm): If an LP has a bounded optimum, then so does its dual, and the two optimal values coincide. 

con-

Lecture 17

17-10

Matching
It is often useful to compose reductions. That is, we can reduce a problem A to B, and B to C, and since C we know how to solve, we end up solving A. A good example is the matching problem. Suppose that the bipartite graph shown in Figure 17.4 records the compatibility relation between four boys and four girls. We seek a maximum matching, that is, a set of edges that is as large as possible, and in which no two edges share a node. For example, in Figure 17.4 there is a complete matching (a matching that involves all nodes).

Al

Eve

Bob

Fay

S

T

Charlie

Grace

Dave

Helen
Figure 17.4: Reduction from matching to max-ﬂow (all capacities are 1)

Lecture 17

17-11

To reduce this problem to max-ﬂow, we create a new source and a new sink, connect the source with all boys and all girls with the sinks, and direct all edges of the original bipartite graph from the boys to the girls. All edges have capacity one. It is easy to see that the maximum ﬂow in this network corresponds to the maximum matching. Well, the situation is slightly more complicated than was stated above: what is easy to see is that the optimum integer-valued ﬂow corresponds to the optimum matching. We would be at a loss interpreting as a matching a ﬂow that ships .7 units along the edge Al-Eve! Fortunately, what the algorithm in the previous section establishes is that if the capacities are integers, then the maximum ﬂow is integer. This is because we only deal with integers throughout the algorithm. Hence integrality comes for free in the max-ﬂow problem. Unfortunately, max-ﬂow is about the only problem for which integrality comes for free. It is a very difﬁcult problem to ﬁnd the optimum solution (or any solution) of a general linear program with the additional constraint that (some or all of) the variables be integers. We will see why in forthcoming lectures.

Lecture 17

17-12

Games
We can represent various situations of conﬂict in life in terms of matrix games. For example, the game shown below is the rock-paper-scissors game. The Row player chooses a row strategy, the Column player chooses a column strategy, and then Column pays to Row the value at the intersection (if it is negative, Row ends up paying Column). r
¦

p 1
¦

s 1 1 0

r p
¦

0 1 1

0 1

s

Games do not necessarily have to be symmetric (that is, Row and Column have the same strategies, or, in terms of
¦ § ¥

matrices, A

AT ). For example, in the following ﬁctitious Clinton-Dole game the strategies may be the issues on

which a candidate for ofﬁce may focus (the initials stand for “economy,” “society,” “morality,” and “tax-cut”) and the entries are the number of voters lost by Column. m
¦ §

t 1 1

e
¦

3 2

s

We want to explore how the two players may play “optimally” these games. It is not clear what this means. For example, in the ﬁrst game there is no such thing as an optimal “pure” strategy (it very much depends on what your opponent does; similarly in the second game). But suppose that you play this game repeatedly. Then it makes sense 
¥

to randomize. That is, consider a game given by an m 
 ¨   ¡ ¡ ¡ ¢¢¢

n matrix Gi j ; deﬁne a mixed strategy for the row player 1. Intuitively, xi is the probability with which Row plays
¥  

to be a vector x1

xm , such that xi

0, and ∑m 1 xi i

¨

¡ ¡ ¡ ¢¢¢

strategy i. Similarly, a mixed strategy for Column is a vector y1

yn , such that y j

0, and ∑n j

1 yj

1.

Lecture 17

17-13

Suppose that, in the Clinton-Dole game, Row decides to play the mixed strategy 5 5 . What should Column do? The answer is easy: If the xi ’s are given, there is a pure strategy (that is, a mixed strategy with all y j ’s zero except
¡ ¡ ¡ ¢¢¢  ¥ ¨ ¡ ¢  ¡ §

for one) that is optimal. It is found by comparing the n numbers ∑m 1 Gi j xi , for j i 
¡

1

n (in the Clinton-Dole

game, Column would compare 5 with 0, and of course choose the smallest —remember, the entries denote what Column pays). That is, if Column knew Row’s mixed strategy, s/he would end up paying the smallest among the
¡ ¡ ¡ ¢¢¢  ¥

n outcomes ∑m 1 Gi j xi , for j i minimum; that is, 

1

n. On the other hand, Row will seek the mixed strategy that maximizes this max min ∑ Gi j xi 

x j i 1 m

This maximum would be the best possible guarantee about an expected outcome that Row can have by choosing a mixed strategy. Let us call this guarantee z; what Row is trying to do is solve the following LP: 3x1 x1 x1
¤ ¤

Symmetrically, it is easy to see that Column would solve the following LP: min w w w
¦ ¤

The crucial observation now is that these LP’s are dual to each other, and hence have the same optimum, call it V .  

¥

¤

¦

¤

3y1 2y1 y1

y2 y2 y2

£

¥

2x2 x2 x2
¦ ¤

£

¦

max z z z

¡

0 0 1

0 0 1

Lecture 17

17-14

Let us summarize: By solving an LP, Row can guarantee an expected income of at least V , and by solving the dual LP, Column can guarantee an expected loss of at most the same value. It follows that this is the uniquely deﬁned optimal play (it was not a priori certain that such a play exists). V is called the value of the game. In this case, the optimum mixed strategy for Row is 3 7 4 7 , and for Column 2 7 5 7 , with a value of 1 7 for the Row player. The existence of mixed strategies that are optimal for both players and achieve the same value is a fundamental result in Game Theory called the min-max theorem. It can be written in equations as follows:
¡  ¨     § ¨     §

x

y

It is surprising, because the left-hand side, in which Column optimizes last, and therefore has presumably an advantage, should be intuitively smaller than the right-hand side, in which Column decides ﬁrst. Duality equalizes the two, as it does in max-ﬂow min-cut.

¥

max min ∑ xi y j Gi j

min max ∑ xi y j Gi j
y x

Lecture 17

17-15

Circuit Evaluation

OR

AND

AND

NOT

AND

OR

AND

T

F

F

T

Figure 17.5: A Boolean circuit

Lecture 17

17-16

We have seen many interesting and diverse applications of linear programming. In some sense, the next one is the ultimate application. Suppose that we are given a Boolean circuit, that is, a DAG of gates, each of which is either an input gate (indegree zero, and has a value T or F), or an OR gate (indegree two), or an AND gate (indegree two), or a NOT gate (indegree one). One of them is designated as the output gate. We wish to tell if this circuit evaluates (following the laws of Boolean values bottom-up) to T. This is known as the circuit value problem. There is a very simple and automatic way of translating the circuit value problem into an LP: for each gate g
¥

we have a variable xg . For all gates we have 0
¥

xg 

Finally, we want to max xo , where o is the output gate. It is easy to see that the optimum value of xo will be 1 if the circuit value if T, and 0 if it is F. This is a rather straight-forward reduction to LP, from a problem that may not seem very interesting or hard at ﬁrst. However, the circuit value problem is in some sense the most general problem solvable in polynomial time! Here is a justiﬁcation of this statement: after all, a polynomial time algorithm runs on a computer, and the computer is ultimately a Boolean combinational circuit implemented on a chip. Since the algorithm runs in polynomial time, it can be rendered as a circuit consisting of polynomially many superpositions of the computer’s circuit. Hence, the fact that circuit value problem reduces to LP means that all polynomially solvable problems do! In our next topic, Complexity and NP-completeness, we shall see that a class that contains many hard problems reduces, much the same way, to integer programming.

¦

¥ 

of h and h , we have the inequalities xg 

xh , xg

xh (notice the difference). For a NOT gate we say xg 

¤

xg

0. If it is an OR gate, say of the gates h and h , then we have the inequality xg
£ £

£

£

£

1. If g is a T input gate, we have the equation xg xh

1; if it is F,

xh . If it is an AND gate 1 xh .

CS124

NP-Completeness Review

Up to this point, we have generally assumed that if we were given a problem, we could ﬁnd a way to solve it. Unfortunately, as most of you know, there are many fundamental problems for which we have no efﬁcient algorithms. In fact, by classifying these hard problems, we can show that there is a large class of simple problems for which there is (probably) no efﬁcient algorithm– the NP-complete problems. Moreover, if you could design an efﬁcient algorithm for any one of these problems, you could design an algorithm for all of them! It’s an all or none proposition, so if you could solve just one of them, you would become rich and famous overnight. These notes will review the main concepts behind the theory of NP-complete problems. One might ask why it is important to study what problems we cannot solve, instead of focusing on problems we can solve. Especially for an algorithms course. There are several possible responses, but perhaps the best is that if you do not know what is impossible, you might waste a great deal of time trying to solve it, instead of coming to terms with its impossibility and ﬁnding suitable alternatives (such as, for example, approximations instead of exact answers).

18-1

Lecture 18

18-2

Polynomial Running Times
The faster the running time, the better. Linear is great, quadratic is all right, cubic is perhaps a bit slow. But how exactly should we classify which problems have efﬁcient algorithms? Where is the cut off point? The choice computer scientists have made is to group together all problems that are solvable in polynomial time. That is, we deﬁne a class of problems P as follows: Deﬁnition: P is the set of all problems Z with a yes-no answer such that there is an algorithm A and a positive integer k such that A solves Z in O nk steps (on inputs of size n). Let us clarify some points in the deﬁnition. The restriction to problems with a yes-no answer is really just a technical convenience. For example, the problem of ﬁnding the minimum spanning tree ( on a tree with integer weights) can be recast as the problem of answering the following question: is the size of the minimum spanning tree at least j? If you can answer one question, you can answer the other; considering only yes-no problems proves more convenient. From the deﬁnition, all problems with linear, quadratic, or cubic time algorithms are all in P. But so are problems with algorithms that require time Θ n100 . This may seem a little strange; for example, would a problem with an algorithm that runs in time Θ n100 really be said to have an efﬁcient solution? But the main point of deﬁning the class P is to separate these problems from those that require exponential time, or Ω 2n steps (for some ε

Problems that require this much time to solve are clearly asymptotically inefﬁcient, compared with polynomial time algorithms. The class P is also useful because, as we shall see below, it is closed under polynomial time reductions.

¢

¡

¡

¡

¡

ε

0.

Lecture 18

18-3

Reductions
Let A and B be two problems whose instances require a “yes” or “no” answer. (For example, 2SAT is such a problem, as is the question of whether a bipartite graph has a perfect matching.) A (polynomial time) reduction from A to B is a polynomial time algorithm R which transforms an input of problem A into an input for problem B. That is, given an input x to problem A, R will produce an input R x to problem B, such that the answer to x is yes for problem A if and only if the answer for R x is yes for problem B. This idea of reduction should not seem unfamiliar; all along we have seen the idea of reducing one problem to another. (For example, we recently saw how to reduce the matching problem into the max-ﬂow problem, which could be reduced to linear programming.) The only difference is, right now, for convenience we are only considering yes-no type problems. A reduction from A to B, together with a polynomial time algorithm for B, yields a polynomial time algorithm for A. (See Figure 18.1.) Let us explain this in more detail. For any input x of A of size n, the reduction R takes time p n , where p is a polynomial, to produce an input R x for B. This input R x can have size at most p n , since this is the largest input R could possibly construct in p n time! We now submit R x as an input to the algorithm for B, which we assume runs in time q m on inputs of size m, where q is another polynomial. The algorithm for B gives us the right answer for B on R x , and hence also the right answer for A on x. The total time taken was at most pn q p n , which is itself just a polynomial in n!

This idea of reduction explains why the class P is so useful. If we have a problem A in P, and some other problem B reduces to it, then B is in P as well. Hence we say that P is closed under polynomial time reductions. If we can reduce A to B, we are essentially establishing that, give or take a polynomial, A is no harder or B. We can write this as

where here the inequality is represents a fact about the complexities of the two problems. If we know that B is easy,

We can also look at this inequality the other way. If we know that A is hard, then the inequality establishes that B is hard. It is this implication that we will now use, to show that problems are hard. This way of using reductions is very different from the way we have used reductions so far; it is also much more sophisticated and counter-intuitive.

¦

then A

B establishes that B is easy.

§

¦

A

B

¡

¡

¡

¡

¡   ¡

¡   ¡

¡

¡ ¥¡

£ ¤¡

¡

Lecture 18

18-4

x Input for A

Reduction R

R(x) Input for B

Algorithm for B

yes/no Output for B Output for A

Algorithm for A
Figure 18.1: Reductions lead to algorithms.

Lecture 18

18-5

Short Certiﬁcates and the Class NP
We will now begin to examine a class of problems that includes several “hard” problems. What we mean by “hard” in this setting is that although nobody has yet shown that there are no polynomial time algorithms to solve these problems, there is overwhelming evidence that this is the case. Recall that the class P is the class of yes-no problems that can be solved in polynomial time. The new class we deﬁne, NP, consists of yes-no problems with a different property: if the answer to the problem is yes, then there is a short certiﬁcate that can be checked to show that the answer is correct. A bit more formally, a short certiﬁcate must have the following properties:

The idea of the short certiﬁcate is the following: a problem is in NP if someone else can convince you in polynomial time that the answer is yes when the answer is yes, and they cannot fool you into thinking the answer is yes when the answer is no. Let us move from the abstract to some speciﬁc problems. Compositeness: Testing whether a number is composite is in NP, since if somebody wanted to convince you a number is composite, they could give you its factorization (the short certiﬁcate). You could then check that the factorization was correct by doing the multiplication, in polynomial time. (Notice you can’t be fooled!) 3SAT: 3SAT is like the 2SAT problem we have seen in the homework, except that there can be up to three literals in each clause. 3SAT is in NP, since if somebody wanted to convince you that a formula is satisﬁable, they could give you a satisyﬁng truth assignment (the short certiﬁcate). You could then check the proposed truth assignment in polynomial time by plugging it in and checking each clause. (Again, notice you can’t be fooled!) Finally, note that P is a subset of NP. To see why, note that if a problem is in P, we don’t even need a short certiﬁcate; someone can convince themselves of the correct answer just by running the polynomial time algorithm! Now, let us see an example of a problem which does not appear to have short certiﬁcates: not-satisﬁable-3SAT: This is like 3SAT, but now the answer is yes if there is no satisyﬁng assignment for the formula. Given a formula with no solution, how can we convince people there is no solution? The obvious way is to list all possible truth assignments, and show that they do not work, but this would not yield a short certiﬁcate.

¨ ¨

It must be short: the length of the polynomial is no more than polynomial in the length of the input. It must certify: there is a polynomial time checker (an algorithm!) that takes the input and the short certiﬁcate and checks that the certiﬁcate is valid.

Lecture 18

18-6

NP-completeness
The “hard” problems we will be looking at will be the hardest problems in NP; we call these problems NPcomplete. An NP-complete problem will have two properties:

problem, it must be at least as hard as any of them! It may seem surprising, that there are problems in NP that have this property. We will start by proving (well, sketching a proof) that an easily stated problem, circuit SAT, is NPcomplete. Once we have a ﬁrst problem done, it will turn out to be much easier to prove that other problems are NP-complete. This is because once we have one NP-complete problem, it is much easier to prove others:

Claim 18.1 Suppose problem A is NP-complete, problem B is in NP, and problem A reduces to problem B. Then problem B is NP-complete.

and the hardest problems in NP are the NP-complete ones, then B must also be NP-complete. Slightly more formally, we have to show that every problem in NP reduces to B. But we already know that every problem reduces to A, and A reduces to B. By combining reductions, as in the picture below, we have that every problem in NP reduces to B. So once we have one problem, we can start building up “chains” of NP-complete problems easily.

¨ ¨

it is in NP all other problems in NP reduce to it

Thus, our concept of “being the hardest” is based on reductions. If all other problems in NP reduce to a

Intuitively, this must be true because if A reduces to B, then B is at least as hard as A. So as long as B is in NP,

Lecture 18

18-7

x Input for A

Reduction R

R(x) Input for B

Algorithm for B

yes/no Output for B Output for A

Algorithm for A
Figure 18.2: If C reduces to A, and A reduces to B, then C reduces to B. (Transitivity!)

Lecture 18

18-8

Cook’s Theorem
The problem circuit SAT is deﬁned as follows: given a Boolean circuit and the values of some of its inputs, is there a way to set the rest of its inputs so that the output is T? It is easy to show that circuit SAT is in NP.

Claim 18.2 A problem is in NP if and only if it can be reduced to circuit SAT.

This statement is known as Cook’s theorem, and it is one of the most important results in Computer Science. One direction is easy. If a problem A can be reduced to circuit SAT , it can easily be shown to be in NP. A short certiﬁcate for an input to problem A consists of the short certiﬁcate for the circuit that results from running the reduction from A to circuit SAT on the input. Given this short certiﬁcate, a polynomial time algorithm could run the reduction on the input to A to get the appropriate circuit, and then use the short certiﬁcate to check the circuit. The other direction is more complicated, so we offer a somewhat informal explanation. Suppose that we have a problem A in NP. We need to show that it reduces to circuit SAT. Since A is in NP, there is a polynomial time algorithm that checks the validity of inputs of A together with the appropriate certiﬁcates. But we could program this algorithm on a computer, and this program would really be just a huge Boolean circuit. (After all, computers are just big Boolean circuits themselves!) The input to this circuit is the input to problem A along with a short certiﬁcate. Now suppose we are given a speciﬁc instance x of A. The question of whether x is a yes instance or no instance is exactly the question of whether there is an appropriate short certiﬁcate, which is exactly the same question ask asking if there is some way of setting the rest of the inputs to the Boolean circuit so that the answer is T. Hence, the construction of the circuit we described is the sought reduction from A to circuit SAT!

Lecture 18

18-9

More NP-complete problems
Now that we have proved that circuit SAT is NP-complete, we will build on this to ﬁnd other NP-complete problems. For example, we will now show that circuit SAT reduces to 3SAT, and since 3SAT is clearly in NP, this shows that 3SAT is NP-complete. Suppose we are given a circuit C with some input gates unset. We must (quickly, in polynomial time) construct from this circuit a 3SAT-formula R C which is satisﬁable if and only if there is a satisfying assignment of the circuit inputs. In essence, we want to mimic the actions of the circuit with a suitable formula. The formula R C will have one variable for each gate (that is, each input, and each output of an AND, OR, or NOT), and each gate will also lead to certain clauses, as described below: 1. If x is a T input gate, then add the clause x . 2. If x is a F input gate, then add the clause x . 3. If x is an unknown input gate, then no clauses are added for it. 4. If x is the OR of gates y and z, then add the clauses y x , z x , and x y z . (It is easy to see that the

5. If x is the AND of gates y and z, then add the clauses x y , x z , and y z x . (It is easy to see that the

6. If x is the NOT of gate y, then add the clauses x y and x y . (It is easy to see that the conjunction of these

7. Finally, if gate x is the output gate, add the clause x , expressing the condition that the output gate should be T. The conjuction of all of these clauses yields the formula R C . It should be apparent that this reduction R can be accomplished in polynomial (in fact, in linear) time. To verify it is a valid reduction, we must now show that C has a setting of the unknown input gates that makes the output T if and only if R(C) is satiﬁable. Suppose C has a valid setting. Then we claim R C can be satisﬁed by the truth assignment that gives each variable the same value as the appropriate gate when C is run on this valid setting. This truth assignment must satisfy all the clauses of R C , since we constructed R C to compute the same values as the circuit. Note that the output gate is T for C, and hence the ﬁnal clause listed above is also satisﬁed.

¡

¡

¡

¡    



clauses is equivalent to x

y.

¡



¡ 



conjunction of these clauses is equivalent to x

y z.

¡

¡

¡





conjunction of these clauses is equivalent to x

y z.

¡

¡

¡

¡

¡

¡

¡

¡

Lecture 18

18-10

Conversely (and this is more subtle!), if there is a valid truth assignment for R C , then there is a valid setting for the inputs of C that makes the output T. Just set the unknown input gates in the manner proscribed by the truth assignment for R C . Since R C effectively mimics the computation of the circuit, we know the output gate must be T when these inputs are applied.

¡

¡

¡

Lecture 18

18-11

From 3SAT to Integer Linear Programming We must take a 3SAT formula and convert it to an integer linear program. This reduction is easy. Restrict all

clearly satisifed if and only if this constraint is; all terms on the left of the equation are either 0 or 1, and there is at least one 1 if and only if one of the literals of the clause is true. It is somewhat strange that linear programming can be solved polynomial time, but when we try to restrict the solutions to be integers, then the problem appears not be solvable in polynomial time (since it is NP-complete). 

whole thing to be at least 1. For example, the above clause becomes x

1

y

z

1. The appropriate clause is 

£ ¤¡ 

£

be turned into a linear constraint by replacing

by

, a literal x by x, and a literal x by 1

x , and then forcing the

¡

¦

¦

variables so that they are either 0 or 1 by including the constrating 0

x

1. Now a clause such as x y z can

¡

£

Lecture 18

18-12

From 3SAT to Independent Set

K such that no two are connected by an edge. The problem is clearly in NP. (Why?) We reduce 3SAT to Independent Set. That is, given a Boolean formula φ with at most 3 literals in each clause, size K or more if and only if the formula φ is satisﬁable.

The reduction is illustrated in Figure 18.3. For each clause, we have a group of vertices, one for each literal in the clause, connected by all possible edges. Between groups of vertices, we connect two vertices if they correspond to opposite literals (like x and x). We let K be the number of clauses. This completes the reduction, and it is clear that it can be accomplished in polynomial time. We now show there is a satisfying truth assignment for φ if and only if there is an independent set of size at least K.

§ 

we must (in polynomial time) come up with a graph G

V E and an integer K so that G has an independent set of

¡



§ 

  

I

V with I

K such that if u v

I then u v

E. That is, we are asked to ﬁnd a set of vertices of size at least

§ 

In an input to Independent Set we are given a graph G

V E and an integer K. We are asked if there is a set

¡



!¡ §

Lecture 18

18-13

x+y+z x+y+z x x

x+y x+y+z x y y x

y

z

y

z

z

Figure 18.3: Turning formulae into graphs.

Lecture 18

18-14

If there is a truth assignment for φ, then there is at least one true literal in each clause. Pick just one for each clause in any way. The set I of corresponding vertices must give an independent set of size K. This is because we use only one vertex per clause, so the only way I could not be independent is if it included two opposite literals, which is impossible, because the satisfying assignment cannot set two opposite literals to T. Now suppose G has an independent set I of size K. Since there are K groups, and each group is completely interconnected, there must be one vertex from each group in I. Consider the assignment that sets all literals in the assignment to T, their opposites to F, and any unused variables arbitrarily. It is clear that this is a valid truth assignment (since if a variable is set to T, its opposite must be set to F).

Lecture 18

18-15

From Independent Set to Vertex Cover and Clique

in C. That is, each edge is adjacent to at least one vertex in the vertex cover. The Vertex Cover problem is, given a graph G and a number K, to determine if G has a vertex cover of size at most K. The reduction from Independent Set to Vertex Cover is immediate from the following observation: C is a

and the edge is covered.) So the reduction is trivial; given an instance G K of Independent Set, we produce the

A clique in a graph is a set of fully connected nodes– every possible edge between every pair of the nodes is there. The clique problem asks whether there is a clique of size K or larger in the graph. Again, the reduction from Independent Set is immediate from a simple observation. Let G be the complement of G, which is the graph with the same nodes as G, but the edges of G are precisely those edges that are missing from G. Then C is a clique in

§ 

G

V E if and only if C is an independent set in G. (See Figure 18.4.) 

 # "§

instance G V

K of Vertex Cover. 

and consider some edge u v . Both u and v can’t be in the independent set, so V

¡

§



¡ §

§

¡ 

vertex cover of G

V E if and only if V

C is an independent set! (For example, suppose I is an independent set, I contatins either u or v or both, 

¡

§ 

Let G

V E be a graph. A vertex cover of G is a set G

¡

V such that all edges in E have at least one endpoint

¡

Lecture 18

18-16

Figure 18.4: Independent sets become cliques in the complement.

CS124

Lecture 19

We have deﬁned the class of NP-complete problems, which have the property that if there is a polynomial time algorithm for any one of these problems, there is a polynomial time algorithm for all of them. Unfortunately, nobody has found an algorithm for any NP-complete problem, and it is widely believed that it is impossible to do so. This might seem like a big hint that we should just give up, and go back to solving problems like MAX-FLOW, where we can ﬁnd a polynomial time solution. Unfortunately, NP-complete show up all the time in the real world, and people want solutions to these problems. What can we do?

19-1

Lecture 19

19-2

What Can We Do?
Actually, there is a great deal we can do. Here are just a few possibilities:

limited. However, these techniques have had some success in practice, and there are arguments in favor of why they are reasonable thing to try for some problems.

Restrict the inputs. NP-completeness refers to the worst case inputs for a problem. Often, inputs are not as bad as those that arise in NP-completeness proofs. For example, although the general SAT problem is hard, we have seen that the cases of 2SAT and Horn formulae have simple polynomial times algorithms. Provide approximations. NP-completeness resutls often arise because we want an exact answer. If we relax the problem so that we only have to return a good answer, then we might be able to develop a polynomial time algorithm. For example, we have seen that a greedy algorithm provides an approximate answer for the SET COVER problem. Develop heuristics. Sometimes we might not be able to make absolute guarantees, but we can develop algorithms that seem to work well in practice, and have arguments suggesting why they should work well. For example, the simplex algorithm for linear programming is exponential in the worst case, but in practice it’s generally the right tool for solving linear programming problems. Use randomness. So far, all our algorithms have been deterministic; they always run the same way on the same input. Perhaps if we let our algorithm do some things randomly, we can avoid the NP-completeness problem? Actually, the question of whether one can use randomness to solve an NP-complete problem is still open, though it appears unlikely. (As is, of course, the problem of whether one can solve an NP-complete problem in polynomial time!) However, randomness proves a useful tool when we try to come up with approximation algorithms and heuristics. Also, if one can assume the input comes from a suitable “random distribution”, then often one can develop an algorithm that works well on average.

To begin, we will look at heuristic methods. The amount we can prove about these methods is (as yet) very

Lecture 19

19-3

Local Search
“Local search” is meant to represent a large class of similar techniques that can be used to ﬁnd a good solution for a problem. The idea is to think of the solution space as being represented by an undirected graph. That is, each possible solution is a node in the graph. An edge in the graph represents a possible move we can make between solutions. For example, consider the Number Partition problem for the homework assignment. Each possible solution, or division of the set of numbers into two groups, would be a vertex in the graph of all possible solutions. For our possible moves, we could move between solutions by changing the sign associated with a number. So in this case, our graph of all possible solutions, we have an edge between any two possible solutions that differ in only one sign. Of course this graph of all possible solutions is huge; there are 2n possible solutions when there are n numbers in the original problem! We could never hope to even write this graph down. The idea of local search is that we never actually try to write the whole graph down; we just move from one possible solution to a “nearby” possible solution, either for as long as we like, or until we happen to ﬁnd an optimal solution.

Lecture 19

19-4

To set up a local search algorithm, we need to have the following: 1. A set of possible solutions, which will be the vertices in our local search graph. 2. A notion of what the neighbors of each vertex in the graph are. For each vertex x, we will call the set of adjacent vertices N x . The neighbors must satisfy several properties: N x must be easy to compute from x

sense to represent neighbors as undirected edges), and N x cannot be too big, or more than polynomial in the input size (so that the neighbors of a node are easy to search through). 3. A cost function, from possible solutions to the real numbers. The most basic local search algorithm (say to minimize the cost function) is easily described: 1. Pick a starting point x.

3. Return the ﬁnal solution.

¢

¤ ¥¢

2. While there is a neighbor y of x with f y

f x , move to it; that is, set x to y and continue.

¢

£

¢

£

(since if we try to move from x we will need to compute the neighbors), if y

N x then x

N y (so it makes

¡

¡

¢

¡

¢

¡

¡

¡

¢

¡

Lecture 19

19-5

The Reasoning Behind Local Search
The idea behind local search is clear; if keep getting better and better solutions, we should end up with a good one. Pictorially, if we “project” the state space down to a two dimensional curve, we are hoping that the picture has a sink, or global optimum, and that we will quickly move toward it. See Figure 19.1.

f(x)

x*, global optimum x
Figure 19.1: A very nice state space. There are two possible problems with this line of thinking. First, even if the space does look this way, we might not move quickly enough toward the right solution. For example, for the number partition problem from the homework, it might be that each move improves our solution, but only by improving the residue by 1 each time. If we start with a bad solution, it will take a lot of moves to reach the minimum. Generally, however, this is not much of a problem, as long as the cost function is reasonably simple.

Lecture 19

19-6

The more important problem is that the solution space might not look this way at all. For example, our cost function might not change smoothly when we move from a state to it neighbor. Also, it may be that there are several local optima, in which cas our local search algorithm will hone in a local optimum and get stuck. See Figure 19.2.

f(x)

local optima x*, global optimum

x
Figure 19.2: A state space with many local optima; it will be hard to ﬁnd the best solution. This second problem, that the solution space might not “look nice”, is crucial, and it underscores the importance of setting up the problem. When we choose the possible moves between solutions – that is, when we construct the mapping that gives us the neighborhood of each node– we are setting up how local search will behave, including how the cost function will change between neighbors, and how many local optima there are. How well local search will work depends tremendously on how smart one is in setting up the right neighborhoods, so that the solution space really does look the way we would like it to.

Lecture 19

19-7

Examples of Neighborhoods
We have already seen an example of a neighborhood for the homework problem. Here are possible neighborhoods for other problems:

MAX3SAT: A possible neighborhood structure is two truth assignments are neighbors if they differ in only one variables. A more extensive neighborhood could make two truth assignments neighbors if they differ in at most two variables; this trades increased ﬂexibility for increase size in the neighborhood. Travelling Salesperson: The k-opt neighborhood of x is given by all tours that differ in at most k edges from x. In practice, using the 3-opt neighborhood seems to perform better than the 2-opt neighborhood, and using 4-opt or larger increases the neighborhood size to a point where it is inefﬁcient.

Lecture 19

19-8

Lots of Room to Experiment
There are several aspects of local search algorithms that we can vary, and all can have an impact on performance. For example: 1. What are the nieghborhoods N x ? 2. How do we choose an inital starting point? 3. How do we choose a neighbor y to move to? (Do we take the ﬁrst one we ﬁnd, a random neighbor that improves f , the neighbor that improves f the most, or do we use other criteria?) 4. What if there are ties? There are other practical considerations to keep in mind. Can we re-run the algorithm several times? Can we try several of the algorithms on different machines? Issues like these can have a big impact on actual performance. However, perhaps the most important issue is to think of the right neighborhood structure to begin with; if this is right, then other issues are generally secondary, and if this is wrong, you are likely to fail no matter what you do.

¢

¡

Lecture 19

19-9

Local Search Variations
There are many variations on the local search technique (below, assume the goal is to minimize the cost function):

swear that genetic algorithms lead to better solutions more quickly than other methods, while others claim that by choosing the right neighborhood functin one can do as well with hill climbing. In the years to come, hopefully more will become understood about all of these methods. If you’re interested, you might try looking for genetic algorithms and simulated annealing in Yahoo. They’re both there.

Hill-climbing – this is the name for the basic variation, where one moves to a vertex of lower (or possibly equal) cost. Metropolis rule – pick a random neighbor, and if the cost is lower, move there. If the cost is higher, move there with some probability (that is usually set to depend on the cost differential). The idea is that possibly moving to a worse state helps avoid getting trapped at local minima. Simulated annealing – this method is similar to the Metropolis rule, except that the probability of going to a higher cost neighbor varies with time. This is analogous to a physical system (such as a chemical polymer) being cooled down over time. Tabu search – this adds some memory to hill climbing. Like with the Metropolis rule and simulated annealing, you can go to worse solutions. A penalty function is added to the cost function to try to prevent cycling and promote searching new areas of the search space. Parallel search (“go with the winners”)– do multiple searches in parallel, occasionally killing off searches that appear less successful and replacing them with copies of searches that appear to be doing better. Genetic algorithms – this trendy area is actually quite related to local search. An important difference is that instead instead of keeping one solution at a time, a group of them (called a population) is kept, and the population changes at each step.

It is still quite unclear what exactly each of these techniques adds to the pot. For example, some people

CS124

Lecture 20

Heuristics can be useful in practice, but sometimes we would like to have guarantees. Approximation algorithms give guarantees. It is worth keeping in mind that sometimes approximation algorithms do not always perform as well as heuristic-based algorithms. Other times they provide insight into the problem, so they can help determine good heuristics. Often when we talk about an approximation algorithm, we give an approximation ratio. The approximation ratio gives the ratio between our solution and the actual solution. The goal is to obtain an approximation ratio as close to 1 as possible. If the problem involves a minimization, the approximation ratio will be greater than 1; if it involves a maximization, the approximation ratio will be less than 1.

20-1

Lecture 20

20-2

Vertex Cover Approximations
In the Vertex Cover problem, we wish to ﬁnd a set of vertices of minimal size such that every edge is adjacent to some vertex in the cover. That is, given an undirected graph G = (V, E), we wish to ﬁnd U ⊆ V such that every edge e ∈ E has an endpoint in U. We have seen that Vertex Cover is NP-complete. A natural greedy algorithm for Vertex Cover is to repeatedly choose a vertex with the highest degree, and put it into the cover. When we put the vertex in the cover, we remove the vertex and all its adjacent edges from the graph, and continue. Unfortunately, in this case the greedy algorithm gives us a rather poor aprroximation, as can be seen with the following example:

vertices chosen by greedy vertices in the min cover

Figure 20.1: A bad greedy example. In the example, all edges are connected to the base level; there are m/2 vertices at the next level, m/3 vertices at the next level, and so on. Each vertex at the base level is connected to one vertex at each other level, and the connections are spread as evenly as possible at each level. A greedy algorithm could always choose a rightmost vertex, whereas the optimal cover consists of the leftmost vertices. This example shows that, in general, the greedy approach could be off by a factor of Ω(log n), where n is the number of vertices.

Lecture 20

20-3

A better algorithm for vertex cover is the following: repeatedly choose an edge, and throw both of its endpoints into the cover. Throw the vertices and its adjacent edges out of the graph, and continue. It is easy to show that this second algorithm uses at most twice as many vertices as the optimal vertex cover. This is because each edge that gets chosen during the course of the algorithm must have one of its endpoints in the cover; hence we have merely always thrown two vertices in where we might have gotten away with throwing in 1. Somewhat surprsingly, this simple algorithm is still the best knwon approximation algorithm for the vertex cover problem. That is, no algorithm has been proven to do better than within a factor of 2.

Lecture 20

20-4

Maximum Cut Approximation
We will provide both a randomized and a deterministic approximation algorithm for the MAX CUT problem. The MAX CUT problem is to divide the vertices in a graph into two disjoint sets so that the numbers of edges between vertices in different sets is maximized. This problem is NP-hard. Notice that the MIN CUT problem can be solved in polynomial time by repeated using the min cut-max ﬂow algorithm. (Exercise: Prove this!) The randomized version of the algorithm is as follows: we divide the vertices into two sets, HEADS and TAILS. We decide where each vertex goes by ﬂipping a (fair) coin. What is the probability an edge crosses between the sets of the cut? This will happen only if its two endpoints lie on different sides, which happens 1/2 of the time. (There are 4 possibilities for the two endpoints – HH,HT,TT,TH – and two of these put the vertices on different sides.) So, on average, we expect 1/2 the edges in the graph to cross the cut. Since the most we could have is for all the edges to cross the cut, this random assignment will, on average, be within a factor of 2 of optimal.

Lecture 20

20-5

We now examine a deterministic algorithm with the same “approximation ratio”. (In fact, the two algorithms are intrinsically related– but this is not so easy to see!) The algorithm implements the hill climbing approximation heuristic. We will split the vertices into sets S 1 and S2 . Start with all vertices on one side of the cut. Now, if you can switch a vertex to a different side so that it increases the number of edges across the cut, do so. Repeat this action until the cut can no longer be improved by this simple switch. We switch vertices at most |E| times (since each time, the number of edges across the cut increases). Moreover, when the process ﬁnishes we are within a factor of 2 of the optimal, as we shall now show. In fact, when the process ﬁnishes, at least |E|/2 edges lie in the cut. We can count the edges in the cut in the following way: consider any vertex v ∈ S 1 . For every vertex w in S2 that it is connected to by an edge, we add 1/2 to a running sum. We do the same for each vertex in S 2 . Note that each edge crossing the cut contributes 1 to the sum– 1/2 for each vertex of the edge. Hence the cut C satisﬁes C= 1 2

v∈S1

∑ |{w : (v, w) ∈ E, w ∈ S2 }| + ∑ |{w : (v, w) ∈ E, w ∈ S1 }|
v∈S2

.

Since we are using the local search algorithm, at least half the edges from any vertex v must lie in the set opposite from v; otherwise, we could switch what side vertex v is on, and improve the cut! Hence, if vertex v has degree δ(v), then C = ≥ = = 1 2 1 2

v∈S1

∑ |{w : (v, w) ∈ E, w ∈ S2 }| + ∑ |{w : (v, w) ∈ E, w ∈ S1 }|
v∈S2

δ(v) δ(v) +∑ v∈S2 2 v∈S1 2

1 ∑ δ(v) 4 v∈V 1 |E|, 2

where the last equality follows from the fact that if we sum the degree of all vertices, we obtain twice the number of edges, since we have counted each edge twice. In practice, we might expect that hill climbing algorithm would do better than just getting a cut within a factor of 2.

Lecture 20

20-6

Euclidean Travelling Salesperson Problem
In the Euclidean Travelling Salesman Problem, we are given n points (cities) in the x − y plane, and we seek the tour (cycle) of minimum length that travels through all the cities. This problem is NP-complete (showing this is somewhat difﬁcult). Our approximation algorithm involves the following steps: 1. Find a minimum spanning tree T for the points. 2. Create a psuedo tour by walking around the tree. The pseduo tour may visit some vertices twice. 3. Remove repeats from the tour by short-cutting through the repeated vertices. (See Figure 20.2.)

Lecture 20

20-7

X

Minimum spanning tree Constructed pseudo tour Constructed tour

Figure 20.2: Building an approximate tour. Start at X, move in the direction shown, short-cutting repeated vertices.

Lecture 20

20-8

We now show the following inequalities: length of tour ≤ length of pseudo tour ≤ 2(size of T) ≤ 2(length of optimal tour)

Short-cutting edges can only decrease the length of the tour, so the tour given by the algorithm is at most the length of the pseudo tour. The length of our psuedo tour is at most twice the size of the spanning tree, since this pseudo tour consists of walking through each edge of the tree at most twice. Finally, the length of the optimal tour is at least the size of the minimum spanning tree, since any tour contains a spanning tree (plus an edge!). Using a similar idea, one can come up with an approximation algorithm that returns a tour that is within a factor of 3/2 of the optimal. Also, note that this algorithm will work in any setting where short-cutting is effective. More speciﬁcally, it will work for any instance of the travelling salesperson problem that satisﬁes the triangle inequality for distances: that is, if d(x, y) represents the distance between vertices x and y, and d(x, z) ≤ d(x, y) + d(y, z) for all x, y and z.

Lecture 20

20-9

MAX-SAT: Applying Randomness
Consider the MAX-SAT problem. What happens if we do the simplest random thing we can think of– we decide whether each variable should be TRUE or FALSE by ﬂipping a coin.

Theorem 20.1 On average, at least half the clauses will be satisﬁed if we just ﬂip a coin to decide the value of each variable. Moreover, if each clause has k literals, then on average 1 − 2 −k clauses will be satisﬁed. The proof is simple. Look at each clause. If it has k literals in it, then each literal could make the clause TRUE with probability 1/2. So the probability the clause is not satisﬁed is 1 − 2 −k , where k is the number of literals in the clause.

Lecture 20

20-10

Linear Programming Relaxation
The next approach we describe, linear programming relaxation, can often be used as a good heuristic, and in some cases it leads to approximation algorithms with provable guarantees. Again, we will use the MAX-SAT problem as an example of how to use this technique. The idea is simple. Most NP-complete problems can be easily described by a natural Integer Programming problem. (Of course, all NP-complete problems can be transformed into some Integer Programming problem, since Integer Programming is NP-complete; but what we mean here is in many cases the transformation is quite natural.) Even though we cannot solve the related Integer Program, if we pretend it is a linear program, then we can solve it, using (for example) the simplex method. This idea is konwn as relaxation, since we are relaxing the constraints on the solution; we are no longer requiring that we get a solution where the variables take on integer values. If we are extremely lucky, we might ﬁnd a solution of the linear program where all the variables are integers, in which case we will have solved our original problem. Usually, we will not. In this case we will have to try to somehow take the linear programming solution, and modify it into a solution where all the variables take on integer values. Randomized Rouding is one technique for doing this.

Lecture 20

20-11

MAX-SAT
We may formulate MAX-SAT as an integer programming problem in a straightforward way (in fact, we have seen a similar reduction before, back when we examined reducitons; it is repeated here). Suppose the formula contains variables x1 , x2 , . . . , xn which must be set to TRUE or FALSE, and clauses C1 ,C2 , . . . ,Cm . For each variable xi we associate a variable yi which should be 1 if the variable is TRUE, and 0 if it is FALSE. For each clause C j we have a variable z j which should be 1 if the clause is satisﬁed and 0 otherwise. We wish to maximize the number of satisﬁed clauses s, or

j=1

∑ z j.

m

The constraints include that that 0 ≤ y i , z j ≤ 1; since this is an integer program, this forces all these variables to be either 0 or 1. Finally, we need a constraint for each clause saying that its associated variable z j can be 1 if and only if the clause is actually satisﬁed. If the clause C j is (x2 ∨ x4 ∨ x6 ∨ x8 ), for example, then we need the restriction: y2 + y6 + (1 − y4 ) + (1 − y8 ) ≥ z j . This forces z j to be 0 unless the clause can be satisﬁed. In general, we replace x i by yi , xi by 1 − yi , ∨ by +, and set the whole thing ≥ z j to get the appropriate constraint. When we solve the linear program, we will get a solution that might have y 1 = 0.7 and z1 = 0.6, for instance. This initially appears to make no sense, since a variable cannot be 0.7 TRUE. But we can still use these values in a reasonable way. If y1 = 0.7, it suggests that we would prefer to set the variable x 1 to TRUE (1). In fact, we could try just rounding each variable up or down to 0 or 1, and use that as a solution! This would be one way to turn the non-integer solution into an integer solution. Unfortunately, there are problems with this method. For example, suppose we have the clause C1 = (x1 ∨x2 ∨x3 ), and y1 = y2 = y3 = 0.4. Then by simple rounding, this clause will not be TRUE, even though it “seems satisﬁed” to our linear program (that is, z 1 = 1). If we have a lot of these clauses, regular rounding might perform very poorly. It turns out that there an interpretation for 0.7 that suggests a better way than simple rounding. We think of the 0.7 as a probability. That is, we interpret y 1 = 0.7 as meaning that x1 would like to be true with probability 0.7. So we take each variable xi , and independently we set it to 1 with the probability given by y i (and with probability 1 − yi we set xi to 0). This process is known as randomized rounding. One reason randomized rounding is useful is it allows us to prove that the expected number of clauses we satisfy using this rounding is a within a constant factor of the true optimum.

Lecture 20

20-12

First, note that whatever the maximum number of clauses s we can satisfy is, the value found by the linear program, or ∑m z j , is at least as big as s. This is because the linear program could achieve a value of at least s j=1 simply by using as the values for yi the truth assignment that make satisfying s clauses possible. Now consider a clause with k variables; for convenience, suppose the clause is just C 1 = (x1 ∨ x2 . . . ∨ xk ).

Suppose that when we solve the linear program, we ﬁnd z 1 = β. Then we claim that the probability that this clause is satisﬁed after the rounding is at least (1 − 1/e)β. This can be checked (using a bit of sophisticated math), but it

follows by noting (with experiments) that the worst possibility is that y 1 = y2 . . . = yk = β/k. In this case, each x1 is FALSE with probability (1 − β/k), and so C1 ends up being unsatisﬁed with probability (1 − β/k) k . Hence the probability it is satisﬁed is at least (again using some math) 1 − (1 − β/k) k ≥ (1 − 1/e)β.

maximum number of satisﬁable clauses, ∑m z j . Hence we expected to get within a constant factor of the maximum. j=1

after randomized rounding is at least (1 − 1/e) ∑ m z j . This is within a factor of (1 − 1/e) of our upper bound on the j=1

Hence the ith clause is satisﬁed with probability at least (1 − 1/e)z i , so the expected number of satisﬁed clauses

Lecture 20

20-13

Combining the Two
Surprisingly, by combining the simple coin ﬂipping algorithm with the randomized rounding algorithm, we can get an even better algorithm. The idea is that the coin ﬂipping algorithm does best on long clauses, since each literal in the clause makes it more likely the clause gets set to TRUE. On the other hand, randomized rounding does best on short clauses; the probability the clause is satisﬁed (1 − (1 − β/k) k ) decreases with k. It turns out that if we try both algorithms, and take the better result, on average we will satisfy 3/4 of the clauses. We also point out that there are even more sophisticated approximation algorithms for MAX-SAT, with better approximation ratios. However, these algorithms point out some very interesting and useful general techniques.

CS 124

Lecture 21

We now consider a natural problem that arises in many applications, particularly in conjunction with sufﬁx trees, which we will study later. Suppose we have a rooted tree T with n nodes. We would like to be able to answer questions of the following form: what is the least common ancestor of nodes u and v; that is, what is the common ancestor of u and v closest to the root? In this setting, we will not be answering a single questions, but many questions on the same ﬁxed tree T . If we are given the tree T in advance, we can design an appropriate data structure for answering future queries. Our algorithm will therefore be measured on several criteria. Of course one important criterion is the query time, or the time to answer a speciﬁc query. However, a second consideration is how much preprocessing time, or time to set up the data structure, is required to answer the questions. A third related aspect to study is the memory required to store the data structure. For example, a trivial algorithm for the problem is to consider each pair of vertices, and compute their least common ancestor by following both paths toward the root until the ﬁrst shared vertex is found. Then all the the answers can be stored in a table. There are
n 2

pairs of vertices, so our table will require Θ(n 2 ) space. Queries can

be answered by a table lookup, which is constant time. Preprocessing, however, can require Θ(n 3 ) time. The problem of designing an appropriate data structure for this is called the Least Common Ancestor (LCA) Problem. We will show that there is an algorithm for LCA that require only linear preprocessing time and memory, but still answers any query in constant time! This result is as efﬁcient as we could hope for. We will reduce the LCA problem to a seemingly different but in fact quite related problem, called the Range Minimum Query (RMQ) Problem. The RMQ problem applies to an array A of length n of numbers. We would like to be able to answer questions of the following form: given two indices i and j, what is the index of the smallest element in the subarray A[i . . . j]? Again, we may prepocess the array A to derive some alternative data structure to answer the questions quickly. There is a trivial solution for the RMQ problem completely similar to the one above for the LCA problem.

21-1

Lecture 21

21-2

21.1 Reduction: From LCA to RMQ
How to we convert an LCA problem to an RMQ problem? Note that we must do the conversion in linear time, if we are going to totally complete the preprocessing in linear time for the LCA problem. Linear time suggests that we want to do a tree traversal. In fact, the observation we will use is that the LCA of nodes u and v is just the shallowest node encountered between visiting u and v during a depth ﬁrst search of the tree starting at the root. So let us do a DFS on the tree, and we can record in an array V the nodes we visit. An example is shown in Figure 1. Notice each node can appear multiple times, but the total length of the array is 2n − 1, where n is the number of nodes in the tree. Each of the n − 1 edges yields two values in the array, one when we go down the edge and one when we go up the edge. The ﬁrst value is the root. Also, from now on we will refer to each node by its number on the DFS search. We will also require two further arrays. The Level Array is derived from V ; L[i] is the distance from the root of V [i]. Adjacent elements in L can only differ by +1 or −1, since adjacent steps in the DFS are connected by an edge. Finally, R[i] is the representative array; R[i] contains the ﬁrst index of V that contains the value i. (Actually, any occurrence of i can be stored in R[i], but we might as well choose a speciﬁc one.) Clearly, to compute LCA(u, v) it sufﬁces to compute RMQ(R[u], R[v]) over the array L. This gives us the index of the shallowest node between u and v, and the array V can be used to determine the actual node from the index.

Lecture 21

21-3

0 1 2 3 4 5 6 7 8 9

V: 0 1 2 1 3 1 0 4 0 5 6 5 7 8 7 5 9 5 0 L: 0 1 2 1 2 1 0 1 0 1 2 1 2 3 2 1 2 1 0 R: 0 1 2 4 7 9 10 12 13 16
Figure 1: Changing an LCA problem into an RMQ problem.

21.2 Solutions for RMQ
We ﬁrst note that we can do better than the naive Θ(n 3 ) preprocessing time for RMQ on an array A by doing a trivial dynamic programming, using the recurrence RMQ(i, j) = A−1 [min(A[RMQ(i, j − 1)], A[ j])]. Here we are using convenient notation. Clearly min(A[RMQ(i, j − 1)], A[ j]) gives the value A[k], where A[k] is the represent that we want the index of this value; note that if multiple indices have this value, we do not particularly care which index we obtain. Each table entry can be calculated in constant time by building the table in order of ranges [i, j] of increasing size, leading to preprocessing time Θ(n 2 ). In fact, we can reduce our table size and memory using a different dynamic program, and by using a few additional operations per query. Let us create a table M(i, j) such that M(i, j) = A −1 [mink∈[i,i+2 j ) A[k]]. That is, M(i, j) contains the location of the minimum value over the 2 j positions starting from i. This table has size O(n log n), and it can easily be ﬁlled in O(n log n) step by using dynamic programming, based on the fact that M(i, j) can be

smallest value that in the subarray A[i . . . j]. However, we want the index of this value. We use the notation A −1 to

Lecture 21

21-4

determined from M(i, j − 1) and M(i + 2 j−1 , j − 1). How do we use the M(i, j) to compute RMQ(i, j), if j is not a power of 2? We may use two overlapping intervals that cover the range [i, j] as follows. Let k = log( j − i + 1) , so that 2 k is the largest power of 2 such that i + 2k ≤ j + 1. Then RMQ(i, j) = A−1 [minA[M(i, k)], A[M( j − 2k , k)]], and this can be computed in constant time from the M. We have shown that we can achieve preprocessing time and memory size Θ(n log n) while maintaining constant query time. Interestingly, this method can be enhanced so as to require preprocessing time and memory size Θ(n log log n) through a recursive construction. (This will be an exercise.) In practice, such a result would probably be good enough – log log n is quite small for reasonable values of n. By continuing the recursive construction for further levels, we could even achieve Θ(n log log log n) preprocessing time and memory size, and so on for any ﬁxed number of logs, while maintaining constant query time. However, this recursive construction would add signiﬁcant complexity to an actual program, and it still would not lead us to a linear preprocessing time solution.

Lecture 21

21-5

21.3 ±1 RMQ
In order to achieve linear preprocessing, we will use an additional fact about the RMQ problem we obtain from the reduction from LCA. Recall that our RMQ problem is on the Level Array obtained from the LCA problem. The Level Array has one additional property that we are not yet taking advantage of: each entry differs from the previous entry by +1 or −1. We can take advantage of this fact to split the RMQ problem into a different set of small subproblems in such a way that we can avoid some work by doing table look-ups. The split works as follows: partition A into blocks of size
log n 2 .

Let X[1, . . . , 2n/ log n] and Y [1, . . . , 2n/ log n] be

arrays such that X[i] stores the minimum element in the ith block of A, and Y [i] stores the position in the ith block where the element X[i] occurs. Now to answer an RMQ query for indices i and j with i < j on the array A, we can do the following: 1. If i and j are in the same block, we can perform an RMQ on this block. Notice that this requires that each block be preprocessed. 2. If i and j are in different blocks, we have to compute the following values, and take the minimum of them: (a) The minimum from position i to the end of i’s block. (b) The minimum from the beginning of j’s block to position j. (c) The minimum of all blocks between i’s block and j’s block. Steps 2a and 2c also require that we preprocess for RMQ queries on each block. Step 2b requires that we perform an RMQ over the array X. Assuming we have done all this preprocessing, the total query time is still constant. However, if we preprocess each block in order to do RMQ’s, we have not saved on the running time. We need a faster way to deal with preprocessing each block. How can we possibly avoid preprocessing each block separately? We use the following observation. Consider two arrays X and X . Suppose that these two arrays differ by a constant at each position; for example, the arrays might be 1, 2, 3, 4, 3, 2 . . . and 3, 4, 5, 6, 5, 4 . . . and. Then the RMQ answers, which give the index of the minimum element, will be the same for these two arrays. Hence we can “share” the preprocessing used for these two arrays! Another way to explain this is that in the ±1 RMQ problem, the initial value of the array does not matter, only the sequence of +1 and −1 values are necessary to determine the answer. Now, how many different such sequences are there? Since there are only log n/2 elements in a block, there are only (log n/2) − 1 values in the sequence of +1

Lecture 21

21-6

and −1 values. Hence there are only 2 (log n/2)−1 =

n/2 possible sequences. This number is so small, we can afford

to compute and store tables for every possible sequence! Even if we use quadratic preprocessing time and memory, √ √ these tables would take time O( n log2 n) to preprocess and O( n log2 n) memory. For each block in A, we have to determine which table to use; this can easily be done in linear time.

Lecture 21

21-7

21.4 Back to the standard RMQ
We have shown that ±1 RMQ problems can be solved with linear time preprocessing, and therefore we have a linear time preprocessing solution for LCA. What about the general RMQ problem? It turns out that we can also reduce the RMQ problem to the LCA problem in linear time. So we can obtain a linear time solution the general RMQ problem, by turning it into an LCA problem, and solving that as a ±1 RMQ problem! The details of this reduction are omitted here.

CS 124

Lecture 22

Spring 2000

Sufﬁx trees are an old data structure that have become new again, thanks to a recent new linear time algorithm for constructing sufﬁx trees due to Ukkonen that proves more useful for many applications. Here, we will describe a sufﬁx tree and discuss their classical use, pattern matching.

22.1 Deﬁnition
A sufﬁx tree T is built for a string S[1 . . . m]. The tree is rooted and directed with m leaves, which are numbered from 1 to m. Each edge is labeled with a nonempty substring of S. The internal nodes of the tree (other than the root) all have at least two outgoing edges, and the labels of all outgoing edges are labeled with different characters. By following the path from the root to leaf i and concatenating the edge labels, one obtains the sufﬁx S[i . . . m]. An example of a sufﬁx tree for the string xyzxzxy\$ is given in Figure 22.1. The ﬁgure helps understand some important points about the sufﬁx tree. First, each internal node has two or more children with different starting characters along the edges, since otherwise the node could be removed or moved in order to make this the case. Also, it is important that the last character of the string be a “unique” character, as this guarantees that the sufﬁx tree as deﬁned actually exists. For example, suppose our string was just xyzxzxy. The sufﬁx tree would remain largely the same. In particular, in the not-quite-sufﬁx tree in Figure 22.1 the path for the sufﬁx xy does not end at a leaf, violating the deﬁnition. The problem is that the sufﬁx xy is also the preﬁx of the string. This problem can be avoided by terminating the string with a special character that does not appear elsewhere, since then no sufﬁx can also be a preﬁx (except for the entire string itself). Hence, from now on, we will assume all strings end with a special character \$. It is also worth noting that a more convenient represenation of the sufﬁx tree does not actually label the edges with characters. Instead, these labels can be represented by a pair of indices; labeling an edge [i, j] represents that the edge label corresponds to characters S[i . . . j]. Besides saving space and ensuring that each edge is conveniently represented by two numbers, this scheme is important for the linear time algorithm for sufﬁx tree construction.

22-1

Lecture 22

22-2

22.2 Construction algorithm

Lecture 22

22-3

Fortunately, there are slightly more complex construction algorithms that require only O(m) time. We will not discuss the algorithm at this point; the details and the subsequent analysis would require a non-trivial amount of time. A reasonable introduction to the algorithm, however, has been written by Mark Nelson and has appeared in Dr. Dobb’s Journal. You can currently ﬁnd it at http://www.dogma.net/markn/articles/sufﬁxt/sufﬁxt.htm.

Lecture 22

22-4

22.3 Using sufﬁx trees for pattern matching
Once we have constructed our sufﬁx tree, we can use it to efﬁciently solve pattern matching problems. There are of course other methods for pattern matching, but using sufﬁx trees has an interesting advantage. Once the sufﬁx tree has been constructed, ﬁnding all the occurences of any pattern P[1 . . . n] in the string S takes time O(n + k), where k is the number of times that the string S appears in the text. So by incurring a one-time preprocessing charge to establish the sufﬁx trees, we can handle any pattern matching problem after that in time essentially proportional to the length of the pattern, independent of the length of the original string! This is quite powerful, particularly for things like DNA databases, where the underlying database is large and ﬁxed but must be able to deal with lots of queries. Suppose that P lies in the string S; for example, suppose P corresponds to S[i . . . i + n − 1]. Then P is the preﬁx of the sufﬁx S[i . . . m]. Hence, if we starting matching characters in P against the labels in the sufﬁx tree for S, we will follow part of the path from the root to the leaf vertex labeled i. Hence, to ﬁnd all occurences of P in S, start at the root, and match down the tree as far as possible. This takes time O(n). If P does not match some path in the tree, then P does not lie in S. If P does match some path in the tree, in matches down to some point z. All the leaves in the subtree below z correspond to sufﬁxes for which P is a preﬁx, so the labels on these leaves correspond to locations that begin an occurence of P. To ﬁnd these positions, we just traverse the subtree below z, using for example depth ﬁrst search. If there are k leaves, the depth ﬁrst search takes only O(k) time.

Lecture 22

22-5

22.4 Representation
An important point about sufﬁx trees: to make sure everything takes linear time, it is important to use the correct representation. For example, we do not explicitly label each edge with a group of characters– this could take as much as Ω(n2 ) time to just write down! Instead, each edge is labeled with a pair of values, representing characters. For example, an edge labeled [a, b] should be thought of as being labeled by the character S[a] . . . S[b]. Hence each edge is just labeled by two numbers, and only linear space is required.

Lecture 22

22-6

\$ x y \$ 7 zx

8

y zxzxy\$ \$ 1 6

zxy\$ 4

zxzxy\$ zxy\$ 2 3 5 y\$

x

y 7

zx zxzxy zxy 2 3 5 y

y zxzxy 1 6

zxy 4

Figure 22.1: A true sufﬁx tree (top); why we need the \$ character (bottom).

Lecture 22

22-7

22.5 Generalized sufﬁx trees
You may want to put a set {S1 , S2 , . . . , Sk } of strings in a sufﬁx tree data structure. (Note– we assume each string ends with the special character \$.) The structure in this case is called a generalized sufﬁx tree. There are two primary differences. First, now each leaf node may contain multiple pairs of numbers. Each pair of numbers identiﬁes a string Si and a location where the sufﬁx from the root to that leaf starts in S i . Note that multiple strings can have a sufﬁx that share a leaf node! Second, each edge label must be represented by three numbers: a number i and a pair [a, b] represent that the characters on the edge label are S i [a] . . . Si [b]. Construcing a generalized sufﬁx tree can easily be done by extending our quadratic time algorithm. However, the linear time algorithm for sufﬁx trees can also be used to build a generalized sufﬁx tree. Hence if m = ∑k |Si |, i=1 constructing the generalized sufﬁx tree can be done in O(m) time.

Lecture 22

22-8

22.6 Longest common extension
Using generalized sufﬁx trees and the LCA algorithm, we can solve a very general problem called the largest common extension problem. Given strings S 1 and S2 , we wish to pre-process the string so that we can answer questions of the following form: given a pair (i, j), ﬁnd the longest substring of S 1 that begins at position i that matches a substring of S2 that begins at position j. We will use linear time pre-processing and linear space, after which we can answer queries in constant time. The solution is to build a generalized search tree for S 1 and S2 . When we build this tree, we should also compute the string depth of each node. The string depth of a node is simply the number of characters along the edges from the root to that node. Notice the string depth is not the same as the tree depth. Also, after building the tree, we precompute the information necessary to do LCA queries on the tree. Given a pair (i, j) we compute the least common ancestor u of the leaf nodes corresponding to the sufﬁx beginning at i in S1 and the sufﬁx beginning at j in S2 . The path from the root to u is longest common extension, and hence the string depth of this node is all we need.

Lecture 22

22-9

22.7 Maximal palindromes
A palindrome is a string that reads the same forwards as backwards, such as axbccbxa. A substring U of a string S is a maximal palindrome if and only if it is a palindrome and extending it one character in both directions yields a string that is not a palindrome. Generally we separate even-length maximal palindromes, or even palindromes for short, and odd-length maximal palindromes (odd palindromes) for convenience. For example, in S = axbccbbbaa, the maximal even palindromes are bccb, bb, and aa. The string bbb is a maximal odd palindrome, and we will skip writing the maximal odd palindromes of length 1. Note that every palindrome is contained in a maximal palindrome. Here is a simple way to ﬁnd all even-length maximal palindromes in linear time. (Finding odd-length maximal palindromes is similar.) Consider S and Sr , the reversal of S. There is a palindrome of length 2k with the middle just after position q if the string of length k starting from position q + 1 of S matches the string of length k starting from position n − q + 1 of Sr . In particular, this palindrome will be maximal if this is the length of the longest match from these positions. Thus, solving the even-length maximal palindrome problem corresponds to computing the longest common extension of (q + 1, n − q + 1) for all possible q. The data stucture can be processed in linear time, and each of the linear number of queries can be answered in constant time, so the total time is linear.