On the Complexity of Protein Folding
Pierluigi Crescenzi, Deborah Goldman, Christos Papadimitriou Antonio Piccolboni, Mihalis Yannakakis
Abstract
We show that the protein folding problem in the twodimensional HP model is NPcomplete.
1 Introduction
Proteins are polymer chains consisting of monomers of twenty di erent kinds. Much of the genetic information in the DNA contains the sequence information of proteins, with three nucleotides encoding one monomer. In turn, proteins in an organism fold to form a very speci c geometric pattern, known as the protein's native state. It is this geometric pattern that determines the macroscopic properties, behavior, and function of a protein. It is in general reasonably stable and unique. The mapping from DNA sequences to monomer sequences is simple and very wellunderstood. In contrast, the mapping from the sequence of a protein to the geometric con guration of its native state the \second half of the genetic code" 8] is much more intricate and complex, and less understood; it has been the subject of intense investigation for decades. It seems clear that the forces underlying protein folding are the interactions between their monomers; recently, the view that nonlocal interactions dominate this process has been gaining ground 4]. To test this and other hypotheses concerning protein folding, researchers resorted to simpli ed models of proteins, mathematical abstractions of proteins that hide many aspects and exaggerate the e ect of others; analysis and computer simulation of such models can then be compared to experimental results with actual proteins, to determine whether the emphasized aspects are indeed the dominant ones. Perhaps the most successful and beststudied such model, and the one with apparently the best match with experiments1 , is the twodimensional hydrophilichydrophobic model, or HP model, proposed by Dill 3]. In this model it is assumed that the protein is a sequence of 0s and 1s, and folding entails embedding the sequence in the twodimensional lattice (see Figure 1). Each such folding is evaluated with a score, equal to the number of pairs of 1s that are adjacent in the lattice without being adjacent in the sequence; for example, in Figure 1 the score is ve, corresponding to the ve pairs of 1s connected by dotted lines. The score captures a simple model of energy minimization, in which the \hydrophobic" 1s tend to be close to each other and thus avoid exposure, while 0s are neutral. It is assumed in this model that the native folded state is the one that maximizes score. It is therefore an interesting problem, given a sequence of 0s and 1s, to
monomers), twodimensional models more accurately represent the physically important surfaceinterior ratios of proteins than do threedimensional models."
1 Chan and Dill 2] state that \for chain lengths for which exhaustive enumeration is possible (up to about 30
1
0 0
1 1
1
0 0 1
1 1 0
0
1 0
0
Figure 1: Embedding in the twodimensional lattice. nd the embedding on the lattice that maximizes score. In this paper we prove that this problem is NPcomplete (Theorem 3). That proteins fold so as to minimize energy has been accepted for decades. This view quickly leads to a puzzling aspect of the problem, known as Levinthal's paradox, which can be paraphrased as follows \How can a folding protein choose so quickly among so many possible foldings the one with minimum energy?" 4]. Our result can thus be seen as a more compelling restatement of that paradox, since it implies that nding the optimum folding in the twodimensional HPmodel the simplest abstraction of the protein folding problem one nds in the literature, and presumably a vast simpli cation of the true detailed 3dimensional energy minimization problem in actual proteins is NPcomplete, that is to say, among the provably hardest problems of the sort alluded to by the paradox, in which we must optimize among an astronomical population of states. There have been several NPcompleteness results related to protein folding in the literature. A few years ago, several authors pointed out that certain general restatements of the problem, in which monomers attract or repel each other in ways that are general and can be used in encoding, are NPcomplete 5, 10, 13]. More interestingly, it was proved in 11] that a combinatorial generalization of the HP model to an in nite alphabet, of which one symbol is neutral like HP's 0 symbol, and the score counts the number of adjacencies of elements with the same symbol, is NPcomplete. More recently, 9] improved this to a nite, albeit very large alphabet. The present result is the rst to settle the complexity of the simple twodimensional HP model actually proposed in the literature as the ultimate simpli cation of the protein folding problem. The HP problem has been attacked from the point of view of approximation algorithms 6]; the present result sheds little light on this aspect of the problem, as our reduction is not in any interesting way approximationpreserving. Our reduction is from the Hamilton cycle problem. As is common in previous proofs of weaker results, we start by showing that the folding problem for sets of sequences (that is, when many sequences are to be optimally folded) is NPcomplete (Theorem 1 in Section 2). We then proceed to establish the result for a single sequence, by resorting to certain interesting variants of the planar Hamilton cycle problem (Theorem 3 in Section 3). In our proof we utilize an idea of Trevisan 12], whereby graphs can be embedded in the hypercube so that adjacency is captured by Hamming distance. Our proof captures one of the basic intuitions of the HP model, namely that hydrophobic monomers 2
will tend to form a large \sphere" (in the twodimensional lattice, a large hydrophobic square). Impurities in this sphere then must be aligned optimally to maximize score, and it is the complexity of this alignment that our proof captures. Finally, in Section 4 we brie y discuss a version of our proof (in fact, without the planarity complication) which settles the NPcompleteness of the threedimensional version of the the protein folding problem in the HP model and in fact, the MAXSNPcompleteness of the problem of minimizing losses in three dimensions. We were recently informed that, independently, Berger and Leighton 1] proved that the threedimensional protein folding problem in the HP model is NPcomplete; in fact, the approximability implications of their result are stronger than ours.
2 The multistring folding problem
The twodimensional lattice is the graph, (Z 2 ; L), with node set Z 2 (all points in the Euclidean plane with integer coordinates), and edges all pairs in L = f((x; y ); (x0; y 0)) : jx x0j + jy y 0 j = 1g. Consider a set of strings S = fs1 ; : : :; sm g from the alphabet f0; 1g. A folding of these strings is an embedding of S into the lattice, that is to say, a onetoone mapping f from the set f(i; j ) : 1 i m; 1 j jsijg to Z 2 such that for all 1 i m; 1 j jsi j 1 we have (f (i; j ); f (i; j + 1)) 2 L. Fix a folding f ; the points f (i; j ) and f (i; j + 1) are called f neighbors. An edge of the lattice f(x; y); (x0; y0)g 2 L is said to be a loss if (a) these points are not f neighbors, and (b) exactly one of these two points is the image under f of a pair (i; j ) such that the j th symbol of si is a 1. Each position in a string containing a one, and which is not the rst or the last, can participate in zero, one, or two losses. The multistring folding problem is the following: given a set of strings s1 ; : : :; sm 2 f0; 1g and an integer E , is there a folding with E or fewer losses? If, as is the case in the strings we construct, no string starts or ends in a 1, then it is easy to see that the total score of a folding is equal to twice the number of 1's, minus the losses, divided by two; hence, minimizing losses is the same as maximizing score, the traditional way of stating the protein folding problem. In this section we prove the following theorem:
Theorem 1 The multistring folding problem is NPcomplete.
In the next section we shall show that the problem remains NPcomplete even if there is only one string.
2.1 Description of the reduction
We start from the following NPcomplete problem: Given a graph G = (V; E ) with nodes of degree four or less, and two nodes v1 ; vn 2 V , is there a Hamilton path from v1 to vn ? As a preliminary step in our reduction, we rst map the nodes in G to the hypercube according to a map used by Trevisan in 12]. Using Hadamard codes, he showed that there exists a function, which we call T , mapping the n nodes of the graph to codewords in f0; 1g8n such that the images 3
then dH (T (vi). L 1g and i = d nk e. : : :. with the exception of s1 and sL . toward the middle of the string. T (vj )) = 7 n. 2 2.of two unconnected nodes have Hamming distance strictly greater than two nodes connected by an edge. ng.) Notice that if there is a Hamilton path from v1 to vn in G. for i 2 n]. and. by placing two copies of each bit of the Trevisan code next to each other (separated by ones). Let d (suggesting dense ) denote the string d = L=90 18 0. beginning with the rst pair of ones in each maximal substring of adjacent ones. there is the substring mt(i) p of length 2E (cd LE e +1). The description of the strings s2 through sL 1 follows: for k 2 f2. such that the rst L=n strings correspond to node v1 . then any Hamilton path from T (v1) to T (vn ) must have length strictly greater than 7 (n 1)n. has total length L. S will contain L strings. The substring lying between the pre x and su x. m0 = Q2E 16n (1041)cq=20. that the function T may be chosen 2 so that T (v1 ) and T (vn ) contain at most as many zeros as T (vi ). vj g is not an edge. then there is a Hamilton path from T (v1 ) to T (vn ) in the Hamming space of length 7 (n 1)n. Notice that. : : :. We now construct a set of strings S and an integer E . The allowed number of losses is E = 7(n 1)n: As for the set of strings S . : : :. sL. d i=1 i=1 ( 8n Y ((1041)cq=2T (vi )l )2 if T (vi)l = 0 t(i)0 = (1041)cq=2T (v ) 041(1041)cq=2 1104T (v ) if T (v ) = 1 . : : :. are identical. then dH (T (vi). If fvi . setting 0 = L=90 (1041)40. such that there is a Hamilton path from v1 to vn in G if and only if there is a folding of the strings in S with at most E losses. 2. s1 . for i 2 n]. let t(i) (suggesting the Trevisan code ) denote i 8n the string t(i) = l=1 (1cq T (vi )l )2. L
sk = 0L4 d1L=10d1L=5
2 (
E cq+1)mt(i)12L=5d0L4 :
Notice that each string contains a pre x and su x of L4 zeros. which we call the sparse substring. All strings. T (vj )) = 4n. The strings s1 and sL . we have Q arranged for all maximal substrings of adjacent ones to have even length. If fvi . for i 6= j in f1. called the internal substring. let m (suggesting middle ) denote i=1 Q2E the string m = Q=1 16n 1cq 0. nally. vj g is an edge in G. 2 if there is no Hamilton path from v1 to vn in G. to strings s2 and sL 1 except for the fact that 4 zeros are inserted between every other pair of two adjacent ones in s2 and sL 1 . will be constructed similarly and so that strings corresponding to the p same city are identical. in particular. We note. Formally. let L = 180n14. ng: 1. the second L=n to node v2 . and so on.
l=1 il il il
4
. which consists of the sparse string m and the sparse string t(i) containing two copies of the Trevisan code (interspersed between strings of ones). respectively. Finally. There are three dense substrings of ones and zeros in each string. applying T to the nodes of G we nd. for any i 2 f1. De ne q = d LE e and let c be an even positive integer to be speci ed Q later. (Here we assume that n is even. called the anks. otherwise. and.
sL 1. and one direction has been proved. the folding has at most E losses. we have to show that it is the intended folding corresponding to a Hamilton path of G.
2. Therefore. We de ne the region R to consist of all points within the L L square of the intended folding. sketched in the next subsection (and proved in the appendix). then the strings corresponding to node vi2 are placed next. Let (vi1 . We construct a folding. However. adjacent to the left and right sides of the rectangle so that their rst bits are in line with the rst bits of the other strings and so that. respectively. Note that in the intended folding the central L L square is composed primarily of ones with some horizontal lines of zeros and horizontal lines containing code bits running through it. Lg and i = d nk e. vertically to form a (2L4 +L) (L 2) rectangle as follows: all the strings corresponding to node vi1 are placed adjacent with their rst bits in the same horizontal line at the left side of the rectangle. The ank strings s1 and sL . p and therefore the smallest surrounding rectangle has sides of length L O( LE ). vin ) be the order of the nodes contained in a Hamilton path. the full proof can be found in the appendix.3 The converse
In this section we summarize the (quite long and involved) proof of the converse. s2 . vi2 . L
sk = 0L4 d0(1041)L=20d0(1041)L=10
This completes the description of the reduction. where vi1 = v1 and vin = vn . by arranging the non. This is rather easy. then there is a folding with at most E losses. as well as all points surrounded by such points. : : :. and there is a 5
. which di er only from s2 and sL 1 through the addition of groups of 4 zeros. called the intended folding.
2. since the 4zero groups have been excluded. A schematic drawing of the intended folding appears in Figure 2. the resulting patterns of bits adjacent to the rectangle are exactly the strings corresponding to nodes v1 and vn . It should be clear that the only place where a loss may occur is where two code bits T (vij )l and T (vij+1 )l from the Trevisan code corresponding to two di erent nodes vij and vij+1 are adjacent.2 The intended folding
E (cq+1)m0t(i)0(1041)L=5d00L4 :
In this subsection we show that if there is a Hamilton path from v1 to vn in G. by the properties of the Trevisan code and because we have arranged the order of the identical groups of strings to be the same as the order (vi1 . the hard direction is the opposite. vi2 . vin ) of the Hamilton path. we are guaranteed that 7 the two copies of the Trevisan code result in at most 2 2 (n 1)n = E adjacent but not neighboring zeroone pairs. : : :. respectively.ank strings. the anks s1 and sL can be placed. We rst prove that the largest component of this region has area at least L2 O(E ) and perimeter at most 4L + O(E ) (Lemmas 4 through 8 in the appendix). In this way. : : :. We consider a folding of the strings with at most E losses.for each k 2 f1. and so on until the strings corresponding to node vin complete the rectangle. also have vertical orientation except for the fact that the 4zero groups are bent to the left and right.
6
.prefixes
flanks dense regions
sparse region
suffixes
Figure 2: The intended folding.
Two Hamilton cycles are called orthogonal if they have the following property: 7
Proof: The reduction from exact cover to planar Hamilton cycle in 7] produces a special planar
1 0 1 0
1 0 1 0
1 0 1 0
. and that in fact that almost all strings so traverse the square (Lemmas 13 and 14 and Corollary 15). square of sides L O( LE ) contained in R (Lemmas 9 and 10). and corresponds to a Hamilton path in G.
graphs. See Figure 3 for an example. connected together by paths of length two." proceeding without too many bendings. Fix a planar graph G. It follows that the folding is the intended one. We then consider a string passing through the center of this rectangle. from one end of the square to the opposite (Lemma 11 and Corollary 12). if the 2input and 3input \or" gadgets are replaced by the ones shown in Figure 4.
p
3 The string folding problem
In this section we show that the string folding problem (the special case of the multistring problem with jS j = 1.
Theorem 2 The Hamilton cycle problem remains NPcomplete even if restricted to special planar
graph. which captures the protein folding problem in the 2dimensional HP model) is also NPcomplete. We then prove that any string that passes through a narrow horizontal strip traverses the square from its top to the bottom side. and prove that it is \relatively straight.1 0 1 0
1 0 1 0
1 0 1 0
1 0 1 0
1 0 1 0
1 0 1 0
1 0 1 0
1 0 1 0
1 0 1 0
1 0 1 0
1 0 1 0
1 0 1 0 1 0
1 0 1 0
1 0 1 0
1 0 1 0
1 0 1 0
1 0 1 0
1 0 1 0
1 0 1 0
1 0 1 0
1 0 1 0
Figure 3: A special planar graph. Let us call a planar graph special if it consists of disjoint faces with nodes of degree three. and becomes triply connected if all nodes of degree two are collapsed.
Their disjoint union (where we duplicate edges in their intersection) is a degree4 planar graph with multiple edges which can be embedded in the plane in such a way that the edges around each node alternate between the two cycles. or from E to W (but not. from E to N). we shall modify the graph so that it contains a \standard" Hamilton cycle H0. denoted N.
Theorem 3 The string folding problem is NPcomplete. 8
Proof: We start from the Hamilton cycle problem for special planar graphs. Suppose that a graph contains the diamond graph depicted in Figure 5 (ignore the node G standing for the rest of the graph). such that any Hamilton cycle of the original graph corresponds to a cycle of the modi ed graph that is orthogonal to H0 . W. take only the degree2 nodes of G. they are orthogonal because. The diamond graph has four endpoints. Given any special
.
planar graph G.g. whereby it is connected to the rest of the graph. E. Any Hamilton cycle of the overall graph must traverse the diamond either from N to S.
N
G W E
S
Figure 5: The diamond graph. by duplicating the three paths of length two. Figure 5 depicts the two Hamilton cycles of the diamond graph (plus another node G). one obtains a degreefour graph around each node of which edges of the two Hamilton cycles alternate.1 0 1 0 1 0 1 1 1 0 0 0 1 0 111111111111 000000000000 1 0 1 1 1 0 0 0 1 0 1 0 1 0 1 1 1 0 0 0 1 1 1 0 0 0 1 1 1 0 0 0 1 1 1 0 0 0 1 0 1 0
1 0 1 0
1 1 1 0 0 0 1 1 1 0 0 0
1 0 1 0
1 0 1 0
1 1 1 0 0 0 1 1 1 0 0 0
1 1 1 0 0 0 1 1 1 0 0 0
1 0 1 0
Figure 4: New 2input and 3input gadgets. e. S.. Starting from G and its embedding.
and consider two such nodes adjacent if they are on the same face of the embedding. the resulting graph G0 is connected. count the occurrences of v on C and let a(v ) be the resulting number.1 0 1 0
W
N E
v
1 0 1 0
a(v) = 3
1 0 1 0
S
Figure 6: Replacing a degree2 node by diamonds. with the E or W node of the bith copy of the diamond replacing node vi . Replace now each degree2 node v of G. E or W. and they. whichever has not been considered up to now. by a(v ) copies of the diamond. and make these nodes the endpoints of the Hamilton path sought. if v1 6= v0 . It is easy to see that any such cycle will be orthogonal to H0 because the E{W and the N{S traversals of the diamond are orthogonal. k 1. H0 . form the standard Hamilton cycle. If the two nodes adjacent to an occurrence of v are on the same face. k. and its adjacent edges. the copies are disjoint. Let C = (v0 . together with the E{W traversal of the diamonds. for each i = 1. Since the original graph is special. repeat that occurrence twice. H0 becomes a Hamilton path as well. For i = 0. v1. vk = v0). and the N and S nodes of the two outermost ones (or the unique one. if a(v ) = 1) coincide with the nodes of G adjacent to v . : : :. H0 is the only Hamilton cycle of G utilizing a E{W traversal of the diamonds. We then perform the multistring reduction. whichever is in the same face with the previous node (for v0. we start with the endpoint which is not on the same face as v2 ). the endpoints of these edges have large Hamming distance in the Trevisan code). We take any degree2 node and replace it with two degree1 nodes. we start with the endpoint. : : :. and if v1 = v0 . join the E or W node of the bi 1th copy of the diamond replacing node vi 1 . Consider thus a cycle C of G0 (allowing repeated nodes but no selfloops) that visits all nodes of G0 at least once. We shall now construct the instance of the string folding problem. suppose the ith element of C is the bith occurrence of node vi (let bk = 1). see Figure 6. Now for each node v . : : :. Notice that these new edges do not harm the planarity of the graph. with the following 9
. that is on the same face as v1 . of the resulting graph G00. Any Hamilton cycle utilizing a N{S traversal must correspond to a Hamilton cycle of the original graph G. We now perform Trevisan's transformation having deleted the E{W edges of the graph (that is.
By a simple modi cation of our techniques (and. there is a Hamilton path in the graph G00 that does not utilize the E{W edges. Conversely. The appropriate modi cation of our proof is this: The string consists of roughly L L L ones. The Hamilton cycle problem is encoded in the part of the cube that lies just below one particular face of the cube. suppose that G did have a Hamilton cycle. all of these n long strings are connected together in the order suggested by the Hamilton path H0 by long (of length 2L4 + 2L2) strings of zeros. manuscript submitted to J. Leighton. by ordering them arbitrarily. is odd. where 8 losses must occur) by zeros. Berger. and connecting the end of the su x of string 2i 1 to the end of the su x of the string 2i.modi cations: The number of strings corresponding to each city. All strings corresponding to the same city are connected together in one string. : : :. This part is traversed in one direction by \tubes" whose crosssection is a square of four zeros. because H0 is orthogonal to H. July 1997. Suppose that indeed there is a folding with E or fewer losses. But this means that we can arrange the n strings corresponding to the cities as in the intended folding in the proof of Theorem 1. joined together as they are via their pre xes and su xes in the order suggested by H0. Then G00 has a Hamilton cycle H other than H0 . F.
References
1] B. even in its most simple yet realistic twodimensional HP simpli cation. and hence there is a Hamilton cycle in the original graph. and a few months earlier. By the precise same argument as in the proof of Theorem 1. and in fact one that is orthogonal to H0. Mismatches in the alignment of these tubes correspond to mismatches in the Trevisan code of the underlying graph. in fact. and the Lbeginning of the pre x of string 2i to the beginning of the pre x of string 2i + 1. i = 1. Mol. and contribute the only extra losses.. and the steps in establishing the converse are analogous to the ones in the present proof. which form a cube protected from all sides (except for the eight corners. is NPcomplete. Finally. This is trivial to accomplish by adding one string to each set.
4 Conclusion and further work
Our NPcompleteness result settles in the a rmative the widespread conjecture that the protein folding problem. T. L=n. Biol. The same result was proven independently. beyond the eight mandatory ones. Thus the intended folding has a total of 4E + 8 losses. 10
. one that is free of the planarity complications of the present proof) we can show that the threedimensional version of the proteinfolding problem in the HP model is NPcomplete. n 2 1 . by Berger and Leighton 1]. We claim that there is a folding with E losses if and only if the original special graph had a Hamilton cycle.
Dill. Principles of protein folding . Paterson. 3] K. Ngo. Yue. 12] L. Le Grand. Protein Science 4 (1995). Hart. M. Dill. 10] J. 8] J. Complexity of protein folding. 6] W. On the complexity of string folding. Chan. Shmoys. 157168. Bulletin of Mathematical Biology 55 (1993). Boston. Chan. pp. Nayak. Istrail. D. U. Dill. Sinclair. 1994. A. pp. A. Dominant forces in protein folding. 71337155. Moult. H. Fraenkel. Bulletin of Mathematical Biology 55 (1993). Fast protein folding in the hydrophobichydrophilic model within threeeighths of optimal. J. Thomas. P. S. A. 1985. S. 11991210. Przytycka. protein structure prediction. pp. 11831198. H. K. Johnson. In The protein folding problem and tertiary structure prediction. Physics Today (1993). A. D. to appear. K. Wiley Interscience. edited by K. Yee. M. S.2] H. D. pp. Proceedings of the 27th Annual ACM Symposium on the Theory of Computing. 11] M. G. The protein folding problem. T. J. Proceedings of the ACMSIAM Symposium on Discrete Algorithms. edited by E. Spatial Codes and the Hardness of String Folding. Discrete Applied Mathematics 71 (1996). 1997.2129. 1995. Rinooy Kan. Bromberg. 5] A. Finding the lowest free energy conformation of a protein is an NPhard problem: proof and implications. Unger. pp. \Computational Complexity. pp. P. Marks. Papadimitriou. Proceedings of the 29th Annual ACM Symposium on the Theory of Computing. 9] A. K. Lenstra. A. 561602. Karplus. S. T. Lawler. C. pp. Biochemistry 29 (1990). J.
11
. S. L. Deciphering the rules of protein folding. 3254. Birkhauser.A perspective from simple exact models. pp. Merz and S. 13] R. and the Levinthal paradox. B. K. Fiebig. Chemical Engineering News 67 (1989). 217230. S. M. King. 4] K. When Hamming Meets Euclid: The Approximability of Geometric TSP and MST Extended Abstract]. pp. Computational complexity. 7] D." in The Traveling Salesman Problem. Zwick. Trevisan. 2432. E. H.
the point will be called empty. A point p 2 Z 2 is said to have a loss within distance d (vertical distance d) if there exists a loss involving two points within distance d (vertical distance d) of p. It will be helpful to visualize R not only as a collection of points but also as a collection of continuous regions. We include points surrounded by a discrete curve of internal points in R so that R will not contain any \holes. the point is bent. then. on the boundary of R which is not one of the 4L 4 intended boundary points. denoted Bdary (A). where for connectivity we allow only vertical or horizontal edges. for a point of R which is not internal must be surrounded by a curve of internal points and hence cannot be on the boundary. where. strings passing through RC are approximately straight and parallel.A The converse
In this section we prove that if there is a folding of the strings in S with at most E losses. We will prove that RC looks very much like an L L square and that. and call a point in Z 2 a zero or a one if it has a preimage which is. the rst inequality is clear. a zero or a one. denoted by the functions Perim() and Area(). then subsets of R may be said to have perimeter and area. that there is an internal point. Suppose. Finally. p. A point q 2 Z 2 is said to be within distance d (vertical distance d) of a point p 2 Z 2 if q is reachable from p using a path of at most d vertical and horizontal edges (at most d vertical edges and any number of horizontal edges). To prove the second inequality. respectively. We make a few further de nitions before we state our rst lemma. two points (x1. The rst lemma follows:
Lemma 4 There exists a positive constant c such that jBdary(RC )j jBdary(R)j 4L + c E 4:
1 1
Proof: Since any boundary point of RC is also a boundary point of R. or if jx1 x2 j = 1 and jy1 y2j = 1 that is. We prove that there
i
. y1) and (x2 . to consist of all points in A adjacent to at least one point not in A. we rst note that all points on the boundary of R must be internal points. By visualizing each point contained in R as a unit square centered at the point and parallel to the axes. a set of points is said to have a loss within distance d (vertical distance d) if it contains a point which has a loss within distance d (vertical distance d). Our immediate goal will be to prove lemmas bounding the perimeter and area of RC . Let f be a folding of the strings in Z 2 such that the resulting number of losses is less than or equal to E . then there is a Hamilton path from v1 to vn in G. Since the embedding is injective. Consider the region R in Z 2 which consists of all internal points (points in the L L square in the intended folding) as well as all points surrounded by a discrete closed curve of internal points. We begin by bounding the perimeter. we allow diagonal edges. otherwise. for the purposes of a discrete curve. De ne the boundary of a region A." Let RC be the largest connected subset of R. Then the perimeter of any subset of R consisting of connected components of R is equal to the number of boundary points it contains plus the number of convex corners formed by points on the boundary of the region. We call an internal point straight if its two neighbors lie in the same vertical or horizontal line. y2) in Z 2 may be joined if they are adjacent (joined by a vertical or horizontal edge in the lattice Z 2 ). for the most part. we will identify bits in the strings with their images in Z 2 . otherwise.
We show that in every potential con guration involving z these assumptions lead to a contradiction. be a loss within distance 6 of z . then it must either be contained in the larger substring 1010 or 0101. However.ank internal zero. which is not an internal point. that p must have a loss within distance 6 or an intended corner within distance 2. empty or a noninternal zero. in fact. z. Since p is on the boundary and. where the isolated pipe represents the region boundary. in addition. for there must. We next state a lemma involving non. If q is a zero.must either be a loss within distance 6 of p or an intended corner within distance 2 of p. which is not adjacent to two other non. the picture of the neighborhood around p (up to rotation) must be as follows: 1 1 1 p 1. then there must be a loss within distance 6 of z . Since there are at most a constant number of points within distance 6 of the two points involved in a loss and there are at most E points (for n 3) within distance 2 of an intended corner.
. both lead to losses within distance two of p. We have completed showing that there must either be a loss within distance 6 of p or an intended corner within distance 2. thus. stated below. it follows from Lemma 5.ank Proof: Suppose that there is a bent non. that there is no loss within distance 6 of z . Since p is a non. cannot be adjacent to two other internal zeros. the picture of its neighborhood under folding f (up to rotation) must be as follows:
ii
internal zeros nor within distance 2 of an intended corner.ank internal zeros nor within distance 2 of an intended corner. Point p must be adjacent to a point. the two possibilities for the point between the two uppermost ones. Case 3: p is a bent internal zero. then there is a loss within distance one. There are three cases to consider: Case 1: p is a one. Assuming there is no loss within distance 6 of z .ank internal zeros referred to above. Suppose.
Lemma 5 If z is a bent non. The assumption of no intended corners close to z implies that if there is a one within distance 2 of z which is contained in the substring 010. then there must also be a loss within distance one because no one which is not on the intended boundary has a noninternal zero as a neighbor in any of the strings. we can set c1 equal to the constant plus four and the lemma follows.ank zero and there is no loss within distance 6. If q is empty. q .ank internal zero which is not adjacent to two other non. Case 2: p is a straight internal zero.
The same separation is true for internal zeros and pre x or su x zeros.ank internal zero and the other adjacent zero is not a bent noninternal zero. a contradiction. the picture must be as follows: 1X
j
0 1
0 z 1 1
X
j j
1 1: No matter where the zero following the starred one is placed. this case is impossible. The necessary picture follows: 1 1
j j
0 0 1 0 z 1 1
j j
1 1: It should be clear that the adjacent zero to the left of z cannot be straight (there are no maximal substrings of zeros of length two).ank internal zero is the upper adjacent zero. iii
. Next. there will be a loss (located at one of the two X 's). Since there are no losses within distance 6.ank (though on anks they may be separated by some groups of 4 zeros). Assume without loss of generality the bent non.0 1 0 z 1 1
j j
1 1: The reason for the appearance of the ones other than the two neighbors of z is that there are always at least eight ones which lie between internal zeros on any string. ank or non. Since we assumed it was not a bent noninternal zero and it is not a bent internal zero. Case 2: One of the adjacent zeros is a bent non. This fact will play a role in almost all of the pictures of con gurations which follow in the proof of this lemma and in the proof of Lemma 6. we consider the classi cations of the zeros lying adjacent to z . Note that the two zeros cannot both be straight. Case 1: One of the adjacent zeros is a bent ank internal zero: Assume without loss of generality that it is the upper adjacent zero.
We next examine the potential locations for the zero following the zero marked with the superscript one (zero 1). It must be contained in one of the groups of 4 zeros on a ank. Assume it is the upper zero. 02
j
01 0 1 0 1 0 z 1 1
j j
j j
1 1 We examine the potential locations for the zero following the zero marked with the superscript two (zero 2).Only one case remains: Case 3: One of the adjacent zeros is a bent noninternal zero. as appears below. Case 3A: The zero following zero 1 goes up. The picture must be the following: 01 0 0 1 0 z 1 1
j j
j j
1 1: >From the picture it should be clear that the upper zero adjacent to z cannot be a pre x or su x zero due to our assumption about intended corners. Case 3Ai: The zero following zero 2 goes left: 1 0 02 1 0 01 0 1
j
X j j j
0 0 0 1 0 z 1 1
j j
1 1: There must be a loss at the X . the zero which is a neighbor to the 1neighbor of the upper zero may also not be a pre x or su x zero. As well. iv
.
Case 3Aii: The zero following zero 2 goes up:
1 1
j j j j
02
0 0 01 0 13
0 1 0 z 1 1
j j
j j
1 1:
Case 3Aiia: The zero following one 3 goes up:
1 1
j j j j
02
1
0 0 0 0 1 1
3
j
0 0
j
0 1 0 z 1 1
j j
j j
1 1:
This con guration is impossible because no matter how the picture is completed. Case 3Aiib: The zero following one 3 goes right: v
. the two starred zeros must belong to the same string. but no string contains either a maximal substring of zeros of length three or the substring 10101.
Case 3Bii: The zero following zero 4 goes left: 1 0 04 01 0 1 0 0 1 0 z 1 1
j j
j j
vi
1 1:
. Case 3B: The zero following zero 1 goes left: 04 01 0 0 1 0 z 1 1
j j
j j
Case 3Bi: The zero following zero 4 goes down:
1 1:
1X 04 01 0 1 0 0 1 0 z 1 1
j j j
j j
1 1: There must be a loss at the X . this con guration is impossible because the two starred zeros must be in the same string. but no string contains a maximal substring of zeros of length two.1 1
j j j j
0 0 02 0 1 01 0 13 0 0 1 0 z 1 1
j j
j j
1 1: Similarly to the previous subcase.
ank internal zero which is not adjacent to two other non. the assumption of no nearby intended corner implies any one within distance 4 of the corner point which is contained in the substring 010 must be contained in the larger substring 1010 or 0101. there exists a positive constant.
Lemma 6 If there is a concave corner on the boundary of R. Thus. this implies the loss is within distance 7 of the corner point. contradiction. again. which is the nal step toward limiting Perim(RC ). In all cases. we show these assumptions must lead to a contradiction. we showed that under our assumptions the potential con gurations must. However. c2 . Case 3Biii: The zero following zero 4 goes up: 1 1
j j j
0 0 0 0
04 01 0 1
X
j j
0 1 0 z 1 1
j j
1 1: There must be a loss at the X . The following lemma. We represent this con guration using the following diagram: vii
Proof: Suppose there is a concave corner on the boundary of R such that no intended corner is
. Also similarly to the proof of Lemma 5. Case 1: Suppose two ones form the sides of the concave corner. the con guration is impossible because the two starred zeros must be contained in the same string in a maximal substring of zeros of length three. As in the proof of Lemma 5. in fact.ank internal zero may be a side of the concave corner.ank internal zeros). a bent non. bounds the number of concave corners on the boundary of any subset of R. either be impossible or contain a loss within distance 6 of z (all points shown in the gures were within distance 6 of z ). for since there can be no intended corner in distance 2 of the zero (and it is not adjacent to two non. We have completed our case analysis. Consequently.Here. We perform a case analysis based on the classi cation (zero or one) of the sides of the the concave corner. then there must either be a loss within distance 7 of the point of R at the corner or an intended corner within distance 4 of the corner point. Note that no bent non.
within distance 4 of the corner point and there are no losses within distance 7 of the corner point. there must be a loss within distance 6 of the zero. such that there are at most c2 E concave corners on the boundary of any subset consisting of connected components of R.ank internal zeros nor within distance 2 of an intended corner must have a loss within distance 6.
both lead to a loss within distance 2 of the corner point. the necessary picture is the following:
1 0
j j0 1
1
1
j
1: Case 2A: The zero following one 1 goes right:
1 00 1 0 0
j j0 1 0
1
1
jj j
jj j
0:
11 1 0
00 0 One of the starred zeros must be followed by a one. Case 2: Two zeros form the sides of the concave corner. either empty or a noninternal zero. The two possibilities for the classi cation of the point not in R which is adjacent to the two ones. contradiction. so there must be a loss within distance 4 of the corner point.1:
j1
The two pipes represent where the ones meet the perimeter of R. Case 2B: The zero following one 1 goes down: 1 1 0
j0 1 1 j j
1
j
j
0 0
1 0 1 0 viii
. contradiction. Since there is no loss within distance 7 of the corner point.
but then there would be a maximal substring of zeros of length two. Case 3A: Suppose the zero is straight. for otherwise there would be a loss as in Case 2A. Case 3Aii: The zero following zero 2 is right: ix
. Thus the con guration is impossible. Case 3: A zero and a one form the sides of the concave corner. One of the starred zeros must be bent.Note the starred zero not adjacent to one 1 must go to the right as shown. The necessary picture follows:
1 1 0 1 1 1 0 0 1 1 0 0 0j + 0 1 1 0 0 0 02 1 0 0 1:
j
j
j j
j
The plus sign denotes the perimeter as well as the link that crosses it. Case 3Ai: The zero following zero 2 is up:
1 1 0 1 1 1 0 X1 0 1 1 0 0 1 0 0j + 0 1 1
j
j
j j j
0 0 0 02
j
1 0 0 1:
There must be a loss at the X .
Case 3B: Suppose the side zero is bent so that its neighbors go up and to the right. Case 3Aiii: The zero following zero 2 is down: 1 1 0 1 1 1 0 0 1 1 0 0 0 j 0 0 0 02 X 1 + j j 0 1 1 0 0 0 1 1 1 0:
j
j
j j
j
Again.1 1 1X 0 0 1 1 1 0 0 0 1 1 0 0 0j + 0 1 1
j
j
j
1 0 0 1:
j
0 0 0 02 0 1
j j
0 1
j
Again. nally. The necessary picture follows: 0 1 0 0j 0 1 X + 1 1 0: There must be a loss at the X . there must be a loss at the X . the side zero is bent so that its neighbors go right and down: x
j
. Case 3C: Suppose. there must be a loss.
00 0
j j j j
0:
0 j0 13 +j 11 1
Case 3Ci: The zero following one 3 goes right: 1 0 0 j0 13 0 0 +j 11 1 0
00 0
j j j j
0:
There must be a loss within distance 4 of the corner point due to the one which must follow one of the starred zeros. Case 3Cii: The zero following one 3 goes up: 0 0 0
0 0 j0 13 1 j +j 1 1 1 14 00 0
j
j
j
j j j j
0:
Case 3Ciia: The bit following one 4 is a zero: xi
.
Case 3Ciib: The bit following one 4 is a one: 0 1 0 0 1
0
0 0 j0 13 1 1 j +j 1 1 1 14 1 1 00 0 1
j
j j
5
j j j j
0:
Case 3Ciib1: The zero following zero 5 is up:
0 0 j0 13 1 1 j +j 1 1 1 14 1 1: Case 3Ciib1A: The zero following zero 6 is left: xii
j
0
j
06 0 1 1
j j
05 0 1 1
.0 0 0 0 j0 13 1 1 1 j +j j jj 4 1 1 1 1 0 00
j
0 0 00
j
j j jj j
j j j j
0
00 0 0 1
j
0X 1:
There must be a loss at the X .
Case 3Ciib1Aii: The zero following zero 7 is up: 1 1
j j
j j j j
0 0 07 06 0 1 1 0 0 j0 13 1 1 j +j 1 1 1 14 1 1: The two starred zeros must lie in the same string. the con guration is impossible. Case 3Ciib1B: The zero following zero 6 is up: xiii 0 05 0 1 1
j
j j
. but no string contains either a maximal substring of zeros of length three or the substring 10101. Thus.07 06 0 1 1 0 0 j0 13 1 1 j +j 1 1 1 14 1 1: Case 3Ciib1Ai: The zero following zero 7 is left: 1 1 0 0
j
0
j
j j
05 0 1 1
j X j
0 07 06 0 1 1
j
05 0 1 1
0 0 j 0 13 1 1 j + j 1 1 1 14 1 1: There must be a loss at the X .
xiv
j
0
j j
05 0 1 1
.0 0 0 0
j j j
8
j
0 0 j0 13 1 1 j +j 1 1 1 14 1 1: Case 3Ciib1Bi: The zero following zero 8 is left: 0 1 0 0 0 0 1
j
0
j
06 0 1 1
j j
05 0 1 1
j j j j j j
8
j
X
0
06 0 1 1 05 0 1 1
0 0 j 0 13 1 1 j + j 1 1 1 14 1 1: There must be a loss at the X . Case 3Ciib1Bii: The zero following zero 8 is up: 1 1 0 0
j
j X j j j
08 0 0 06 0 1 1
j j j
0 0 j0 13 1 1 j +j 1 1 1 14 1 1: There must be a loss at the X .
Case 3Ciib2: The zero following zero 5 is left: 09
0 0 05 0 1 0 0 j0 13 1 1 j +j 1 1 1 14 1 1: Case 3Ciib2A: The zero following zero 9 is right: 1 1
j
0 1
j
j j
X j
0 0 05 0 1 0 0 j0 13 1 1 j +j 1 1 1 14 1 1: There must be a loss at the X . Case 3Ciib2B: The zero following zero 9 is up: 11 1
j
09 0 0 1
j
j j
j j j j
09
00 0 0 0 05 0 1 0 0 j0 13 1 1 j +j 1 1 1 14 1 1: This con guration is impossible because the two starred zeros must belong to the same string. but no string contains a maximal substring of zeros of length three (or the substring 10101). Case 3Ciib2C: The zero following zero 9 is left: xv
j
0 1
j
j j
.
by limiting ourselves to the connected region RC . such that
3
Perim(RC ) 4L + c3E:
corners.
j j
Corollary 7 There exists a positive constant. c .ank strings. c . we showed that under our assumptions the potential con gurations must either be impossible or contain a loss within distance 7 of the corner point (all points shown in the gures were within distance 7 of the corner point). Finally. This completes our case analysis. The corollary. Using Lemmas 4 and 6. in particular. a concave corner on the boundary without an intended corner within distance 4 of the corner point must have a loss within distance 7 of the corner point. such that
4
Area(RC ) L2 c4E:
More speci cally. Consequently.1 1
j X j j
0 09
0 1
0 0 0 05 0 1 0 0 j 0 13 1 1 j + j 1 1 1 14 1 1: Again. a contradiction. In all cases. we do not lose many internal points. we nd a loss. Since there are at most E points within distance 4 of an intended corner (for n 4) and there are at most a constant number of points within distance 7 of the two points involved in a loss. RC contains all the internal substrings of the non. setting c2 equal to the constant plus four. and there
xvi
. we prove that. each component of RN must contain at least L points. We now prove a lower bound on Area(RC ) and. we may take c3 = c1 + c2. Since all the internal points from a non.ank strings and all but at most c4 E internal points contained in the anks. follows from the observation that the perimeter of RC is equal to the number of boundary points plus the number of convex corners it contains.
Proof: Note that the number of convex corners contained in RC is four plus the number of concave
Lemma 8 There exists a positive constant. any subset consisting of connected components of R has at most as many concave corners as R. then.
Proof: Let RN consist of the connected components of R which contain the internal points of the L 2 non. This completes the proof of the lemma. we conclude that there are at most c2 E concave corners on the boundary of R.ank string must lie in the same connected component.
since any boundary point of RN is also a boundary point of R.
4
Proof: Let a = maxfjx1 x2j + 1 : (x1. and thus Area(RC ) L2 2L. RN contains only one connected component. where c LE
c4 LE . Then perim(RC ) < 4L + c3 E (c3 +1)E = 4L E since we have lost the use of more than (c3 +1)E intended boundary points. and the number of convex corners contained in RN is at most 4x plus the number of concave corners contained in RN . We next show that RC is. Since 4 L 16 > (c1 + c2 )E for n 10. in fact. using Lemmas 4 and 6. we see that 4L + c1E 4 + 4x + c2 E is an upper bound on Perim(RN ). On the other hand. Suppose RN consists of x L 2 connected components. we nd a lower bound on Perim(RN ) by making use of the fact that any p component of area A must have perimeter greater than or equal to 4 A. which leads to a contradiction since E 8 for n 2. A lower bound on p Perim(RN ) is. y1). for a positive integer x. y2) 2 RC g L.are at most L 2 such components. (L + a)(L + b) L2 c4E ) (a + b)L + ab xvii
c4 E:
. Therefore. y2) 2 RC g L and let b = maxfjy1 y2j + 1 : (x1. we derive the following inequalities:
1. Suppose more than (c3 + 1)E internal ank points are not contained in RC . Comparing the area and perimeter of the bounding rectangle to those of (the continuous version of) RC . We show RN RC by proving that. (4 L 8)x 4 L + (c1 + c2 )E . it must be true that q p 4x + 4L + (c1 + c2 )E 4 4(x 1) L + 4 L2 (x + 1)L p 4(p 1) L + 4(L (x + 1)) x p = (4 L 4)x + 4L 4 L 4: p p p Consequently. we must have RN RC . Further. given by 4(x 1) L + 4 L2 (x + 1)L. in fact. since RC is the largest connected component of R. thus. Therefore.ank strings are contained in RC . meaning the internal substrings of the non. b
p Lemma 9pThe smallest rectangle containing RC has sides of length L+a and L+b. we still must have p Perim(RC ) 4 L2 2L 4L 8. we must have x = 1. Using this inequality to nd a lower bound on the perimeter. its area is at most L2 + c5LE . (x2. Comparing upper and lower bounds. However. where c5 is a positive constant.
a. First. the limiting situation is when x 1 components are as small as possible and the remaining points arep contained in the last component. From the de nitions it should be clear that the smallest rectangle containing RC has sides of length exactly L + a and L + b. we can nd an upper bound on Perim(RN ) by nding upper bounds on jBdary (RN )j and the number of convex corners contained in RN . very similar to an L L square by bounding the dimensions and area of the smallest rectangle containing RC and the dimensions of a square strictly contained in RC . (x2. The lemma follows by taking c4 = c3 + 1. y1 ). Clearly jBdary (RN )j jBdary (R)j.
and p 4 . SQ1 and the 4 points are pictured in Figure 7. Since RC contains at least L2 c4E internal points by Lemma 8. Taking c5 = c23 + c2 . we derive:
Since ( c23 pLE + c23 E )( c23 LE ) = c43 LE + c43 E LE c23 LE + c4E . which implies b c4 LE using inequality 1. it is certainly true that c4 E a. regardless of the signs of a and b. At this point.) Now consider the a square. (xi . since there are no holes in RC . SQ2 (also pictured in Figure 7).0) p with sides of length L c6 LE . jyij 2 (L c4 LE ) 2c5LE .0). we must have a c23 LE + p c3 E c LE . Substituting bounds for a + b and b in inequality 1 using inequality 2. for each i 2 4]. 2(L + a + L + b) 4L + c3E ) a + b
c3 E:
2
If a and b are both positive or both negative. This isp p clearly impossible due p Corollary 7. We note for later use that ab c2E 2. SQ. We may now state a lemma which nds a large square strictly contained in RC . centered at (0. in RC y (x 1 such that. (x2. it will help to have the bounding rectangle and RC oriented in the coordinate plane. then. (x3. we must have that Perim(RC ) Perim(SQ2) = 4L 4(c4 + 2 2c5) LE . Now let us 4 assume without loss of generality that a is positive and b negative. the 4 4 bound stated in the lemma follows. clearly any region contained in the bounding rectangle of area 2c5LE must contain an internal point contained in RC . That is. Moreover. 4 2 To verify the bound on the area of the rectangle.2. where c6 = 29c5. the square. then there must be a point of RC contained in each of those 4 squares. In particular. p3). Since SQ1 is contained p in the bounding rectangle we know that if we consider the 4 squares with sides of length 2c5LE (and area 2c5LE ) which share a corner with SQ1. (x1. and since c4 > c3. Since RC is connected and we have found the points (xi. if therep pa point in the interior of SQ2 but not in RC which is at a distance of more than exists 4(c4 +2 2c5) LE +c3E from all the sides of SQ2. is contained in RC . we must have Perim(RC ) > 4L+c3 E . y1). for a constant c5. b c4 E .
p
c3 EL + a( c3 E a) 2 2
p
c4 E ) a(a c23 E ) c23 LE + c4E:
2
2
p
p
p Lemma 10 For a positive constant c . any square p sides of p to p with length less than L (c4 +2 2c5) LE 2(4(c4 +2 2c5) LE +c3 E ) = L (9c4 +18 2c5) LE 2c3 E centered at the origin must be contained in RC . which has sides of length L (c4 +2 2c5) LE and is centered at the origin. we note that (L + a)(L + b) = L2 +(a + b)L + ab L2 + c23 LE + c2E 2 using inequality 2 and the bound on ab given above. yi) is in quadrant i and jxi j.0).
tained in RC .
xviii
. p p i 2 4]. y2). y4). there exists a square with sides of length L c LE con6 6
Proof: We proved in the previous lemma that the smallest rectangle containing RC has area less than or equal to L2 + c5 LE . p p Consider the square. there exist 4 points. (This is just p translation of the the embedding f . SQ1. Let us assume without loss of generality that one of the up to 4 potential central points of the bounding rectangle is (0. yi ) 2 RC . Thus. with sides of length L c4 LE which is centered at (0.
y3)
(x4 . and 4 points of RC .'s s &
5
p L c4 LE
SQ1
(x2 . y1)
s
$ s

(x3 . SQ2. 0)
(x1 . y2) SQ2 (0.
xix
. y4)
p2c LE
s%
Figure 7: SQ1.
There are then 3 possibilities for the way in which string s exits SQ: it either exits at a single side (forms a Ushape).ank strings: there are 3 dense substrings of length 10 . p Consider the square with side length 2c5LE which is centered at the origin. This square must contain an internal point. the middle of which begins with bit 3L number 10 + 1 of the internal substring. forcing its perimeter to be too high. L p LE must be used to get from one side of c6 the square to the other. For purposes of illustration we'll assume s exits through the top side of SQ. of RC . leaving a slack of size only c6 LE . facing bits 3L + 1 through 25L 10 p p on the left must be ones. so we have found a non.ank string passing near the origin which is approximately straight. we see that bits 3L + 1 through 25L may only be opposite bits in the range 10
7L + 1 (c + 2p2c )pLE. again because the string (c6 +2 5 p p L has slack only at most (c6 +2 2c5) LE .We now move closer to our goal of showing that many strings pass through RC in an approximately straight and parallel fashion by rst showing that there exists a non. the bits on the opposite side of the U. 4L + (c + 2p2c )pLE ]: 5 6 6 5 10 5
L We may assume (c6 + 2 2c5) LE 180 E 1. containing p cannot be a ank because we would lose the use of too many intended boundary points inside SQ.ank string has length L.
Proof: It should be clear that the second sentence follows from the rst since the internal substring p of a non. Now. s. the string must then be relatively straight. Recall the structure of the L non. if there is no loss in the window containing a vertical zero and its neighbors and the three points adjacent to them on the right. The string must have at least L (c6 + 2 2c5) LE bitsp pSQ. for illustration. 0) which exists exits SQ within O( LE ) distance of the centers of two opposite sides. which is true for n 10. then the point adjacent to the vertical zero on the right must also be a vertical zero. at two points. in particular. We claim that no matter how s is folded into a Ushape. Point p is at a distance of at least 2 (L c6pLE )p 2c5LE from each side of SQ.
p Lemma 11 There p a non.ank string p p 1 passing near (0. we derive that the anks must lie at opposite sides of RC . Of the L bits. Suppose.0). SQ. p. As a corollary to the placement of this string. exits at adjacent sides (forms an Lshape). The string. and inside by evaluating the two limiting con gurations of the U (where the at most (c6 + 2 2c5) LE bits are outside SQ). Note that it is impossible for any pre x or su x zero to be in RC since that would imply the whole pre x or su x is in RC . then. 0) exits SQ (not to return) are near the centers of two opposite sides of SQ. the right side. Since a side of SQ has p p
length L c6 LE . at most that many zeros out of the 90 in the middle dense substring may be horizontal or bent.ank zero without a loss
p
p
xx
. We prove the rst two cases are impossible. Next. or exits at opposite sides as desired. Thus s cannot exit SQ at any point farther than (c6 + 3 2c5) LE from the centers of the sides of SQ. containing O( LE ) bent points.ank string passing within an O( LE ) distance of (0. Clearly. Thus the string must exit RC and. p2c )pLE L=10 and the bits in the above range are ones. We show the two points at which the string near (0. Assuming the adjacent vertical zero is also a non. that s forms a Ushape. rst. Thus we may assume there are at least E + 1 vertical zeros.
Since a loss terminating a line may be uniquely associated with that line. one of the E + 1 lines of zeros must have no associated losses. then the pattern will continue and there will be a horizontal line of vertical zeros: 1 1 1 0 0 0 1 1 1: Thus. (This is just a rotation of the folding f . and since at most (c6 + 2 2c5) LE of the rst 180 zeros in the dense substring are vertical or bent. say exiting through the right side of SQ.
Proof: We use horizontal lines of vertical zeros to place the anks. One half of the L (for illustration we assume it exits through the top side of SQ) contains the middle. Therefore. or unless one of the zeros in the line has a loss within its window containing six points. resulting in a loss and a contradiction.) We may now place the anks:
j j j
j j j
p p Corollary 12 The anks both intersect the lines y = ( (L c LE) c LE). We may assume without loss of generality that string s exits at the top and bottom of SQ and thus its orientation is vertical. Further. this line of zeros must run into a one on the right side of the U. one of the two lines must su er a loss. Consider the horizontal second half of the L. Since the straight zeros in the lines are surrounded by ones. These zeros must form vertical lines of horizontal zeros. In fact. we are guaranteed there is at least one horizontal line of vertical zeros without an associated loss in distance two. and no ank may terminate either of these lines between the sides of the L and the point where they meet. and. suppose string s forms an Lshape. However. there are E + 1 lines of vertical zeros extending from the left side of the U toward the right which may not be terminated unless they reach a ank. where c is a positive constant. We conclude the string cannot form a Ushape. at least E + 1 zeros are horizontal (and at a distance at L least 40 from the right side of SQ). as above. one ank lies p
1 2 6 7 7 1 to the left of and one ank lies to right of the lines x = ( 2 (L c6 LE ) 3 2 4
c E ). this line must encounter the horizontal line of vertical zeros originating from the rst half of the L. dense substring and. using a symmetrical argument. we are guaranteed that at least one of the lines will be without an associated loss in distance two. This second half contains a portion of at least 10 (c6 +2 2c5) LE 3L 40 p p L of the rightmost dense substring. We prove this is impossible by using a similar argument involving lines of zeros. the only way for a horizontal line of vertical zeros to end without a loss in vertical distance one is for it to end in a vertical ank zero:
xxi
. impossible due to the resulting loss of intended boundary points. Let us. next. However. in the vertical interval stretching between the two lines. which contains the p p L rightmost dense substring. a contradiction.in a similar window. the string cannot form an L.
that is. where c7 = 17c6. Then 3c4E points of the ank must be inside SQ.) In addition. the ank completing the lines of zeros on the left must lie to p 3 the left of the line x = 1 (L c6 LE ) + 2 c4E in the vertical interval and p ank completing the 2 1 the lines of zeros on the right must lie to the right of the line x = 2 (L c6 LE ) 3 c4E in the 2 vertical interval. RT 0. then it may not be adjacent to any other zero in another string in the collection other than the zero (if it is a zero) in bit number l of the other string.ank string which passes through the region. p p p p 1 1 Suppose that in the vertical interval 2 (L c6 LE ) + c7 LE ). we show we may assume all zeros contained in the sparse substrings of the non. which is bounded by the p two anks and the lines y = ((2cE + c + 2c ))d LE e + 2E ) must exit RT through its top and
6 7
xxii
. These facts will allow us to nd a Hamilton path using the collection of strings. both p p 1 anks intersect the lines y = ( 2 (L c6 LE ) c7 LE ). The central string s prohibits the anks from stretching across the top or bottom of RC to complete these lines. Note. so 2 that the shortest number of ones between two zeros in the sparse substring is 4(c6 + 2c7)d LE e. p p Let RT be the region bounded by the two anks and the lines y = ( 1 (L c6 LE ) c7 LE ). if the zero is bit number l of the internal substring of one of our collection of strings. They will also be important in the two lemmas and two corollary below where we nd the collection of strings. An important fact to note is that the y coordinate (vertical coordinate) of any point contained in the sparse substring of these strings is con ned to lie in an p interval of length (c6 + 2c7) LE + 1. Consequently.1 1 1 0 0 0 0 1 1 1 0: Using the rst and last dense substrings contained in the vertical string. but this implies that at least c4E intended boundary points have been lost. the pre x and/or su x of s would cause the loss of too many intended boundary points. (We will show that no pre x or su x bit may be contained in RT . 2 We now show that greater than L L non. p 2E (cd LE e + 1) (c6 + 2c7p LE.ank strings pass through (RC and) RT and exit n through its top and bottom sides. from the top and bottom of SQ. bit number l of the internal substring must lie in p c the vertical interval of length (c6 + 2c7) LE + 1 centered at L l. (2cE + c6 + 2c7))d LE e + 2E ]. This is the large collection of approximately straight and parallel strings we have been searching for. Therefore. Setting c = 4(c6 + 2p7).ank strings are straight. 2 (L c6 LE ) c7 LE ] a ank 3 is further than 2 c4 E from the closest side of SQ inside the interior of SQ. Thus. nally. Further. we can nd horizontal lines p p of vertical zeros with no associated losses at distances of no more than 8c6 LE +9(E +1) c7 LE . which is impossible. and very importantly.
j j j
j j j
Lemma 13 Any non. it is clearly impossible for a zero in the sparse substring to be adjacent to a zero from the same string. that the zeros and all the code bits in the sparse substrings of the strings in the collection must lie in p p p p the vertical interval. 2E (cd LE e + 1) + (c6 + 2c7) LE ] ) (2cE + c6 + 2c7))d LE e 2E. one ank must complete both of the lines on the left and the other ank must complete both of the lines on the right.
No pre x or su x bit may lie on these lines. p horizontal zero in the sparse substring must have a Any loss withinpvertical distance (c6 + 2c7)d LE e + 1 since. since no pre x or su x bit may lie in SQ.ank string passing through RT 0 . there are certainly at least p L c6d LE e 6c4E strings passing through RT . the only way for a sparse bent zero in s not to have a loss within vertical distance two (or distance 2). next. Recall that lines of
p Lemma 14 There are at least L c d LEe 6c E strings which pass through (RC and) RT and
exit through its top and bottom sides. resulting in a loss.ank strings (contained in RT 0) exit RT through xxiii
p Proof: We prove that L c LE 6c E strings pass through the region RT 0 \ SQ contained
4
j j
. Further. It su ces to show that s has a sparse vertical zero without a loss in vertical distance one. the next bit in the line must be a one or empty. Let s0 be a non.) We verify that there is always at least one vertical zero without a loss in vertical distance one contained in the sparse substring of s.
Proof: We rst show that no pre x or su x bit may be contained in RT . (We removed an extra factor of 3c4E 2 because later it will be useful to assume these strings are far from the anks. String s0 must exit RT through two points contained in its top and/or bottom sides. eliminate the possibility of s0 forming a Ushape with respect to either the top or bottom side using the same argument as used in Lemma 11. no pre x or su x bit may be contained in RT (and we also note for future reference that there can be no intended corners inside RT ). the whole pre x or su x must be contained in the region bounded by the closest ank and side of SQ. because of the large distance between zeros noted in the paragraph before the previous lemma. since no ank string may pass through SQ further than 3 c4E from the 2 corresponding side of SQ (as noted in Corollary 12) to end the line. However. Further. Thus s0 exits RT through its top and bottom sides as desired. For then.
vertical zeros without associated losses and completed by anks are guaranteed to lie on the top and bottom sides of RT or just above and below these sides. if there is no loss within vertical distance p (c6 +2c7 )d LE e. Thus. is if it is adjacent to two other zeros: 1 1 0 0 1 0 1: However. neither of which may be ank zeros. a horizontal line of vertical zeros must be formed and we are guaranteed by the previous lemma that all strings containing the zeros which pass through RT 0 \ SQ must exit RT through its top and bottom sides. which is true for n 10. the two zeros which are adjacent to the bent zero.bottom sides. However. then a vertical line of zeros extending upward of length (c6 +2c7 )d LE e +1 must be formed. and the two lines. must be contained in the same string since non. the strings in RT 0 \ SQ containing the zeros are distinct. Finally.
6 6 4
in RT by again making use of horizontal lines of vertical zeros. this would cause the loss of too many intended boundary points contained in the closest ank. by the observations made before the previous lemma. We. Next. p p L We need only the assumption that (c6 + 2c7) LE + (4cE + 2c6 + 4c7)d LE e + 4E 10 .
The reason is that there are always two losses attributable to sparse horizontal zeros (one above the zero and one below). vj2 . However. sk2 . using similar windows. we have found our desired Hamilton path from v1 to vn . since all n nodes are represented in the collection of strings. p Thus at least L c d LE ep 6c E 8E strings pass vertically through RT such that all their zeros are straight. Assuming c d LE e + 6c E + 8E < L . it follows by symmetry that vjn = vn. must correspond to a Hamilton path in G. : : :. we note that all
1
the sparse zeros contained in sk1 must be vertical.) We prove:
Proof: There are at most 8 points within distance 2 of a loss. Recall the Hamming distance of the two strings is equal to the number of points where a zero bit from the Trevisan code of one node corresponds a one bit from the Trevisan code of the other node (the bits are at the same position in the Trevisan code). the ank is s1 . there must be a loss involving points between the strings p within vertical distance at most (c6 + 2c7 )d LE e + 1. we are done. : : :. there is at least one representative of every node passing vertically through RT . (vj1 . The loss may also be uniquely attributed to the pair of di ering code bits.its top and bottom sides. if skl and skl+1 correspond to di erent nodes. Then there must be a line of vertical zeros extending from the code zero to the right. and only one loss may xxiv
. the number of losses involving strings skl and skl+1 and points in between must be at least the Hamming distance of the two strings.
Corollary 15 There exist greater than L
such that all their sparse zeros are vertical or horizontal. None of the zeros in the line can be in string skl+1 and the loss ending the line must be contained in the window consisting of the last zero in the line and its neighbors and the three adjacent points to the right.ank strings which pass vertically through n
RT
Therefore. Thus. there are are at least 7n losses which can be attributed to the two strings due to the Trevisan code. there must be a vertical sparse zero without a loss in vertical distance one. Suppose there is a code vertical zero in skl which corresponds to a code one in skl+1 . that is. n
6 4
Lemma 16 It must be true that vj1 = v and vjn = vn . Similarly. vjn ). (This is just a re ection of the folding f . Let (sk1 . and this line must end before or at the time it reaches skl+1 . In fact. true for n 10. of the nodes corresponding to the strings above. and the order. Assume without loss of generality that the ank on the left side of RC corresponds to v1 .
6 4
L non. skm ) be the lefttoright order of the collection of strings in Corollary 15. Therefore. Therefore. so there are at most 8E bent zeros. all sparse bent zeros must have a loss within vertical distance two (and distance two). the total number of losses contained in RT is at least 7(n 1)n = E . First. there may only be n 1 transitions between di erent nodes. the loss may be uniquely attributed to the pair of di ering code bits in the two strings. if the code zero is horizontal. deleting adjacent occurences of the same node. since the loss must involve points in the strings skl and skl+1 and/or points in between the strings and the loss is within vertical distance one of the code zero.
1
Proof: We show that vj1 = v . p Since losses within vertical distance (c6 + 2c7 )d LE e + 1 are uniquely attributable to the sparse zeros in a string and there are at least E + 1 sparse zeros. and let us say skl is to the left of skl+1 . We claim that for l 2 m 1]. Consequently. Yet this is impossible due to the limitations on the proximity of zeros.
for then s1 must have zeros in exactly the same Trevisan code positions as s00 . we may rule out any sparse bent zeros facing toward the right using the same argument. This fact will complete the proof. we show that a sparse zero contained in s1 which is (horizontally) adjacent to a point of p s00 must lie in a vertical interval of length (c6 + 2c7)d LE e centered at L l. there can be no other losses outside this region.be lie between the strings sk1 and sk2 . there can be no intended corner within distance one of the bent zero's neighbors (this was noted in Lemma 13). it su ces to show that at least L (c6 + 2c7)b LE c internal points of s1 with distinct y coordinates are horizontally adjacent to points of s00 and are contained in a vertical interval of p p 00 length L (c6 + 2c7)b LE c. call it s00 .ank internal zeros. Finally. we show that all the strings passing through RT 0 between s1 and sk1 also represent vj1 . there must be a loss within distance 6 of the zero. Since we are already guaranteed E losses which lie between the strings sk1 and skm . however. vj1 = v1 . L (c6 + 2c7)b LE c internal points of sp with distinct y coordinates face s1 horizontally (in a vertical interval of length L (c6 + 2c7)b LE c). Thus. Since all the points must then be ones or vertical zeros and no losses within distance two of s00 are allowable. To show that a sparse zero of s1 which is (horizontally) adjacent to s00 must lie in the desired p interval. using the proof of Lemma 5. Next. none of these points may be a bent zero.
xxv
. the codes must then be identical. In terms of any sparse bent zero which faces toward the left. Regarding the string next to s1 . This completes the proof. For all the strings except for the string closest to s1 this is easy. Therefore. Since the Trevisan code contained in s1 has at most as many zeros as the code in s00. It su ces to verify that there are no sparse bent zeros in any of these strings. and the zero may not be adjacent to two other non. Again making use of Lemma 5. We simply use the same argument as was used in Lemma 14 to show that there must be a loss within distance one of a string containing a sparse bent zero. this is impossible since s00 is far from sk1 . Inside RT . if the zero is the 2 lth internal bit contained in s1 . the points must all be adjacent horizontally to distinct internal points of s1 .