This action might not be possible to undo. Are you sure you want to continue?

# Transforming Men into Mice (polynomial algorithm for genomic distance problem

)

Sridhar Hannenhalli1 and Pavel A. Pevzner1

Department of Computer Science and Engineering Institute of Molecular Evolutionary Genetics The Pennsylvania State University University Park, PA 16802

Then Puss said, \I understand that you have magical powers, that you can change yourself into any kind of animal... But, it must be easy to turn yourself into something huge. However, it must be impossible to turn into something very, very small - like a mouse". Brothers Grimm, Puss N Boots

Abstract

Many people (including ourselves) believe that transformations of humans into mice happen only in fairy tales. However, despite some di erences in appearance and habits, men and mice are genetically very similar. In the pioneering paper, Nadeau and Taylor, 1984 estimated that surprisingly few genomic rearrangements (178 39) happened since the divergence of human and mouse 80 million years ago. However, their analysis is non-constructive and no rearrangement scenario for human-mouse evolution has been suggested yet. The problem is complicated by the fact that rearrangements in multi-chromosomal genomes include inversions, translocations, fusions and ssions of chromosomes, a rather complex set of operations. As a result, at the rst glance, a polynomial algorithm for the genomic distance problem with all these operations looks almost as improbable as the transformation of a (real) man into a (real) mouse. We prove a duality theorem which expresses the genomic distance in terms of easily computable parameters re ecting di erent combinatorial properties of sets of strings. This theorem leads to a polynomial-time algorithm for computing most parsimonious rearrangement scenarios. Based on this result and the latest comparative physical mapping data we have constructed a scenario of human-mouse evolution with 131 reversals/translocations/fusions/ ssions. A combination of the genome rearrangement algorithm with the recently proposed experimental technique called ZOO-FISH suggests a new constructive approach to the 100-year old problem of reconstructing mammalian evolution.

1 This work is supported by NSF Young Investigator award CCR-9457784 and NIH grant 1R01 HG00987. Authors' E-mail addresses: hannenha@cse:psu:edu pevzner@cse:psu:edu

1 Introduction

When the Brothers Grimm described a transformation of a man into a mouse they could hardly anticipate that two centuries later human and mouse will be the most genetically studied mammals. Man-mouse comparative physical mapping started 20 years ago and currently more than 1300 pairs of homologous genes are mapped in these species. As a result, biologists found that the related genes in human and mouse are not chaotically distributed over the genomes but instead form \conserved blocks" (synteny groups). Current comparative mapping data indicate that both human and mouse genomes are combined from approximately 150 blocks which are \shu ed" in human as compared to mouse (Copeland et al., 1993). Shu ing of blocks happens quite rarely (roughly once in a million years) thus making it possible to reconstruct a rearrangement scenario of human-mouse evolution. Below we present the combinatorial formulation of the problem. In the model we consider, every gene is represented by an identi cation number (positive integer) and an associated sign ( \+" or \;") re ecting the direction of the gene. A chromosome is de ned as a sequence of genes, while a genome is de ned as a set of chromosomes. Given two genomes and ; with the same set of genes, we are interested in a most parsimonious scenario of evolution of into ;, i.e. the shortest sequence of rearrangements (de ned below) to transform into ;. Fig. 1 illustrates 4 rearrangement events transforming one genome into another. Let = f (1) : : : (N )g be a genome consisting of N chromosomes and let (i) = ( (i)1 : : : (i)ni ), ni being the number of genes in the ith chromosome. Every chromosome can be viewed either from \left to right" (i.e. as = ( 1 : : : n )) or from \right to left" (i.e. as ; = (; n : : :; 1)) leading to two equivalent representations of the same chromosome. From this perspective a 3-chromosomal genome f (1) (2) (3)g is equivalent to f (1) ; (2) (3)g or f; (1) ; (2) ; (3)g, i.e. the directions of chromosomes are irrelevant. The four most common elementary rearrangement events in multichromosomal genomes are reversals, translocations, fusions and ssions as de ned below. Let = ( 1 : : : i;1 i : : : j j +1 : : : n ) be a chromosome and 1 i j n. A reversal ( i j ) rearranges the genes inside the chromosome and transforms into a chromosome ( 1 : : : i;1 ; j : : : ; i j +1 : : : n ). Let = ( 1 : : : i;1 i : : : n ) and = ( 1 : : : j ;1 j : : : m ) be two chromosomes and 1 i n +1, 1 j m + 1. A translocation ( i j ) exchanges genes between two chromosomes and and transforms them into chromosomes ( 1 : : : i;1 j : : : m ) and ( 1 : : : j ;1 i : : : n ) with (i ; 1) + (m ; j + 1) as the genome obtained from as a result and (j ; 1) + (n ; i + 1) genes respectively. We denote of a rearrangement . Given genomes and ;, the genomic sorting problem is to nd a series of rearrangements (reversals and translocations) 1 : : : t such that 1 t = ; and t is minimum. We call t the genomic distance between and ; and denote it as d( ;). We distinguish between internal reversals which do not involve ends of the chromosomes (i.e. the reversals ( i j ) of a n-gene chromosome with 1 < i j < n) and pre x reversals involving ends of the chromosomes (i.e. either i = 1 or j = n). Also note that a translocation ( n + 1 1) concatenates the chromosomes and resulting in a chromosome 1 : : : n 1 : : : m and an empty chromosome . This special translocation leading to a reduction in the number of (non-empty) chromosomes is known in molecular biology as a fusion. The translocation ( i 1) for 1 < i n \breaks" a chromosome into two chromosomes ( 1 : : : i;1 ) and ( i : : : n ). This translocation leading to an increase in the number of (non-empty) chromosomes is known as a ssion. A translocation is internal if it is neither a fusion, nor a ssion. Fusions and ssions are rather common in mammalian evolution. For example, the only di erence in the overall genome organization of human and chimpanzee is the fusion of chimpanzee chromosomes 12 and 13 into human chromosome 2. Genome rearrangements provide a multitude of challenges for computer scientists see Pevzner and Waterman, 1995 for a review of combinatorial problems motivated by genome rearrangements. Kececioglu and Sanko , 1993 suggested the rst approximation algorithm to analyze genome rearrangements in uni-chromosomal genomes (reversals only). The problem was further studied by Bafna and Pevzner, 1993, Kececioglu and Sanko , 1994, Kececioglu and Gus eld, 1994, Bafna and Pevzner, 1995a and Hannenhalli and Pevzner, 1995. See also Sanko et al.,1992, Bafna and Pevzner, 1995b and Hannenhalli et al., 1995 for biological applications, as well as Gates and Papadimitriou, 1979, Even and Goldreich, 1

1981, Jerrum, 1985, Aigner and West, 1987, Cohen and Blum, 1993, Heydari and Sudborough, 1993 for studies of related combinatorial problems. Kececioglu and Ravi, 1995 made the rst attempt to analyze rearrangements of multi-chromosomal genomes by devising an approximation algorithm for genomes evolving by internal reversals and internal translocations. Recently Hannenhalli, 1995 devised a polynomial algorithm for the case when only internal translocations are allowed. All these algorithms address the case when both genomes contain the same number of chromosomes. This is a serious limitation since di erent organisms (in particular human and mouse) have di erent numbers of chromosomes. From this perspective, every realistic model of genome rearrangements should include fusions and ssions. Moreover, despite some shortcomings (see the last section for a discussion on centromeres) the reversals/translocations/fusions/ ssions model adequately re ects the existing biological challenges (Joe Nadeau, personal communication). It turns out that fusions and ssions present a major di culty in analyzing genome rearrangements: the problem of devising an approximation algorithm for genome rearrangements with reversals/translocations/fusions/ ssions was raised by Kececioglu and Ravi, 1995. This paper presents an exact polynomial algorithm for this problem. Every analysis of rearrangements involves revealing \hidden obstacles" which prevent a \fast" transformation of one genome into another (like dual variables in linear programming). Sixty years ago Dobzhansky and Sturtevant, 1938 already used a notion of breakpoint, which is the most obvious example of such an obstacle. Based on this notion, Kececioglu and Sanko , 1994, devised a 2-approximation algorithm for uni-chromosomal genomes. However, an estimate of genomic distance in terms of breakpoints is very inaccurate. Bafna and Pevzner, 1993 revealed another obstacle (cycle decomposition) which signi cantly improves the bounds for genomic distance. Finally, Hannenhalli and Pevzner, 1995 found a duality theorem for uni-chromosomal genomes (reversals only) which expresses the genomic distance in terms of four parameters describing the combinatorial structure of permutations. In the case of multi-chromosomal genomes combinatorics of rearrangements becomes rather complicated. This paper presents the duality theorem for multi-chromosomal genomes which computes the genomic distance in terms of seven (!) parameters capturing di erent combinatorial properties of sets of strings. Our analysis of multi-chromosomal genomes extensively uses the duality theorem for uni-chromosomal genomes (Hannenhalli and Pevzner, 1995), which is stated in section 2. In section 3 we introduce an idea called ipping of chromosomes to analyze a relatively simple case of so-called co-tailed genomes (approximation algorithm for this case was proposed by Kececioglu and Ravi, 1995). In section 4 we introduce another idea called capping of chromosomes. In section 5, using cappings, we take the rst step towards a polynomial-time algorithm for genomic distance by proving a bound which is at most one rearrangement away from the genomic distance. This bound provides the intuition for a rather complicated potential function that leads to a polynomial-time algorithm and duality theorem for genomic sorting (section 6). Finally, in section 7 we present biological applications, formulate open problems, and discuss our result in relation to the recent experimental breakthrough in ZOO-FISH chromosome painting.

**2 Cycles, hurdles and fortresses
**

In reversal distance problem, the order of genes in two (uni-chromosomal) genomes is represented by (unsigned) permutations = ( 1 2 : : : n ) and = ( 1 2 : : : n ) and the only considered rearrangements are reversals (i.e. both and consist of a single chromosome and the genes do not have associated signs). Given (unsigned) permutations and , the reversal distance problem is to nd a series of reversals 1 2 : : : t such that 1 2 t = and t is minimum (Fig. 2a). We call t the reversal distance between and . Sorting by reversals is the problem of nding the reversal distance, d( ), between and the identity permutation (12 : : :n). Let = ( 1 : : : n ) be a permutation of the elements f1 : : : ng. Denote i j if ji ; j j = 1. Extend a permutation = ( 1 : : : n ) by adding 0 = 0 and n+1 = n +1. We call a pair of consecutive elements i and i+1 , 0 i n, of a breakpoint if i 6 i+1 . The breakpoint graph of is an edge-colored graph G( ) with n + 2 vertices f 0 1 : : : n n+1g. We join vertices i and j by a black edge if i j and by a gray edge if i j . (See Fig. 2b). Later we also use the notion of breakpoint graph 2

) for two permutations and which is de ned as G( ) G( ;1) described earlier. A cycle in an edge-colored graph G is alternating if the colors of every two consecutive edges of this cycle are distinct. In the following, by cycles we mean alternating cycles. Let ~ be a signed permutation of f1 : : : ng, i.e. a permutation with 00+00 or 00 ;00 sign associated with each element (Fig. 2c). In the signed case, every reversal of fragment i j ] changes both the order and the signs of the elements within that fragment. We are interested in the minimum number of reversals d(~ ) required to transform a signed permutation ~ into the identity signed permutation (+1 + 2 : : : + n). De ne a transformation from a signed permutation ~ of order n to an (unsigned) permutation of f1 : : : 2ng as follows. To model the signs of elements in ~ replace the positive elements +x by 2x ; 1 2x and negative elements ;x by 2x 2x ; 1 (Fig. 2c). We call the unsigned permutation , the image of the signed permutation ~ . In the breakpoint graph G( ), elements 2x ; 1 and 2x are joined by both black and gray edges for 1 x n. We de ne the breakpoint graph G(~ ) of a signed permutation ~ as the breakpoint graph G( ) with these 2n edges excluded. Observe that in G(~ ) every vertex has degree 2 (Fig. 2c) and therefore the breakpoint graph of a signed permutation is a collection of disjoint cycles. Denote the number of such cycles as c(~ ). We observe that the identity signed permutation of order n maps to the identity (unsigned) permutation of order 2n, and the e ect of a reversal on ~ can be mimicked by a reversal on thus implying d(~ ) d( ). In the following, by sorting the image = 1 2 : : : 2n of a signed permutation ~ = ~ 1~ 2 : : :~ n , we mean sorting of by reversals (2i + 1 2j ) which \cut" only after even positions in . In the rest of this section, is an image of a signed permutation. We say that reversal (i j ) acts on black edges ( i;1 i) and ( j j +1 ) in G( ). We call (i j ) a reversal on a cycle if the black edges ( i;1 i) and ( j j +1) belong to the same cycle in G( ). Every reversal increases c( ) by at most 1, i.e., c( ) ; c( ) 1 (Bafna and Pevzner, 1993). A gray edge g is oriented if for a reversal acting on two black edges incident to g , c( ) ; c( ) = 1 and unoriented otherwise. A cycle in G( ) is oriented if it has an oriented gray edge and unoriented otherwise. Gray edges ( i j ) and ( k t) in G( ) are interleaving if the intervals i j ] and k t] overlap but neither of them contains the other. Cycles C1 and C2 are interleaving if there exist interleaving gray edges g1 2 C1 and g2 2 C2. See Fig. 2c for examples. Let C be the set of cycles in the breakpoint graph of a permutation . De ne an interleaving graph H (C I ) of with the edge set I = f(C1 C2) : C1 and C2 are interleaving cycles in g (Fig. 2d). The vertex set of H is partitioned into oriented and unoriented vertices (cycles in C ). A connected component of H is oriented if it has at least one oriented vertex and unoriented otherwise. In the following we use the terms edge of , cycle in and component of instead of (more accurate) terms edge of G( ), cycle in G( ) and component of H ( ). A connected component U corresponds to the set of integers U = fi : i 2 C 2 U g representing the set of positions of the permutation belonging to cycles of U . For a set of integers U de ne Umin = minu2U u and Umax = maxu2U u. Let be a partial order on a set P . An element x 2 P is called a minimal element in if there is no element y 2 P with y x. An element x 2 P is the greatest in if y x for every y 2 P and jP j > 1. Let U be a collection of sets of integers. De ne a partial order on U by the rule U 00W i Umin Umax] Wmin Wmax] for U W 2 U . We say that a set U 2 U separates sets U 0 and U if 0 00 there exists u 2 U such that Umax < u < Umin . A hurdle for the set U is de ned as an unoriented component U in U which is either a minimal hurdle or the greatest hurdle where a minimal hurdle is a minimal element in and the greatest hurdle satis es the following two conditions (i) U is the greatest element in and (ii) U does not separate any two sets in U . A hurdle K 2 U protects a non-hurdle U 2 U if deleting K from U transforms U from a non-hurdle into a hurdle (i.e. U is a hurdle in U n K ). A hurdle in is a superhurdle if it protects a non-hurdle U 2 U . De ne a collection of sets of integers U = fU : U is an unoriented component of permutation g and let h( ) be the overall number of hurdles for the collection U . Permutation is called a fortress if it has an odd number of hurdles and all these hurdles are superhurdles. De ne if is a f ( ) = 1 otherwisefortress 0 For a signed permutation ~ with the image we de ne b(~ ) = b( ), c(~ ) = c( ), h(~ ) = h( ) and f (~ ) = f ( ). 3

G(

**Theorem 1 (Hannenhalli and Pevzner, 1995) For a signed permutation ~ of order n, d(~ ) = b(~ ) ;
**

c(~ ) + h(~ ) + f (~ ).

**3 Flipping the chromosomes
**

For a chromosome = ( 1 : : : n ), the numbers + 1 and ; n are called tails of . Note that changing the direction of a chromosome does not change the set of its tails. Tails in a N -chromosomal genome comprise the set T ( ) of 2N tails. In this section we consider co-tailed genomes and ; with T ( ) = T (;). For co-tailed genomes internal reversals and translocations are su cient for genomic sorting, i.e. pre x reversals, fusions and ssions can be ignored (the validity of this assumption will become clear later). For chromosomes = ( 1 : : : n ) and = ( 1 : : : m ) denote the fusion ( 1 : : : n 1 : : : m ) by + and the fusion ( 1 : : : n ; m : : : ; 1 ) by ; . Given an ordering of chromosomes ( (1) : : : (N )) in a genome and a ip vector s = (s(1) : : : s(N )) with s(i) 2 P 1 +1g one can f; form a concatenate of as a permutation (s) = s(1) (1) + : : : + s(N ) (N ) on N ni elements. i=1 Depending on the choice of a ip vector there exists 2N concatenates of for each of N ! orderings of chromosomes in . If an order of chromosomes in a genome is xed we call an ordered genome. In this section we assume w.l.o.g. that ; = ( 1 : : : N ) is an (ordered) genome and = 1 + : : : + N is the identity permutation. We denote d( ) d( ;) and call a problem of genomic sorting of into ; simply a sorting of a genome . We use the following idea to analyze co-tailed genomes. Given a concatenate of a genome one can optimally sort by reversals (Hannenhalli and Pevzner, 1995). Every reversal in this sorting corresponds to a reversal or a translocation in a (not necessarily optimal) sorting of the genome . For example, a translocation ( i j ) acting on chromosomes = ( 1 : : : n ) and = ( 1 : : : m ) can be alternatively viewed as a reversal ( ; i n + (m ; j + 1)) acting on ; (and vice versa). De ne an optimal concatenate of as a concatenate with minimum reversal distance d( ) among all concatenates of . Below we prove that sorting of an optimal concatenate of mimics an optimal sorting of a genome . This approach reduces the problem of sorting to a problem of nding an optimal concatenate of . Let be a concatenate of = ( (1) : : : (N )). Every tail of (i) corresponds to two vertices of of the breakpoint graph G( ), exactly one of which is a boundary (either leftmost or rightmost) vertex among the vertices of the chromosome (i) in the concatenate . We extend the term tail to denote such vertices in G( ). An edge in a breakpoint graph G( ) of a concatenate is interchromosomal if it connects vertices in di erent chromosomes of and intrachromosomal otherwise. A component of is interchromosomal if it contains an interchromosomal edge and intrachromosomal otherwise. Every interchromosomal black edge in G( ) connects two tails. Let btail( ) (notice that btail( ) = N ; 1) be the number of interchromosomal black edges in G( ). Note that for co-tailed genomes tails in G( ) are adjacent to tails only and hence a cycle containing a tail contains only tails. Let ctail( ) be the number of cycles of G( ) containing tails. De ne b( ) = b( ) ; btail( ) (notice that b( ) = n ; N and b( ) = n ; 1) and c( ) = c( ) ; ctail( ). Consider the set of intrachromosomal unoriented components IU in . Hurdles, superhurdles and fortresses for the set IU are called knots, superknots and fortresses-of-knots respectively. Let k( ) be the number of knots in a concatenate of . De ne f ( ) = 1 if is a fortress-of-knots and f ( ) = 0 otherwise. Clearly, b( ), c( ), k( ) and f ( ) do not depend on the choice of a concatenate . Lemma 1 For co-tailed genomes and ;, d( ) b( ) ; c( ) + k( ) + f ( ). Proof A more involved version of the proof of theorem 1 from Hannenhalli and Pevzner, 1995. Concatenates (s) and (s0 ) of an (ordered) genome are i-twins if the directions of all chromosomes but ith one in (s) and (s0 ) coincide, i.e. s(i) = ;s0 (i) and s(j ) = s0 (j ) for j 6= i. A chromosome (i) is properly ipped in (s) if all interchromosomal edges originating in this chromosome belong to oriented components in (s). A concatenate is properly ipped if every chromosome in is properly ipped. The following lemma proves the existence of a properly ipped concatenate. Lemma0 2 If a chromosome (i) is not properly ipped in = (s) then it is properly ipped 0in the i-twin of . Moreover, every properly ipped chromosome in remains properly ipped in . 4

belonging to an unoriented component in . Note that the orientation of any interchromosomal gray edge originating at (i) is di erent in as compared to 0 (i.e. a non-oriented edge in becomes oriented in 0 and vice versa). Since all edges interleaving with g in are unoriented, every interchromosomal edge originating at (i) and interleaving with g in is oriented in 0. All interchromosomal edges originating in (i) which are not interleaving with g in , interleave with g in 0. Since g is oriented in 0 all such edges belong to an oriented component containing g in 0. Therefore, (i) is properly ipped in 0 . Let (j ) be a properly ipped chromosome in . If (j ) is not properly ipped in 0 then there exists an interchromosomal unoriented component U having an interchromosomal gray edge originating at (j ) in 0. If U does not have an edge originating at (i) in 0 then U is an unoriented component in , implying that (j ) was not properly ipped in , a contradiction. If U has an (unoriented) gray edge h originating at (i) then, clearly, this edge does not interleave with g in 0. Therefore h interleaves with g in and h is oriented in thus implying that g belonged to an oriented component in , a contradiction. Lemma 2 implies existence of a properly ipped concatenate = (s) with h( ) = k( ) and f ( ) = f ( ). Below we show that there exists a sorting of by b( ) ; c( ) + h( ) + f ( ) reversals which mimics a sorting of by b( ) ; c( ) + k( ) + f ( ) internal reversals and translocations. Theorem 2 For co-tailed genomes and ;, d( ;) d( ) = b( ) ; c( ) + k( ) + f ( ). Proof Assume the contrary and let be a genome with the minimum value of b( );c( )+h( )+f ( ) among the genomes for which the theorem fails. Let be a properly ipped concatenate of with minimal value of btail( ) ; ctail ( ) among all properly ipped concatenates of . If btail( ) = ctail ( ) (i.e. every interchromosomal black edge is involved in a cycle of length two) then there exists an optimal sorting of by b( ) ; c( )+ k( )+ f ( ) reversals which act on intrachromosomal black edges (Hannenhalli and Pevzner, 1995). Every such reversal can be mimicked as an internal reversal or an internal translocation on thus leading to sorting by b( ) ; c( ) + k( ) + f ( ) internal reversals/translocations. Since is a properly ipped concatenate, b( ) = b( ) + btail( ), c( ) = c( ) + ctail ( ), h( ) = k( ), f ( ) = f ( ). Therefore, optimal sorting of mimics an optimal sorting of by b( ) ; c( ) + k( ) + f ( ) internal reversals/translocations. If btail( ) > ctail( ) then there exists an interchromosomal black edge involved in a cycle of length greater than two and this edge belongs to an oriented component in (since every interchromosomal black edge belongs to an oriented component in a properly ipped concatenate). Hannenhalli and Pevzner, 1995 proved that if there exists an oriented component in then there exists a reversal in acting on the black edges of an oriented cycle in this component such that c( ) = c( )+1. Moreover, this reversal does not create new unoriented components in and h( ) = h( ), f ( ) = f ( ). Note that every cycle containing tails of chromosomes belongs to an oriented component in and consists entirely from edges between tails. Therefore acts either on two intrachromosomal black edges or on two interchromosomal black edges belonging to some oriented cycle of this component. A reversal acting on two interchromosomal black edges can be interpreted as a transformation of a concatenate of = ( (1) : : : (i ; 1) (i) : : : (j ) (j + 1) : : : (N )) into a concatenate 0 = 0 (s0 ) where 0 is a new ordering ( (1) : : : (i ; 1) (j ) : : : (i) (j + 1) : : : (N )) of the chromosomes and s0 = (s(1) : : : s(i ; 1) ;s(j ) : : : ;s(i) s(j + 1) : : : s(N )). Therefore, btail( ) ; ctail( ) = btail( ) ; (ctail( ) + 1) and is a properly ipped concatenate of , a contradiction to minimality of btail( ) ; ctail( ). If reversal acts on two intrachromosomal black edges then is a properly ipped concatenate of implying that b( ) ; c( ) + k( ) + f ( ) = (b( ) ; btail( )) ; (c( ) ; ctail( )) + h( ) + f ( )= (b( ) ; btail( )) ; (c( ) ; 1 ; ctail( )) + h( ) + f ( ) = b( ) ; c( ) + h( ) + f ( ) + 1 Since b( ) ; c( )+ k( ) + f ( ) > b( ) ; c( )+ h( ) + f ( ), the theorem holds for the genome . Therefore d( ) d( ) + 1 = b( ) ; c( ) + k( ) + f ( ) 5

Proof: Let g be an interchromosomal gray edge in originating in the chromosome (i) and

**4 Capping the chromosomes
**

We now turn to the general case when genomes and ; might have di erent sets of tails and di erent number of chromosomes. Below we describe an algorithm for computing d( ;) which is polynomial in the number of genes but exponential in the number of chromosomes. This algorithm provides an intuition for the (truly) polynomial-time algorithm which is described in the following sections. Let and ; be two genomes with M and N chromosomes respectively. W.l.o.g. assume that M N and extend by N ; M empty chromosomes. As a result = f (1) : : : (N )g and ; = f (1) : : : (N )g contain the same number of chromosomes. Let fcap0 : : : cap2N ;1g be a set of 2N distinct positive integers (called caps) which are di erent from genes of (or equivalently, ;). Let ^ = f ^ (1) : : : ^ (N )g be a genome obtained from by adding caps to the ends of each chromosome, i.e. ^ (i) = cap2(i;1) (i)1 : : : (i)ni cap2(i;1)+1. Note that every reversal/translocation in corresponds to an internal reversal/translocation in ^ . If this translocation is a ssion we assume that there are enough empty chromosomes in (the validity of this assumption will become clear later). ^ Every sorting of into ; induces a sorting of ^ into a genome ; = f ^(1) : : : ^(N )g (called capping j capj ^(i)1 : : : ^(i)m (;1)k+1capk ) for 0 j k 2N ; 1. Genomes ^ and of ;), where ^(i) = ((;1) i ^ ^ Si N ; are co-tailed since T ( ^ ) = T (;) = 2=0;1 (;1)icapi . There exist (2N )! di erent cappings of ;, each ^ capping de ned by the distribution of 2N caps of ^ in ;. Denote the set of (2N )! cappings of ; as ;. The following lemma leads to an algorithm for computing genomic distance which is polynomial in the number of genes but exponential in the number of chromosomes N . ^ ^ ^ ^ Lemma 3 ( ;) = min;2; ( ^ ;) ; ( ^ ;) + ( ^ ;) + ( ^ ;) ^ ^ Proof Follows from theorem 2 and an observation that every sorting of ^ into a genome ; 2 ; by internal reversals/translocations induces a sorting of into ;. ^ ^ Let ^ and ^ be arbitrary concatenates of (ordered) cappings ^ and ;. Let G( ^ ;) be a graph ^ (or equivalently, ;) ^ obtained from G(^ ^ ) by deleting all tails (vertices of G(^ ^)) of genome ^ ^ ^ from G(^ ^ ). Di erent cappings ; correspond to di erent graphs G( ^ ;). Graph G( ^ ;) has 2N vertices corresponding to caps, gray edges incident on these vertices completely de ne the capping ^ ^ ;. Therefore, deleting these 2N gray edges transforms G( ^ ;) into a graph G( ;) which does not ^ depend on capping ; (Fig. 3a,b,c,d). Graph G( ;) contains 2N vertices of degree 1 corresponding to 2N caps of (called -caps) and 2N vertices of degree 1 corresponding to 2N tails of ; (called ;-tails). Therefore, G( ;) is a collection of cycles and 2N paths, each path starting and ending with a black edge. A path is a -path (;;-path) if it starts and ends with -caps (;-tails) and a ;-path if it starts with a -cap and ends with a ;-tail. A vertex in G( ;) is a ;-vertex if it is a -cap in a ;-path and a -vertex if it is a -cap in a -path. ; - and ;;-vertices are de ned similarly (see Fig. 3d). ^ Every capping ; corresponds to adding 2N gray edges to the graph G( ;), each edge joining a -cap ^ ^ with a ;-tail. These edges transform G( ;) into the graph G( ^ ;) corresponding to a capping ; (Fig. 3e). De ne b( ;) as the number of black edges in G( ;) and c( ;) as the overall number of cycles ^ ^ and paths in G( ;). The parameter b( ;) = b( ^ ;) does not depend on capping ;. Clearly ^ ^ ^ c( ^ ;) c( ;) with c( ^ ;) = c( ;) if every path in G( ;) is \closed" by a gray edge in G( ^ ;). ^ ;) containing a -path contains at least one more path leads ^ An observation that every cycle in G( ^ to an inequality c( ^ ;) c( ;) ; p( ;), where p( ;) is the number of -paths (or equivalently, ;;-paths) in G( ;). We de ne the notions of interleaving cycles/paths, oriented and unoriented components, etc. in the graph G( ;) in a usual way (see Appendix) by making no distinction between cycles and paths in G( ;). We say that a vertex j is inside a component U of if j 2 Umin Umax]. An intrachromosomal component for genomes and ; is called a real component if it has neither a -cap nor a ;-tail inside.

d b c k f

6

For genomes and ; de ne RU ( ;) as the set of real components and IU ( ;) as the set of intrachromosomal components (as de ned by the graph G( ;)). Clearly RU ( ;) IU ( ;). Hurdles, superhurdles and fortresses for the set RU ( ;) are called real-knots, super-real-knots and fortresses-of-real-knots. Let RK be the set of real-knots (i.e. hurdles for the set RU ( ;)) and K be the set of knots (i.e. hurdles for the set IU ( ;)). A knot from the set K n RK is a semi-knot if it does not contain a - or ;;-vertex inside. Clearly, every semi-knot contains a ;-path (otherwise, it would be a real-knot). Denote the number of real-knots and semi-knots for genomes and ; as ^ r( ;) and s( ;), respectively. Clearly k( ^ ;) r( ;) implying that

b

^ However, this bound is not tight since it assumes that there exists a capping ; which simultaneously ^ ;) and minimizes k( ^ ;). Taking s( ;) into account leads to a better bound for ^ ^ maximizes c( genomic distance which is at most 1 rearrangement apart from the genomic distance (next section).

^ ^ ^ ( ^ ;) ; c( ^ ;) + k( ^ ;)

b

( ;) ; c( ;) + p( ;) + r( ;)

**5 Caps and tails
**

Genomes and ; are correlated if all the real-knots in G( ;) are located on the same chromosome and non-correlated otherwise. In this section we restrict our analysis to non-correlated genomes (it turns out that the analysis of correlated genomes involves some additional technical di culties) and prove a tight bound for d( ;) (this bound provides an intuition for a rather complicated potential function used in the proof of the duality theorem):

b(

;) ; c( ;) + p( ;) + r( ;) + d

s(

;) 2

e ;) 2

d(

;)

b(

;) ; c(

;) + p(

;) + r(

;) + d

s(

e+1

The following lemmas suggest a way to connect some paths in G( ;) by oriented edges. Lemma 4 For every -path and ;;-path in G( ;) there exists either an interchromosomal or an oriented gray edge which joins these paths into a ;-path. Lemma 5 For every two unoriented ;-paths located on the same chromosome there exists an oriented gray edge which joins these paths into a ;-path. ^ In a search for an optimal capping we rst ignore the term f ( ^ ;) in lemma 3 and nd a capping ^ whose genomic distance d( ^ ;) is within 1 from the optimal. The following theorem suggests a way ^ to nd such an \almost optimal" capping ;. ^ ^ ^ Theorem 3 min;2; b( ^ ;) ; c( ^ ;) + k( ^ ;) = b( ;) ; c( ;) + p( ;) + r( ;) + d s( 2 ;) e. ^ ^ ^ Proof Every capping ; de nes a transformation of G( ;) into G( ^ ;) by consecutively adding 2N g1 g2 g2N ^ gray edges to G( ;): G( ;) = G0 ! G1 ! : : : ! G2N = G( ^ ;): For a graph Gi the parameters bi = b(Gi), ci = c(Gi ), pi = p(Gi), ri = r(Gi) and si = s(Gi ) are de ned in the same way as for the graph G0 = G( ;). For a parameter de ne s i as i ; i;1 , i.e. ci = ci ; ci;1, etc. Denote s ; i = (ci ; pi ; ri ; d 2i e) ; (ci;1 ; pi;1 ; ri;1 ; d i2 1 e). Below we prove that i 0 for 1 i 2N , i.e. adding a gray edge does not increase the parameter c( ;) ; p( ;) ; r( ;) ; d s( 2 ;) e. For a xed i we ignore index i, i.e. denote = i , etc. Depending on the edge gi the following cases are possible (the analysis below assumes that and ; are non-correlated): Case 1: edge gi \closes" a ;-path (i.e. gi connects a ;-vertex with a ; -vertex within the same ;-path). If this vertex is the only ;-vertex in a semi-knot, then c = 0 p = 0 r = 1 s = ;1 (note that this might not be true for correlated genomes). Otherwise c = 0 p = 0 r = 0 s = 0. In both cases 0. Case 2: edge gi connects a ;-vertex with a ; -vertex in a di erent ;-path. This edge 0. \destroys" at most two semi-knots and c = ;1 p = 0 r = 0 s ;2. Therefore 7

This edge \destroys" at most one semi-knot and c = ;1 p = 0 r = 0 s > ;2. It implies 0. Case 4: edge gi connects a -vertex with a ;;-vertex. This edge can not destroy any semi-knots and c = ;1 p = ;1 r = 0 s 0. It implies 0. ^ ^ ^ Note that b2N = b( ^ ;) = b( ;) = b0 , c2N = c( ^ ;), p2N = 0, s2N = 0 and r2N = k( ^ ;). s2N e b ; c + p + r + d s0 e = ^ ^ ^ Therefore b( ^ ;) ; c( ^ ;) + k( ^ ;) = b2N ; c2N + p2N + r2N + d 2 0 0 0 0 2 b( ;) ; c( ;) + p( ;) + r( ;) + d s( 2 ;) e. ^ ^ ^ ^ We now prove that there exists a capping ; such that b( ^ ;) ; c( ^ ;) + k( ^ ;) = b( ;) ; c( ;) + s( ;) e by constructing a sequence of 2N gray edges g : : : g connecting -caps p( ;)+ r( ;)+ d 2 1 2N with ;-tails in G( ;) such that i = 0 for all 1 i 2N . Assume that the rst i ; 1 such edges are already found and let Gi;1 be the result of adding these i ; 1 edges to G( ;). If Gi;1 has a -path then it has a ;;-path as well and, by lemma 4 there exists an interchromosomal or oriented gray edge joining these paths into an oriented ;-path. Clearly c = ;1, p = ;1, r = 0, s = 0 for this edge, implying = 0. If Gi;1 has at least two semi-knots (i.e. si;1 > 1) let v1 and v2 be a ;- and a ; -vertex in di erent semi-knots. If v1 and v2 are in di erent chromosomes of then the gray edge gi = (v1 v2) \destroys" both semi-knots. Therefore c = ;1 p = 0 r = 0 s = ;2 and = 0. If v1 and v2 belong to the same chromosome then by lemma 5 there exists an oriented gray edge joining these paths into an oriented ;-path. This gray edge destroys two semi-knots. Therefore = 0 in this case also. If Gi;1 has the only semi-knot, let P1 be a ;-path in this semi-knot. If it is the only ;-path in the semi-knot then for an edge gi \closing" this path, c = 0 p = 0 r = 1 s = ;1 implying that = 0. Otherwise, c = 0 p = 0 r = 0 s = 0 implying that = 0. If Gi;1 has neither a -path, nor a semi-knot then let gi be an edge closing an arbitrary ;-path in Gi;1 . Since gi does not belong to a semi-knot, c = 0 p = 0 r = 0 s = 0 and = 0. ^ Therefore, the constructed sequence of edges g1 : : : g2N transforms G( ;) into G( ^ ;) such that s( ;) e. ^ ^ ^ b( ^ ;) ; c( ^ ;) + k( ^ ;) = b( ;) ; c( ;) + p( ;) + r( ;) + d 2 Since 0 f ( ;) 1, lemma 3 and theorem 3 imply that b( ;) ; c( ;)+ p( ;)+ r( ;)+ d s( 2 ;) e is within one rearrangement from the genomic distance d( ;) for non-correlated genomes. In the following section we close the gap between b( ;) ; c( ;) + p( ;) + r( ;) + d s( 2 ;) e and d( ;) for arbitrary genomes.

Case 3: edge gi connects a ;-vertex with a ;;-vertex (or a ; -vertex with a

-vertex).

**6 Duality theorem for genomic distance
**

The major di culty in closing the gap between b( ;) ; c( ;) + p( ;) + r( ;) + d s( 2 ;) e and d( ;) is to \uncover" remaining \obstacles" in the duality theorem. It turns out that the duality theorem involves 7 (!) parameters, thus making it very hard to explain an intuition behind it. Theorem 3 provides such an intuition for the rst ve parameters. Two more parameters are de ned below. A component in G( ;) containing a ;-path is simple if it is not a semi-knot. ^ Lemma 6 There exists an optimal capping ; which closes all ;-paths in simple components. Let G be a graph obtained from G( ;) by closing all ;-paths in simple components. Without a confusion we can use the terms real-knots, super-real-knots and fortress-of-real-knots in G and de ne rr( ;) as the number of real-knots in G. Note that rr( ;) does not necessarily coincide with r( ;). Correlated genomes and ; form a weak-fortress-of-real-knots if (i) they have an odd number of real-knots in G , (ii) one of the real-knots is the greatest real-knot in G, (iii) every real-knot but the 8

greatest one is a super-real-knot in G and (iv) s( ;) > 0. Notice that a weak-fortress-of-real-knots can be transformed into fortress-of-real-knots by closing ;-paths contained in one of the semi-knots. De ne two more parameters as follows: 1 if and fr( ;) = 0 otherwise; form a fortress-of-real-knots or a weak-fortress-of-real-knots in G 1 if there exists gr( ;) = 0 otherwise the greatest real-knot in G and s( ;) > 0 The following duality theorem proves that the algorithm Genomic Sort (Fig. 4) solves the genomic sorting problem. The running time of Genomic Sort (dominated by the running time of sorting signed permutations by reversals3 ) is O(n4), where n is the overall number of genes (Hannenhalli and

Pevzner, 1995). Theorem 4 d( ;) = b( ;) ; c( ;) + p( ;) + rr( ;) + d s( ;);gr( 2 ;)+fr( ;) e ^ Proof Let t be the number of ;-paths in simple components of G( ;) and let ; be an optimal capping which closes all these ;-paths in simple components (lemma 6). Similar to the proof of the ^ theorem 3 we consider a transformation of G( ;) into G( ^ ;) de ned by 2N gray edges: G( ;) = g1 g2 gt gt +1 g2N ^ G0 ! G1 ! : : : ! Gt = G ! : : : ! G2N = G( ^ ;) and assume that the rst t edges in this transformation close ;-paths in simple components. The parameters bi ci pi ri gri and fri are de ned in the same way as in the theorem 3. Denote s ;gr +fr s ;gr +fr i = (ci ; pi ; ri ;d i 2i i e) ; (ci;1 ; pi;1 ; ri;1 ;d i;1 i;1 i;1 e). Below we prove that i 0 2 for t + 1 i 2N . For a xed i we ignore index i, i.e. denote = i , etc. Depending on the edge gi the following cases are possible: Case 1: edge gi closes a ;-path P . If this path is the only ;-path in a semi-knot S then we consider two sub-cases: gri = 1 and gri = 0. If gri = 1 then there exists the greatest real-knot in Gi and si > 0. Therefore edge gi transforms S into the greatest real-knot in Gi . It implies that all realknots in Gi;1 (if any) are located on the same chromosome as S . Moreover, ri;1 > 0 since otherwise S is the only real-knot in Gi , a contradiction to S being the greatest real-knot in Gi (see the de nition of the greatest hurdle). Since si > 0, S is not a semi-knot (since S is not a hurdle in IU ), a contradiction. If gri = 0 then either gri;1 = 0 or gri;1 = 1. If gri;1 = 0 then c = 0 p = 0 r = 1 s = ;1, gr = 0, fr ;1 implying that 0. If gri;1 = 1 then c = 0 p = 0 r = 0 s = ;1, gr = ;1. If fri;1 = 1, Gi;1 forms a fortress-of-real-knots or a weak-fortress-of-real-knots. One can see that in this case Gi is a fortress-of-real-knots and fri = 1. Therefore fr = 0 implying 0. If fri;1 = 0 then fr 0 and 0. If the path P is the only ;-path in a simple component then after closing the path P , either gr = 0 (in this case c = 0 p = 0 r = 0 s = 0, fr 0) or gr = 1 (in this case c = 0 p = 0 r = 1 s = 0, fr ;1). In both cases 0. If the path P is not the only ;-path in its component then closing P does not destroy any semi-knots. Therefore, c = 0 p = 0 r = 0 s = 0, gr = 0 and fr = 0, implying = 0. Case 2: edge gi connects a ;-vertex with a ; -vertex in a di erent ;-path. This edge \destroys" at most two semi-knots. If it destroys less than two semi-knots then c = ;1 p = 0 r = 0 s ;1. Since gr 0 and fr ;1, 0. If it does destroy two semi-knots, c = ;1 p = 0 r = 0 s = ;2. Clearly gr 0 in this case. If gr = ;1 then s ; gr + fr ;2 and 0. If gr = 0 and fr 0 then 0. If gr = 0 and fr = ;1 then fri;1 = 1 and fri = 0. It implies that Gi;1 is a weak-fortress-of-real-knots and si;1 = 2 si = 0. It implies gri;1 = 1, gri = 0 and gr = ;1, a contradiction. Case 3: edge gi connects a ;-vertex with a ;;-vertex (or a ; -vertex with a -vertex). This edge \destroys" at most one semi-knots and c = ;1 p = 0 r = 0 s ;1. Since gr 0 and fr ;1, 0.

3

Recently Berman and Hannenhalli further improved the running time for sorting signed permutations by reversals

9

Case 4: edge gi connects a

-vertex with a ;;-vertex. This edge can not destroy any semiknots and c = ;1 p = ;1 r = 0 s 0, gr 0, fr 0. Note that if gr = 1 then both Gi;1 and Gi have the greatest real-knots, si;1 = 0 and si = 1. It implies s = 1 and 0. ^ ^ ^ Note that b2N = b( ^ ;) = b( ;), c2N = c( ^ ;), p2N = 0, r2N = k( ^ ;), s2N = 0, gr2N = 0 ^ ;). Also bt = b( ;), ct = c( ;), pt = p( ;), rt = rr( ;), st = s( ;), ^ and fr2N = f ( ^ grt = gr( ;) and frt = fr( ;). Therefore for an optimal capping ; :

^ ^ ^ ^ ^ ^ d( ^ ;) = b( ^ ;) ; c( ^ ;) + k( ^ ;) + f ( ^ ;) = b( ^ ;) ; c2N + p2N + r2N + d s2N ;gr22N +fr2N e st ;grt +frt e = b( ;) ; c( ;) + p( ;) + rr( ;) + d s( ;);gr( ;)+fr( ;) e bt ; ct + pt + rt + d 2 2

^ We now prove that there exists a capping ; such that c2N ; p2N ; r2N ; d s2N ;gr22N +fr2N e = ct ; pt ; st ;grt +frt e by building a sequence of 2N ; t gray edges g rt ; d 2 t+1 : : : g2N connecting -caps with ;-tails in G such that i = 0 for all t + 1 i 2N . The algorithm Genomic Sort building this sequence of edges is shown in Fig. 4. Closing a ;-path inside a component having more than one ;-path inside it (line 3) does not a ect any of the parameters and hence = 0 for the gray edge closing such a path. Connecting a -path with a ;;-path via an interchromosomal or oriented edge (line 6) a ects only two parameters ( c = ;1, p = ;1) and = 0. When number of semi-knots is greater than 2, for an edge \destroying" 2 semi-knots (line 8), c = ;1 and s = ;2. Other parameters do not change and hence, = 0. When number of semi-knots is 2 and gri;1 = 1 then for the edge closing the ;-path in one of the semi-knots (line 11), c = r = p = 0 and s = ;1. Clearly, gri = 0 hence gr = ;1. Moreover fri;1 = fri, hence fr = 0. Thus = 0. If the number of semi-knots is 2 and gri;1 = 0 then for an edge \destroying" the two semi-knots (line 13), c = ;1 and s = ;2. Other parameters do not change, hence, = 0. For the edge closing the ;-path in the only semi-knot (line 15), if gri;1 = 1 then c = p = r = 0, s = ;1, gr = ;1, fr = 0, hence = 0. Else if gri;1 = 0, c = p, r = 1, s = ;1, gr = 0, fri = 0. It can be veri ed that = 0 in this case. Closing any other ;-path (line 17) doesn't a ect any parameters and hence = 0.

**7 Applications and Open Problems
**

To derive human and mouse gene orders we used comparative mapping data from the Mouse Genome Database (Jackson Laboratory). Deriving gene orders is a non-trivial task since the map accuracy in human is signi cantly lower than in mouse (mice are much easier to breed!) and for some closely located genes in human the relative ordering is still unknown. Moreover, despite the fact that the average number of genes in a human-mouse \conserved block" is about 10, some of the blocks consist of a single gene thus making it impossible to infer a sign of these blocks. These problems forced us to make a number of arbitrary decisions while deriving the order of synteny groups in human and mouse. The rapid progress in human-mouse comparative mapping leaves no doubt that in a few years the complete and unambiguous information about human-mouse synteny groups will be obtained. Centromeres represent another di culty in analyzing chromosomal rearrangements. We have chosen to ignore the positions of centromeres since the molecular structure and evolution of centromeres are very poorly understood. In particular it is unclear whether a transformation of an inactive centromere into an active one and vice versa is a frequent evolutionary event. From this perspective, ignoring centromeres might be the most reasonable approach at the moment. We also ignore transpositions since they are extremely rare in chromosome evolution. 10

Under all these limitations we derived a human-mouse gene order consisting of 138 conserved gene blocks. For this (tentative) gene order, a most parsimonious scenario of human-mouse evolution involves 131 reversals/translocations/fusions/ ssions, thus \improving" the Nadeau and Taylor, 1984 and more recent Copeland et al., 1993 estimates. Note that our estimate is constructive unlike all the previous estimates. At the same time this estimate should be taken with caution until a more reliable gene order is produced by experts in human-mouse comparative mapping. Of course, gene orders for just two genomes are hardly su cient to delineate a correct rearrangement scenario. Comparative gene mapping has made possible the generation of comparative maps for 28 species representing di erent mammalian orders (O'Brien and Graves, 1991). However, the resolution of these maps is signi cantly lower than the resolution of the human-mouse map. Since conventional comparative physical mapping is very laborious and time consuming, one can hardly expect that the tremendous e orts involved in obtaining the human-mouse map will be repeated for other mammalian genomes. However, a newly developed experimental technique called chromosome painting allows one to derive gene order without actually building an accurate \gene-based" map! In past, applications of chromosome painting were limited to primates (Jauch et al., 1992), and attempts to extend this approach to other mammals were not successful because of DNA sequence diversity between distantly related species. Very recently, an improved version of chromosome painting, called ZOO-FISH that is capable of detecting homologous chromosome fragments in distant mammalian species was developed (Scherthan et al., 1994). In April, 1995 Rettenberger et al., reported successful completion of the human-pig chromosome painting project. In a relatively inexpensive experiment Rettenberger et al., 1995 identi ed 47 conserved blocks common to human and pigs and used these data for analyzing human-pig evolution. The success of the human-pig chromosome painting project indicates that gene orders of many mammalian species can be generated with ZOO-FISH inexpensively. This provides an invaluable new source of data to attack a 100-years old problem of mammalian evolution with a new constructive approach versus previous ones based on the statistics of point mutations. This paper makes the rst step in this direction but the problem of analyzing genome rearrangements in multiple genomes remains open.

8 Acknowledgments

We are indebted to Joe Nadeau for many helpful insights on biology of genome rearrangements and comparative human-mouse physical mapping. We are also grateful to Vineet Bafna and Webb Miller for many helpful comments and Jannan Eppig for her help with the Mouse Genome Database.

References

1] M. Aigner and D. B. West. Sorting by insertion of leading element. Journal of Combinatorial Theory, 45:306{309, 1987. 2] V. Bafna and P. Pevzner. Genome rearrangements and sorting by reversals. In 34th Annual IEEE Symposium on Foundations of Computer Science, pages 148{157, 1993. (to appear in SIAM J. Computing). 3] V. Bafna and P. Pevzner. Sorting by reversals: Genome rearrangements in plant organelles and evolutionary history of X chromosome. Molecular Biology and Evolution, 12:239{246, 1995a. 4] V. Bafna and P. Pevzner. Sorting by transpositions. In Proc. 6th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 614{623, 1995b. 5] D. Cohen and M. Blum. Improved bounds for sorting pancakes under a conjecture. 1993 (manuscript). 6] N. G. Copeland, N. A. Jenkins, D. J. Gilbert, J. T. Eppig, L. J. Maltals, J. C. Miller, W. F. Dietrich, A. Weaver, S. E. Lincoln, R. G. Steen, L. D. Steen, J. H. Nadeau, and E. S. Lander. A genetic linkage map of the mouse: Current applications and future prospects. Science, 262:57{65, 1993. 7] T. Dobzhansky and A.H.Sturtevant. Inversions in the chromosomes of drosophila pseudoobscura. Genetics, 23:28{64, 1938. 8] S. Even and O. Goldreich. The minimum-length generator sequence problem is NP-hard. Journal of Algorithms, 2:311{313, 1981. 9] W. H. Gates and C. H. Papadimitriou. Bounds for sorting by pre x reversals. Discrete Mathematics, 27:47{57, 1979.

11

10] S. Hannenhalli. Polynomial algorithm for computing translocation distance between genomes. In Combinatorial Pattern Matching, Proc. 6th Annual Symposium (CPM'95), Lecture Notes in Computer Science, pages 162{176. Springer-Verlag, Berlin, 1995. 11] S. Hannenhalli, C. Chappey, E. Koonin, and P. Pevzner. Genome sequence comparison and scenarios for gene rearrangements: A test case. In Genomics, 1995. (to appear). 12] S. Hannenhalli and P. Pevzner. Transforming cabbage into turnip (polynomial algorithm for sorting signed permutations by reversals). In Proc. 27th Annual ACM Symposium on the Theory of Computing, pages 178{189, 1995a. 13] M. Heydari and I. H. Sudborough. On sorting by pre x reversals and the diameter of pancake networks. 1993 (manuscript). 14] A. Jauch, J. Wienberg, Stanyon, N. Arnold, S. Tofanelli, T. Ishida, and T. Cremer. Reconstruction of genomic rearrangements in great apes gibbons by chromosome painting. Proc. Natl. Acad. Sci., 89:8611{8615, 1992. 15] M. Jerrum. The complexity of nding minimum-length generator sequences. Theoretical Computer Science, 36:265{ 289, 1985. 16] J. Kececioglu and D. Gus eld. Reconstructing a history of recombinations from a set of sequences. In 5th Annual ACM-SIAM Symp. on Discrete Algorithms, pages 471{480, 1994. 17] J. Kececioglu and R. Ravi. Of mice and men: Evolutionary distances between genomes under translocation. In Proc. 6th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 604{613, 1995. 18] J. Kececioglu and D. Sanko . Exact and approximation algorithms for the inversion distance between two permutations. In Combinatorial Pattern Matching, Proc. 4th Annual Symposium (CPM'93), volume 684 of Lecture Notes in Computer Science, pages 87{105. Springer-Verlag, Berlin, 1993. (Extended version has appeared in Algorithmica, 13: 180-210, 1995.). 19] J. Kececioglu and D. Sanko . E cient bounds for oriented chromosome inversion distance. In Combinatorial Pattern Matching, Proc. 5th Annual Symposium (CPM'94), volume 807 of Lecture Notes in Computer Science 807, pages 307{325. Springer-Verlag, Berlin, 1994. 20] J. H. Nadeau and B. A. Taylor. Lengths of chromosomal segments conserved since divergence of man and mouse. Proc. Natl. Acad. Sci. USA, 81:814{818, 1984. 21] S. O'Brien and J. Graves. Report of the committee on comparative gene mapping in mammals. Cytogenet. Cell Genet., 58:1124{1151, 1991. 22] P.A. Pevzner and M.S. Waterman. Open combinatorial problems in computational molecular biology. In 3rd Israel Symposium on Theory of Computing and Systems, pages 158{163. IEEE Computer Society Press, 1995. 23] G. Rettenberger, C. Klett, U. Zechner, J. Kunz, W. Vogel, and H. Hameister. Visualization of the conservation of synteny between humans and pigs by hetereologous chromosomal painting. Genomics, 26:372{378, 1995. 24] D. Sanko , G. Leduc, N. Antoine, B. Paquin, B. F. Lang, and R. Cedergren. Gene order comparisons for phylogenetic inference: Evolution of the mitochondrial genome. Proc. Natl. Acad. Sci. USA, 89:6575{6579, 1992. 25] H. Scherthan, T. Cremer, U. Arnason, H. Weier, A. Lima de Faria, and L. Fronicke. Comparative chromosomal painting discloses homologous segments in distantly related mammals. Nature Genetics, 6:342{347, April 1994.

12

Π

+1+2+3+4

reversal

Γ

-3-2-1+4

translocation

-3-2-1+4

fusion

-3-2-1+4+5+6+7+11 +9+10+8

-3-2-1+4+5+6+7+11

fission

+5+6+7+8 +9+10+11

+5+6+7+8 +9+10+11

+5+6+7+11 +9+10+8

+9 +10+8

Figure 1: Evolution of genome

into genome ;

3 3 3

5 5 4 4 8 2

8 4 5 5 7 3

6 6 6 6 6 4

4 8 8 7 5 5

7 7 7 8 4 6

9 9 9 9 3 7

2 2 2 2 2 8

1 1 1 1 1 9

10 10 10 10 10 10

11 11 11 11 11 11

(a)

3 9 1

black edge gray edge

(b)

0 3 5 8 6 4 7 9 2 1 10 11 12

A B C D E

8 14 13 17 18 3 4 1 2 19 20

F

22 21 23

(c)

0 5 6 10 9 15 16 12 11 7

+3

-5

+8

-6

+4

-7

+9

+2

+1

+10

-11

B D A E

F

(d)

C

non-oriented cycle oriented cycle

Figure 2: (a) Optimal sorting of a permutation = (3 5 8 6 4 7 9 2 1 10 11) by 5 reversals. (b) Breakpoint graph G( ). (c) Transformation of a signed permutation into an unsigned permutation and the breakpoint graph G( ). Gray edges (8 9) and (22 23) are oriented while gray edges (4 5) and (18 19) are unoriented. Cycles C and F are oriented while cycles A B D and E are unoriented. Gray edges (6 7) and (12 13) are interleaving while gray edges (6 7) and (4 5) are non-interleaving. (d) Interleaving graph H with two oriented and one unoriented component.

13

Genomes

Π: (-3-2)(-1+4+5+6+7+12)(+10+9+11+8) Γ: (+1+2+3+4)(+5+6+7+8)(+9+10+11+12)

Cappings

^ Π: (+13-3-2+14)(+15-1+4+5+6+7+12+16)(+17+10+9+11+8+18) ^ Γ: (+13+1+2+3+4+14)(+15+5+6+7+8+16)(+17+9+10+11+12+18) ^ π: ^ γ:

Concatenates

+13-3-2+14+15-1+4+5+6+7+12+16+17+10+9+11+8+18 +13+1+2+3+4+14+15+5+6+7+8+16+17+9+10+11+12+18

(a)

^ ^ G (π,γ)

0

+13

-3

-2

+14

+15

-1

+4

+5

+6

+7

+12

+16

+17

+10

+9

+11

+8

+18

19

chromosome11 chromosome

chromosome 2

chromosome 3

(b)

G (Π,Γ)

+13 -3 -2 +14 +15 -1 +4 +5 +6 +7 +12 +16 +17 +10 +9 +11 +8 +18

^ ^

chromosome 1

chromosome 2

chromosome 3

(c)

ΠΓ-path ΠΠ-path

ΓΓ -path

G (Π,Γ)

+6 +7 +12

semi-knot

ΠΓ

-path

+16

ΠΓ-path

ΠΓ -path

+13

-3

-2

+14

+15

-1

+4

+5

+17

+10

+9

+11

+8

+18

chromosome 1

chromosome 2

chromosome 3

(d)

+13

-3

-2

+14

+15

-1

+4

+5

+6

+7

+12

+16

+17

+10

+9

+11

+8

+18

chromosome 1

chromosome 2

chromosome 3

(e)

Π

-3-2 -1+4+5+6+7+12 +10+9+11+8

translocation

-3-2 -1+4+5+6+7+8 +10+9+11+12

fusion

-3-2-1+4+5+6+7+8 +10+9+11+12

fission

-3-2-1+4 +5+6+7+8 +10+9+11+12

reversal

Γ

+1+2+3+4 +5+6+7+8 +9+10+11+12

reversal

+1+2+3+4 +5+6+7+8 +9-10+11+12

(f)

reversal

+1+2+3+4 +5+6+7+8 -9-10+11+12

reversal

+1+2+3+4 +5+6+7+8 +10+9+11+12

^ Figure 3: (a) Genomes and ;, cappings ^ and ; and concatenates ^ and ^. (b) Graph G(^ ^ ). ^ ;) is obtained from G(^ ^ ) by deleting the tails. Caps ^ Tails are shown as white boxes. (c) Graph G( are shown as white circles. (d) Graph G( ;) with 4 cycles and 6 paths (c( ;) = 10). -caps are shown as boxes while ;-tails are shown by diamonds. For genomes and ;, b( ;) = 15, r( ;) = 0, p( ;) = 1, s( ;) = 1, gr( ;) = fr( ;) = 0. Therefore d( ;) = 15 ; 10 + 1 + 0 + d 1;0+0 e = 7. 2 ^ ^ (e) Graph G( ^ ;) corresponding to an optimal capping of ; = (+13 + 1 + 2 + 3 + 4 ; 15)(;14 + 5 + 6 + 7 + 8 + 18)(+17 + 9 + 10 + 11 + 12 + 16). Added gray edges are shown by thick dashed lines. (f) optimal sorting of into ; with seven rearrangements. 14

1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21.

Algorithm Genomic Sort( ;)

Construct the graph G = G( ;) Close all ;-paths in simple components of G( ;) (lemma 6) Close all but one ;-path in components having more than one ;-path inside them while G contains a path if there exists a -path in G nd an interchromosomal or an oriented edge g joining this -path with a ;;-path (lemma 4) elseif G has more than 2 semi-knots nd an interchromosomal or an oriented edge g joining ;-paths in any two semi-knots (lemma 5) elseif G has 2 semi-knots if G has the greatest real-knot nd an edge g closing the ;-path in one of these semi-knots

g G g g

^ ^ nd a capping ; de ned by the graph G = G( ^ ;) ^ sort genome ^ into ; (theorem 3, Hannenhalli and Pevzner, 1995) ^ sorting of ^ into ; mimics sorting of into ;

else nd an interchromosomal or an oriented edge joining ;-paths in these semi-knots (lemma 5) elseif has 1 semi-knot nd edge closing the ;-path in this semi-knot else nd edge closing arbitrary ;-path add edge to the graph , i.e +f g

g G G G g

Figure 4: Algorithm Genomic Sort

15