200

IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 12, NO. 2, JUNE 1993

Maximum Likelihood, Least Squares, and Penalized Least Squares for PET
Linda Kaufman
Abstract- The EM algorithm is the basic approach used to maximize the log likelihood objective function for the reconstruction problem in PET. The EM algorithm is a scaled steepest ascent algorithm that elegantly handles the nonnegativity constraints of the problem. We show that the same scaled steepest descent algorithm can be applied to the least squares merit function, and that it can be accelerated using the conjugate gradient approach. Our experiments suggest that one can cut the computation by about a factor of 3 by using this technique. Our results also apply to various penalized least squares functions which might be used to produce a smoother image.

I. INTRODUCTION
OSITRON emission tomography (PET) is used to study blood flow and metabolism of a particular organ. The patient is given a tagged substance (such as glucose for brain study) which emits positrons. Each positron annihilates with an electron and emits two photons in opposite directions. The patient is surrounded by a ring of detectors, which are wired so that whenever any pair of detectors senses a photon within a very small time interval, the size of which is systemdependent, the count for that pair is incremented. In a matter of minutes, several million photon pairs may be detected. The reconstruction problem in PET is to determine a memory map of the annihilations, and hence a map of the blood flow, given the data gathered by the ring of detectors. There are two main approaches given in the literature: convolution backprojection [28], which was originally devised for CAT; and the probability matrix approach, which better captures the physics of the positron annihilation, but in practice has not been as popular as convolution backprojection. There are two main arguments usually leveled against the probability approach. In the first place, the images are often speckled, and secondly, they can be expensive to produce. There have been various proposals for different merit functions-maximum likelihood (ML) [31], least squares (LS), to give a better image. maximum a posreriori-intended Various smoothers have been proposed, which tend to consider nearest neighbor interactions (see Green [9], Hebert and Leahy [ 111, Lange [ 191, Geman and McClure [7], and Levitan and Herman [22]). These smoothing techniques also choose a particular solution when there is no unique one with the ML or LS approaches. A disadvantage is that they have parameters which must be determined. Herman and Odhner [12] have arrested some of the controversy over the desirability of some
Manuscript received September 19, 1991; revised July 23, 1992. The author is with AT&T Bell Laboratories, Murray Hill, NJ 07974. IEEE Log Number 9208162.

P

of the merit functions by showing that the suitability of an approach depends on the medical application, and sometimes the speckling is inconsequential. The EM algorithm proposed in [29] and [21] is the basic approach for solving the ML problem. Techniques have been suggested for speeding up each iteration of the EM algorithm by taking advantage of the fact that the algorithm is well suited for parallel computation [4], [24], [ 131, and by decreasing the number of unknowns by multigridding and adaptive gridding [27], [26]. Various people have suggested treating the steps of the EM algorithm as a direction, and then using an inexact line search to speed up convergence. The EM algorithm is a scaled steepest ascent algorithm. The scaling is a very good way to incorporate the nonnegativity constraints. However, as a steepest ascent technique, it is a linearly convergent scheme. Steepest ascent algorithms are notorious for going across steep canyons rather than along canyons, and for taking very small steps whenever the level curves are ellipsoidal. Using a line search usually improves the rate of convergence, but the algorithm is still linearly convergent. However, one can create a superlinearly convergent scheme using the ideas of the conjugate gradient algorithm. The conjugate gradient algorithm uses a linear combination of the current step and the previous one to create directions which are A orthogonal where A is an approximation to the Hessian matrix. It tends to go along canyons. Tsui et al. [30] have used the conjugate gradient algorithm with the least square objective function, and showed that in one setting, LS-CG was ten times faster than ML-EM. In Section 11, we develop the EM algorithm for both the maximum likelihood and least squares merit function, and we show that EM for ML is equivalent to applying EM to a continually reweighted least squares problem. The Kuhn-Tucker conditions which are used to develop the EM algorithm elegantly incorporate the nonnegativity constraints. Most algorithms that have been used for the LS problem either do not incorporate the constraints (see [30]), include them more as an afterthought (see [16]), or force one to decide whether a variable is small or 0 (see [2] and [17]). Moreover, the same techniques apply to various smoothing penalty functions. Using the same type of technique with various merit functions eliminates some of the factors that tend to obscure the issue of determining which, if any, is the best merit function. In Section 111, we discuss ways of accelerating the EM algorithm for least squares computation. Our techniques are similar to those discussed in [ 151 for the maximum likelihood

0278-0062/93$03,00 0 1993 IEEE

should be much more acceptable to the medical imaging community than these same techniques were when applied to the maximum likelihood function in [15]. decreases the amount of speckling.proposed for PET by Shepp and Vardi [29] and Lange and Carson [21]: xpew) = $*d) T %pb. However. In general.4) The Kuhn-Tucker conditions along with the formula given in (2. Furthermore. with an initial guess of 0. AND PENALIZED LS FOR PET 20 1 function. Some researchers suggest terminating the EM algorithm before the speckling obscures the image. then the gradient of l ( z ) is given by m b‘=l The Kuhn-Tucker conditions (see [8] or [23]) for maximizing 1 subject to nonnegativity constraints is and tll(Z) ~ ax b 50 for h = 1. We show that with little modification. and it would be nice if the sum of the emitted pairs equals the sum of the detected pairs. the number of photon pairs emitted at z. if roundoff error is not considered.One approach. as the algorithms converge. as the results in Kaufman [15] indicate. i. the algorithms can be used with a merit function with a smoothing penalty term. but the PCG approach is still just as effective. suggested by Shepp and Vardi [29]. especially if it is used in a multigrid setting.2) lead to the EM algorithm of Dempster et al. There are various mathematical approaches to determine the map of the annihilations. The initial guess also tends to be a big contributing factor to the appearance of an image. since differences in function values can be computed more easily for LS than for ML. EM APPLIED L s AND ML AND PENALIZED LIKELIHOOD TO In the discrete reconstruction problem. the number emitted in box b. Our algorithm is similar to the one proposed by Kawata and Nalcioglu [17]. . Applying our PCG algorithm to the least squares function tends to reduce the number of iterations by about a factor of 3 over the EM-ML algorithm. but we suggest a preconditioned conjugate gradient algorithm (PCG) based on the scaled steepest descent algorithm in order to take into consideration the nonnegativity constraints that they ignore.5) Z:ld)pb. one has data q where represents the number of photon pairs detected in tube t. for an underdetermined system. When comparing various merit functions. various algorithms will approach different optima. to find the least squares solution that has minimum norm. it will determine the solution that is closest to the starting point. (2. Assuming that 1 b=l B xb?)bt = 171 and B x p b t b=l 1 where p b . but one can impose a grid of B boxes on the affected organ and try to compute as unknowns x b . . such as a squared difference as in [22] or an ln(cosh). . .. In general. Just because an algorithm is producing iterates that appear to have converged does not mean that the optimum of the function has been obtained. In Section IV. Different algorithms may take different paths to the solution. We also suggest that the scaling in the EM algorithm might not be optimal. An algorithm that stops when the gradient is small may terminate prematurely in a region that is “almost” flat. Images obtained using different algorithms or starting guesses to optimize the same objective function might be radically different. If the vt’s are assumed to be independent Poisson variables with means xi. One would like to determine z(z). t represents the probability that a photon pair emitted in box b will be detected in tube t. there is little difference between the images produced using EM for least squares and EM for ML. when the solution is not unique or there are multiple local optima.e. Like [30]. is based on the assumption that the emissions occur according to a spatial Poisson point process in a certain region. As in the case with the EM algorithm for ML. 11.t B t=l (2. this is not computationally feasible. based on function differences. numerical results are given. Iterates could be bouncing back and forth between the sides of a steep canyon.t b‘=l . Adding a smoothing penalty term. The appropriate weighting parameters in these penalty approaches depends on the total number of annihilations counted and the shape of the image. z. t is the probability that photons emitted from box b will be detected by detecting pair t . we turn to the conjugate gradient algorithm. while others suggest some type of smoothing. However. and using even the same stopping criteria may produce different results. and adjusting them might not be an easy task. [SI. the images become snowier. The vector z which maximizes L ( z ) also maximizes l ( z ) = log(L(s)) whose gradient is much simpler to compute than that of L ( z ) .. The main features of the image appear early in the sequence of pictures produced by the EM algorithm applied to LS and in those produced by the EM-based PCG algorithm. Shepp and Vardi show that z can be found by maximizing the likelihood function vt where T is the number of detecting tubes. We would like the z ’ s to be nonnegative. B and x b = 0. and thus might appear to be converging. as in Green [9]. We assume that a matrix P can be constructed such that p b .KAUFMAN: ML. For example. the algorithm used to optimize a particular merit function must be considered. acceleration techniques. but we give a bit more freedom in choosing a diagonal scaling matrix. LS. the conjugate gradient algorithm is guaranteed.

-and thus require roughly the same amount of work per iteration. Using the Kuhn-Tucker conditions for minimizing a function subject to nonnegativity constraints.7) subject to nonnegativity constraints.7) is ill advised. . As we did for f(z).6 million are nonzero.9 ) (2. The gradient of (2.)= (2. and the nonlinear function of Hebert and Leahy [ l l ] v(xj. I I w T z .8) = PD2(PT#) . are rather similar. these can take the form of where U ( z ) is designed to penalize large differences in estimate parameters for neighboring boxes and have the general form U ( % )= y ~ v ( x j : x . Various suggestions have been given for v.xi)= ln(cosh((x. and that there might be 128 detectors or 128 x 127/2 columns in P. EM-ML: For IC = 1. Green's suggestion in [9] of v(xj. VOL. . Let (2. - xj)/S) (2. .10) I n ( l + (xi. The EM-LS algorithm and the EM algorithm for the maximum likelihood function. (2. 2 . . xb (k+l) = xb (k) .14) where y is a positive constant and Nj are usually the eight or so nearest neighbors. EM-LS: For IC = 1 . P would have 125 million elements.12) where D is a diagonal weighting matrix.x (k)Yb b forb = 1 . . 2 .( 2 ) Set xf+') = xf) .9 ) = PD2(u . Pe = e. if ut = for t . T 2) Set $t = v t / u t 3) Set y = Pd 4) Set x f + l ) . NO. where S is another parameter to be set. any scaled steepest descent or ascent method using this scale factor we will call an EM-like algorithm.xf)z&. of which only about 1.xj12/s) (2.9 ) . y = P+ = P ( e . namely. Our development of the EM algorithm can also be extended to merit functions that include a penalized potential function. one minimizes (2. . which we will call EM-ML. / XE.10) implies that in the EM-ML algorithm. is the EM-LS Another way to look at the correspondence between the two methods is to consider minimizing 1i 4 2 ) = . 2. We note that for a rather small problem. B . More formally. until convergence: 1 ) Set U = P d k ) for t = 1 .11) and (2.9 ) . Because all the row sums of P are 1. .17) .Pa. . In the likelihood situation. namely.. Thus.x. Thus.9) then it should be obvious from (2. This becomes more apparent when the EM-ML is written in matrix form as follows. 2 . one can write the algorithm in matrix form as follows. . . .7) is where z=Pa which. (2. If D were allowed to change each iteration and d..16) so that $t = 1 . The gradient of w(z) is simply Vw = P D 2 ( P T z. = (1/ui)'l2. .we can derive an EM-like algorithm for w(z). until convergence: 1) Set z = P ( P ~ ~ 9) ~ ) . JUNE 1993 Another way of rewriting the EM algorithm is Equation (2.202 IEEE TRANSACTIONS ON MEDICAL IMAGING. one may impose a 128 x 128 grid leading to 16 384 unknowns. Because of its size and density structure. using a factorization of the P matrix to solve (2. e represent the vector containing all Let 1's. In the rest of the paper. one can derive an EM-like algorithm for the least squares function.11) so that the EM algorithm might be thought of as a scaled steepest ascent algorithm with the distance of each element to the nonnegativity constraint used as the scale factor. including \ - suggested in [22] where Nj represent the eight nearest neighbors. given the tube counts 9.x ( k ) z b b where 2 Of = P ( P ' z .a ) = e . N j ) j (2.9) that the iterates obtained from the EM-ML algorithm would be exactly those obtained from applying an EM-like algorithm to a continually reweighted least squares problem. 12. Both EM-LS and EM-ML require matrix-vector multiplications with matrices P and P T . Another approach for determining the annihilation map is the least squares approach in which.ot.9NI. .

2 .16) and is easier to evaluate and seems to have fewer numerical considerations. For the least squares problem.la) subject to (3.la) (3. Thus. if for some 6. and others suggested in [ 111 and [7]. the objective method is changed to reflect the constraints. However. One travels along s until some approximation to f has been sufficiently decreased or a variable becomes negative. but as suggested by various authors. which is often the case in tomography. nonnegative. as in the well-known active set techniques sometimes used for such problems. lb).16).KAUFMAN: ML. as in the EM algorithm and in barrier methods. and then a bent line approach can be applied.1). General Minimization Algorithm: 1) Determine dl). . If there are initially many variables that are positive. The third approach involves the explicit incorporation of the constraints into the search direction s. (3. The approach assumes that it is important to determine whether variables are 0. one has to be "close enough' to get convergence. and + + . Secondly. These. (3. Included in this category is the EM algorithm. Nonnegativity is maintained by restricting a. the EM-ML algorithm follows the general outline given above with a set to 1. it may be fine. Many algorithms for minimizing (3. For EM-LS. it is better to do an inexact line search that stops when f has been sufficiently decreased than to do an exact line search. One can obviously add V(z) to f(z)and form a penalized least squares function and apply the EM algorithm as given above. Often. Thus. The form of (2.16) be appropriate to the problem. The tests in 12) indicate that when there are more than a few variables at bound. whenever a component of z as becomes 0. For some methods. one might consider sb = 0. A. described in Section 11. s can be determined without considering (3. the constraints can be used in initially forming s as in the third approach. but it is critical that IS in (2. and as one travels along s. and the reader is referred to a recent paper by Beierlaire et al. it is not that important whether a variable is 0 or just close to 0. After the step is taken. MAP will denote a penalized least squares function using (2. AND PENALIZED LS FOR PET 203 which also has an additional parameter. the constraints can be explicitly used while forming s. at each iteration. many small steps might be required. the EM-LS algorithm. LC using the log(cosh) function in (2. As shown by Lange [ 191. ALTERNATIVE Y S WA THE LEASTSQUARES FOR SOLVING PROBLEM so that there is a preservation of tube counts in the tomography problem. In the active set procedures. We also point out that extending our results to merit functions involving a penalized smoothing term as in (2. The problem of small steps is partially overcome in the bent line approach. Moreover. such that z 2 0. For others. Moreover. The parameter a in step 2b) is often used to obtain a sufficient decrease in f along s and to maintain feasibility. are all easy to differentiate. The direction s is chosen to minimize some approximation to f in the space of the free Variables. In the remainder of this paper. we will discuss several general approaches for minimizing f(z)= +llpTz all.2) Step 1) of the above algorithm should not be treated lightly. Various algorithms have been suggested for determining a.Allowing a to vary in the EM-like algorithms gives much greater flexibility. negative elements are set to 0. one would have There are four main approaches to handling the constraints in the general algorithm given above. s might be considered a bent line. there may be a large number of variables at 0. and LN using the penalty term in (2.14) is easy. 111. As our data indicate in the next section. the iterates will always be nonnegative. in which elements that are small and elements that are zero might be displayed by the same color.17). then for a > 8. . it is set to 1 . until convergence: ( ' a) Determine a search direction s) b) Determine a step size a ( k ) c) Set z ( k + l ) z(k) +a(k)s(k). it penalizes high deviations between neighboring boxes excessively. Here. starting at z = 0 spells disaster. LS. s would bend. which sets a = 1. does not guarantee that the tube counts will be preserved or that f(z("')) 5 f(z('1). and even. which does not guarantee feasibility. a can be larger than in the active set approach so that some of the elements of x may become initially negative. Because the ultimate use of the variables in tomography is a picture. which will eventually be driven to zero. if the EM-ML algorithm were modified to include a step size parameter. . which a priori were thought to be positive.15) is very conducive to a least squares situation. the ln(cosh) function has many desirable properties. and as proved in [31].lb) have the following general form. Finally. [2]. and works well when one knows a priori almost all the variables that will be at bound. with a line search and various interior points methods given in [ 181 and other recent papers. the constraints would determine the breakpoints in the bent line. the active set procedures are rarely cost effective for the tomography problem. like the EM-LS algorithm of Section 11. The term suggested by Hebert and Leahy seems to have most of the advantages of (2.lb) Our aim in this discussion is to determine ways to accelerate the EM-LS algorithm given in Section 11.13. Ensuring Nonnegativity In this section. Thus. one separates the elements of x into those which should be kept at 0 and the variables that are free to vary. The bent line approaches assume that ( ' the task of reevaluating f at the projections of z ) for various values of cr is much less than computing a new direction s. certain algorithms produce much better pictures when the initial guess is uniform. In the first place. as long as cy(') = 1. xb & s b is 0. Thirdly. For certain algorithms for solving (3. 2) For k = 1 .

then the standard nonlinear conjugate gradient algorithm will produce the same sequence of iterates as the linear conjugate gradient method applied to the system P P T z = Pq. If the function is quadratic. Computational tests suggest that this strategy makes little progress in tomography problems. originally proposed by Hestenes and Stiefel in 1952 [14].until convergence: a) Set s = -Wg. (See [3]. - If one had a penalized least squares function of the form f(z) U ( z ) where U ( z ) might be something like (2.) The linear conjugate gradient algorithm. the minimum of f along s d) Set Q = min(0. 12. If PT has m positive distinct singular values. as in (3. the new search direction b) Set U = P T s c) Set 0 = y/uTu. then one would subtract a U ( z ) / a zfor g in the bounded line search EM-LS algorithm and change the formula for 0 in (5c) accordingly. one only goes as far as the constraint and restarts at the top of the algorithm. 5) For IC = 2 . not an easy task. Thus. steep-sided valley. 2.) In our experience. the conjugate gradient algorithm is guaranteed to converge in at most m steps. an iterative method is used to approximately solve the system. It is probably a bad strategy in general. the conjugate gradient method is optimal over a class of procedures that is easy to implement. large steps in the "freer" variables are tolerated. 4) Set y = g T W g . (See [8]. The search directions tend to go along steep valleys.(z(~+') z * ) ~ Q ( I QRI. Assume x* is the solution to the problem P P T z = P q so that + + --* = ( I + Q R ~ ( Q ) ) ( ~ ( O ) -z*). but rarely as quickly as those that involve the distance to the constraints in the determination of s. is used to iteratively solve a symmetric positive semi-definite linear system when it is easy to do a matrix-vector multiplication with the coefficient matrix. + Let E(z(k+l)) l(z(k+l) z * ) T Q ( z ( k + l ) . one quickly gets somewhat close to the solution. The advantage of incorporating the constraints into s is that. = - + B. VOL.la).(Q)is a polynomial of degree IC. When the level curves of a function are very elongated hyperellipsoids and minimization is tantamount to finding the lowest point in a flat. their algorithm involves solving a huge linear system involving P . Q ( I + QRI. . good pictures are obtained in significantly fewer less than m steps. The bounded line search EM algorithm. As we shall see in Section IV. Determining s and Acceleration Schemes In (3. One can use an active set strategy in the space of free variables. of degree IC. all of which are equivalent in infinite precision arithmetic. proposed first by Fletcher and Reeves in 1964 [6]. However. and there are no constraints. . The variant that seems least sensitive to roundoff error in finite precision arithmetic is LSQR [25]. rather than across them. ~ i i r i ~ ~ < ~ ( . there are a number of ways LSQR can be modified to take into consideration nonnegativity constraints. The search directions s satisfy the condition that s.(Q))(z(O) z*). If the approximation is linear. In particular.corresponding to elements close to 0.q).PPTs3 = 0 for i # j. and for the least squares problem. 3) Let W be the diagonal matrix containing dl). The strength of the conjugate gradient method is captured in the following theorem recast from Luenberger [23]. Bounded Line Search EM-LS: 1) Determine dl).+')) . The conjugate gradient algorithm is an easy to use algorithm which generates gradients that are mutually orthogonal. The directions generated are too similar and information is not gathered in subspaces orthogonal to these directions.~ ~ I . Several altematives exist which remove or decrease the effect of ellipsoidal level curves.204 IEEE TRANSACTIONS ON MEDICAL IMAGING. and again restart whenever one goes past the first bend.z*)T = min Rk .16).2). but then little progress is made. elements of s. in many fewer steps than the bounded line search EM-LS algorithm. + The nonlinear conjugate gradient algorithm.z* 1 implying E(z(I. and consider the class of procedures given by . a steepest descent algorithm. the direction s is chosen to approximately minimize f or a modification of f to include constraint information. The point z('+l) generated by the conjugate gradient method satisfies qz("+l)) . every step of the conjugate gradient method is at least as good as the steepest descent step would be from the same point. . as in the EM algorithm.' ) / ~ ~ ) ) = dk-') as e) Set f) Reset W to the diagonal matrix containing z' () g) Set z = P u h) Set g = g a z i) Set y = g t W g . and whenever taking a step of Q to minimize f . follows the third approach. 2) Set g = p ( P z ( l ). tends to traverse across the valley rather than going along the valley. 3 . such as the EM algorithm. each involving a matrix by vector multiplication to determine s. . One can do a bent line search approach. Theorem: Assume do)is the initial guess of an iteration procedure and Q = P P T . JUNE 1993 a standard unconstrained method is used to determine s of the new function. is used in function minimization. One can use a quadratic approximation as in the method suggested by Han et al.(z(k+') . NO. are kept small. The above theorem states that in one sense. violates a nonnegativity constraint. There are various ways of stating the conjugate gradient algorithm for minimizing a quadratic function.(k+l) = Rlc(Q)Vf(z(O') where RI.(Q))(JO)z*) where the minimum is taken with respect to all possible polynomials RI. usually. As explained in Section 111-A in general terms. given below. [lo]. The active set methods and the bent line methods eventually reach this state. or an approximation thereof. dramatic decreases in function values can be obtained very quickly. Often.) The efficacy of this . (See [2].

The traditional linear conjugate gradient algorithm is variable becomes 0. + given by Kawata and Nalcioglu [17] (KN). miny(b)<Ox i k ) / s i " ) Wcorresponding to nonzero values of z are set to 1. if there is a restart every iteration. one might consider a conjugate gradient EM algorithm with a line search EM step is taken. all the elements that eventually will be 0 are determined.5) guarantee that if xJ = 0. set a = min(6.y$('+1) = T $ ( k ) . is positive w 1 / 2 p p T w 1 / 2 2 = w1/2pv (3. one never sees the power of the PCG approach nonnegativity. accelerated in PCG. most of the work is involved in multiplication by P and P T . reset w3/.increment k by 1.6) that is updated each time there is a number of steps. was that for each iteration. If xJ/has the where W1/'2 = z. c) Set c = ~ ( ~ ) / p ( ' T . applied to (3. there is no PPTz= Pr) (3. that the KN algorithm overcomes. although the designed to solve the system algorithm will terminate in a finite number of steps.Jk+l) -wg .) If it is assumed that W is a of the inner PCG algorithm guarantees convergence in a finite constant matrix in (3. allowing only one element to change better conditioned than the old one and the algorithm will and checking the sign of the gradient only after termination converge faster. and i> Set p('"+l) = -. miny(b. y = (yTWy)l/'. S') it would be nice not to scale the search direction by T . reported in [ 151. would use s = -Wg. set a = max(6. The smoothness in which the f) Set y = Pu . Of course.S().>O xjk)/sz(") be made in that iteration. given above.S s ( k ) . and S = O/p(') When s. the whole algorithm is just the standard linear conjugate gradient algorithm applied to the preconditioned system. and and the algorithm is restarted with a new preconditioner. (a1 < (61. j) Set s+ (' ') = W g . operations that must be done with the standard EM algorithm. it will never increase. In practice. One problem with the above algorithm is that once a (3. This returns us to the old a) Set d = P T y . then little progress can If 6 < 0.6) is as follows. restart the algorithm. k to 1.. and also depends on how the bent line search is implemented.p y nonnegativity constraint is handled in the EM algorithm does g> Set y = ( y T ~ y ) 1 / 2 not appear in the KN algorithm. then the linear conjugate gradient algorithm. One can be very lucky.$('). if z. there is one problem with the EM algorithm. Let W be the diagonal matrix with wii = xi. = approach initially so that one momentarily permits infeasibility 7) Until convergence iterate: and then sets negative elements to 0. in Section PCG: IV. The matrix W is usually called a largest negative gradient for all z's that are 0. which can be numerically e')If ( a (< (61. Set six iterations. In our examples. A third possibility involves using the constraints more explicitly. and if not. and after a few iterations. Because the objective function is quadratic. Although the algorithm does not seem simple. AND PENALIZED LS FOR PET 205 approach is problem-dependent. The line search EM algorithm. One way around this Solving (3. however. Thus. If the algorithm never hits a constraint.KAUFMAN: ML. la). (3. f(z)could not be further decreased by letting z3 become positive again. +(k) = . To accelerate the EM algorithm in the same way that the conjugate gradient algorithm accelerates the steepest descent algorithm. Thus. LS.5) is also equivalent to solving problem is to check on termination whether d f /ax. a restart was necessary so that 4) Set y = Pu. In the KN algorithm.x. a restart. < 0. this can be overcome by instituting a bent line 6) Set dk) W g . be accelerated just as the EM algorithm for LS is containing dk). because a constraint was hit. whenever a constraint is hit. its variants like PCG above. It also reweights the problem so that information corresponding to nonnegligible x's is considered more important. > 0 or there is no fear of hitting the boundary. In theory. 6 = ry.= p / p ( ' ) ) The PCG algorithm given above is similar to the algorithm d) Set p ( k ) = (p(k))2 @2)1/2. The EM algorithm is very sensitive to the initial guess. and p ( k ) = y. go back to step 2 ) difficult and rather unnecessary. .yu problem. where g is the gradient of f ( z ) . retaining for zero x's in the inner loop and restarts immediately. 2 ) Set d = r) . h) Set 9 = Y/Y However. the role of the preconditioner is to ensure that the distance to the nearest constraint is large enough so that progress is not hindered. In algorithm PCG. the elements of d')If 6 > 0. The EM algorithm for maximum likelihood can. and one gets the superlinear convergence associated with the conjugate gradient technique. (See [8].Jl preconditioner because it is assumed that the new system is to max. one is forced to determine e> Set z ( k + l ) = z ( k ) + as(k) whether an element is small or 0.6) for all zJ = 0. of trying to determine which elements are b) Set p = lldll2 and U = d / P "0" rather than letting the EM algorithm itself do it. as in the EM algorithm.PTz(')).4) LSQR algorithm as given in [25] and algorithm PCG given Let us derive such an algorithm for the quadratic problem above are steps d') and e') and the inclusion of W . occurred about every 1) Determine dl). the conjugate gradient inner iteration was never activated. whether it be in the . termination is assured. Moreover. The problem we encountered in practice 3) Let = I(dllz and u = d/$('). is small but nonzero and s. and starting with a random start is disastrous. W to the diagonal matrix Set in principle. one usually checks the gradient restart. As 5 ) Let g = y / ( ~ ~ ~ y ) l / ~ . In practice. but the differences and 6 = q $ ( k ) ) / p ( k ) are very important. The main differences between the standard . Notice that the search direction generated in step j) is just a linear combination of the EM step and the previous direction.

l /and 1 5 K 5 Xo/X1 where ~) Xo. we give some of our numerical results using a computer simulation of a PET scan. NUMERICAL RESULTS In this section. wherea = ( l . JUNE 1993 ML setting or the LS setting. ( ~ ) c) Set z ( k + l ) = z ( k ) + . the rate of convergence not only depends on the condition of the system. As pointed out by Herman and Odhner [ 121. then one could take advantage of this property. the convergence is superlinear [8]. However. A good discussion of penalty functions and the rate of convergence of restarted conjugate gradient algorithms is found in Luenberger [23]. mins(k)<o .5. There is another change which one may wish to make with the W matrix in the PCG algorithm. When the conjugate gradient algorithm is applied to a nonlinear function. and minimizing penalized least square functions. and other factors which would.206 IEEE TRANSACTIONS ON MEDICAL IMAGING. of course. the coarser grid tends to bleed through and give unnecessary artifacts . With the EM approach. Set 2) Set d = 1 .i is set to 1 if yi > 0. which were chosen to imitate the brain's typical tracer concentration. one is concerned with the ability of any approach to produce an z which. The phantom that we will use to generate the data and that we want to reconstruct is given in Picture 1 . This is admittedly a small problem. NO. First of all.a P d g) Let g = y . 12. When the problem is underdetermined. say of the form y z T z . Moreover. Even in our situation with the preconditioner changing. we assume we have a single detector ring of 128 equally spaced detectors mounted on it. If initially w.)Is(k) d) If la1 < 1 1 increment IC by 1 . but also on the distribution of the eigenvalues. If the nonsingular portion of Q is well conditioned. Because our algorithms all involve matrix-vector multiplications. A few large eigenvalues. The purpose of our experiments is to determine whether there is any real difference between the pictures produced by plotting the z's produced by optimizing the maximum likelihood function.VU(z). containing ~ ( ' 1 . 6 ) Set s = Wg. VOL. these can be added. minimizing the least squares function. 2. On the other hand. However. We will also determine whether it matters how the least squares function is optimized. when turned into a picture. if one is not very careful. if one has a consistent system. go back to step 2) 6. especially if y is less than XI. e) Set d = P T s f ) Set y = y . and the suitability of any test depends on the ultimate medical application. in a roundoff-errorfree environment. theoretically. PCG-Penalized: 1) Determine dl). it is not difficult to include a penalty term. the tube data for our problems have a total emission count of 10 million photon pairs. be considered in a production run situation are not included.~ : ~ ) / s . There are several ways for measuring the success of a technique. one would be increasing K .~ .P T d k ) )Set W to the diagonal matrix . the number of + . if the eigenvalues are shifted far enough. An initial homogeneous guess is a good starting point for W. but should be relatively large.VU(z) h) Set y = (gTWg) i) Set 6 = y / p j) Set s('+l) = Wg 6 d k ) k) Set p = y. IV. In a multigrid or an adaptive grid situation. the minimum of j ( d k ) a ( ) s') U(&) as(") b) Set Q = min(d. Attenuation. and convergence is assured within a finite number of iterations. 4) Let g = y . ( ' ) 7) Until convergence iterate: a) Determine 6. In order to take advantage of a penalty smoothing term as in (2. which may arise from say a penalty term.l / ~ ) / ( l + ~ . they do not affect the rate of convergence. the iterates will always lie in the range space of PT if the starting guess is 0. detector efficiency. The picture is particularly speckled.14). As discussed by Axelsson [l]. Unless otherwise stated. finding d requires only a few operations. 3) Set y = P d . which are the small regions in Picture 1 with rates of 1. In our simulation. We will always be working with the 128 x 128 square grid that the detector ring circumscribes so that the type of grid will not be a factor in the problem. the conjugate gradient algorithm seeks out the solution with the least norm if the initial guess is 0. one may want to use as an initial guess information that might come from a coarser grid. if one adds a penalty term. which could be the case in tomography. then the resulting problem may be even more amenable to the conjugate gradient approach. If the penalty term is quadratic. IC to 1. If there are any zero eigenvalues of Q. starting with an initial guess of 0 is not an option given in our current PCG scheme with the W matrix defined above. then adding a penalty term to f may make the resulting system ill conditioned. 5 ) Set p = gTWg. these approaches can lead to conflicting results. do not necessarily hinder convergence. one will have to evaluate f ( z ) U ( z ) or its gradient. It is made up of eight ellipses. and each element is computed using the angle of view mechanism of [29]. the rate of convergence of the conjugate gradient algorithm is not necessarily adversely affected by the fact that Q may be singular. If it is not quadratic. The picture should also not have other regions that may be + + + + Obviously. A1 are the extreme positive eigenvalues of Q. iterations of the conjugate gradient algorithm required to reach a relative accuracy of E is at most . Our P matrix will only consider the detector ring geometry.i = xi. the PCG algorithm may be reworked as follows. Little improvement is made in those components which are very small initially. In fact. People often suggest a penalty regularization term to a least squares problem to handle the case when the matrix Q = P P T is semi-definite. allows an observer to visually determine regions of increased activity.

o The phantom used in the computer simulation. a fast scalar machine. and one could have stopped there. The computations could have taken advantage of the hardware of a parallel machine. Line plots of a one-dimensional cross section tend to be more informative if absolute intensities of a tumor are important. In fact. Pictures at various iterations give some clue to the rate of convergence. With a maximum likelihood (ML) merit function and the standard EM algorithm. .1 . some sort of line search technique is essential for LS. Although the function values for the case of the PCG algorithm for the nonpenalized function were still decreasing slowly.0 Picture 1.0 is guaranteed to yield nonnegative iterates that converge to the solution. A weighted sum of squares which would take into consideration . and the PCG algorithm of Section 111. as needed in computing a search direction.61 -0. This means the cost of doing the search is small. the cost of doing a bounded line search involves function evaluations that are cheaper than performing a matrix-vector multiplication with the P matrix. In the main example of this paper. In every ML example seen by the author for PET using a bounded line search. we show some examples in which approximately the same picture is obtained in about one third of the work with a line search. Secondly.e .o 4. Because of noise in the data. In our comparisons. A bounded line search procedure in which a is small enough to ensure nonnegativity is extremely cheap in LS. greatly improved the situation. The number 32 was a bit arbitrary. the features in the test phantom were well defined by iteration 16. we will usually be looking at algorithms that use a line search. the maximum function improvement is attained when Q is at the bound. which uses a square grid. An area in a picture might stand out when it is slightly different from the background. for the rest of this section. (b) Iterations Function value versus iteration count for 10 nullion tube counts.7) for each iteration of the bounded line search EM-ML algorithm. . A. Q was bounded for each iteration with the ML function by about 1. The reader is referred to [16] for suggestions on how to adapt the algorithms to these machines and data showing that the computation would have been finished on a Cray within 10 seconds. . l(a) gives the least squares value defined in (2. one sees the deterioration of the image as the algorithm tries to fit the noise in the data.7) at the optimum may not be the best test for goodness of fit. The efficacy of a bounded line search is problem-dependent.4 -0. LS. Each of the algorithms was stopped after 32 iterations. the functions involving penalty terms had converged. . which required about 4 minutes of computation time. but it is also very helpful to look at numerical measures. 0. but one must realize that in a gray scale picture. the terms EM-LS and EM-ML imply that a bounded line search has been performed. a bounded line search produced only a slight improvement. For ML. there were six restarts when the line search suggested that the optimum was at the boundary. By the 32nd iteration. in one of our examples. That it did better than the EM-LS algorithm was a complete surprise. mistaken for tumors. which used the conjugate gradient technique.w1 . The measure of the success is usually visual. This is the type of result that numerical analysts have seen for years. continuing with a step size of 1 and setting negative elements back to 0 produced a divergent rather than a convergent algorithm. when it really is not very close to the actual solution. Since the cost of doing a crude line search usually is so outweighted by its potential benefits. Note that no attempt was made to minimize the least squares value during the EM-ML algorithm. a step size of 0. as shown in [15]. Thus. Pnmrom 1. Moreover. 1 (a). Unless stated otherwise. After 32 iterations.KAUFMAN: ML. the features of the phantom for the EM-ML and EM-LS computations were quite well defined. the least squares value of (2. there was no surprise that the PCG algorithm. . a step size of 1. The graph shows the difference that a preconditioner makes.06. AND PENALIZED LS FOR PET 207 101 I I .0051 on the first iteration with LS produced a negative iterate. However. I I . Such numerical information is sometimes missing in the tomography literature. the algorithms were started with the same approximation which is uniform within the inscribed circle of the grid and 0 outside that circle. one should be concerned with the rate of convergence-how much computational time is required before the tumors appear.0 PEr le& 0 IO 20 30 0 IO 20 30 Fig. it is much easier to determine relative intensities rather than absolute intensities. Our computations were done on the SGI 4D/240. There is no such guarantee with the least squares (LS) merit function or the penalized LS merit function and the EM algorithm. Function Plots Fig. . By going as far as 32 iterations. The guess is scaled so that the sum of the unknowns equals the computed tube count. Thus. the bounded line search EM-LS algorithm. In 1151. but speckling had not really occurred. In 32 iterations. doing a bounded line search is highly recommended.4 -:::I 02 I led7 -0. I(x) 0. Thus. like function values. The least squares values were just computed and printed.

3 shows the efficacy of the preconditioned conjugate gradient approach on a problem with a ring of 256 detectors which circumscribe a 128 x 128 square grid.14). Fig. many iterations with small changes can often add up to a large change.~ ~ l l z / l ~ p < l. It strongly suggests that those advocating a smoothing term consider using the conjugated gradient approach as an accelerator. and the image produced by PCG on the MAP function with y = 0 . In Fig. 1. so that the overall differences were quite small. In fact. but not recommended in [2]. with the constraints handled in a variety of ways. 12. determining that this is the case with a process like EM-LS where function values and projected gradients decrease gradually is much more difficult than with a procedure like PCG where there is an initial rapid decrease in these values. Function values versus iteration for problem wlth 256 detectors. The RESTRICTED LSQR algorithm was the algorithm mentioned. the algorithm takes a projected gradient step which would obtain a sufficient decrease that would satisfy the Goldstein-Armijo criteria for convergence. The differences are largest where there are large transitions between regions in the phantom (e. really seems to be quite similar to choosing when to stop the nonpenalized PCG iteration. Choosing the penalty parameter byd ismap-0 1 IO U) 30 0 IO 20 30 Fig. the curve tracing f(s) with the MAP function using the preconditioned conjugate gradient algorithm is initially similar to the curve for that algorithm applied to the nonpenalized function. Assuming that there are iterates that have captured the signal with little noise. Fig.0 and 0. which is a cross between the restricted LSQR algorithm and a bent line search projected gradient algorithm. The CONTINUOUS PCG algorithm is similar to the PCG algorithm. Recall that in a bent line search algorithm for the nonnegativity constraint case. we have considered a problem which even with the nonnegativity constraints is underdetermined. em-ml 0 IO 20 lterauons 30 Fig. 2(a) shows the power of the conjugate gradient approach over the scaled gradient approach with the penalized function MAP. there are slight differences.) Thus. and I O million tube counts. 2. the preconditioned algorithms outperform the algorithms that do not use a preconditioner.. (b) Iterations. There are two significant features of these graphs. The PROJGR algorithm used the inexact bent line search algorithm given in [2].0071. In the next section. Although small changes in the function value is often used as a criterion for convergence. Thus far. not the iterate at the last restart. VOL. but before the image has been affected greatly by the noise in the data. JUNE 1993 detector efficiency might be more appropriate. near the skull). These extra evaluations are not counted in our graph. the preconditioned conjugate gradient algorithm is applied to problems using the penalizing terms LC and LN with y = S = 100 for both cases. we compared the PCG image. 2. In the first place. which is the same algorithm as PCG with nonzero elements of W always set to 1. Again. 1 for a problem with 1 million tube counts. zp at iteration 8. Our data support the methodology of the comparison in [ 121. 4 gives the least squares values for each iteration for some of the same algorithms used in Fig. 2(a) considers the penalized smoothing approach with the penalty term MAP with two values of y in (2. an element of the iterate that becomes negative is reset to 0 before the merit function is evaluated. This accounts for the wavy nature of the PROJGR curve. . and 11sp . 1 . where we give some line plots of early iterations. those curves with the penalized term which use the PCG approach tend to follow the nonpenalized curve for a while and then flatten out. 2(b). a more common situation.. The graph shows the difference that a preconditioner makes. Function value versus iteration for 1 million tube counts. several function evaluations might be required on each restart. (See [8]. the . Fig. e. the function value is that of the nonpenalized term. not the sum of f ( z ) and U ( z ) . similar function values do not guarantee similar iterates. (a). 6. In Fig. at ~ ~ l 2 the same iteration. Again. NO.g. a restart was necessary immediately almost every time the restricted LSQR algorithm was begun. Of course. 4.208 IEEE TRANSACTIONS ON MEDICAL IMAGING. 128 x 128 gnd. There were only four restarts for 32 iterations. Function value versus iteration for I O million tube counts using penalty term.em-Is "1 10 20 iterations 30 Fig. Only the projected gradient steps produced a reasonable descent. Fig. Fig. 3. In our example. Secondly. Because the scale factor in the search direction is changed with each iteration. and the smoothing penalty function tends to smooth out these transitions. Again. but changes W each iteration so that its diagonal reflects the current iterate. It is often suggested that one stop the EM algorithm before convergence when one has obtained a good fit of the signal in the data. all the algorithms are some form of LSQR. In the PROJGR algorithm. one would expect fewer restarts than with the PCG algorithm. whenever the restricted LSQR algorithm is restarted and the first step violates the nonnegativity constraints. For the penalized function.g.1. l(b).

In many iterative processes. Particularly with LC. Iteration 16. Figs. 11-16 look at the Y coordinate of the phantom at X = 0 for various algorithms for the 1 million tube count problem. 5. Iteration 8. y and 6 were set 100. but the absolute intensity is still lagging. All the difficulties seem enlarged in this case. Fig. Since all the algorithms and merit functions require about the same amount of work per iteration. and one is now seeing the raggedness that usually characterizes the EM-ML algorithm at late iterations. if there is a change in the tube counts. 5.1 with the true histogram for I O million tube counts. Figs. Notice that the intensity of the tumor is just about correct by the eighth iteration for the PCG algorithm. the least squares function also has this unpleasant feature. the answers before the iterates have converged bare little resemblance to the final solution. using a penalty function is a viable approach. as our next problem suggests. One could obviously have stopped the PCG algorithm at iteration 8. Figs. a tumor may be missed. Iteration 16. Ten million tube count. By showing plots for reconstructed images stopped well before convergence. A graph of the CONTINUOUS PCG algorithm is almost indistinguishable from that of the PCG algorithm. 1. Ten million count. one must stop before convergence. AND PENALIZED LS FOR PET 209 3 m I - 3000 em-is 2000 1000 -1 -0. The added effort of the PROJGR is simply not worth the cost.5 I 0 I 0. but the tumor is ill defined. 9 and 10 consider different penalizing terms with different parameters.5 1 Fig. 6. y was set to 10 and 6 to 100. and if the penalty parameter is not adjusted.5 0 r I 1 0. 6 compares the eighth iteration for the PCG algorithm and the PCG algorithm on the penalized function with penalty term MAP and y set to 0. Line Plots Let us now consider comparing the algorithms visually. The tumor is well defined for the EM-LS algorithm. The tumor is well defined by the eighth iteration for the PCG algorithm. one has to be careful with 6 to avoid overflow while evaluating the cosh function during line searches. Iterating further did not improve the situation. and hence is not given. the least squares function would be changed. I I - em-1s I usto to gram I -1 4.5 I ’ 1 - PmJP -r -1 1 -0. Ir coordinate of phantom at X = 0 Ten million tube count B. we have indicated that this is not the case in our studies. our images taken at early iterations indicate their quality if one is forced to stop the processes at a specific time for whatever reason. Fig 8 I -1 4 s I I 0 I 05 I 1 PCG algorithm is faster than the standard EM algorithm. Ten million count. like budgetary or computer failure. Assuming that the penalty parameters do not have to be reset often. 5-10 are line plots at z = 0 using the problem with 10 million tube counts and 128 detectors. LS.5 I Fig. as one might in an application. the produced images can often given useful information. If for some reason. l rcoordinate of phantom at S = 0 . below the point “0.5 0 0. The tumor defined by the EM-ML algorithm is better defined than in the eighth iteration.coordinate of phantom at S = 0. The penalty approaches did give a rather smooth picture with a well-defined tumor. We begin with line plots that are important if absolute intensities are important. Figs.KAUFMAN: ML. The main signal in the data seems to emerge quickly. and the plots for the unpenalized function and the penalized one are almost identical. 7. With MAP when y was set to I . In Fig. The parameters for the penalized functions were changed to give a more realistic picture. For LC and LN. Iteration 8. Determining suitable parameters was a rather tedious process. Thus. However. lr coordinate of phantom at S = 0 . we compare the line plot for the EM-ML and EM-LS algorithms at the eighth iteration with a histogram of the true emission count that generated the data. an unexpected result. as did the unpenalized function at iteration 8. Fig. the tumor was slightly damped. In Picture I . For the penalty term in LN. and further iteration tends just to produce artifacts . The curve tends to be smoother for the PROJGR algorithm. The tumor is hardly defined by the EM-ML algorithm. This will give us an idea whether low function values actually correspond to useful information. 7 and 8 compare the algorithms for the I O million count problem at the 16th iteration. 0” there is a tumor at which the emission count should be almost at 2000.

The penalty parameters used in the previous example with 10 million counts for MAP and LN were completely wrong. the EM-LS algorithm seems to capture the intensity of the tumor better than the EM algorithm applied to the maximum likelihood function.. Ten million count.. Y coordinate of phantom at X count problem.5 1 Fig. With LN. and finally gave a most satisfying result.5 0 0. Y coordinate of phantom at X = 0. Y coordinate of phantom at X = 0.. I - In100. . Reducing S to 10 caused overflow problems during the line search procedure.5 1 -1 4.5 0 0. Fig.:: . The extra parameter in LN was useful... 2. 14.. as pointed by Herman and Odhner [12]..100 CCI rmu) L100. . Iteration 8.0 1 . Pictures When comparing objective functions and algorithms designed to optimize these functions. With MAP...l hislapu . Iteration 16. Iteration 16..100 In l0..... 13. = 0. C. One million tube that may be mistaken for tumors. Iteration 16.5 0 0. a larger parameter was needed to smooth the plot. the utility of a picture depends on the ultimate application: a pleasing picture that does not give sufficient information is useless. :: : . However.. 3oo . However.. . Y coordinate of phantom at X c .. 1 -1 I I I I I: L I - em-ls p j op 100 I I I I -0. I I I I I I I I * ' I I -1 -0.. 16..5 1 I I I 1 I -1 -0.. Y coordinate of phantom at X = 0.5 1 Fig.5 0 0. NO.. and at worst hazardous. 10.5 1 Fig. 9. smoothing by adjusting a penalty parameter is at best annoying.lW I I I I I 1 -1 4. I would rather apply smoothing afterwards to the unpenalized function rather than trying to guess beforehand what to do..100 :..: . 15.. reducing both y and 6 in LC is appropriate. 2000- 1000- 0. Ten million count. Y coordinate of phantom at X count problem. and a bespeckled picture that . Reducing both to 20 gave a smoother picture. VOL. .0 mapl. One million tube - I-010 n1. Iteration 16. - In100. In general. the previous parameters smoothed out the tumor.. Fig.. = 0...5 0 05 1 Fig._ I I I 1 I -0. = 0..5 0 0. Iteration 16. .... Fig. ) map0. the pictures produced are the most important measure.5 0 0.5 1 -1 -0. 12.. adjusting the parameters is not a task one wishes to repeat often. At least for this problem. : -1 ... Iteration 16.. ) ."%.. Because this problem has fewer counts.. 200- - un-ls loo0I '. Fig. Iteration 8. 11.. 12.r. Y coordinate of phantom at X = 0. .. . but the differences are really minimal.210 IEEE TRANSACTIONS ON MEDICAL IMAGING. Y coordinate of phantom at X = 0. JUNE 1993 3000 ) ..5 0 05 I I 1 I I I I I -1 -0..

1. In pictures 2 4 . For MAP. but the small tumors appear more distinct earlier than in the nonaccelerated algorithms. this may be important. Picture 4. and adjusting the parameters so that convergence would be obtained when f(z)was about that given by iteration 10 or 14 using the preconditioned conjugate gradient algorithm on the nonpenalized function. the pictures show the methods at comparable points in the computation. looking at the function values produced. However. There appears to be little difference between the solution obtained using PCG and PCGO. The different orientations of the two lower tumors seem to be evident by iteration 16 for PCG and iteration 32 for EM-LS. Images after iteration 8 with the 10 million tube count problem. Theoretically. a penalty parameter of 1. both y and b were set to 100. there does not appear to be that much difference between iteration 8 of the PCG algorithm for the nonpenalized case and the converged penalized pictures. but are not discernible for the EM-ML approach. With this problem. The speckling that five years ago caused concern about the EM algorithm applied to the maximum likelihood function also appears when the EM algorithm and its variants are applied to the least squares merit function.” If the zth component of the gradient is positive. we see the sensitivity of the parameter to the tube count. Picture 6 treats the problem with 256 detectors and 10 million tube counts. Thus. captures exactly the right information is useful. With all the penalty functions. the penalty parameters for all of the penalty functions were determined by first guessing one parameter. for the problem with 1 million tube counts. However. in the absence of roundoff error. LS. Moreover. there is no question whether there is a unique answer since the number of boxes . one could choose parameters which on convergence gave a much more pleasing picture than the converged nonpenalty function case. Picture 5 gives the 32nd iteration for various penalized least squares functions obtained using the preconditioned conjugate gradient algorithm outlined in Section 111. Pictures 2 4 give iterations 8. AND PENALIZED LS FOR PET 211 Picture 2. then uiii is initially set to 0. For some medical applications.0 is better than that obtained with y = 0. when the least squares function is not unique. and 32 for the 10 million count problem with 128 detectors. Images after iteration 16 with the 10 million tube count problem. Hopefully. Picture 3. Since the major portion of work of each iteration for each method is the two matrix multiplications with the P matrix. PCGO indicates starting with an initial guess of “0. Picture 5. otherwise. a picture that would be good for one application might not be good for another. Images after iteration 32 with the 10 million tube count problem with various penalizing terms. Images after iteration 32 with the 10 million tube count problem. 16. Thus. PCGO will choose the one with minimum norm. For LN and LC. later in picture 10. in a production situation. the speckling appears sooner.KAUFMAN: ML. it is initially set to 1. In truth. tube counts will not differ by a factor of 10 or else one really needs a trained person manipulating penalty parameters.0 smoothed out the tumor too much. the solution will lie entirely in the range space of P T . the better setting in the 10 million tube count case. For the PCG algorithm. the picture produced with y set to 1.

Adding a penalty term produced a less snowy picture. The most striking comparison occurs in picture 11. The same random starting guess was used with both algorithms. but this might not be important for a particular application. Smoothing using a penalty function approach does improve the pictures. For the PCG algorithm. Images after iteration 16 with the 1 million tube count problem. which compares the line search EM-LS algorithm with one in which the scaled gradient direction is modified slightly. The reconstructions in picture 10 obtained using the preconditioned conjugate gradient algorithm applied to various penalty functions are all rather strikingly different from each other. implying that the corresponding variable should be enlarged. Picture 8. Picture 9. the LC penalty term with y = 6 = 20 removed much more of the annoying speckling than the parameter setting used in the 10 million tube count problem that is shown in picture 9. Whenever the gradient is negative. Images after iteration 8 with the 1 million tube count problem. The picture tends to corroborate our previous results. Images after iteration 32 with the I million tube count problem. Picture 10. The preconditioned conjugate gradient algorithm converged very quickly to a bespeckled picture. starting with a random guess is a bad idea for the EM approach. Picture 7.212 IEEE TRANSACTIONS ON MEDICAL IMAGING. and the EM algorithm applied to the maximum likelihood function gives a clearer picture. Images after iteration 32 with the I million tube count problem with various penalizing terms. Values with low numbers will not recover with an EM-like algorithm that scales an element of the search direction by the numerical value of the variable. JUNE 1993 Picture 6. Starting with an initial guess of 0 does not affect the situation. VOL. . 2. The modified algorithm is a compromise between the EM algorithm and a standard active constraint approach. In picture 8. Certainly. the speckling is much more of a hindrance than in the previous problem. Images after iteration 16 with the 10 million tube count problem with 256 detectors. NO. remained at 128'. Whenever a component of the gradient is positive. Pictures 7-10 consider the 1 million tube count problem. but one has to be careful not to smooth out the tumors and to adjust the penalty parameter to account for the fact that f ( z ) will be smaller and that small changes are more significant than in the previous case. that component of the gradient is multiplied by the maximum elements in z when determining the search direction. 12.

Our results suggest that one must use care in choosing an algorithm to optimize whatever merit function one uses. pp. Axelsson. The left-hand images used the standard EM-LS algorithm. with the inner square having smaller values. In the right-hand images. the parameter in the penalty function. to determine the search direction. it seems that. AND PENALIZED LS FOR PET 213 Picture 11. CONCLUSIONS Picture 12. applied to the least squares function or to a penalized least squares function produces satisfactory images much faster than a line search EM-ML approach. we see the iterates produced using the same random starting guess and the MAP merit function using the EM-LS algorithm. B. REFERENCES [ I ] 0. For picture 12. The modified algorithm is not as sensitive to the initial guess. The images are those of iterations 8 and 16 of the EM-LS algorithm applied to the MAP function with = 0. However. which were used in picture 11. The preconditioner takes into consideration the nonnegativity constraints as easily as is done in the EM algorithm.KAUFMAN: ML.. the pictures do not conclusively indicate the best merit function. V. whenever the gradient was negative. LS. The initial guess. 1-16. The modified algorithm is much less sensitive to the initial guess. These images all used the same starting guess. 1980. Duff for writing the local software which produced the pictures. but not as much as applying the modified algorithm to the nonpenalized function. were compared.” Linear Algebra Appl.1. Thus. Our choice of preconditioner also permits much more progress in the nonzero variables than in an approach that does not use a preconditioner. but penalty functions usually entail determining a penalty parameter. in general. “Conjugate gradient type methods for unsymmetric and inconsistent systems of linear equations. and the ability of the algorithm to handle the nonnegativity constraints all affect the pictures. there are very good theoretical statistical reasons for considering a penalized approach. Coughran. and T. In picture 13. -. but it does not hinder increasing variables. there is an undershoot+vershoot problem just at the boundary of the square with the standard EM-LS algorithm. As many others have suggested. 29. that component of the search direction is handled like the standard algorithm. Perhaps there is a penalty parameter setting and a penalty term that are not as sensitive as our results suggest. the inner square was begun with numbers 3/7 less than the outer portion. vol. the modified algorithm treats decreasing variables as smoothly as the EM algorithm. the smoothing MAP function shown in Picture 13. . whenever the gradient was negative. the stopping criteria. appending a penalty term is less effective than modifying the scaling matrix when the initial guess is not uniform. dnd one is not forced to determine whether an element is small or 0. it was scaled by the largest 2. which marries the EM algorithm to the conjugate gradient method. This suggests that any algorithm which initially uses a coarse grid to obtain an initial approximation to the solution in an efficient manner and then tries to refine the grid must use care when refining the grid to avoid a shadow of the coarse grid appearing during the fine grid iteration. and the same algorithms. Although the higher starting values do not seem to give a worse or different picture within the square. It is much better than applying the unmodified algorithm to Our results show that a preconditioned conjugate gradient approach. In the right-hand images. ACKNOWLEDGMENT The author wishes to thank E. The left-hand images used the standard EM-LS algorithm. These images all used the same random starting guess. Picture 13. The smoothing helped a bit. and the right-hand images used the starting guess of Picture 12. In general. it was scaled by the largest .rL to determine the search direction. Grosse. The left-hand images used the random starting guess of Picture 11.

Med. and N. vol. vol. vol. Comput. MA: Addison-Wesley. Miller. 7. Mathematical Developments Arising from Linear Programming Mathematics.- innn L77”. Soc.” IEEE Trans. “Corrections to convergence of EM reconstruction with Gibbs smoothing. vol. 43-71. pp. 9.” Linear Algebra Appl. Standards. Med.” in Large Scale Numerical Optimization. 49. “Comparison between MLEM and WLS-CG algorithms for SPECT image reconstruction. 1990. Nalcioglu. New York: Academic. 1964. Dec. pp. [5] A. pp. 1984. Med.. 1991. no. A. “A multigrid expectation maximization reconstruction algorithm for positron tomography. vol. 302-316. 1991. eds. T. Imaging. Nucl. Rubin. 177-200. Lange. Green. 1-38. [ 191 K. Tuyttens. “A statistical model for positron emission tomography. P. Dhawan. Stiefel. X.” J. vol. Med. 194-202. B. 1985. and S. “A generalized EM algorithm for the 3-D Bayesian reconstruction from Poisson data using Gibbs priors. vol. 84-93. vol.’’ ACM Trans. “A maximum a posteriori probability expectation maximization algorithm for image reconstruction in emission tomography. Ph.” IEEE Trans. “Constrained reconstruction by the conjugate gradient method. Statist. Li. eds. Introduction to Linear and Nonlinear Programming. Roy. Toint. Imaging. XX. F. pp. der. M. Odhner. Amer. pp. Imaging. 1 1 1-143. 426-436.. vol. Lagarias and M. L. C. 1987. ‘V1‘U. Nut. Bunch and D. Imaging. Assisted Tomography. pp. Sept. Math. Imaging. pp. J. 80. Coleman and Y. “On iterative algorithms for linear least squares problems with bound constraints. Med. Wright.. Providence. Philadelphia. Geman and D.” IEEE Trans. 1952. vol. 8. [21] K. 1991.” Annals Oper. Mar.” IEEE Trans. W. MI-6. pp. 1982. L. Gill. Med. 1991.” IEEE Trans. 1973. C. [6] R. 1 14.” IEEE Trans. Shepp. 65-71. Vardi. A. Herman and D. vol. Reading.” in Proc.” IEEE Trans.” IEEE Trans. [4] C. Imaging. Dec. Kaufman. T.” J. 439446. pp. Amer. G. Mar. M. C. 2 1 4 3 . [ 181 J. NS-21. vol. Imaging. vol. Imaging. L. 1 P. M. B. [22] E. 4 0 9 4 3 6 . Math. Pardalos.. . JUNE 1993 [2] M.” IEEE Trans. June 1991. Odhner. RI: Amer. vol. eds. 1988. Imaging. “Evaluation and optimization of iterative reconstruction techniques.’’ IEEE Trans. “Computational aspects of an interior point algorithm for quadratic programming problems with box constraints. Dec. pp. and D. New York: Academic. Med.” IEEE Trans. vol. pp. 8-20. IO. Sept. Sect. V. Imaging. 513-522.” Proc. Paige and M. L. pp. VOL. Rose. Han. Cho. pp.” J. G. “Implementing and accelerating the EM algorithm for positron emission tomography. p.” Comput. June 1985. Herman. Shepp and B. 92-112. B. Mullani. SIAM. Statist. 2. M. Fletcher and C. 10. Shepp and Y. Assn. “The Fourier reconstruction of a head section. pp. J. pp. Mar. Murray. T. Zhao. “EM reconstruction algorithms for emission and transmission tomography. Res.. Res.. 149-154. 113-122. Sofhyare.” IEEE Trans. and D. M. Bjorck. 1989. P. Levitan and L. 7. 9. “LSQR: An algorithm for sparse linear equations and sparse least squares. 273-278. “Bayesian reconstructions from emission tomography using a modified EM algorithm. MI-4. pp. E. Luenberger. and M. [20] -. 38. IO. Nucl. Statist. 325-353. 572-588. Practical Optimization. A. X. McCarthy and M. “Maximum likelihood SPECT in clinical computation times using mesh-connected parallel computers. pp. “Method of conjugate gradients for solving linear systems. pp. Washington.214 IEEE TRANSACTIONS ON MEDICAL IMAGING.. 1171 S. Todd. vol. Leahy. Bur. DC. T.-Y. “Bayesian image analysis: An application to single photon emission tomography. 1974. Bierlaire. “Function minimization by conjugate gradients. [23] D. [8] P. Med. “Parallelization of the EM algorithm for 3-D PET image reconstruction. 1977. Sci. Gullberg. Zenios. Pan and A. Coleman and Y. vol. McClure. Med. Hestenes and E. Reeves. Imaging. Ye. Laird. eds. Kawata and P. [24] A. IO. Jan.. Logan. 336346. Y. and Y. Toennies. “A parallelized algorithm for image reconstruction from noisy projections. Vardi. Hebert and R. 1990. 12-18. 10. Ranganath. and G . Sci. SIAM. 1981. 22. “Maximum likelihood reconstruction in positron emission tomography. Lange and R. 1987. pp. 185-192. Imaging. Med. Med. Chen. 8. Statist. pp. Med. E. G. NO.. “Numerical study of multigrid implementations of some iterative reconstruction algorithms. pp. and L. R. D. Kaufman. 12. H. Carson. Herman. Workshop Large Scale Optimization. Ass.” IEEE Trans. [7] S. Yagle.. T... 1990. vol. 37-51. pp. pp. 1991. “Solving emission tomography problems on vector machines. “Methods for sparse linear least squares problems. 1982. Dec. Frey. 39.” in Sparse Matrix Computations. “Maximum likelihood from incomplete data via the EM algorithm. 1990. [ 161 _ _ .” IEEE Trans. vol. 1976.” J. pp. MI-6.” IEEE Trans. 1991. 228. S. Imaging. [3] A. pp. A. Lee. 1990. Comput. Dec. 1766-1772. F. and Z . Med. “Convergence of EM reconstruction with Gibbs smoothing. 34-37. Saunders. Dempster. 1985. Tsui. vol. Li. Soc.. N. J.