You are on page 1of 79

Review and Enhancement

Optimization methods in Image


Luong Vu Ngoc Duy

Department of Computing
Imperial College London

A thesis submitted for the degree of

MEng. Computing
June 2009

To my loving parents, my sister and brother...


Without the helps from some of the kindest, smartest and most enthusiastic people, I would never be able to produce any work as better
as this project.
First of all, I thank my supervisors, Prof. Berc Rustem and Prof.
Daniel Rueckert, for unlimited idea and support they have given me.
Prof. Duncan Gillies for being my personal tutor and second marker.
Dr. Daniel Kuhn for being patient to hear my talking and Dr. George
Tzallas-Regas for any time discussion and suggestion.
I also owe a debt to Prof. Rasmus Larsen for sending me a very useful
tutorial and Dr. Stefan Klein for his advice and explanation.
Finally, my thanks to all of my friend who have made an exciting time
at Imperial.


Image Registration is a popular technique used in applications such as

medical image processing, face recognition, object flow and tracking.
The objective is to minimize the difference between two images and
produce the best transformation to match a deformed image to a reference image. To find this transformation, optimization is needed. In
this paper, we analyse and present a framework for two optimization
approaches: non-linear deterministic and stochastic approximation.
For non-linear deterministic optimization, we will examine most of the
widely use algorithms, such as Gauss Newton, Levenberg-Marquadrt,
QuasiNewton and Nonlinear Conjugate Gradient, and present some
modifications, Recursive Subsampling technique and Weighted technique, to enhance the rate of convergence for particular type of applications.
In addition, we will propose a novel approach to Stochastic Approximation that based on Difference Sampling. This technique avoids
bias in the case where there is only a small distortion at local parts
of the image. Therefore it will reduce the variance in approximating
the solution compared to Uniform Random Sampling. The stochastic
optimization method will be analysed is Robbins-Monro.
The results show a better convergence of Weighted/Recursive Subsampling technique and Difference Sampling technique compared to
traditional methods.
Special attention is paid to nonrigid 2D monomodal image registration.

1 Introduction

2 Image Registration


Rigid and Affine Transformation . . . . . . . . . . . . . . . . . . .

Local Deformation . . . . . . . . . . . . . . . . . . . . . . . . . .



Registration framework . . . . . . . . . . . . . . . . . . . . . . . .



Cost Function F . . . . . . . . . . . . . . . . . . . . . . . .



Gradient g . . . . . . . . . . . . . . . . . . . . . . . . . . .
Transformation W(p) . . . . . . . . . . . . . . . . . . . . .



Optimization . . . . . . . . . . . . . . . . . . . . . . . . .



Pre-condition . . . . . . . . . . . . . . . . . . . . . . . . .


3 Deterministic Optimization
3.1 Gauss-Newton (GN) . . . . . . . . . . . . . . . . . . . . . . . . .



Levenberg-Marquardt (LM) . . . . . . . . . . . . . . . . . . . . .



Quasi-Newton (QN) . . . . . . . . . . . . . . . . . . . . . . . . .



Nonlinear Conjugate Gradient (NCG) . . . . . . . . . . . . . . . .

Step-size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .



Experiments and Results . . . . . . . . . . . . . . . . . . . . . . .


4 Recursive Subsampling and Weighted approach for Deterministic Optimizations


Recursive Subsampling for Random Deformed Images . . . . . . .


Weighted approach based on Difference Image for Local Deformation 25

Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . 26




5 Stochastic Approximation


Stochastic Approximation . . . . . . . . . . . . . . . . . . . . . .
5.1.1 Robbin-Monro and the derivative . . . . . . . . . . . . . .



Decaying sequence . . . . . . . . . . . . . . . . . . . . . .



Difference Sampling . . . . . . . . . . . . . . . . . . . . . . . . . .



Sampling Strategy . . . . . . . . . . . . . . . . . . . . . . . . . .
5.3.1 Deterministic Sampling . . . . . . . . . . . . . . . . . . . .



Stochastic Sampling . . . . . . . . . . . . . . . . . . . . .


Experiments and Results . . . . . . . . . . . . . . . . . . . . . . .


6 MATLAB Implementation: Vreg

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .





Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . .



Cost Function F . . . . . . . . . . . . . . . . . . . . . . . .



Image Gradient I . . . . . . . . . . . . . . . . . . . . . .
Transformation W(p) and Jacobian W
. . . . . . . . . . .



Other evaluations . . . . . . . . . . . . . . . . . . . . . . .


User Guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .


7 Conclusion


A Performance of SubSampling methods on different images




List of Tables

Complexity of the registration framework . . . . . . . . . . . . . .



Deterministic Optimization Performance Evaluation . . . . . . . .



Summary of results for Subsampling methods . . . . . . . . . . .



List of Figures

Pictures of Lena . . . . . . . . . . . . . . . . . . . . . . . . . . . .


Transformation Grid for Fig 1.1(c) . . . . . . . . . . . . . . . . .


Pictures of Lena after registration . . . . . . . . . . . . . . . . . .


Type of deformation. Edited from (34). . . . . . . . . . . . . . . .


Example of Affine Transformation . . . . . . . . . . . . . . . . . .


Example of Local Deformation . . . . . . . . . . . . . . . . . . . .

Traditional Registration Framework . . . . . . . . . . . . . . . . .



Deterministic Convergence Rate . . . . . . . . . . . . . . . . . . .



Deterministic Optimization Boxplot . . . . . . . . . . . . . . . . .



Example of Random Deformed Input . . . . . . . . . . . . . . . .



Recursive Subsampling Registration Model . . . . . . . . . . . . .



Example of Local Deformed Input . . . . . . . . . . . . . . . . . .

Convergence rate of single and subsampling methods . . . . . . .



SubSampling Convergence Rate . . . . . . . . . . . . . . . . . . .



Lena: Performance of single and subsampling methods . . . . . .



Ankle: Performance of single and subsampling methods . . . . . .

Knee: Convergence of different grid size . . . . . . . . . . . . . . .



Knee: Average Performance with different grid size . . . . . . . .


4.10 Lena: Convergence of different levels of deformation . . . . . . . .


4.11 Lena: Average Performance of different levels of deformation . . .

4.12 Lena: Converegence of UnWeight and Weight methods . . . . . .


4.13 Average performance of UnWeight and Weight methods . . . . . .





Random Uniform Samples of local deformed images . . . . . . . .



Random Samples based on Difference Sampling: Deterministic

strategy and Stochastic strategy . . . . . . . . . . . . . . . . . . .



Example of 5 levels of probabilities . . . . . . . . . . . . . . . . .



Random and Local Deform Images for Stochastic . . . . . . . . .



Convergence of RUD by Deterministic and Stochastic methods . .

Convergence of RLD by Deterministic and Stochastic methods . .



Average Performance of Stochastic Approximation methods . . .



Performance of Combined and Stochastic methods . . . . . . . . .


A.1 Different type of images used for SubSampling Methods Test . . .


A.2 Ankle: Average Performance of SubSampling methods . . . . . .


A.3 Brain: Average Performance of SubSampling methods . . . . . . .

A.4 Knee: Average Performance of SubSampling methods . . . . . . .


A.5 Lena: Average Performance of SubSampling methods . . . . . . .


A.6 Lung: Average Performance of SubSampling methods . . . . . . .



Chapter 1
Given two images 1.1(a) and 1.1(c). The two images are not similar, we want to

(a) Original

(b) Difference

(c) Deformed

Figure 1.1: Pictures of Lena

find out if image 1.1(c) represents the person in image 1.1(a). The grid line in
each image shows the original pixel alignment. Image 1.1(b) shows the difference
of two images. If we can find a transformation grid, Figure 1.2. We could apply
the transformation on image 1.1(c) and see if the two images represent the same
person (with some acceptable threshold).
We aim to find a transformation W that spatially align the two images, such
that the deformed image 1.1(c) could be warped back to the referenced image
1.1(a). The difference image of above examples is obtained by taking the absolute
intensity difference between every pair of pixels of input images. The range

Figure 1.2: Transformation Grid for Fig 1.1(c)

(a) Original

(b) Difference

(c) Deformed

Figure 1.3: Pictures of Lena after registration

of difference 0 . . . 1 represents pixel intensities, from black to white. The value
of difference image can be defined by Sum of Square value (SSD). SSD value
of 1.1(b) is 595. After registration, we found the new difference 1.3(b) with
SSD = 26.9.
The above process is Image Registration. It is a common task used in various
types of applications: object recognition, optical flow, medical image processing.
The process of finding an optimal transformation could be very expensive, which is
an disadvantage for many applications. Some clinical process such as brain shift
estimation based on intra-operatively acquired ultrasound (36) require almost
real-time registration. Therefore, a fast registration is desired. In this paper, we
pay attention to intensity based registration for 2D images.
The various medical imaging modalities availability allows us to obtain more
details of humans body functioning and anatomy. For instance, Magnetic Resonance Imaging (MRI) systems give a detailed description of brain anatomy,
while Positron Emission Tomography (PET) techniques depict the functioning

and metabolic activity of the brain. In general, it is benefitial to visually align

images from different modalities (multimodal image processing (45)) to simultaneously obtain all available information. Some classic image registration techniques
(30) include:
Spatial Registration where all dimensions are spatial. Most applications
focus on 3D-3D registration of two images that can be presented by two
tomographic datasets, or the registration of a single tomographic image to
any spatially defined information, e.g., a vector obtained from EEG data.
2D-2D registration may apply to separate slices from tomographic data,
or intrinsically 2D images like portal images. 2D-3D registration is also
possible but more complicated, eg. pre-operative CT image to an intraoperative X-ray image.
Times series Registration could be used to monitor the growth process of
medical treatment, such as monitoring of tumor growth (medium interval),
post-operative monitoring of healing (short interval), or observing the passing of an injected bolus trough a vessel tree (ultra-short interval).
Intrinsic/Extrinsic registration methods: Intrinsic methods are based on
image generated content only, such as a set of identified salient points (landmarks) on object surfaces. In contrary, Extrinsic registration is based on
foreign objects introduced into the imaged space. A commonly used fiducial object is a stereotactic frame (12) screwed rigidly to the patients outer
skull table. Such frames are used for localization and guidance purposes in
In this paper, image registration refer to an application for 2D-2D images
that come from an identical sources (monomodal ) where one image is a random
deformed version of another. The image is usually displayed by assigning varying
levels of brightness known as gray levels or intensities, to each point in the image
space. Our interest in geometrical shapes and their interrelationships requires us
to impose a coordinate system on each participating image space. The points in
the image space are specified by the usual Cartesian coordinates, i.e. as distances
from the orthogonal coordinate system axes. Image registration can now be

defined as the process of finding the one-to-one mapping between the coordinates
in the image spaces of interest such that the points so transformed will correspond
to the same anatomical point. We also emphasize the use of difference image
which represent the error between two images by the mean of taken an absolute
difference between two imaged pixel intensities at the same coordinates.
In the next Chapter, we briefly describe Image Registration Framework. We
show that Image Registration problem can be formulated as an Optimization
problem, thus we will analyse the performance of difference Optimization approaches. In fact, there is no such Optimization method that produce the best
performance for all type of applications but the choice of Optimization methods depends on particular applications. We, therefore, review the Optimization
methods based on different type of deformation images.
In Chapter 3, Deterministic Optimizations are listed for random uniformly
deformed images. The Quasi Newton (QN) method shows the best convergence
rate however it suffers from outliers in some test cases. Besides Quasi Newton,
Gauss Newton (GN) and Levenberg Marquardt (LM) methods produce a consistent convergence rate and GN is slightly better than LM. Nonlinear Conjugate
Gradient (NCG) result shows that the method is very dependent on the type of
input images.
The interesting part comes in Chapter 4, we present a framework for Recursive Subsampling registration. The technique Subsampling has been presented
before, however, most of those concepts are not similar from our concept. The
test results show that recursive Subsampling technique outperforms all normal
deterministic optimization methods. In addition, we also introduce a Weighted
Subsampling approach, which is inspired from the Difference Sampling in Chapter 5. Weighted Subsampling methods demonstrate even more atractive results
compared to Subsampling methods for localised deformed images.
Finally, the best of this thesis is here. In Chapter 5, we proposed a novel approach based on Stochastic Appoximation. The type of registrations that is suitable for this method is local deformation at local parts of images, which is a very
likely type of applications in medical image processing. This approach employs a
random non-uniform sampling method that we call it Difference Sampling. The
results show that for local-part-deformed images, Difference Sampling Stochastic

produce better convergence rate compared to Random Sampling Stochastic or

Deterministic Optimization methods.
Chapter 6 provides a guidance on implementaion materials of the entire thesis
using MATLAB. Chapter 7 is the conclusion. More testing results are shown in
Appendix A.

Chapter 2
Image Registration
Image Registration has been an active research area from the last few decades.
In general, depends on the type of applications, we use different registration
techniques. Classification of image registration is primely defined as followed:
Dimension: 2D-2D, 3D-3D, 2D-3D
Nature of Deformation: Rigid, Affine, Local Deform
Optimization procedure
Modality: monomodal, multimodal
Manual, Semi-automatic, Automatic registration
An overview on classical registration methods can be found in (30). All methods
that we discuss in this paper emphasize the 2D-2D monomodal automatic registration application with a potential to extend to 3D-3D multimodal applications.
The type of deformation (39) of an image determine the complexity of the
registration problem and affect the choice of suitable registration methods. Typically, the deformation of an image is categorised by number of parameters or
degree of freedom: rigid, affine transformation and local deformation.

2.1 Rigid and Affine Transformation

Figure 2.1: Type of deformation. Edited from (34).


Rigid and Affine Transformation

The simplest transformation - rigid transformation (up to 4 parameters) consists

of translation, rotation and scale. It preserves distances, lines and angles:

p11 p12 0
y 0 = p21 p22 0 y
0 1
An Affine Transfomation model is a bit more complex which has up to 8 parameters. In addition to rigid model, affine model compensates for global size changes
and shears. It preserves parallel lines but not angles:

p11 p12 p13
y 0 = p21 p22 p23 y
p31 p32 1
Readers should reference to (29) for intensive review on rigid image registration methods. In (2), S. Baker and I. Matthews provided a very comprehensive
analysis and extension on gradient descent image registration methods for rigid
and affine transformation. They showed that Gauss-Newton (35) and LevenbergMarquardt (27; 31) produce better performance compared to Newton, Steepest
Descent, Diagonal Hessian.

2.2 Local Deformation

Figure 2.2: Example of Affine Transformation


Local Deformation

The above transformation models are often regards as a global transformation. In

order to account for more complex deformation, for instance, the deformation of
the heart muscle due to respiration in MRI, higher order transformation has to be
used (46). One of the earliest method was introduced by Goshtasby (16), which
use a modification to the original least square method. Another method based
on fluid registration (5) cause the changing in intensity during the registration
process. However, the use of radial basis functions as mapping transformations
has shown a big advantage over other techniques. Typical radial basis function
methods are thin-plate spline registration method (33) and B-spline model of
local deformation. In this paper, we will use the B-spline transformation model
which is introduced by D. Rueckert et. al (38) because of its robustness and fast
convergence for large scale problems. For nonrigid deformation model, we define
a combined transformation consisting of a global and a local components:
T (x) = Tlocal (Tglobal (x))


where x = (x, y), Tglobal is an affine transformation matrix and Tlocal is a local
transformation matrix. In this paper we only examine the local deformation
therefore Tglobal is absorbed in 2.1. Following Rueckerts formulation (38), we

2.2 Local Deformation

Figure 2.3: Example of Local Deformation

derive a 2D model to our problem:
Tlocal (x) = x +

3 X

Bm (u)Bn (v)pi+m,j+n


m=0 n=0

where pi,j is a control point of the grid px py with uniform spacing and Bm is
a m-th cubic B-spline basis function (26) and:
i = bx/nx c 1
j = by/ny c 1
u = x/nx bx/nx c
v = y/ny by/ny c
One of the attractive features is that the basis functions have local supports, i.e.
if we change a control point pi,j , it only affects its local neighbourhoods. The
mesh of control points P acts as parameters for the transformation matrix and
its resolution px py decides the degree of freedom (number of parameters) of the
registration problem. A large spacing mesh grid is less expensive to solve than a
fine mesh grid, however its pay off is that it can not model a small local deform.
For example, a mesh grid of 5 5 control points yields a 50 -parameter problem
should not produce as quality matching as a mesh grid of 9 9 control points
(162 -parameter problem), however, the 50 -parameter problem is less expensive
than the 162 -parameter problem. The choice of mesh grid resolution is up to the

2.3 Registration framework

user and depends on the requirement of specific applications. Another benefit of

B-spline blending function is its constant derivative with respect to the objective
parameters (section 2.3.3).


Registration framework

There are various algorithms for image registration such as difference decomposition (15) or linear regression (7) however the gradient-based framework that first
proposed by Lucas-Kanade (28) is still the most widely use technique.
Given a deformed image I and a referenced image T. The registration process
aims to find a spatial transformation matrix W(p), where pT = (p1 , . . . , pn ) is
a set of parameters, that match the two images: I(W(p)) T . Lucas-Kanade
algorithm iteratively generates the transformations W(pk ) that reduce the difference between two images. The process of generating a set of warp parameters pk
at iteration k -th requires the gradient of the cost function, gk , and an appropriate
descent parameter, ak , that ensure the descent property of the cost function:
pk+1 = pk + ak gk


The gradient of the cost function is derived in the next section. The descent
parameter is defined depends on different optimization schemes (Chapter 3 and


Cost Function F

In an automatic framework, we perform registration based on the dissimilarity

in intensities between images. Succeeding in the applications requires a quantitative measure of the goodness of the registration. There are various of intensity
difference measure methods such as Mutual Information (45), Cross-correlation
(13), Histogram entropy (6). For simplicity and enhanced-optimization focus of
monomodal image registration, we use Sum of Squared Difference (SSD) as
the cost function throughout the paper:
[I(W(p)) T]2
2 x



2.3 Registration framework

We aim to match I(W(p)) to T therefore the objective of this problem becomes a

minimization process that minimize 2.4 with respect to p. Assuming we know the
current estimation of p, the problem becomes iteratively solving for the increment
p; thus 2.4 can be written as:
[I(W(p + p)) T]2
2 x


where p = ak gk (2.3). The process of minimization runs until F or p converge,

thus the termination criteria normally are F  or kpk . In order to find
p, we perform the first order Taylor expansion on I(W (p + p)) transform 2.5

I(W(p)) + I
p T
2 x

where I = x
, y is the gradient of image I evaluated at W(p) and


is the

Jacobian of the transformation matrix (2.9). Next, differentiate 2.6 with respect
to p yields:



p T
I(W(p)) + I

Setting this expression to zero and solving for p obtains (35):

X  W T
p = H
[I(W(p)) T]


where H is the Hessian matrix of the objective function and a part of the descent
parameter in Equation 2.3. The second term in the RHS is the gradient of the
objective function (2.8). Expression 2.7 is only used when applying Deterministic Optimization methods, for Stochastic methods, we use different techniques
(Chapter 5). The choice of Hessian evaluation is one of the main sources that
affects the application performance.


Gradient g

Differentiate F (2.4) with respect to p yields:

X  W T
[I(W(p)) T]



2.3 Registration framework


Transformation W(p)

The transformation matrix is defined by the B-spline tensor model 2.2, W(p) = Tlocal .
Hence, the derivative of the deformation field with respect to control points p is:
Bm (u)Bn (v)
m=0 n=0


For any given input images, we can always compute the Jacobian W
at the bep
gining of the procedure and do not need to recompute it during the registration
process. This is a big advantage to reduce the computational cost.
During transformation process, interpolation procedure is essential. Images
come discrete with pixel values at integer coordinates. However, after apply transformation, the coordinates are likely to be fractional numbers. Therefore we must
be able to evaluate the image pixels at arbitrary coordinates. This is achieved
by interpolation. Different methods of interpolation such as linear, bilinear, trilinear can be used. In this paper, we will use linear interpolation because of its
reasonable good quality and less expensive than higher order interpolation.



In Registration framework, we have to iteratively minimize the cost function (2.4)

and update parameters (2.3). This is an Optimization Process.
The fact that Image Registration is an Optimization problem benefits itself from
a vast amount of literature on one of the most studied subject in mathematics.
Popular optimization methods include gradient descent (35), conjugate gradient
(10), Newton type (quasi-Newton (11), Gauss-Newton (35), etc.), stochastic approximation (37) and evolutionary strategy (18). However, every benefit has its
pay-off. The various availabilities of optimisation methods trigger two problems:
the choice of optimisation methods and the parameter settings for optimisation
problems. Unfortunately, current literatures does not provide the definite answer
to these problems. There have been many researches on this topic which produce
some limit guidance on the choice of optimisation methods as well as aided constraints.


2.3 Registration framework

In this paper, we provide an extensive analysis on performance of Gauss Newton,

Levenberg Marquardt, Quasi Newton, Nonlinear Conjugate Gradient (Chapter
3) on random deformed images. In addition, we propose some extensions to
these methods in Chapter 4. Chapter 5 presents an extension to Robbins Monro
Stochastic method for local deformation at local parts of images.



In practice, the use of cost function 2.4 is not realiable if we do not include some
pre-conditions. The first condition is that the input images must come from the
same source to assure monomodality. The second condition is that the deformed
fields must not be folded. One way to exempt the second condition is by adding
the regularisation term into the cost function (17). However, we do not include
this therefore we ensure that no folding is possible for random generated inputs
by ensuring the Jacobian of the transformation fields is non-negative.

Figure 2.4: Traditional Registration Framework


Chapter 3
Deterministic Optimization
A standard formula for deterministic optimization methods follows the derivation
from 2.3 and 2.7:
pk+1 = pk H1
k gk


where is an additional step-size parameter to guarantee the descent property

of the cost function and gk is defined as in 2.8. All optimization algorithms
described in this chapter follows the form of 3.3 except there is a slight difference
in NCG method. The complexity of the procedure 3.3 is shown in table 3.1.











N n2



N 2n

N 2n


O( )


Table 3.1: Complexity of the registration framework where N is the total number of pixels
of an input image and n is number of control points of the mesh grid

Total Complexity = N n2 + N 2 n + n3 + O(H)


The following sections show that the choice of Optimization affect the performance because it contributes different complexity measure to evaluate p. The
step-size complexity often negligible because in medical image processing application, the initial deformation often not very large (no folding allowed), therefore
the approximate descent direction itself often satisfy the descent property.


The optimization procedure can be outline as follows:

p = min F =
[I(W(p)) T]2
2 x
P recompute :


Iterate :


by 2.9
Evaluate W(p) by 2.2


Warp I by W(p) to obtain I = I(W (p))


Evaluate I


Evaluate I



Compute the gradient g =



[I(W (p)) T ] (2.8)


Compute the Hessian H by an Optimization class


Find step-size (section 3.5)


Compute p using 2.7

(10) Update pk+1 = pk p (3.1)

U ntil :


kpk p


kFk F


kgk g


Exceeding maximum of iterations

for NCG method: (7) is not used, (9) p is computed using 3.8

for LM method: use Trust Region instead of (8)

Step (1)-(6) and (10) are essential for every method, step (8) is negligible for good
approximation. Therefore the comparable complexity now relies on step (7) and



3.1 Gauss-Newton (GN)


Gauss-Newton (GN)

Gauss-Newton (35) algorithm is a class of nonlinear least-square problem. The

objective function 2.4 can be formulated as a least-square problem:
1X 2
F =
2 x


where fx = Ix (W (p)) Tx . Gauss-Newton method approximates the Hessian

2 F to:
X  f T  f 

The convergence of Gauss-Newton to a local minimum can be rapid if the approximated Hessian term dominates the value of the full Hessian evaluation by
Newton method (35, p256-257), eg. when the approximation is good and close
to local minima, and can approach quadratic rate (4, p341-342). The cost for
evaluating 3.5 is O(N 2 n2 ), thus the comparable complexity is O(N 2 n2 + n3 ).


Levenberg-Marquardt (LM)

Levenberg-Marquardt can overcome the weakness of Gauss-Newton, namely it

takes into account whether the error in approximating the Hessian gets better
or worse after each iteration. The implementation of Levenberg-Marquardt for
registration problem as follows (2, p39-40):

I p1


X  W T  W  X






I p

In Levenberg-Marquardt method, instead of using step-size line search, we use the

trust regions approach, the second term in 3.6, to guarantee the descent property


3.3 Quasi-Newton (QN)

of the objective function. Starting with a small = 0.01, we adjust depends

on the error in approximating the Hessian. For instance, if the error decreases,

ie. F approach local minimum, is reduced, say 10

, then LM method is
approximately GN method. If error increases, is increased to 10. The
rate of convergence for LM method can be expressed as similar as GN method
and the comparable cost is the same: O(N 2 n2 + n3 )


Quasi-Newton (QN)

Quasi-Newton methods, like GN or LM, only require the first-order derivative

of the objective function at each iteration. Furthermore, its evaluation does not
include the image gradient and the Jacobian of the warp matrix, which contains
the expensive calculation of N N . Indeed, the Hessian evaluation in QuasiNewton method only require calculation with the matrix of dimensions n n
1 such as SR1,
much smaller than N N . There are various ways to construct H
DFP, BFGS (35), however, numerical experiments have shown that the Broyden
class (BFGS) is more efficient (22; 35). We use a Broyden update approximation
to H (40):
Hk Hk1 +

[gk gk1 Hk1 (pk pk1 )][pk pk1 ]T

kpk pk1 k22


Given certain conditions, QN methods can be shown to be superlinearly convergence (11):

kpk+1 pk
k kpk p

The computational cost to compute 3.7 is O(n3 ), thus the comparable complexity
is reduced to O(n3 ).


Nonlinear Conjugate Gradient (NCG)

Starting from a Linear Conjugate Gradient method to solve for a convex quadratic
function (35, p102), Fletcher and Reeves (14) extend the method to nonlinear


3.5 Step-size

functions. In NCG methods, p is defined as a linear combination of the gradient

gk and the previous search direction dk1 = pk1 = pk pk1 :
p = dk = gk k dk1


Several expressions for scalar k have been proposed in (10), including:

Dai Y uan : kDY =

gkT gk
k1 k gk1)


Hestenes Stief el : kHS =

gkT (gk gk1 )

k1 (gk gk1)


The choice of k has a big influence to the convergent property of the method.
In this study, we adopt a hybrid version (9) as in (22):
k = max(0, min(kDY , kHS ))


One practical implementation of NCG methods is to restart the iteration at every m steps by setting = 0, this property is heuristically handled in the hybrid
method above. Readers can refer to (9) and (8) for extensive review on convergence properties of NCG. The comparable cost to evaluate the search direction is
O(n2 ).



Many step-size strategies are available to date (35, Chapter 3), thus the choice
of an optimal step-size strategy remains one of the most difficult thing in optimization. A traditional line search method, Armijo rule (1), guarantee global
convergence, and for certain type of problems, we could apply the Modified Armijo
method (42) which shows better performance. For our comparative study, we implement an Armijo-similar method with a certain maximum number of iterations:

k 1
= min kgk+1 k2 kgk k2 gk p

where (0, 1); k = 0,1,2,. . . and 0,


3.6 Experiments and Results


Experiments and Results

We investigate which of the 4 deterministic optimization methods is the best

choice. The input data are generated by randomly displaced the control points
by the normal distribution of N (0, 10) to obtain the deformed images. We
generate 100 of such randomly deformation. The average initial SSD from Lena
images is 606.26, and from Ankle images is 1562.15. Figure 3.1 represents the
convergence rate of 3 particular random tests (the graph is plotted after the first
We also look at the average performance for 100 random inputs by examining
the boxplot of SSD and time measures in Figure 3.2.
It seems that GN is more consistent and perform better than LevenbergMarquardt, it produces a good convergent result within shorter time. This pattern is expectable because the initial deformation is reasonable small therefore
the approximate Hessian without diagonal matrix calculation is sufficient. NCG
convergence rate depends on how much difference the original images are and
how good approximate descent direction at each iteration. According to the convergence figure 3.1, NCG converge very slow when it approaches the optimal
solution. In the tests, I use a same stopping criteria for all methods for fairness,
however, the average time for NCG could be reduced by alter its termination criteria so that it could stop as it gets close to local minima. Intensive study on the
test results shows that although QN produce worse final result in average for 100
tests, its rate of convergence is always faster compared to other. The fact that
QN produce a slightly higher optimal value is due to its premature termination
at local minima sometimes. However, since the time taken for QN is very small
compared to other methods, one could reapply QN registration using the resulted
warped parameters to obtain a better optimal value while not exceeding the time
by other methods. We will see this in the next Chapter - Recursive Subsampling


3.6 Experiments and Results


Comparable Convergence
N 2 n2 + n3
N 2 n2 + n3

Very Fast

Lena Test
Ankle Test
SSD Time SSD Time




Table 3.2: Deterministic Optimization Performance Evaluation. N is total number of pixels,

n is number of control points.


3.6 Experiments and Results

(a) Lena

(b) Ankle

Figure 3.1: Convergence Rate for 3 random tests. Each panel in the figure shows one
random test. F is the SSD measure, T is CPU runtime measure. GN (GaussNewton),
LM(LevenbergMarquardt), QN(QuasiNewton), NCG(Nonlinear Conjugate Gradient)

3.6 Experiments and Results

(a) Lena

(b) Ankle

Figure 3.2: Average performance on 100 random tests. Difference is SSD measure, Time is
CPU runtime measure. GN (GaussNewton), LM(LevenbergMarquardt), QN(QuasiNewton),
NCG(Nonlinear Conjugate Gradient)

Chapter 4
Recursive Subsampling and
Weighted approach for
Deterministic Optimizations

Recursive Subsampling for Random Deformed


Figure 4.1: Example of Random Deformed Input

In many literatures, subsampling techniques either refer to random sample
subset of pixels (22; 45) or a single subset of pixels (24; 32). The random subsampling approach is reviewed in Chapter 5, and the use of single subset of pixels
cannot be guaranteed to converge. Our approach is more or less similar to the


4.1 Recursive Subsampling for Random Deformed Images

paper by Sun and Guo (44), however our proposal method is more general.
As we know for our gradient descent optimizer, a better initial estimation
of objective parameters enhance the rate of convergence. This idea helps us to
think of a way to find a good estimation of warp parameters p supplied to the full
resolution image registration. For instance, given the images of resolution Nx Ny
with total number of pixels N , we could shrink it by half to N2x N2y with N4 pixels,
downscale it again by half to


N4y with


pixels, and so on . . . until they are

at reasonable resolutions to register. Obviously, the time to register


images is around 4 times faster than N4 -pixels images and 16 times faster than
full images. The idea of recursive subsampling framework is that we will use the
result warped parameters of a smaller dimensional registration phase as an initial
estimated parameters for the larger dimensional registration phase. In order to
do this, we add a recursive subsampling mechanism to our traditional registration
framework 2.4. The number recursive phase indicates how many times we want
to downsampling the images, stop shrinking the images when the resolution is
smaller than 50 50 could be a good heuristic criterion.

Figure 4.2: Recursive Subsampling Registration Model


4.2 Weighted approach based on Difference Image for Local


The use of Recursive subsampling methods also empower the good performance of Quasi-Newton method compared to others because with good initial
estimation, Quasi-Newton method is less likely to encounter premature termination at local minima.


Weighted approach based on Difference Image for Local Deformation

Figure 4.3: Example of Local Deformed Input

From this section, we refer Random Deformation as Random Uniform Deformation on the whole image and Local Deformation as Random Deformation at
local parts of the image.
The idea to associate weights to the registration process comes from Difference
Sampling approach (Chapter 5). For local deformation inputs, which are very
likely to happen in medical image processing, one can manually identify the
deformed parts before starting registration process. If we could find a way to
automatically identify the local deformed parts, then emphasize these regions
during transformation, the performance can be enhanced. We propose a method
to implement this idea by using a weighted transformation which is proportional
to the current difference image. The difference image can be defined as:
x = kIx T xk



4.3 Experiments and Results

where x is a pixel of an image and x [0, 1]. We want to assign weights that
proportional to x . Lets define a weight that corresponding to pixel x to be:
(x) = + log(1 + x )


where constant (0.5, 1) is to be decided based on initial local deformation.

We know, for certains, that log(1 + x ) [0, 0.7). In my study, I suggest that
we set to be small when the local deformation is quite large and to be large
when the local deformation is quite small. This heuristic prevents the method
from slow convergence and premature termination. The transformation matrix
asscociated with Weighted Difference is therefore:
x = (x)Wx


for every pixel x


Experiments and Results

All difference measure are Sum of Square Difference (SSD), all time measure are
CPU runtime in seconds.
Recursive Subsampling methods.
First, we examine the rate of convergence of the Recursive Subsampling methods compared to traditional methods. In our tests, we downscale the images by
3 levels, thus there are 4 phases with number of pixels equal to 64
, 16 , 4 and N .
For a random test, we use the same set of random input images of Lena pictures

and Ankle pictures, apply both methods and compare the rate of convergence.
Figure 4.4 show the test results.
The above figure shows that Subsampling technique is generally faster and produces similar or better results compared to single (normal) methods. We now compare rate of convergence of different subsampling optimizations. Figure 4.5 shows
convergence rate of four Subsampling optimizations: Subsampling Gauss-Newton
(SGN), Subsampling Levenberg-Marquardt (SLM), Subsampling Quasi-Newton
(SQN) and Subsampling Nonlinear-Conjugate-Gradient (SNCG). Each panel
demonstrates convergence rate of one recursive phase: panel 1, 2, 3 and 4 correspond to

, , ,
64 16 4

N -pixels registration phases respectively. We can see that low

resolution phases costs a fraction of time compared to large resolution phases. The


4.3 Experiments and Results

convergence test indicates Subsampling Quasi-Newton method accelerates much

faster while giving good results.
In the next experimetns, we will see that SQN is the best choice. Lets look
at the overall performance on average for 100 random generated tests of Lena
picture (figure 4.6)and Ankle picture (figure 4.7). Recall in chapter 3 that single QN method sometimes suffers from outliers, the SQN method reduces this
bug dramatically since it uses a better starting estimation of warped parameters
at every recursion. The test results illustrate that SubSampling Quasi-Newton
method produces an as good result as other methods with much less time. Please
see Appendix A for performance testing on average of 100 tests for different type
of pictures: Lena, Ankle, Brain, Knee, Lung; table 4.1 summarizes the results of
Subsampling methods on all types of images. From now on, we only use Subsampling methods for testing purpose.
Next, we look at the effect of using different resolution grids of control points.
As we know the performance of registration process is influenced by number of
control points (Table 3.2). It is expecting that the finer grids will reduce the
cost function however it will take longer time because the number of parameters
increases. The graph in figure 4.8 compares the rate of convergence of using
5 5-grid and 7 7-grid on the Knee image. The average initial difference (SSD)
between two input Knee images is 2526.72. Figure 4.9 demonstrates the average
performance on 100 random tests. It confirms that the finer the grid, the better
the convergence however the computational cost is more expensive.
Finally, we will see what happen if the moving image gets deformed more
heavily. We generate random test images by increasing the variance of deformation from 10 to 15, and 20. We predict that the more deformation, the more likely
that optimizations stop at a local minimum, hence the worse convergence to the
optimal solution. The graph in figure 4.10 shows the convergence of different
deformation levels for Lena pictures. The boxplot in figure 4.11 empowers the
pattern by observing 100 random test of each deformation levels.
Weight Recursive Subsampling methods
In order to generate local deformation image. We randomly choose one control point and randomly displace the control point by a variance with Normal


4.3 Experiments and Results

distribution. We then apply the Recursive Subsampling methods with and without weight. Figure 4.12 compares the convergence rate between two approaches.
Clearly, we can see that for local deformation, Weighted Transformation converge
faster than UnWeighted Transformation. One disadvantage of Weight approach
is that it seems to stop prematurely when it get close to the local minimum.
However, as same as an argument for single Quasi-Newton method, we benefit
from the fast convergence therefore we could reapply the method using current
results. In general, the optimal values obtained by Weight methods are almost as
same as UnWeight methods while the consumed time is much more appreciated
(Figure 4.13).



Average Result SSD


Average CPU runtime


Ankle 300x336










Brain 354x353










Knee 353x343










Lena 256x256










Lung 394x378










Table 4.1: Summary of results for Subsampling methods on Random Deformation (100 tests
each image). The suffix after an image name is the resolution. Subsampling
GN(GaussNewton), LM(LevenbergeMarquardt), QN(QuasiNewton),


4.3 Experiments and Results

(a) Lena

(b) Ankle

Figure 4.4: Convergence rate of single and subsampling methods. F is SSD measure, T
is CPU runtime. The lower red dots in subsampling methods are convergence of shrinked phases.


4.3 Experiments and Results

(a) Lena

(b) Ankle

Figure 4.5: SubSampling Convergence Rate. F is SSD value, T is runtime. Each panel
shows convergence rate of one recursive phase. S-prefix is Subsampling.


4.3 Experiments and Results

(a) Difference Measure

(b) Time Measure

Figure 4.6: Lena: Performance of single and subsampling methods. Difference is SSD value,
Time is runtime. Single is traditional method. GN(GaussNewton),
LM(LevenbergeMarquardt), QN(QuasiNewton), NCG(NonlinearConjugateGradient)


4.3 Experiments and Results

(a) Difference Measure

(b) Time Measure

Figure 4.7: Ankle: Performance of single and subsampling methods. Difference is SSD
value, Time is runtime. Single is traditional method. GN(GaussNewton),
LM(LevenbergeMarquardt), QN(QuasiNewton), NCG(NonlinearConjugateGradient)

4.3 Experiments and Results

Figure 4.8: Knee: Convergence of 5 5 and 7 7-grid of control points. F is SSD value, T
is runtime. The data is drawn from the final recursive phase with full resolution. Subsampling
GN(GaussNewton), LM(LevenbergeMarquardt), QN(QuasiNewton),


4.3 Experiments and Results

(a) Difference Measure

(b) Time Measure

Figure 4.9: Knee: Average Performance with different grid size. Difference measure is SSD,
Time is runtime. Subsampling GN(GaussNewton), LM(LevenbergeMarquardt),
QN(QuasiNewton), NCG(NonlinearConjugateGradient)


4.3 Experiments and Results

Figure 4.10: Lena: Convergence of different levels of deformation. F is SSD value, T is

runtime. var -suffix indicates level of deformation, the higher suffix, the more deformation.
Data is drawn from final recursive phase. SGN, SQN, SLM, SNCG are Subsampling
Optimization methods.


4.3 Experiments and Results

(a) Difference Measure

(b) Time Measure

Figure 4.11: Lena: Average Performance of different levels of deformation. Difference

measure is SSD, Time is runtime. var -suffix indicates level of deformation, the higher suffix,
the more deformation. SGN, SQN, SLM, SNCG are Subsampling Optimization methods.


4.3 Experiments and Results

Figure 4.12: Lena: Converegence of UnWeight and Weight methods for local deformation.
Difference is SSD measure, Time is CPU runtime. Subsampling GN(GaussNewton),
LM(LevenbergeMarquardt), QN(QuasiNewton), NCG(NonlinearConjugateGradient)


4.3 Experiments and Results

(a) Difference Measure

(b) Time Measure

Figure 4.13: Lena: Average performance on 100 test between UnWeight and Weight
methods for local deformation. Difference is SSD, Time is runtime. SGN, SQN, SLM, SNCG
are Subsampling Optimization methods

Chapter 5
Stochastic Approximation

Stochastic Approximation

Image Registration is a large scale optimization problem. In addition to deterministic optimization algorithms that we reviewed in Chapter 3, the stochastic
gradient descent methods (23) is also widely investigated in current researchs.
The approximation framework follows the same scheme as 2.3 where:
pk+1 = pk + ak g


The distinctions in Stochastic Method (SM) include the derivative of the cost
k and the decaying sequence
function, g(pk ), is replaced by an approximation g
{ak }. SM aims to find the unknown solutions by successively reducing the inaccuracy in their estimates. They have been successfully applied in many applications
and been evaluated in the Image Registration field (22).
The speed and accuracy of SM depends on the quality of the gradient estimation obtained by random sampling. In general, Random Uniform Sampling (RUS)
is often used for both monomodal and multimodal stochastic image registration
(21). In this chapter, we present a novel approach to random Difference Sampling
(DS) methods which uses either Deterministic Sampling strategy or Stochastic
Sampling strategy. We argue that when the input images are very localised deformed, RUS results in too few samples at that specific location. One solution
could be allow more iterations, to ensure that in the end, enough samples have
been drawn from the local deformed regions. However, the immediate effect of


5.1 Stochastic Approximation

using more iterations is more computational time. If we can realiably detect the
misaligned regions, we could greatly accelerate the registration. Our Difference
Sampling Stochastic method aims to detect the deformed regions based on the
difference image x = kIx T xk, and randomly pick a subset of pixels from those
regions based on some defined non-uniform probability distribution.
When the image is largely deformed, the difference image is no longer a realiable indicator of misalignment. Ideally, Difference Sampling will converge to
Random Uniform as its non-uniform probabilities becomes almost uniform.
The registration procedure for Stochastic Approximation methods is similar
to 3.3 except we replace step (7), (8), (9), (10) by 5.1 and termination criterion
is replaced by convergence of {pk }:
by 2.9
Iterate : (2) Evaluate W(p) by 2.2

P recompute : (1)

(3) Warp I by W(p) to obtain I = I(W (p))

(4) Evaluate I at x
(5) Evaluate I
at x
X  W T
(6) Compute g =
[I(W (p)) T ] (2.8)
(7) Compute decaying sequence ak
(8) Update pk+1 = pk + ak g

U ntil :


E{pk+1 } E{pk } 

ii. Exceeding maximum of iterations


where is a subset of random pixels. The study of complexity for Deterministic Optimization (3.2) shows that step (5), (6), (7), (9) are the most expensive
and costs at least N 2 n + n3 (SQN and SNCG). In Stochastic methods, step (5)
and (6) costs S2 n instead of N 2 n, where S << N is the size of , step (7)
and (8) costs n. It makes up the competitive cost of S2 n + n much smaller than
N 2 n + n3 .


5.1 Stochastic Approximation

In the next sections we learn how to pick random pixels and how to approximate using those random data.


Robbin-Monro and the derivative

A well-known Stochastic Approximation Method was introduced by Robbin and

Monro (RM) (37) in 1958. The method assumes that an approximation of the
derivative of the cost function is available:
k = g(pk ) + k


Robbin-Monro method guarantees the convergence to the solution p

if the bias
of the approximation error k goes to zero:
gk ) g(pk ),

as k


where E(.) denotes expectation. The approximated gradient g

k is not necessarily
vanish close to the solution p
, which satisfies g
p) = 0; therefore the convergence
of pk (5.1) must be forced by ensuring ak 0 as k . This leads to a
study of the decaying sequence ak .


Decaying sequence

In most theoretical work on Stochastic Approximation Algorithms, the decaying

sequence ak , is designed to guarantee convergence of the optimizer, is a nonP 2
increasing non-zero sequence ak , k N such that
k=1 ak <
k=1 ak = and
. Clearly, there are many sequences that describe the valid decaying sequence.
In practice of medical image processing problem, the following expression is often
used (43):
ak = /(A + Q)


Different adaptive step-sizes are described in (20) and (21). For simplicity and
adopting (3), we employ the implementation of step-sizes sequence in (20). The
algorithm observes that the more rapid oscillates of pk about the stationary point
, the closer pk is to its optinum. At the same time, the decaying sequence ak
should approach to zero.


5.2 Difference Sampling

At iteration k, pk denotes the transformation parameters with elements pik ,

i = 1,2, . The oscillations of pk can be formulated as the rate of changes in
the sign of (pik pi ) (pik1 pi ) = pik pik1 . The step-size aik associated with
pik is therefore inversely proportional to the number of sign changes of pik pik1 .
Our modified adaptive step-size strategy to 5.5 for the i -th component of ak as
aik = /(A + Qik ), with = 1


where Qik is the number of sign changes in pim pim1 , m = 2,..k and Qi1 = 0.
A and are heuristically chosen depends on applications. In our experiments. I
set = 150 and A = 15.


Difference Sampling

Given a two input images which are very localised misalignment. A Random
Uniform Sampling (RUS) method might cause a bias in estimating the difference,
as in Figure 5.1, because it does not provide sufficient samples at the deformed
parts. Difference Sampling (DS), in contrast, is a non-uniform sampling approach
which could reduce the variance of the approximation, as in Figure 5.2. DS takes
into account the probability distribution based on the current difference between
two images.
The idea of DS basically follows a common sense: If the deformed image differs
from the referenced image at small local parts, we need not to consider very
much at those parts that are already identical but only at those small parts that
are different. In addition, the larger error between a pair of pixels (at the same
coordinates in two images), the more likely those pixels will be pick. Interestingly,
a few non-uniformly sampling approaches have been proposed before by Bhagalia
(3) and Sabuncu (41). However, their sampling methods emphasize the imageedges which is quite different to our approach and also cause a bias in case of
localised deformation.
In order to study the variance reduction by DS, we briefly explain how nonuniform random distribution brings advantages in certain problems. We want to


5.2 Difference Sampling

sample the a subset of pixels that indicate the current misalignment. Recall that
the error (2.4) can be written as:
[I(W(p)) T]2
2 x

for every pixel x


We can approximate F by evaluating 5.7 based on a subset of random pixels

using RUS method:
uni = F
[Ix (W (p)) Tx ]2 PU

= f (X)


where PU is a Uniform Distribution over with constant probability. Lets define

another approximation of F by using DS method to be:
dif = F
[Ix (W (p)) Tx ]2 PD (x), where PD (x) =

f (X)


The expectation and variance of the above estimations can be written as:
uni = E(F ) =

var(uni ) = var(f (X))

f (X)
var(dif f ) = var


The above equations show that the expectation of RUS and DS methods are
similar. Therefore the use of DS is only advantageous if we can formulate a
f (X)
distribution X PD that ensure var( w(X)
) < var(f (X)). It is possible to do
so by setting the larger weight w(x) at pixels that have more influence to f (X).
How to set up the difference distribution and how to sample the data will be
discussed next.


5.3 Sampling Strategy

Figure 5.1: Random Uniform Samples of local deformed images


Sampling Strategy

Unlike Random Uniform Sampling, which uses constant probability, Difference

Sampling technique takes into account intensities of the difference image. The
simplest way to derive the probability of a pixel i -th is:
P (i) =

kIi Ti k
+ i


where and i is some constant to be defined. Depends on the sampling strategies,

we explain the setting of those constants accordingly. The idea of our the sampling
strategies is inspired by similar techniques in (41)


Deterministic Sampling

Deterministic Sampling is based on stratified sampling (19). From equation 5.13,

we set = 1 and i = 0 for all i. The probability P (i) now is simply the intensity
value of pixel i-th of the difference image. We classify the pixels into different
levels based on its probability. The group at higher level has larger probability


5.3 Sampling Strategy

Figure 5.2: Random Samples based on Difference Sampling: Deterministic

strategy and Stochastic strategy
and its pixels are more likely to be sampled. Figure 5.3 illustrates an example of
probability classification.


Stochastic Sampling

We formulate the Stochastic Sampling technique as a decision making process.

We walk through every pixels in the difference image and at pixel i -th, we decide
to sample or not based on its probability (5.13) greater than a threshold, where
(0, 1) and i = rand(1, 1). The value of inversely proportional to the
number of samples, thus the larger deformation (more samples needed) require
smaller for sufficient number of samples. A good guidance to choose is between
0.2 (large local deformation) to 0.8 (small local deformation). While the first

kIi T ik

plays a role of non-uniform; the value of i , plays a role of

random, establishes an equal chance for every pixels to be sampled.


5.4 Experiments and Results

Figure 5.3: Example of 5 levels of probabilities


Experiments and Results

We examine the rate of convergence between Subsampling Deterministic Optimization algorithms and Robin-Monro Stochastic Optimization using two type of
sampling techniques(Difference Sampling with Deterministic strategy, Stochastic
strategy; and Random uniform Sampling). For fairness of comparison, we take a
subset of 2% of total number of pixels for all stochastic methods. Registration
is applied for MRI pictures of Knee with 512x512 pixels. Since the main part
of the MRI picture locates at the center, the side parts have uniform intensity
therefore Random-Uniform Deformation (RUD) of the image results in the large
misalignment at the center of the image and Random-Local Deformation (RLD)
results in the small change at one part of the center of the image (Figure 5.4).
A Random uniform Sampling Stochastic method (RSS) will be re-sampled
at every iteration to avoid bias and earn sufficient samples at every region of
the pictures. The Deterministic Sampling Stochastic method (DSS) only get
samples once since it concentrate on the regions of difference and also take into
account neighbour regions. The Stochastic Sampling Stochastic method (SSS)
samples the pixels based on the current difference of the two images therefore,
theoretically, it need to be re-sampled at every iteration. However, Stochastic Resampling at every iteration is very costly and indeed, the changes in difference
after one iteration is not much, we could possibly re-sample using Stochastic


5.4 Experiments and Results

Figure 5.4: MRI Knee: We generate 100 random uniform deformation (RUD) images by
randomly perturbed every control points and 100 random local deformation (RLD) images by
randomly perturbed one control point.

Sampling approach at every certain amount of iterations. In general, I would

recommend to re-sample the images using Stochastic Sampling approach about
5-20 times during the whole process of registration. In our test cases, since I
set the maximum number of iterations is 100 therefore we compare the rate of
convergence for SSS with resample at every 20, 15, 10, 5 iterations. We denote
SSS-20, SSS-15, SSS-10, SSS-5 respectively.
Figure 5.5 and 5.6 compare the convergence rate between Subsampling Deterministic Opimizations and Stochastic Approximation with different sampling
methods for RUD and RLD types. The the initial average difference (SSD) of
RUD and RLD inputs are 2060 and 780 respectively. With the large size of
the image, we use 7x7 control point grid for all methods. For SSS method, I
re-sample at every 5 iterations, I set = 0.25 for RUD inputs and = 0.75 for


5.4 Experiments and Results

RLD inputs. The graphs show us in both experiments, Stochastic Approximation

methods always outperform Deterministic Optimization methods (Det. methods
take, in average, more than 200 seconds to converge for RUD, and more than 75
seconds for RLD; compared to Stochastic which are 80+ seconds for RUD and
40+ seconds for RLD; and Det. final SSD always higher than Stoc. final SSD)
. We will not examine further Deterministic methods here. Within Stochastic
Approximation class, RSS is more likely to stop at a local minima and its convergence rate is worse compared to Difference Sampling methods. Within Difference
Sampling methods, DSS converges slower than SSS-5, this pattern is expectable
because Stochastic Sampling method gets a new set of samples that fit the current
different parts during registration therefore it follows the misalignment regions
In the next experiment, we review the average performance of Stochastic
methods and also conclude that the more frequent resampling of SSS method,
the better convergence rate. We examine on 100 random tests of RUD input
type and RLD input type. Figure 5.7 shows the results. In general, we can see
that RSS often stops at a local minimum with higher SSD compared to Difference
Sampling methods. DSS performs better than RSS however not as consistent as
SSS. Between SSS methods, SSS-5 produce a best result within a lowest time, the
experiment illustrates the more frequent resampling, the better result SSS would
The final experiments is carried out with the hope of finding a better approach
than any of above methods, however, it turns out to be not succesful, I include it
for completeness of this research paper. We want to analyze the effect of applying
Deterministic Optimizations after a good approximation by Stochastic methods.
We call it Combined method with two phases: In the first 100 iterations, SSS-5 is
used; in the second phase, we apply Deterministic methods based on the current
approximation for maximum of 100 iterations. In contrary, we let SSS-5 to run
for 200 iterations. The experiments results are shown in figure 5.8. As we see
the performance of Combined methods using Deterministic Optimization in the
second phase does not show any advantage to Stochastic methods only. Perhaps if
we let Stochastic runs for more iterations, eg. 200-300 iterations, until it can not


5.4 Experiments and Results

produce any better result, then we apply Deterministic Optimization for further


5.4 Experiments and Results

(a) RandomUniformDeform Convergence by UnWeight Det. methods. Weight methods are not
applicable here because of large deformation

(b) RandomUniformDeform Convergence by Stochastic Approximations methods

Figure 5.5: Convergence of RUD by Deterministic and Stochastic methods. Data for Det.
Opt. is drawn from last phase of Recursive SubSampling. SGN,SLM,SQN,SNCG are
Subsampling Det. methods. D/R/S-5 SS are Deterministic/ Random/
Stochastic-resample every 5 iterations
50 Sampling Stochastic methods

5.4 Experiments and Results

(a) RandomLocalDeform Convergence by Weight and UnWeight Det. Opt. methods

(b) RandomLocalDeform Convergence by Stochastic Approximations methods

Figure 5.6: Convergence of RLD by Deterministic and Stochastic methods. Data for Det.
Opt. is drawn from last phase of Recursive SubSampling. SGN,SLM,SQN,SNCG are
Subsampling Det. methods. DSS/RSS/SSS-5 are Deterministic/ Random/
Stochastic-resample every 5 iterations Sampling Stochastic methods

5.4 Experiments and Results

(a) RandomUniformDeform performance on 100 test. Difference is SSD, Time is runtime.

(b) RandomLocalDeform performance on 100 test. Difference is SSD, Time is runtime.

Figure 5.7: MRI Knee: Average Performance of Stochastic Approximation methods for two
type of deform: RUD and RLD. DSS/RSS/SSS-suffix are Deterministic/ Random/
Stochastic-resample frequency52Sampling Stochastic methods

5.4 Experiments and Results

(a) Convergence Rate of Combined and Stochastic methods. F is SSD, Time is runtime.

(b) Average Performance on 100 tests of Combined and Stochastic methods. F is SSD, Time is

Figure 5.8: Comparison between Combined Stoc-Det and Stochastic methods. DSS-SQN,
RSS-SQN, SSS-SQN are combined methods. DSS, RSS, SSS-5 are Stochastic methods


Chapter 6
MATLAB Implementation: Vreg


The code presents all algorithms described in this paper is manually designed
using MATLAB version 7 R14 SP3 ( with Spline toolbox.
Some coding notations are learned from (2) and B-spline implementation follows a tutorial by R. Larsen (25). I outline some important evaluations in the
registration framework 3.3




Cost Function F

The error image can be computed as an vector: e = abs(I(:)-T(:)). Hence

the cost function F = 0.5*ee.


Image Gradient I

Image gradient can be obtained by using gradient. We evaluate the gradient of

the image with respect to x and y:
[dIx,dIy] = gradient(I)


6.2 Implementation


Transformation W(p) and Jacobian


We construct the transformation based on B-spline model (2.2). The tensor Bspline is defined by two sets of control points (knots) with respect to the row
and the column directions. The knots are placed at uniform spacing. In order
to handle the displacement of image boundaries, we need to add extra 3 knots
on top of the boundaries knots. For instance, a set of knots on one row can be
constructed by:
k = augknt(0:space:rowlength,3)
Each 2D basis function is the tensor product of a row and a column B-spline basis
function. Let the row functions be bxi (x), i=1. . . m and the column functions be
byj (y), j=1. . . n. Then the displacement W = (Wx , Wy ) becomes:
Wx =
Wy =

m X
i=1 j=1
m X

bxi (x)byj (y)pxij

bxi (x)byj (y)pyij

i=1 j=1

Use MATLAB spline make function spmak to construct basis function from knots
sequence. We can classify row and column B-spline functions into two sets:
Qx i = bxi (x)
Qy j = byj (y)
Using Knonecker product Q = kron(speye(2),kron(Qx,Qy)) we obtain:
W = I2 Qx Qy p = Qp
where I2 is a sparse identity matrix size 2 2. It is easy to see that



Other evaluations

Having W(p), we can compute I(W(p)) by using linear interpolation:

Other expressions can be computed using simple matrix operations.


= Q.

6.3 User Guide


User Guide

Using Vreg is easy. User needs to run the application on MATLAB, set the
paths to VReg package and all of its subfolders. Once you have done this, call
the register function with desired parameters and let the machine works it out.
The registration process for inputs of resolution less than 512 512 should not
take more than one minute if using an appropriate algorithm.
Given any two 2D images T and I where I is a deformed version of T. We
start the registration process by calling:
[F,p] =

Only the first four parameters are always essential although user is encouraged
to indicate algo. The rest are not needed however you need to put empty notation
[] at the parameter that you do not want to include. List of parameters:
T,I are filenames of input images. I is the deformed image.
p1,p2 indicates grid of control points p1 p2, eg. 7 7.
algo is the choice of algorithms:
or, alternatively, GN, LM, QN, NCG. User wants to use Weight
methods have to add arguments weight,[] at the end of the call,
ie. after [fig]. Default = 0.75.
RandomSamplingStochastic, DeterministicSamplingStochastic or, alternatively, RSS, DSS. User uses these methods can add arguments [%],[],[A] after [fig] to indicates how many pixels should
be sampled, eg. 0.02 indicates 2% of total pixels. Default values:
%=0.02, =150, A=15.
StochasticSamplingStochastic[-resample] or SSS[-resample] where [-resample]
indicates the frequency of resampling, eg. SSS-5 means resampling
after every 5 iterations. User uses this method can add arguments
[],[],[A] to the end. Note: (0.2, 0.8), larger fewer samples
(5.3.2). Default values: =0.7, =150, A=15.


6.3 User Guide

recur.phase is the number of recursive phase we want. A single registration has 0 recursive phase. A Subsampling methods has 1 or more recursive
phases. Note: do not chop down the image too much, it can not be registered, recur.phase should be less than 4. Default values: 3 for Deterministic
methods and always 0 for Stochastic methods.
init.warp is the initial estimation of warped parameters, it should be the
result from a previous registration using this application. Default value is
max.iter indicates maximum number of iterations allowed to run. Default
value: 100.
show, fig : show=1 will show the registration process on the figure(fig)
and suitable for demo because it slows down the registration. Default value:


Chapter 7
In this paper, we have discussed the performance of different type of Optimization methods by theoretical review and practical experiments. The choice of
Optimization methods is mainly depends on: the size of the input images and
the type of deformation (random deformation or very localised deformation). We
have shown that the new proposed approaches based on detecting the misalignment parts of the input images can accelerate and produce a better results.
We define Optimizations as either Deterministic approach or Stochastic approach. For each approach, we construct a unify registration framework. Deterministic approach is suitable for small and medium size images while Stochastic
is more suitable for large images.
In Deterministic approach, our extensive study shows that Quasi-Newton is
the better choice compared to Gauss-Newton, Levenberg-Marquardt and Nonlinear Conjugate Gradient methods. In addition, the Recursive Subsampling methods always outperform the methods without Subsampling. We also examine the
effect of applying Weight (based on difference image) to transformation matrix.
The results show that Weight Deterministic methods produce better convergence
rate compared to UnWeight methods for very localised deformed input images.
In Stochastic approach, we have demonstrated that, for localised deformation,
the use of Difference Sampling methods with either Deterministic strategy or
Stochastic strategy produces a better convergence rate compared to Random
Uniform Sampling. In addition, Stochastic Sampling strategy perform slightly
better than Deterministic strategy for most of localised deformation experiments;


Random Uniform Sampling is suitable for random deformation.

The incompleteness of this project is its limitation to 2D monomodal images
and no regularisation compensation. The proposed methods in this paper are
all based on difference image, this is an obstacle for multimodal input images.
Although extending to 3D input images and adding regularisation should not
be difficult but extending to mutimodal input images would require some additional functions to compensate bias. Future work can also review a better setting
parameters for Stochastic methods to enhance the convergence speed.


Appendix A
Performance of SubSampling
methods on different images
Figure A.1 shows the images used for registration. Lena 256256 indicates the
image Lena has resolution of 256x256.
The box-plot below shows the average performance on 100 tests.


(a) Ankle 300x336

(b) Brain 354x353

(d) Lena 256x256

(c) Knee 353x343

(e) Lung 394x378

Figure A.1: Different type of images used for SubSampling Methods Test


Figure A.2: Ankle: Average Performance of SubSampling methods

Figure A.3: Brain: Average Performance of SubSampling methods


Figure A.4: Knee: Average Performance of SubSampling methods

Figure A.5: Lena: Average Performance of SubSampling methods


Figure A.6: Lung: Average Performance of SubSampling methods


[1] Larry Armijo. Minimization of functions having lipschitz continuous first
partial derivatives. 1966. 18
[2] Simon Baker and Iain Matthews. Lucas-kanade 20 years on: A unifying
framework. International Journal of Computer Vision, 56:221 255, 2004.
7, 16, 54
[3] Roshni R. Bhagalia.



of Michigan, 2008. 41, 42
[4] Ake Bjrck. Numerical methods for least squares problems. SIAM, 1996. 16
[5] M. Bro-Nielsen and C. Gramkow. Fast fluid registration of medical images.
In Proceedings Visualization in Biomedical Computing, 1996. 8
[6] T. Buzug and J. Weese. Improving dsa images with an automatic algorithm
based om template matching and an entropy measure. Computer assisted
radiology of Excerpta medica - international congress series, 1124:145150,
1996. 10
[7] T.F. Cootes, G.J. Edwards, and C.J. Taylor. Active appearance models. In
Proceedings of the European Conference on Computer Vision, 1998. 10
[8] Y.H. Dai. Convergence of nonlinear conjugate gradient methods. Journal of
Computational Mathematics 19 (5), 539-548, 2001. 18



[9] Y.H. Dai. An efficient hybrid conjugate gradient method for unconstrained
optimization. Ann. Oper. Res., vol. 103, pp. 3347, 2001. 18
[10] Y.H. Dai. A family of hybrid conjugate gradient methods for unconstrained
optimization. Math. Comput., vol. 72, pp. 13171328, 2003. 12, 18
[11] J. E. Dennis, Jr., and J. J. Mor. Quasi-newton methods, motivation and
theory. SIAM Rev, vol. 19, pp. 46-89, 1977. 12, 17
[12] Vandermeulen et. al. Multi-modality image registration within covira. In
Medical imaging: analysis of multimodality 2D/3D images, Vol. 19 of Studies
in health, technology and informatics, pp. 2942, 1995. 3
[13] A. C. Evans, D. L. Collins, P. Neelin, and T. S. Marrett. Correlative analysis
of three-dimensional brain images. Computer-integrated surgery, Technology
and clinical applications, 1996. 10
[14] R. Fletcher and C. M. Reeves. Function minimization by conjugate gradients.
Computer Journal 7, 1964. 17
[15] M. Gleicher. Projective registration with difference decomposition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1997. 10
[16] A. Goshtasby. Image registration by local approximation methods. Image
and Vision Computing 6, 1988. 8
[17] Eldad Haber and Jan Modersitzki. Image registration with guaranteed displacement regularity. Int. J. Comput. Vision, 71(3):361372, 2007. 13
[18] N. Hansen and A. Ostermeier. Completely derandomized self-adaptation in
evolution strategies. Evol. Comput., vol. 9, no. 2, pp. 159195, 2001. 12
[19] Neville Hunt and Sidney Tyrrell. nhunt/meths/strati.html.
[20] H. Kesten. Accelerated stochastic approximation. Ann. Math. Stat., 29:41
59, 1958. 41



[21] Stefan Klein, Josien P. W. Pluim, Marius Staring, and Max A. Viergever.
Adaptive stochastic gradient descent optimisation for image registration. International Journal of Computer Vision, 81:227239, 2009. 39, 41
[22] Stefan Klein, Marius Staring, and Josien P.W. Pluim. Evaluation of optimization methods for nonrigid medical image registration using mutual
information and b-splines. IEEE Transactions on Image Processing, 2007.
17, 18, 23, 39
[23] H.J. Kushner and G.G. Yin. Stochastic Approximation and recursive algorithms and applications. Springer-Verlag, New York, 2003. 39
[24] J. Kybic and M. Unser. Fast parametric elastic image registration. IEEE
Trans. Image Process., vol. 12, no. 11, pp. 14271442, 2003. 23
[25] Rasmus Larsen. Medical image analysis non-linear b-spline based image
registration, part 2. tutorial, 2007. 54
[26] Seungyong Lee, George Wolberg, and Sung Yong Shin. Scattered data interpolation with multilevel b-splines. IEEE Transactions on Visualization and
Computer Graphics, 3:228244, 1997. 9
[27] Kenneth Levenberg. A method for the solution of certain non-linear problems
in least squares. The Quarterly of Applied Mathematics, 2:164168, 1944. 7
[28] B. Lucas and T. Kanade. An iterative image registration technique with
an application to stereo vision. In Proceedings of the International Joint
Conference on Artificial Intelligence, 1981. 10
[29] F. Maes, D. Vandermeulen, and P. Suetens. Comparative evaluation of multiresolution optimization strategies for multimodality image registration by
maximization of mutual information. Med. Image Anal., 3:373386, 1999. 7
[30] J. B. Antoine Maintz and Max A. Viergever. A survey of medical image
registration, 1997. 3, 6
[31] Donald Marquardt. An algorithm for least-squares estimation of nonlinear
parameters. SIAM Journal on Applied Mathematics, 11:431441, 1963. 7



[32] D. Mattes, D. R. Haynor, H. Vesselle, T. K. Lewellen, and W. Eubank. Petct image registration in the chest using free-form deformations. IEEE Trans.
Med. Imag., vol. 22, no. 1, pp. 120128, 2003. 23
[33] C. R. Meyer and et. al. Demonstration of accuracy and clinical versatility of mutual information for automatic multimodality image fusion using
affine and thin-plate spline warped geometric deformations. Medical Image
Analysis, 1997. 8
[34] Bryan S. Morse. Image registration, lucas-kanade algorithm. CS 650: Computer Vision lecture notes. vii, 7
[35] J. Nocedal and S. J. Wright. Numerical optimization. Springer, 2006. 7, 11,
12, 16, 17, 18
[36] X. Pennec, P. Cachier, and N. Ayache. Tracking brain deformations in time
sequences of 3d us images. Pattern Recognit. Lett., 24:801813, 2003. 2
[37] H. Robbins and S. Monro. A stochastic approximation method. Ann. Math.
Statist., 22:400407, 1951. 12, 41
[38] D. Rueckert and et. al. Nonrigid registration using free-form deformations:
Application to breast mr images. IEEE Transaction on Medical Imaging, 18,
1999. 8
[39] Daniel Rueckert. Tutorial on image registration. Tutorial. 6
[40] Berc Rustem. Algorithms for equilibria, games and systems of nonlinear
equations. Lecture Notes, 2005. 17
[41] M. R. Sabuncu and P. J. Ramadge. Gradient based nonuniform sampling
for information theoretic alignment methods. Proc. Intl. Conf. IEEE Engr.
in Med. and Biol. Soc., 3:16831686, 2004. 42, 44
[42] Z. J. Shi and J. Shen. New inexact line search method for unconstrained
optimization. 2005. 18



[43] J.C. Spall. Implementation of the simultaneous perturbation method for

stochastic optimization. IEEE Trans. Aerosp. Electron. Syst., 34:817823,
1998. 41
[44] Shaoyan Sun and Chonghui Guo. Medical image registration by maximizing
a hybrid normalized mutual information. Bioinformatics and Biomedical
Engineering, 2007. 24
[45] P. Viola and W. M. Wells III. Alignment by maximization of mutual information. International conference on computer vision, 1995. 3, 10, 23
[46] Barbara Zitova and Jan Flusser. Image registration methods: a survey.
Image and Vision Computing, 21, 2003. 8