Professional Documents
Culture Documents
2 Pixel registration
44
Zitova and Flusser [97] also identify the four steps present in the majority of
registration methods as follows:
Numerous techniques have been developed over the years addressing each of
the above steps with various degrees of success in various applications,
therefore the interested reader is invited to sample through the 200+ references
surveyed by Zitova and Flusser [97] for detailed descriptions of these methods.
However, the methods that are relevant to this thesis will be addressed briefly
in the following sections. These are correlation-like and Fourier based
methods, and combinations of affine and elastic transforms. The reason these
methods are more relevant to this study is the fact that they are suitable for
more complicated geometric deformations rather than matching distinct
features between images.
45
2.2.1 Correlation-like methods
x u
y
f z
w
I
Figure 2-11 Template matching in search window
where, with reference to Figure 2-11, the summation is done over (X, Y) z, f is the
feature window to be matched in image I inside the search window w, point (u,v) is the
origin of the corresponding candidate patch z. The size of the search window w limiting
the scope of (u,v) can be adjusted to optimize computational overheads. If the
approximate location of the feature window f is know, then the search window w can be
centred around that point and made only large enough to capture the likely true location
of the feature. When f=z perfect correlation is obtained.
46
However the simple cross-correlation term in 2.17 suffers from a major
drawback such as the change in image energy inside the search window. According to
this measure, any bright spot can score higher even compared to an exact feature match.
The template image f in Figure 2-11 has been cross-correlated for all values of (u,v)
with the full image I. Figure 2-12(a) shows the cross-correlation map obtained using
equation 2.17 and Figure 2-13(a) shows the same map as a 3D mesh. It is evident from
both figures that there is no distinct location that scores particularly high for similarity;
actually, there seem to be many locations other the correct one that score higher. This
problem is overcome by making the cross-correlation invariant to changes in image
intensity, therefore changes in energy. This invariance is achieved first by unbiasing the
image and feature vectors, then by normalizing them. The unbiasing is simply the
process of subtracting the expected (mean) values from the vectors and the
normalization is the process of setting them to unit length. This gives the following
normalized cross-correlation formula within the same summation interval:
I ( X , Y ) E (I z ) f ( X u, Y v) E (f )
( X ,Y ) z
CC(u , v ) 2.18
2 2
I ( X , Y ) E (I z ) f(X u, Y v) E (f )
( X ,Y ) z ( X ,Y ) z
When the same feature image f is cross-correlated against the full image, the
cross-correlation map exhibits a single peak at the correct location of the feature image;
this can be easily seen in Figure 2-12(b) and Figure 2-13(b) as a bright dot and a sharp
peak, respectively.
No noise was included in this process, and the feature image was an
immaculate portion of the target image, therefore a cross-correlation coefficient of unity
was obtained at the correct location using the normalised cross-correlation formula.
However the shortcomings of the simple correlation term are evident even in this ideal
case. It must be noted that the normalised cross-correlation can have its own shortfalls,
especially when there are significant geometrical distortions between the feature and
target images, and when the feature size is too small. The former would prevent the
47
occurrence of a perfect match resulting in lower and wider peaks whereas the latter
could falsely identify similar but unrelated features as a match resulting in multiple
peaks. These problems could be mitigated at some extent by limiting the search window
around the expected locations of the features rather than the full image.
(a) (b)
Figure 2-12 Gray scale plot of simple cross-correlation (a) and normalised cross-correlation (b).
Bright areas mean greater correlation.
(a) (b)
Figure 2-13 3D mesh plot of simple cross-correlation (a) and normalised cross-correlation (b) mesh.
Peaks mean greater correlation.
48
However the ease of implementation of correlation-like registration methods
using simple arithmetic operators make them ideal for massively parallel
implementations such as with FPGAs (Field Programmable Gate Arrays). Therefore,
despite their shortcomings, correlation-like methods are still favoured in real-time
applications due to their ease of implementation, particularly on streamed data.
where x and y are the spatial shifts, and M and N are the height and width of the
images, respectively, and F and G are the FFTs of the original and circularly shifted
images, respectively. Factoring out the phase difference is simply a matter of
calculating the normalized cross-power spectrum:
u y v x
j2
M N
F G F Fe
P u, v
F G F F
2.21
u y v x
j2
M N
P u, v e
49
The phase correlation method has the advantage over the cross-correlation
method of working in one single step without the need of a sliding window. However
the phase correlation method requires that the Fourier transform of both target and
feature images be of the same size. As most real life problems are not about finding the
shift between two ideally wrapped around images, but locating a relatively much
smaller feature in a target image, the size of the feature image (64 64) and the size of
the target image (512 512) will have to be brought up to a common larger size. A
simple padding up to the size of the target image (512 512) could be done but it
wouldn’t account for spread, so the final padded size should be the sum of both minus
1, in each direction, i.e., ((512+64-1) (512+64-1) as shown in Figure 2-14.
(a) (b)
Figure 2-14 (a) Target image f and (b) padded feature image g.
The feature could be centred in the padding with no real effect on the result,
the extra offset would just have to be subtracted from the result later on. In order to
avoid artefacts arising from the edges of the image from showing up in the phase
correlation map, both target and feature images can be tapered on the edges as shown in
Figure 2-15; in this case a transition of 10 pixels was used.
50
(a) (b)
Figure 2-15 (a) Target image f and (b) padded feature image g with 10 pixels cosine taper on edges
of both images.
Figure 2-16(a) shows the inverse of the cross-power spectrum between the
untapered target and the feature images and Figure 2-16(b) for the tapered pair; the
single and very distinct peak is in both plots at the location where the feature has a
perfect match. However it is evident that the phase correlation map for the tapered pair
is cleaner from artefacts and is slightly steeper. These artefacts occurring at the edges of
Figure 2-16(a) are simply due to the discontinuities introduced by the padding and are
easily avoided by tapering the edges of the images.
(a) (b)
Figure 2-16 Inverse of the cross-power spectrums (phase correlation maps). (a) without cosine
taper and (b) with cosine taper. Notice the lower peak in (a) and noise in the side-bands.
51
image acquisition conditions. More recently, Fraser et al. [40] proposed a more robust
to noise region of interest (ROI) approach using a Wiener filter to determine the
shiftmaps, and as an added feature of the method, also obtaining the position dependent
PSF.
The ROI approach becomes a necessity when dealing with differently distorted
feature and target images and constitutes the basis for hierarchical methods.
52
under given stretching forces. It is not uncommon to see elastic methods using optic
flow as a similarity metric. Periaswamy and Farid propose such a robust elastic method
coupled with hierarchical affine transforms [71, 72]. Another affine technique where an
explicit spline representation of the images in conjunction with spline processing was
proposed earlier by Thévenaz et al. [87]
In its simplest meaning optic flow is the perceived visual motion of objects by
a moving observer. The term optic flow is very much reminiscent of the continuity
equation in actual fluid flow, however in this case the property that stays constant along
a trajectory is the pixel intensity (gray level). Therefore the underlying assumption is
that the gray level I(x,y) of an infinitesimal pixel does not change as it moves along the
optic flow (u,v), where u and v are the perceived motion of the pixel in the x and y
directions, respectively, and that any change in intensity is with respect to time. This
latter statement basically frames the following equation known as the Optic Flow
Constraint (OFC):
I ( x, y , t ) I ( x, y , t ) I ( x, y , t )
u v 0 2.23
x y t
However, this is again an ill-posed problem as a single scalar constraint is not sufficient
to determine the two dimensional flow field (u,v). In practical terms, the only quantity
that can be measured in equation 2.23 is I t . Various solutions have been proposed to
make this problem well-posed, all involving the introduction of further constraints such
as the differentiation of the OFC [68], the minimisation of a functional derived from the
OFC and a smoothness penalty term [51], the assumption of constant or affine variation
in the optic flow [16, 17, 24]. Other methods relying on the local constancy of the
optical flow using spatiotemporal filtering have been also suggested [10, 30, 36, 49]. A
variation on these methods was introduced by Burns et al. [23] based on spatiotemporal
wavelet analysis as early as in 1994. The regularization problem (circumventing the ill-
position condition) has been extensively studied, however as the imaging process is
inherently noisy and the derivatives can only be approximated numerically, any two
independent equations will not necessarily give the best estimate of the true optic flow,
therefore raising robustness concerns. Robustness has been then addressed by many
authors and a thorough survey is presented by Bab-Hadiashar and Suter before their
53
improved take on the problem in [14] where they propose improvements over the use of
the Weighted Total Least Square (WTLS) and Least Median of Squares (LMedS)
methods. Elad and Feuer [33] introduce the assumption of temporal smoothness in
recursive image sequences that allow the incorporation of the Constrained Weighted
Least Square (CWLS) estimator. Whilst assumption of local constancy of the optic flow
is a convenient constraint, it is also the source of another problem called time aliasing,
which is the failure to correctly detect large displacements. A multi-resolution approach
has been suggested by many authors [72] where the flow field is accumulated from a
coarse grid sampling data down to finer scales as far as possible. This is very similar to
hierarchical shrinking window methods [40].
All the shortcomings of the above methods are a perfect plea for the use of
wavelet analysis, which is a fundamental tool of multiscale filtering. Bernard offers a
successful implementation of wavelet analysis in his PhD thesis “Wavelets and ill posed
problems: optic flow and scattered data interpolation” [18].
This technique models the mapping between images as a locally affine but
globally smooth warp [71] and is described here in detail as it was used in all image
registration experiments through this thesis.
where m1, m2, m3, m4 form the 2 2 affine matrix, m5, m6 the translation vector and m7,
m8 are parameters that embody the change in contrast and brightness, respectively. The
estimation of these parameters is done by defining a quadratic error function to be
minimized:
2
E (m) m7f ( x, y, t ) m8 f (m1 x m2 y m5 , m3 x m4 y m6 , t 1) 2.25
x, y
54
where m T m1 ... m8 , and denoting a small spatial neighbourhood. However this
2
m7 f ( x, y, t ) m8
f ( x, y , t )
E (m) f ( x, y , t ) 2.26
x, y (m1 x m2 y m5 x)
x
f ( x, y , t ) f ( x, y , t )
(m3 x m4 y m6 y)
y t
2
E (m) k cT m 2.27
x, y
f f f
where k f x y
t x y
f f f f f f
and c T x y x y f 1 .
x x y y x y
The derivatives are evaluated using a set of derivative filters specifically designed for
multi-dimensional differentiation [35]. Ordinarily, in the case of discretely sampled
images, these derivatives are often computed as differences between neighbouring
sample values, which are typically poor approximations to derivatives and lead to
substantial errors [72].
Differentiating the error function with respect to the unknowns and setting to
zero in order to minimize it gives:
dE (m)
2c k c T m 2c c T m 2c k 0 2.28
dm x, y x, y x, y
55
1
T
m cc ck 2.29
x, y x, y
2
Eb (m) k cT m 2.31
and the new quadratic error term that embodies the smoothness constraint:
2 2
8
mi mi
E s (m) i 2.32
i 1 x y
where i is a positive constant that controls the relative weight given to the smoothness
constraint on parameter mi. Minimizing the error function by differentiating it with
respect to the unknowns and setting it to zero yields:
1
m cc T L ck Lm 2.33
wise average of m over the spatial neighbourhood . One might think that equation
2.33 is as easily solvable as equation 2.29, however the inclusion of m causes it to grow
quickly into an enormous linear system that is intractable. Instead, equation 2.33 is
solved iteratively. The iterations are initiated by evaluating m for the first time using
56
the closed form solution given by equation 2.29, then a first estimate of m is calculated
and fed into equation 2.33 to calculate the second estimate of m . This process is
repeated, where at each iteration an improved estimate of m is evaluated from the
previous solution. In its present implementation, the iterations are allowed to run for a
fixed number of times, whereas they can be stopped after reaching a prescribed
levelling of m , which could be beneficial in case where the images to be registered
may require less or greater iterations.
The problem of missing large motions due to the finite support of the spatial
and temporal derivatives is averted by adopting a coarse-to-fine scheme where a
Gaussian pyramid is built for both images. A Gaussian pyramid is a succession of low
pass filtered versions of the original image. At each level, first a global affine
transformation is identified and the corresponding shiftmaps are obtained, then the
reference image at that level is warped and the fine scale registration iterations are
started with a (2n+1) (2n+1) window around each pixel. The minimum value for n can
be 1, resulting in a 3 3 window, since it is not possible to register a single pixel; in this
implementation, a window of 5 5 is used. Larger window sizes can be envisaged,
however they are likely to fail to capture smaller distortions. The fine scale process is
also repeated by accumulating the shiftmaps until a maximum number of iterations is
reached. The shiftmaps from the previous coarse level are re-sampled for the next finer
level and new shiftmaps are added until the finest level is reached. This algorithm
proved to have outstanding sub-pixel precision and was parallelized and used in all
registration experiments in this thesis. My parallelization of Periaswamy’s algorithm is
discussed in greater detail in Appendix C.
2.3 Shiftmaps
i
When referring to a shiftmap in singular form, unless its direction is specified such as XS or YS, it is convenient to
assume that it is the (XS,YS) pair of shiftmaps.
57
B. Zitová, J. Flusser / Image and Vision Computing 21 (2003) 977–1000 979
Fig. 1. Four steps of image registration: top row—feature detection (corners were used as the features in this case). Middle row—feature matching by invariant
descriptors (the corresponding pairs are marked by numbers). Bottom left—transform model estimation exploiting the established correspondence. Bottom
right—image resampling and transformation using appropriate interpolation technique.
arise. Physically corresponding features can be dissimilar about the acquisition process and expected image
due to the different imaging conditions and/or due to the degradations. If no a priori information is available, the
different spectral sensitivity of the sensors. The choice of the model should be flexible and general enough to handle
feature description and similarity measure has to consider all possible degradations which might appear. The
these factors. The feature descriptors should be invariant to accuracy of the feature detection method, the reliability
the assumed degradations. Simultaneously, they have to be of feature correspondence estimation, and the acceptable
discriminable enough to be able to distinguish among approximation error need to be considered too. Moreover,
different features as well as sufficiently stable so as not to be the decision about which differences between images
influenced by slight unexpected feature variations and noise. have to be removed by registration has to be done. It is
The matching algorithm in the space of invariants should be desirable not to remove the differences we are searching
robust and efficient. Single features without corresponding for if the aim is a change detection. This issue is very
counterparts in the other image should not affect its important and extremely difficult.
performance. Finally, the choice of the appropriate type of resampling
The type of the mapping functions should be technique depends on the trade-off between the demanded
chosen according to the a priori known information accuracy of the interpolation and the computational