You are on page 1of 15

2.

2 Pixel registration

Pixel registration is a more specific reference to image registration. In its


broader sense, image registration is the process of overlaying two or more images of the
same scene taken at different times, from different viewpoints, and/or by different
sensors [97]. The aim is to geometrically align a set of sensed images with a reference
image. In the context of remote sensing, the variations between the reference and sensed
images are mostly limited to affine distortions due to differences in viewpoint and
camera focal length and content depth due to sensor band. These limitations are the
basis for extracting scenery change information over time. An affine transformation has
six degrees of freedom; therefore three points with two degrees of freedom each will
constrain the six unknown affine coefficients. It is therefore essentially sufficient to
identify three coinciding feature points as the corners of a large triangular patch in order
to infer the corresponding affine transformation [92] between the sensed image and the
reference image. Zitova and Flusser, in their survey of image registration methods [97],
classify the application of image registration into four main groups:

Different viewpoints (multiview analysis)

Different times (multitemporal analysis)

Different sensors (multimodal analysis)

Scene to model registration.

In the context of imaging through turbulence, its application falls in two


groups, multitemporal analysis and scene to model registration for the following
reasons: the images are video frames registered against one of the previous frames or
against a prototype frame obtained from the previously acquired frames.

In the more specific sense, pixel registration aims at identifying one-to-one


correspondence maps between the images; it is essentially an ill-posed inverse problem,
unlike an affine transform, due to the fact that single pixels can not be tracked by solely
looking at their intensity values that vary due to noise or simply illumination. Elastic
transforms [15] aid in the case of localised distortions that do not conform to any
specific model such as sensor distortion or point of view change.

44
Zitova and Flusser [97] also identify the four steps present in the majority of
registration methods as follows:

“Feature detection. Salient and distinctive objects (closed-boundary


regions, edges, contours, line intersections, corners, etc.) are manually or,
preferably, automatically detected. For further processing, these features
can be represented by their point representatives (centres of gravity, line
endings, distinctive points), which are called control points (CPs) in the
literature.

Feature matching. In this step, the correspondence between the features


detected in the sensed image and those detected in the reference image is
established. Various feature descriptors and similarity measures along
with spatial relationships among the features are used for this purpose.

Transform model estimation. The type and parameters of the so-called


mapping functions, aligning the sensed image with the reference image, are
estimated. The parameters of the mapping functions are computed by
means of the established feature correspondence.

Image resampling and transformation. The sensed image is transformed


by means of the mapping functions. Image values in non-integer
coordinates are computed by the appropriate interpolation technique.”

Numerous techniques have been developed over the years addressing each of
the above steps with various degrees of success in various applications,
therefore the interested reader is invited to sample through the 200+ references
surveyed by Zitova and Flusser [97] for detailed descriptions of these methods.
However, the methods that are relevant to this thesis will be addressed briefly
in the following sections. These are correlation-like and Fourier based
methods, and combinations of affine and elastic transforms. The reason these
methods are more relevant to this study is the fact that they are suitable for
more complicated geometric deformations rather than matching distinct
features between images.

45
2.2.1 Correlation-like methods

The cross-correlation (CC) method is a standard way of estimating the degree


of similarity between two series. In the case of 2D data such as images, the two series
are made of pixel intensities.
X
Y

x u
y

f z
w

I
Figure 2-11 Template matching in search window

Earlier template matching by cross-correlation simply used the cross-correlation term in


the squared Euclidean distance between the two image patches to be compared [63]:
2
cc(u , v) I( X , Y ) f ( X u, Y v) 2.17
( X ,Y ) z

where, with reference to Figure 2-11, the summation is done over (X, Y) z, f is the

feature window to be matched in image I inside the search window w, point (u,v) is the
origin of the corresponding candidate patch z. The size of the search window w limiting
the scope of (u,v) can be adjusted to optimize computational overheads. If the
approximate location of the feature window f is know, then the search window w can be
centred around that point and made only large enough to capture the likely true location
of the feature. When f=z perfect correlation is obtained.

46
However the simple cross-correlation term in 2.17 suffers from a major
drawback such as the change in image energy inside the search window. According to
this measure, any bright spot can score higher even compared to an exact feature match.
The template image f in Figure 2-11 has been cross-correlated for all values of (u,v)
with the full image I. Figure 2-12(a) shows the cross-correlation map obtained using
equation 2.17 and Figure 2-13(a) shows the same map as a 3D mesh. It is evident from
both figures that there is no distinct location that scores particularly high for similarity;
actually, there seem to be many locations other the correct one that score higher. This
problem is overcome by making the cross-correlation invariant to changes in image
intensity, therefore changes in energy. This invariance is achieved first by unbiasing the
image and feature vectors, then by normalizing them. The unbiasing is simply the
process of subtracting the expected (mean) values from the vectors and the
normalization is the process of setting them to unit length. This gives the following
normalized cross-correlation formula within the same summation interval:
I ( X , Y ) E (I z ) f ( X u, Y v) E (f )
( X ,Y ) z
CC(u , v ) 2.18
2 2
I ( X , Y ) E (I z ) f(X u, Y v) E (f )
( X ,Y ) z ( X ,Y ) z

By recognising the respective covariance (COV) and variance (VAR)


expressions, 2.18 can be rewritten as:
COV (I z , f ) u ,v
CC(u, v) 2.19
VAR(I z ) VAR (f ) u ,v

When the same feature image f is cross-correlated against the full image, the
cross-correlation map exhibits a single peak at the correct location of the feature image;
this can be easily seen in Figure 2-12(b) and Figure 2-13(b) as a bright dot and a sharp
peak, respectively.

No noise was included in this process, and the feature image was an
immaculate portion of the target image, therefore a cross-correlation coefficient of unity
was obtained at the correct location using the normalised cross-correlation formula.
However the shortcomings of the simple correlation term are evident even in this ideal
case. It must be noted that the normalised cross-correlation can have its own shortfalls,
especially when there are significant geometrical distortions between the feature and
target images, and when the feature size is too small. The former would prevent the

47
occurrence of a perfect match resulting in lower and wider peaks whereas the latter
could falsely identify similar but unrelated features as a match resulting in multiple
peaks. These problems could be mitigated at some extent by limiting the search window
around the expected locations of the features rather than the full image.

(a) (b)
Figure 2-12 Gray scale plot of simple cross-correlation (a) and normalised cross-correlation (b).
Bright areas mean greater correlation.

(a) (b)
Figure 2-13 3D mesh plot of simple cross-correlation (a) and normalised cross-correlation (b) mesh.
Peaks mean greater correlation.

In addition to these shortcomings, it must be noted that the computation both of


these cross-correlation maps are extremely slow compared to the Fourier methods
described in the next section. It took about 33.5 seconds to compute the simple cross-
correlation map over the full 512×512 image, and 194.4 seconds for the normalised
cross-correlation map on a quad-core 2GHz AMD workstation using only one of the
processor cores with a MATLAB [5] implementation of the codes.

48
However the ease of implementation of correlation-like registration methods
using simple arithmetic operators make them ideal for massively parallel
implementations such as with FPGAs (Field Programmable Gate Arrays). Therefore,
despite their shortcomings, correlation-like methods are still favoured in real-time
applications due to their ease of implementation, particularly on streamed data.

2.2.2 Fourier methods

Fourier methods offer substantial computational speed advantages thanks to the


fast Fourier transform (FFT) and greater accuracy than correlation-like methods when
dealing with images corrupted with frequency-dependent noise or that were taken under
varying conditions [97]. They operate on the Fourier representation of the images in the
frequency domain. The Fourier Shift Theorem [21] is the basis of the phase correlation
method originally proposed for the registration of translated images. Assuming that two
images f and g are circularly shifted versions of each other, i.e. wrapped around, this
will result in a phase shift in the Fourier domain such that:
u y v x
j2
G u, v F (u , v)e M N 2.20

where x and y are the spatial shifts, and M and N are the height and width of the
images, respectively, and F and G are the FFTs of the original and circularly shifted
images, respectively. Factoring out the phase difference is simply a matter of
calculating the normalized cross-power spectrum:
u y v x
j2
M N
F G F Fe
P u, v
F G F F
2.21
u y v x
j2
M N
P u, v e

Since the inverse Fourier transform of a complex exponential is a Kronecker


delta:
p x, y x x, y y
2.22

searching for the peak in p(x,y) will yield x and y.

49
The phase correlation method has the advantage over the cross-correlation
method of working in one single step without the need of a sliding window. However
the phase correlation method requires that the Fourier transform of both target and
feature images be of the same size. As most real life problems are not about finding the
shift between two ideally wrapped around images, but locating a relatively much
smaller feature in a target image, the size of the feature image (64 64) and the size of
the target image (512 512) will have to be brought up to a common larger size. A
simple padding up to the size of the target image (512 512) could be done but it
wouldn’t account for spread, so the final padded size should be the sum of both minus
1, in each direction, i.e., ((512+64-1) (512+64-1) as shown in Figure 2-14.

(a) (b)
Figure 2-14 (a) Target image f and (b) padded feature image g.

The feature could be centred in the padding with no real effect on the result,
the extra offset would just have to be subtracted from the result later on. In order to
avoid artefacts arising from the edges of the image from showing up in the phase
correlation map, both target and feature images can be tapered on the edges as shown in
Figure 2-15; in this case a transition of 10 pixels was used.

50
(a) (b)
Figure 2-15 (a) Target image f and (b) padded feature image g with 10 pixels cosine taper on edges
of both images.

Figure 2-16(a) shows the inverse of the cross-power spectrum between the
untapered target and the feature images and Figure 2-16(b) for the tapered pair; the
single and very distinct peak is in both plots at the location where the feature has a
perfect match. However it is evident that the phase correlation map for the tapered pair
is cleaner from artefacts and is slightly steeper. These artefacts occurring at the edges of
Figure 2-16(a) are simply due to the discontinuities introduced by the padding and are
easily avoided by tapering the edges of the images.

(a) (b)
Figure 2-16 Inverse of the cross-power spectrums (phase correlation maps). (a) without cosine
taper and (b) with cosine taper. Notice the lower peak in (a) and noise in the side-bands.

In noiseless cases such artefacts can be easily ignored. However more


sophisticated filtering may be required in cases with extreme noise and/or varying

51
image acquisition conditions. More recently, Fraser et al. [40] proposed a more robust
to noise region of interest (ROI) approach using a Wiener filter to determine the
shiftmaps, and as an added feature of the method, also obtaining the position dependent
PSF.

The ROI approach becomes a necessity when dealing with differently distorted
feature and target images and constitutes the basis for hierarchical methods.

2.2.2.1 Hierarchical shrinking window registration

The registration of individual pixels is not possible due to intensity variations,


therefore larger features need to be registered. However pixel and sub pixel accuracies
can be achieved by hierarchically shrinking and shifting the registration window. An
automatic image-sequence registration algorithm was proposed by Thorpe et al. [89]
aimed at restoring wide-area images of astronomical objects viewed through a turbulent
atmosphere. Their method allows zooming in from a full scale window that first
identifies global shifts, then gradually moves to smaller windows to identify local
deformations. Shiftmaps are updated as further displacements are identified at smaller
scales. They follow on describing a method of restoring the deformations by registering
each frame to an averaged prototype frame.

2.2.3 Elastic methods

Images with complex and/or local distortions end up complicating the


implementation and application of methods using parametric mapping functions. Such
distortions are generally discounted as exceptions in parametric mappings, whereas in
locally warped images they will have to be dealt with as always occurring, which
defeats the purpose of parametric methods. The alternative approach is not to use
parametric transformation models which will fail to match the localised distortions
anyway, but to stretch the images to come into alignment [15]. This is done by viewing
the images as rubber sheets and applying external forces to bring them to alignment
with minimal bending and stretching. In this approach, there is no feature matching as
such as feature correspondence is difficult to establish. Instead, registration is achieved
by locating iteratively the minimum strain energy in the fictitious rubber sheet. Any
similarity function can be used to establish a cost function for a given similarity level

52
under given stretching forces. It is not uncommon to see elastic methods using optic
flow as a similarity metric. Periaswamy and Farid propose such a robust elastic method
coupled with hierarchical affine transforms [71, 72]. Another affine technique where an
explicit spline representation of the images in conjunction with spline processing was
proposed earlier by Thévenaz et al. [87]

2.2.3.1 Optic flow

In its simplest meaning optic flow is the perceived visual motion of objects by
a moving observer. The term optic flow is very much reminiscent of the continuity
equation in actual fluid flow, however in this case the property that stays constant along
a trajectory is the pixel intensity (gray level). Therefore the underlying assumption is
that the gray level I(x,y) of an infinitesimal pixel does not change as it moves along the
optic flow (u,v), where u and v are the perceived motion of the pixel in the x and y
directions, respectively, and that any change in intensity is with respect to time. This
latter statement basically frames the following equation known as the Optic Flow
Constraint (OFC):
I ( x, y , t ) I ( x, y , t ) I ( x, y , t )
u v 0 2.23
x y t

However, this is again an ill-posed problem as a single scalar constraint is not sufficient
to determine the two dimensional flow field (u,v). In practical terms, the only quantity
that can be measured in equation 2.23 is I t . Various solutions have been proposed to
make this problem well-posed, all involving the introduction of further constraints such
as the differentiation of the OFC [68], the minimisation of a functional derived from the
OFC and a smoothness penalty term [51], the assumption of constant or affine variation
in the optic flow [16, 17, 24]. Other methods relying on the local constancy of the
optical flow using spatiotemporal filtering have been also suggested [10, 30, 36, 49]. A
variation on these methods was introduced by Burns et al. [23] based on spatiotemporal
wavelet analysis as early as in 1994. The regularization problem (circumventing the ill-
position condition) has been extensively studied, however as the imaging process is
inherently noisy and the derivatives can only be approximated numerically, any two
independent equations will not necessarily give the best estimate of the true optic flow,
therefore raising robustness concerns. Robustness has been then addressed by many
authors and a thorough survey is presented by Bab-Hadiashar and Suter before their

53
improved take on the problem in [14] where they propose improvements over the use of
the Weighted Total Least Square (WTLS) and Least Median of Squares (LMedS)
methods. Elad and Feuer [33] introduce the assumption of temporal smoothness in
recursive image sequences that allow the incorporation of the Constrained Weighted
Least Square (CWLS) estimator. Whilst assumption of local constancy of the optic flow
is a convenient constraint, it is also the source of another problem called time aliasing,
which is the failure to correctly detect large displacements. A multi-resolution approach
has been suggested by many authors [72] where the flow field is accumulated from a
coarse grid sampling data down to finer scales as far as possible. This is very similar to
hierarchical shrinking window methods [40].

All the shortcomings of the above methods are a perfect plea for the use of
wavelet analysis, which is a fundamental tool of multiscale filtering. Bernard offers a
successful implementation of wavelet analysis in his PhD thesis “Wavelets and ill posed
problems: optic flow and scattered data interpolation” [18].

2.2.3.2 Differential elastic image registration

This technique models the mapping between images as a locally affine but
globally smooth warp [71] and is described here in detail as it was used in all image
registration experiments through this thesis.

Denoting the source and target images as f ( x, y, t ) and f ( xˆ, yˆ , t 1) ,


respectively, Periaswamy modelled locally the motion between images as an affine
transformation for each small spatial neighbourhood:

m7f ( x, y , t ) m8 f (m1 x m2 y m5 , m3 x m4 y m6 , t 1) 2.24

where m1, m2, m3, m4 form the 2 2 affine matrix, m5, m6 the translation vector and m7,
m8 are parameters that embody the change in contrast and brightness, respectively. The
estimation of these parameters is done by defining a quadratic error function to be
minimized:

2
E (m) m7f ( x, y, t ) m8 f (m1 x m2 y m5 , m3 x m4 y m6 , t 1) 2.25
x, y

54
where m T m1 ... m8 , and denoting a small spatial neighbourhood. However this

error function cannot be minimized analytically as it is non-linear in its unknowns. This


is overcome by approximating the error function using a first order truncated Taylor
series expansion:

2
m7 f ( x, y, t ) m8

f ( x, y , t )
E (m) f ( x, y , t ) 2.26
x, y (m1 x m2 y m5 x)
x
f ( x, y , t ) f ( x, y , t )
(m3 x m4 y m6 y)
y t

And written in more compact matrix form as:

2
E (m) k cT m 2.27
x, y

f f f
where k f x y
t x y

f f f f f f
and c T x y x y f 1 .
x x y y x y

The derivatives are evaluated using a set of derivative filters specifically designed for
multi-dimensional differentiation [35]. Ordinarily, in the case of discretely sampled
images, these derivatives are often computed as differences between neighbouring
sample values, which are typically poor approximations to derivatives and lead to
substantial errors [72].

Differentiating the error function with respect to the unknowns and setting to
zero in order to minimize it gives:

dE (m)
2c k c T m 2c c T m 2c k 0 2.28
dm x, y x, y x, y

Then solving for m yields:

55
1
T
m cc ck 2.29
x, y x, y

As stated earlier, is a small spatial neighbourhood, where it is stipulated that


the affine parameters are constant within this neighbourhood. In addition to the latter,
equation 2.24 also stipulates that the contrast and brightness parameters are also
constant within . These stipulations are however at odds with respect to their mutual
existence, i.e., a larger makes it more likely for equation 2.29 to be solvable which
makes it less likely that the constancy assumption will hold, and vice versa. Periaswamy
solves this problem by introducing a smoothness assumption and augmenting the error
function with an additional term:

E (m) E b (m) E s (m) 2.30

where Eb (m) is defined as in equation 2.27 without the summation:

2
Eb (m) k cT m 2.31

and the new quadratic error term that embodies the smoothness constraint:

2 2
8
mi mi
E s (m) i 2.32
i 1 x y

where i is a positive constant that controls the relative weight given to the smoothness
constraint on parameter mi. Minimizing the error function by differentiating it with
respect to the unknowns and setting it to zero yields:

1
m cc T L ck Lm 2.33

where L is an 8 8 diagonal matrix with diagonal elements i and m is the component-

wise average of m over the spatial neighbourhood . One might think that equation

2.33 is as easily solvable as equation 2.29, however the inclusion of m causes it to grow
quickly into an enormous linear system that is intractable. Instead, equation 2.33 is
solved iteratively. The iterations are initiated by evaluating m for the first time using

56
the closed form solution given by equation 2.29, then a first estimate of m is calculated
and fed into equation 2.33 to calculate the second estimate of m . This process is
repeated, where at each iteration an improved estimate of m is evaluated from the
previous solution. In its present implementation, the iterations are allowed to run for a
fixed number of times, whereas they can be stopped after reaching a prescribed
levelling of m , which could be beneficial in case where the images to be registered
may require less or greater iterations.

The problem of missing large motions due to the finite support of the spatial
and temporal derivatives is averted by adopting a coarse-to-fine scheme where a
Gaussian pyramid is built for both images. A Gaussian pyramid is a succession of low
pass filtered versions of the original image. At each level, first a global affine
transformation is identified and the corresponding shiftmaps are obtained, then the
reference image at that level is warped and the fine scale registration iterations are
started with a (2n+1) (2n+1) window around each pixel. The minimum value for n can
be 1, resulting in a 3 3 window, since it is not possible to register a single pixel; in this
implementation, a window of 5 5 is used. Larger window sizes can be envisaged,
however they are likely to fail to capture smaller distortions. The fine scale process is
also repeated by accumulating the shiftmaps until a maximum number of iterations is
reached. The shiftmaps from the previous coarse level are re-sampled for the next finer
level and new shiftmaps are added until the finest level is reached. This algorithm
proved to have outstanding sub-pixel precision and was parallelized and used in all
registration experiments in this thesis. My parallelization of Periaswamy’s algorithm is
discussed in greater detail in Appendix C.

2.3 Shiftmaps

Shiftmapsi, in digital image processing, are by definition maps of shifts in the x


and y directions describing complex geometric deformations between corresponding
pixels of two images. Figure 2-17 shows the greyscale representation of the x and y
direction shiftmaps for the simulated warped discussed in section 1.3.5; the lighter areas

i
When referring to a shiftmap in singular form, unless its direction is specified such as XS or YS, it is convenient to
assume that it is the (XS,YS) pair of shiftmaps.

57
B. Zitová, J. Flusser / Image and Vision Computing 21 (2003) 977–1000 979

Fig. 1. Four steps of image registration: top row—feature detection (corners were used as the features in this case). Middle row—feature matching by invariant
descriptors (the corresponding pairs are marked by numbers). Bottom left—transform model estimation exploiting the established correspondence. Bottom
right—image resampling and transformation using appropriate interpolation technique.

arise. Physically corresponding features can be dissimilar about the acquisition process and expected image
due to the different imaging conditions and/or due to the degradations. If no a priori information is available, the
different spectral sensitivity of the sensors. The choice of the model should be flexible and general enough to handle
feature description and similarity measure has to consider all possible degradations which might appear. The
these factors. The feature descriptors should be invariant to accuracy of the feature detection method, the reliability
the assumed degradations. Simultaneously, they have to be of feature correspondence estimation, and the acceptable
discriminable enough to be able to distinguish among approximation error need to be considered too. Moreover,
different features as well as sufficiently stable so as not to be the decision about which differences between images
influenced by slight unexpected feature variations and noise. have to be removed by registration has to be done. It is
The matching algorithm in the space of invariants should be desirable not to remove the differences we are searching
robust and efficient. Single features without corresponding for if the aim is a change detection. This issue is very
counterparts in the other image should not affect its important and extremely difficult.
performance. Finally, the choice of the appropriate type of resampling
The type of the mapping functions should be technique depends on the trade-off between the demanded
chosen according to the a priori known information accuracy of the interpolation and the computational

You might also like