This action might not be possible to undo. Are you sure you want to continue?

¨

AT F

¨

UR INFORMATIK

DER TECHNISCHEN UNIVERSIT

¨

AT M

¨

UNCHEN

Diplomarbeit in Informatik

GPU-Supported Recovery of Weakly Textured Surfaces from

Monocular Video Data

Taha Sabri Koltukluoglu

FAKULT

¨

AT F

¨

UR INFORMATIK

DER TECHNISCHEN UNIVERSIT

¨

AT M

¨

UNCHEN

Diplomarbeit in Informatik

GPU-gest¨ utzte Rekonstruktion schwach texturierter

Oberﬂ¨achen aus monokularen Videodaten

GPU-Supported Recovery of Weakly Textured Surfaces from

Monocular Video Data

Author: Taha Sabri Koltukluoglu

Supervisor: Univ.-Prof. Dr.-Ing. Darius Burschka

Advisor: Dipl.-Inf. Univ. Oliver Ruepp

Date: December 9, 2011

I assure the single handed composition of this diploma thesis only supported by declared

resources.

M¨ unchen, den 9. Dezember 2011 Taha Sabri Koltukluoglu

Acknowledgments

This research work would not have existed or the results would have looked diﬀerent

without the contribution of some people, whom I cannot thank enough for helping me to

pursue my goals in an interesting as well as state-of-the-art area of research in computer

vision. Without their help, I would not be able to make any contribution to an actual, but

still not fully solved problem of 3D reconstruction.

First of all I would like to thank Prof. Darius Burschka, head of Machine Vision and

Perception Group, as well as to Oliver Ruepp for supervising my thesis and oﬀering me a

top working environment without the lack of any tools or hardware needed to achieve my

goals. It has always been my dream to perform scientiﬁcally valuable research and to test

or apply the results on interdisciplinary areas, such as in the ﬁeld of medical imaging. They

made it possible giving me the opportunity to investigate the behaviour of an information

theoretic approach for data fusion problems and its ﬁtness in the area of three-dimensional

reconstruction from monocular images under varying lighting conditions (still an open re-

search question). Additionally, Prof. Burschka’s support for my attention to take part at

the international supercomputing conference event (ISC 11) is much appreciated.

The readiness provided by Oliver for unusually long discussion sessions about the theo-

retical issues is actually not that common for a thesis and deserves a lot of thanks. I could

always rely to his extensive knowledge. At certain times, his additionally full-time presence

for progress checking as well as for testing was a great help for me to ﬁgure out my failures.

This gave me the opportunity to ﬁx the theoretical problems and to improve my algorithms,

which ﬁnally led to achieve the break through for a reconstructed scene. This would not be

possible without his countless corrections.

The proposed methods for data fusion problems were heavily based on top of the previous

work of Oliver. Most notably the multi-scale optimization algorithms and both coordinate-

as well as intensity-based bundle adjustment algorithms were all developed by Oliver. The

presented results couldn’t be achieved without the technologies he developed.

During my thesis, I encountered a lot of colleagues, whom I want to express my gratitude

to. I always enjoyed the soccer games after work. The provided atmosphere was very ﬁne

and they were really able to get by with me. Special thanks to these researchers: Juan

Carlos Ramirez de la Cruz, Elmar Mair, Chavdar Papazov and Susanne Petsch.

Additionally, I want to thank to my colleague Peter Maday, who has supported me during

my thesis. He was ready to discuss both general and theoretical issues about my work and

made suggestions. In that sense he contributed to achieve my goals by a great deal.

vii

Furthermore, many special thanks to Dr. Joerg Woelfel as well as to Dr. Rolf Matzner

(from RPC Engineering) for their unforgettable support for me to tackle the ﬁnancial prob-

lems for the life in Munich during my entire thesis. Dr. Joerg Woelfel was always available

to discuss the topics about my work in detail even if this was not his area of research. He

gave me a great motivation at times when I was almost frustrated due to certain issues.

Finally and most of all, I have to express my inﬁnite gratitude to my mother, Sevil

Tokat, for her passion during my entire studies, as she was the one who most suﬀered from

the situation that I was not able to visit the hometown frequently during my studies in

Germany. There were times, where we could not see each other for years. Still, she was

every single day standing by me with her spirit and has supported me for whatever I have

decided to do.

This thesis as well as my entire studies are dedicated to Haci Faruk Koltukluoglu.

viii

Abstract

The recovery of 3D structure from monocular image sequences is under extensive research.

The solution of the surface reconstruction problem usually incorporates the identiﬁcation

of point correspondences between participating images. However, such correspondences are

diﬃcult to establish if the scene does not exhibit suﬃcient structure. Such an example in the

bio-medical area is video-endoscopy. To eliminate the need for ﬁnding the same locations

in the images, one potential solution is the usage of so-called image-based methods that

directly operate on image intensities to account for the aforementioned problem.

In regular surface reconstruction algorithms, it is commonly assumed that image locations

preserve their appearance over time. However, in many practical applications, such assump-

tions are not valid, for example, in the case of changing illumination. The current work

investigates the beneﬁt of using information-theoretic similarity measures in conjunction

with an image-based method to overcome such limitations.

The dense recovery of surface geometry relies on an intensity-based bundle adjustment

method estimating the surface geometry and the camera parameters for the individual

images using a non-linear optimization method. The approach requires the estimation of

marginal and joint probability densities and computation of associated entropies of large

sample sets, which is computationally very expensive.

Nevertheless, some parts of these computations are parallelizable, hence they have been

implemented on massively parallel GPU architectures using OpenCL framework.

The proposed method has been tested using synthetic as well as real world data sets. The

behaviour and robustness was evaluated in both static and time varying lighting conditions.

Keywords: Monocular vision, dense 3D reconstruction, bundle adjustment, data fusion, mutual

information, probability density estimation, surface modelling

ix

Contents

Acknowledgements vii

Abstract ix

Contents xi

I. Introduction and Deﬁnitions 1

1. Introduction 3

1.1. Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2. Bundle Adjustment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3. Surface Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3.1. B-Spline Surface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2. Overview and Problem Statement 7

2.1. Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2. Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2.1. Surface Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2.2. Warping Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2.3. Minimization Problem . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3. Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.4. Current Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

II. Theories and Methods 15

3. Similarity Measures 17

3.1. Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.2. Mutual Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.2.1. Multiple Color Channels . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.3. Normalized Variations of Mutual Information . . . . . . . . . . . . . . . . . 21

3.4. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4. Probability Density Estimation 23

4.1. Deﬁnition and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.2. Histogramming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.3. Kernel Density Estimator (Parzen-Rosenblatt Window) . . . . . . . . . . . 24

4.3.1. Selection of Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . 26

xi

Contents

4.4. Summary and other References . . . . . . . . . . . . . . . . . . . . . . . . . 30

5. Continuous Estimation of Mutual Information 31

5.1. Parzen-Window Estimate of (Joint-)Entropy . . . . . . . . . . . . . . . . . . 31

5.2. Mutual Information and its Derivatives . . . . . . . . . . . . . . . . . . . . 34

5.2.1. The Gradient of Mutual Information . . . . . . . . . . . . . . . . . . 35

5.2.2. The Hessian of Mutual Information . . . . . . . . . . . . . . . . . . . 36

5.3. Normalized Variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5.3.1. Normalized Mutual Information . . . . . . . . . . . . . . . . . . . . 39

5.3.2. Relative Mutual Information (Symmetric Uncertainty) . . . . . . . . 40

5.4. An Alternative Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.4.1. The Gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.4.2. The Hessian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5.5. Bandwidth Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

III. Implementation and Experiments 45

6. GPU and CPU Implementation 47

6.1. The Implementation of Mutual Information and its Derivatives . . . . . . . 47

6.2. GPU Version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

7. Evaluation 49

7.1. Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

7.2. Test Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

7.3. Reconstruction Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

7.4. Running Times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

8. Results 51

8.1. Synthetic Datasets under Static Lighting . . . . . . . . . . . . . . . . . . . . 51

8.2. Synthetic Datasets under Varying Lighting Conditions . . . . . . . . . . . . 54

8.3. Real Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

8.4. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

IV. Conclusion 59

9. Discussion and Future Work 61

Appendix 65

A. Derivation of Kernel Density Estimator 65

B. Projective Issues and Camera Models 69

B.1. Pinhole Camera Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

B.2. CCD Cameras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

xii

Contents

B.3. Finite Projective Camera . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

C. Log-Sum-Exp Trick 75

D. B-Splines 77

D.1. B´ezier Curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

D.1.1. General Deﬁnition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

D.1.2. Examination of Cases and Constructing B´ezier Curves . . . . . . . . 78

D.2. Spline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

D.3. B-Spline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

Bibliography 83

xiii

Part I.

Introduction

1

1. Introduction

1.1. Motivation

This research work investigates the problem of dense three-dimensional reconstruction from

weakly-textured monocular image sequences under varying lighting conditions. Recovering

the structure of a 3D model from 2D projections has been in the focus of attention and

subject to extensive research work for a few decades. The problem can be regarded as the

process of re-projection from a set of two-dimensional images back to the three-dimensional

world, thus recovering the depth dimension which was lost after 2D projection. Yet, no

all-satisfying method has been found so far.

In the area of 3D reconstruction, most attention has been on feature-based recovery

processes, which must rely on robust feature detecting schemes, such as SIFT, KLT etc.

However, the assumption of a great potential to ﬁnd such reliable template points in the

participating images is not always reasonable due to the lack of suﬃcient distinctive features

in application speciﬁc cases. In the medical imaging area, for example, endoscopic images

as well as retinal images do not exhibit suﬃcient texture or feature points for conventional

algorithms to be applied to accurately recover the surfaces and camera poses.

In this work, the focus is on dense reconstruction estimating a depth for each pixel based

directly on intensity values in the captured images. This is well suitable for the surfaces

which exhibits minimal texture. Such surfaces are mostly encountered in the ﬁeld of bio-

medical imaging techniques, where this work could potentially be applied. The application

deals with extracting a three-dimensional form of a scene from the motion of the scenes

two-dimensional projections in certain frames. The problem arising in this situation is also

referred to as structure from motion and deals with a data fusion issue that can be solved

by casting it as a mathematical energy minimization problem.

For such a data fusion problem, the main focus of this work is on investigating the

behaviour of an information theoretic approach for alignment of intensity values under

varying lighting conditions. Another arising problem, to be confronted at that point, is

the probability density estimation in terms of intensity distribution of participating images,

especially in a multi-variate case.

Some deﬁnitions and introduction about applied methods are given in the next sections.

Chapter 2 gives an overview about this work and shows the problem statement as well as

mathematical formulations to set up the optimization process. In chapter 3 the information

theoretic approach as an objective function for image similarity measurement is introduced

and the problem of density estimation at that point is stated. Chapter 4 is dedicated

to probability density estimation as well as to the arising problems thereof. Then, the

continuous approximation of mutual information is given in chapter 5 and its gradients as

3

1. Introduction

well as the Hessian matrix are derived analytically. The last chapters show the results and

experiments as well as the conclusions of this work.

1.2. Bundle Adjustment

Given an image sequence depicting some 3D scene from diﬀerent point of views, bundle

adjustment can be seen as a global minimization process reﬁning a priori estimated param-

eters simultaneously. The parameters to be reﬁned usually describe the 3D scene as well as

the relative motion and some optical characteristics of the cameras involved, such as camera

pose, extrinsic parameters and possibly radial distortion.

Given a set of homogeneous image coordinates x

i

j

∈ R

3

describing the position of the

jth world point X

j

∈ R

4

as seen by the ith camera with a homogeneous camera projection

matrix P

i

∈ R

3×4

, the reconstruction problem results in ﬁnding the 3D world points (4D

in homogeneous representation) as well as the camera parameters, such that

P

i

X

j

∝ x

i

j

. (1.1)

Note that the homogeneous camera parameter matrix P = K [R|t] consists of camera

calibration matrix K ∈ R

3×3

(which is usually known a priori in this work, since the

camera is calibrated before capturing images) and extrinsic parameter matrix [R|t] ∈ R

3×4

with rotation and transformation parameters.

If the acquisitions of the images are noisy, then 1.1 will not be satisﬁed. Assuming the imag-

ing error to be zero-mean Gaussian, the bundle adjustment seeks the maximum likelihood

(ML) solution ( an introduction is given by L. Le Cam in [10]). The minimization problem

can formally be given as

min

ˆ

P

i

,

ˆ

X

j

i

j

I

ij

d(

ˆ

P

i

ˆ

X

j

, x

i

j

)

2

where I

ij

denote the binary variables that equal 1 if point j is visible in image i and 0

otherwise. d(x, y) denotes the Euclidean distance between homogeneous points x and y.

Finally,

ˆ

P

i

ˆ

X

j

is the predicted projection of point X

j

on the ith image frame, whereas

x

i

j

stand for the measured image points on participating frames. This process is called

bundle adjustment and involves the minimization of the reprojection error of the light rays

originating from each 3D point and converging on each camera’s optical center, which are

adjusted optimally with respect to both the structure and viewing parameters.

Bundle adjustment is almost always used as the last step of every feature-based 3D

reconstruction algorithm. Hartley and Zisserman give an in depth overview in original

bundle adjustment methods [22]. However, there exist intensity-based bundle adjustment

methods as well (also applied in this work), brieﬂy explained in the paper by Triggs et al.

[44].

On the other hand, as applied in this work, the amount of 3D points and the set of

all participating image views amounts in solving a reconstruction problem which requires

4

1.3. Surface Representation

minimization over 3 · j + 7 · i (considering j times 3D points and i times extrinsic camera

parameters, 3 for translation and 4 for rotation quaternion, with a total of 7 degrees of

freedom), which becomes extremely costly. As a solution the reduction in the number of

3D points as well as in the number of views might help. In this case, not including all

the views at once, also referred to as sliding windows, can be applied and there is also a

more recent paper evaluating the status of real-time bundle adjustment methods [17] using

sliding window approaches.

1.3. Surface Representation

One of the important decisions in a three-dimensional reconstruction process is the choice

of an appropriate object representation model, which has its own area of research usually

investigated in the ﬁeld of computer graphics. Surfaces are one way to represent a model.

Other methods are solids and wireframes, such as lines and curves. Point clouds are also

sometimes used as temporary ways to represent an object, with the goal of using the points

to create one or more of the three permanent representations. An in-depth survey about

surface representations given by Andreas Hubeli and Markus Gross can be found in [24].

Point cloud representations are generally applied in the traditional bundle adjustment

algorithms and require the reliable identiﬁcation of the points. Since this work takes in ac-

count the intensity measurements only, it is not feasible to locate such points robustly, even

if several frames are considered simultaneously. This observation disqualiﬁes point clouds

as scene representation for intensity based model-recovery algorithms. B-spline surfaces are

applied instead.

1.3.1. B-Spline Surface

B-Splines can simply be seen as a generalisation of B´ezier curves. Appendix D brieﬂy

explains the theories behind such curves in two-dimensional case. Based on these deﬁnitions,

a three-dimensional surface can be deﬁned with such curves and this is referred to as a B-

Spline Surface. Formally, it is given as

c(u, v) =

m−1

i=0

n−1

j=0

B

i,p

(u)B

j,q

(v)c

ij

where c

ij

’s stand for the control points with i = 0, · · · , m−1 rows and j = 0, · · · , n −1

columns. B

i,p

(u) and B

j,q

(v) are B-spline basis functions of degree p and q, respectively,

given a knot vector of k knots in the u-direction and a knot vector of l knots in the v

direction. Figure 1.1(a) gives a visual impression of such a B-spline surface and Figure 1.1(b)

shows the two-dimensional basis functions as wireframe surfaces. Note that the product of

two one-dimensional B-spline basis functions, in the u and v directions, respectively, serves

as coeﬃcient of some control point c

ij

.

5

1. Introduction

Image courtesy of Dr. C.-K. Shene

a

(a) A B-spline surface deﬁned by 6 rows and

6 columns of control points

(b) Basis functions of some control points, where they are ﬁxed in u direction, whereas they

are changing in v direction.

a

http://www.cs.mtu.edu/~shene/COURSES/cs3621/NOTES/surface/bspline-construct.html

Figure 1.1.: B-Spline Surface

6

2. Overview and Problem Statement

2.1. Concept

For the description of the three-dimensional structure in the scene, speciﬁc surface models,

B-spline surfaces of varying order and complexity (refer to subsection 1.3.1), are applied

that ultimately impose a smoothness constraint on the observed points in form of a per-pixel

depth map model for a given reference image,

D : R

k

×R

2

→R. (2.1)

The value D(d, u, v), where d ∈ R

k

stands for a set of k surface parameters, is supposed

to describe the depth for the pixel with coordinates (u, v). This concept is illustrated in

Figure 2.1 and the depth map of a real face image is shown in Figure 2.2(b).

Figure 2.1.: The depth map concept. Schematic 2D view of a depth map. Image courtesy

of Oliver Ruepp

Observing a three-dimensional scene under two diﬀerent viewpoints essentially amounts

in two diﬀerent view frames, which are related to each other in some way. Supposing the

situation that the extrinsic camera parameters as well as a perfect mathematical description

of the surface are known, a coordinate warping function of form R

2

→R

2

can be formulated,

which maps a set of coordinates of one image into other (see next section 2.2). This is

reasonable if the coordinates, that are to map, always stay within the image frames of both

views. Therefore the depth map described above is not applied to the entire reference image,

but to a user-chosen rectangular region of interest (ROI), which is illustrated in Figure 2.2.

The arising problem is to ﬁnd both the surface parameters describing the scene as well as

7

2. Overview and Problem Statement

the parameters of participating cameras.

To achieve a model reconstruction from a sequence of monocular images, the system has

to be initialized. First, a template image and an ROI (as in Figure 2.2(a)) is chosen for

the surface to be reconstructed. Then, some correspondences between the template image

and the acquired images {1, · · · , |V |}, for a given window size |V | (see sliding window

approaches), are computed for the ﬁrst initialization of the system based on multi-scale

optimization approach.

To achieve usable results and assure that the algorithm is stable, one must also be careful

which images to select for the optimization window V . Simply choosing consecutive images

might be a bad idea, because if the camera stops moving for some time, a number of images

from the same position will be taken. If those images are placed inside the window, the

optimization process will become very unstable, since depth reconstruction is impossible

without a certain minimal baseline. Therefore new images are accepted in the sliding

window only if the distance of the baseline to the previous image is big enough. Whenever

a new image is accepted, the oldest image is discarded.

In the next section, the mathematical formulations as well as the problem statements are

given in form of a minimization problem, which then results in a process to optimize an

objective function, that is supposed to describe the goodness of similarity for the aligned

image intensity pairs to account for image correspondences. For such a similarity measure,

an information theoretic phenomenon, based on a probabilistic approach for telling how

well one image explains the other one, is investigated and experimented within this work.

The motivation behind the investigation of such an information theoretic approach was

the reason for the need of a three-dimensional reconstruction under changing brightness

or varying light conditions. This is usually complicated or diﬃcult to achieve without the

presence of certain features. The application only relies on image intensity values which

yield less information. This problem, based on the proposed approach described above, is

pointed out in the work of Oliver Ruepp [33].

8

2.1. Concept

(a) The template image and the chosen ROI

(green rectangle)

(b) The depth map for reconstructed face

(c) 3D Reconstruction. Left site view (d) 3D Reconstruction. Right site view

Figure 2.2.: The Reconstruction of a Region of Interest and its respective depth map.

9

2. Overview and Problem Statement

2.2. Problem Statement

2.2.1. Surface Function

To describe a three-dimensional surface, a parameterized function is formally deﬁned based

on the characteristics of a pinhole camera model. Appendix B gives necessary introduction

for such a camera model. Given a 3D Cartesian point coordinate X

S

= (x, y, z)

T

, the

projection function is given as

π(X

S

) =

Å

xf

x

z

+p

x

,

yf

y

z

+p

y

ã

T

, (2.2)

where f

x

, f

y

are focal lengths in terms of pixel dimensions and p

x

, p

y

describe the camera

center as stated in Appendix B.

Each pixel can now be described in terms of a ray originating from the camera center

and reaching the surface of the 3D scene at a certain depth λ. Assuming that the camera

center is placed at the origin of the coordinate system, this yields a surface function

r(u, v, λ) = λ

Ç

u −p

x

f

x

,

v −p

y

f

y

, 1

å

T

,

as the ray corresponding to pixel coordinates (u, v), describing the three-dimensional surface.

2.2.2. Warping Function

As described in the concept 2.1, a coordinate warping function can be formulated, if the

exact camera parameters as well as a perfect mathematical description of the 3D scene in

form of certain surface parameters would be known. Figure 2.3 illustrates this. For such a

warping function, the parameters can be computed that explain the correspondences in a

best way, based on a non-linear optimization process.

Figure 2.3.: Left, middle: Surface under two diﬀerent camera positions. Right: Warping of

surface coordinates from left to right image. Image courtesy of Oliver Ruepp

10

2.2. Problem Statement

To formulate the optimization problem, a cost function has to be deﬁned, that describes

this correspondence. More on that follow soon.

2.2.3. Minimization Problem

Now we are ready to point out to the actual statement, which is a mathematical optimization

problem, where a certain objective function has to be minimized. Recall that the function

D(d, u, v) in equation 2.1 denotes the depth map function, for some coordinates u and v and

there is a sequence of monocular images, from which I

0

is denoted as the reference image,

where I

0

(u, v) refers to the pixel of the image at location (u, v). The participating images

are numbered sequentially, such as I

0

, I

1

, · · · , I

n−1

. Furthermore, lets denote c

n

as the set

of camera parameters for the nth camera.

Now, supposing that there exist a mechanism, called w

n

, depending on the camera pa-

rameters c

n

as well as on k-dimensional surface parameters d, the following deﬁnition would

be expected, if the described correspondence would be exact:

I

0

(u, v) = I

n

(w

n

(d, c

n

, u, v)),

where w

n

is referred to as image coordinate warping function for a certain frame n and

deﬁnes the correspondence for the coordinates between the reference and nth image.This

equation would also hold for each coordinate location (u

i

, v

i

), if the correct camera param-

eters c

n

and the surface parameters d would be known. This mechanism is mathematically

formulated as

w

n

(d, c

n

, u, v) ≡ π (T(c

n

, r(u, v, D(d, u, v)))) ,

where D and r are as explained above and

T(c

n

, X

S

) : R

3

×R

4

×R

3

→R

3

stands for the transformation mapping 3D spatial coordinates X

S

to 3D coordinates in the

camera frame described by c

n

= (t

n

, q

n

), denoting the extrinsic camera parameters t

n

∈ R

3

as translation and q

n

∈ R

4

as rotation quaternion. Furthermore, π is the projection of a

3D location to 2D image coordinates as stated in the equation 2.2.

This situation motivates the deﬁnition of some similarity measurement SM on intensity

values, that stands for an objective function O, which is supposed to take its minimum

value in case of correct camera and model parameters:

O(d, c

n

) = SM

ÖÖ

I

0

(u

0

, v

0

)

.

.

.

I

0

(u

m

, v

m

)

è

;

Ö

I

n

(w

n

(d, c

n

, u

0

, v

0

))

.

.

.

I

n

(w

n

(d, c

n

, u

m

, v

m

))

èè

,

11

2. Overview and Problem Statement

The problem of ﬁnding a warping function from the template image I

0

to the current

image I

n

is now formulated as the problem of minimizing some cost function with respect

to camera and depth map parameters. In Ruepp’s work [33], an optimization method to this

problem, based on a well known technique, is suggested and implemented in an extendible,

yet ﬂexible way.

As also stated in section 1.2, such an optimization process reﬁning some camera and

model parameters based on intensity values (or coordinates) might be quite expensive in

the computational sense. Therefore, the set of involved pixels are reduced to a set of m

chosen reference points. Furthermore, the participating views are also reduced to a set of

views of size |V | during the optimization process and the entire applications is handled as

a sliding window approach, where new views are added to the window whenever necessary,

whereas older views are then discarded respectively. This allows a generalization of the

objective function derived above:

O

(d, c

V

1

, c

V

2

, · · · ) =

j∈V

O(d, c

j

),

and

V

1

, V

2

, · · · ⊂ 1, 2, · · · , n.

Thus, the new objective function O

**is set up as the sum of objective functions, where
**

the optimization process is performed simultaneously over the views hold in a window of a

certain size |V |.

2.3. Previous Work

Given the problem statements, one would expect to apply a suitable similarity measure to

account for correspondences between image intensities as required during the optimization

process described above. Previous work, progressed by Oliver Ruepp, focused on a cost

function performing the evaluation with respect to the diﬀerences between intensity pairs.

Good results have been obtained in that way for certain scenes, where the brightness between

diﬀerent image frames does not change and the light source stays in a constant location.

However, the robustness of such a cost function were limited for some special scenes. For

example, in medical video endoscopy, the light source is moving along with the camera. As

a result, the lighting conditions are diﬀerent between the frames. As it was obvious, the

cost function previously applied, did not performed well with changing illumination, which

is in most cases expected in the real world.

12

2.4. Current Approach

2.4. Current Approach

The main concentration of this work is on investigation of an information theoretic approach

for the required similarity measurement. The reason for such an approach was the potential

of its robustness against changes in illumination and brightness of participating frames.

The evaluation of the correspondence between intensity pairs is no longer performed based

on pixel diﬀerences as described in the previous work. Instead, the probability distributions

of image samples have been under consideration. If certain images are aligned on top of

each other, there also exist a joint probability distribution, which is of great importance.

Such joint distributions are used in the information theory to look for uncertainties.

This work aims to apply these theories for dense recovery of the scenes under changing

illumination. The idea behind this goal is to look for an order or disorder of joint distribu-

tions between intensity values of image samples. If the images are well aligned, one would

expect the amount of certainty of joint distributions to be similar in a way, independent

from the diﬀerences in brightness or illumination. The question under investigation is to

deﬁne how well one of the samples explains the other one.

13

2. Overview and Problem Statement

14

Part II.

Theories and Methods

15

3. Similarity Measures

Image similarity measures are broadly used in computer vision. A similarity measure quan-

tiﬁes the degree of similarity between intensity patterns in two images [18]. Usually, there

is a ﬁxed pattern, X, also referred to as the reference image, and a changing pattern, Y ,

called the ﬂoating image, which is explained as a mapping of a set of some estimated param-

eters. These parameters are then adjusted during an image alignment process based on a

similarity measure, until some matching criterion is optimized, thus the similarity measure

reaches an optimum. The choice of such an objective function depends on the modality of

the images to be aligned.

Some of common examples of image similarity measures include sum of squared intensity

diﬀerences (SSD), sum of absolute intensity diﬀerences (SAD), cross-correlation (CC) etc.,

which are usually applied for registration applications in the same modality. On the other

hand, for the dense recovery of objects from monocular images, the Pseudo-Huber cost

function was applied as a similarity measure in our group. The work published at ACCV

2010 can be found here [33]. Another interesting measurement is mutual information, which

is applied mostly in cases of diﬀerent modalities (such as in medical procedures to register

between CT and MRI images).

Mutual information is deﬁned in terms of entropy measures. The methods are explained

in the following sections. The results and experiments are shown in the following chapters.

17

3. Similarity Measures

3.1. Entropy

In information theory, entropy of a random variable X is a measure of the average or

expected information content of an event, whose distribution is determined by the marginal

probability of the random variable. One such measure was introduced by Shannon in 1948

[38] , and is deﬁned as

H (X) =

x∈X

p (x) · log (1/p (x)) (3.1)

where p(.) is the probability mass function (pmf) of the random variable X. Shannon

entropy measures the degree of uncertainty of a random variable by scoring less likely

outcomes higher than the more likely ones. This is consistent with the notion that knowledge

of an outcome that can be easily predicted is considered less valuable. An extension of the

Shannon entropy to the domain of real numbers can be achieved by replacing the sum with

the integral. This is considered later in the chapter 5.

3.2. Mutual Information

In the imaging domain, MI [38] has been very successful as a similarity measure for auto-

matic registration of multi-modal images[30]. MI is deﬁned as the diﬀerence of the sum of

marginal entropies and the joint entropy, considering the joint probability densities of both

images (entropy stands for untidiness of an image).

Mathematically, MI measures the mutual dependence of two random variables X and Y

[14, 29, 27]. It quantiﬁes the dependence between the joint distribution of X, Y and is

formally deﬁned as:

MI(X; Y ) = H(X) −H(X|Y )

= H(X) +H(Y ) −H(X, Y ) ,

MI(X; Y ) =

y

x

p(x, y) log

p(x, y)

p(x)p(y)

dxdy ,

where p(x, y) is the joint probability density function of X and Y , and p(x), p(y) are the

marginal probability density functions of X and Y respectively. Furthermore, H(X|Y ) is

the information content of random variable X if Y is known, thus the conditional entropy

measuring the uncertainty about X knowing the realization of Y . H(X, Y ) is the joint

entropy of the two random variables and is a measure of combined information of the two

random variables. Note that, in imaging, p is usually meant to be the probability of certain

image intensity value. In case of bivariate densities, p(x, y) refers to the probability of

simultaneous occurrences of intensities x and y between the two images.

18

3.2. Mutual Information

Considering two same images under the assumption that they are perfectly aligned and

there is a complete similarity, the projection of the joint density function will be on an

identity-line from the minimum to the maximum intensity, as seen by Figure 3.1.

Image courtesy of Prof. Navab

a

a

See lecture slides: http://campar.in.tum.de/Chair/TeachingWs11CAMPLecture

Figure 3.1.: Joint Histogram of two same images.

This means that the entropy actually gives the amount of disorder. Since we are looking

for a certain order (such as the straight line on the diagonal shown in Figure 3.1), the

joint entropy has to be minimized, which means that the mutual information between

participating images has to be maximized. Figure 3.2 visualizes the joint entropies in terms

of some real images, where it can easily be seen, that in case of a bad match between images,

the joint-histogram of intensities gives an impression of an intensive disorder, whereas well

aligned images give a well ordered visual impression.

3.2.1. Multiple Color Channels

If evaluating the mutual information between two given images, one is confronted with

the problem of diﬀerent color channels. One possible way to overcome this issue, is to

convert the color images to luminance images and to evaluate mutual information between

converted images. On the other hand, in order not to lose any color information, the mutual

information can be taken for all diﬀerent color channels of the images and the result can be

summed up, which is the way followed in this work. Usually RGB channels are used and

the mutual information for such color images are evaluated as

MI(X; Y )

RGB

= MI(X; Y )

R

+ MI(X; Y )

G

+ MI(X; Y )

B

.

19

3. Similarity Measures

Figure 3.2.: Joint Histogram. The images at the ﬁrst and second columns are almost

the same, where as in the last column the diﬀerence is obvious. The Joint-

Histograms are given respectively.

20

3.3. Normalized Variations of Mutual Information

3.3. Normalized Variations of Mutual Information

Besides the general form of MI, there also exist diﬀerent normalized variations. According

to [41], the normalized mutual information is calculated from three entropies as

NMI(X; Y ) =

H(X) +H(Y )

H(X, Y )

, (3.2)

where H still refers to Shannon entropy. This version was investigated by C. Studholme

for its robustness against overlapping region invariance in the ﬁeld of 3D medical image

registration.

Following variants are provided by the coeﬃcients of constraint [13] or uncertainty coef-

ﬁcient [31],

C

XY

=

MI(X, Y)

H(Y )

and C

YX

=

MI(X, Y)

H(X)

.

However, the two coeﬃcients, also referred to as ﬁdelity ratios and applied by J. W.

Larson [26], are not necessarily equal. Absolute redundancy of information gives a more

sophisticated and symmetrically scaled variant expressed as

AR =

H(X) +H(Y ) −H(X, Y )

H(Y ) +H(X)

.

This value is normalized by R

max

,

R

max

=

min(H(Y ), H(X))

H(Y ) +H(X)

.

Thus, in case of independency of participating variables, absolute redundancy attains a zero

value, whereas it attains R

max

, when one variable becomes completely redundant with the

knowledge of the other [32, 5].

Besides that, the relative mutual information RMI(X, Y ) is the ratio between the mu-

tual information MI(X, Y ) and the average of the sum of marginal entropy information

transmitted by X and Y [42] and is formally given as

RMI(X, Y ) =

H(X) +H(Y ) −H(X, Y )

H(X)+H(Y )

2

. (3.3)

It represents a weighted average of the two uncertainty coeﬃcients [31]. This term is also

referred to as the symmetric uncertainty [48].

Following expressions, as investigated in [49, 40], give other variations of mutual infor-

mation:

21

3. Similarity Measures

MI(X, Y)

H(X, Y )

,

MI(X, Y )

»

H(Y )H(X)

,

MI(X, Y )

min(H(X), H(Y ))

.

3.4. Summary

The most important problem arising by evaluation of mutual information is the probability

density function p. This values are typically not known and an accurate estimation is

needed in order to approximate the mutual information. This directly takes us to the area

of density estimation problems. The process of estimating such densities might then bring its

own problems including the estimation of other certain parameters. The level of complexity

of such a process is more than one could assume at the beginning. The probability density

estimation is still an unsolved problem, especially in the case of multi-variate densities.

This part have been the most important obstacle of this thesis. Next chapter points out

to probability density estimation, gives the theories behind that and shows the potential

problems.

22

4. Probability Density Estimation

4.1. Deﬁnition and Motivation

Considering a continuous random variable, the probability density function (pdf) reveals

the distribution of that variable allowing to evaluate some statistical characteristics such as

mean and variance (if they exist). Besides that, pdf also provides the probability that this

variable will take on values in a certain interval.

In reality, the underlying pdf p(x) of an observable random variable X is not known.

Usually, there exist observations of X to some extent, from which the underlying pdf might

be estimated approximately. The computation of such an approximation is referred to

as probability density estimation (pde). Such estimators can be categorized as being

parametric, non-parametric and semi-parametric.

If some form of the pdf can be assumed, due to application speciﬁc reasons, parametric

methods can be used. This is, for example, often applied in ultrasound signal processing

algorithms in form of Rician and Rayleigh functions.

Non-parametric methods do not assume any kind of form of the underlying pdf, so there is

no guidance or constraints from theory. The simplest and most widely used non-parametric

estimation technique is the histogram. Due to some limitations and depending on the

needs of the application, other non-parametric methods have been developed such as kernel

density estimation (kde), also referred to as Parzen windowing, and artificial neural

networks.

Semi-parametric estimators can be seen as a combination or a compromise between para-

metric and non-parametric methods, whereby the superposition of a number of parametric

densities are used to approximate the underlying density, such as Gaussian mixture models.

Within this work, the estimation problem is approached without making any assumption

about the structure of underlying pdf, except for some unknown parameters that need to

be estimated. For this reason, this work aspires to non-parametric estimation, applying

the kernel density estimator, which is derived and explained in the following sections, that

gives a summary of non-parametric estimation techniques based on H¨ardle [21]. A short

description of simple histogram approach, pointing out to its drawbacks and unﬁtness for

this application, is also given in the next section. Then, the well known approach, kernel

density estimator, is derived from the histogram technique.

23

4. Probability Density Estimation

4.2. Histogramming

The most commonly used and widely known method to estimate the underlying pdf is the

histogram. It is a graphical representation of the distribution of samples drawn from a

random variable. An histogram can be constructed by deﬁning a set of intervals B

j

, called

the bins, of (usually) the same length h, referred to as binwidth, and counting the sum

of the observations falling into each bin. The density of each bin then corresponds to the

division of frequency count by the sample size n and the binwidth h.

Given the samples X

1

, X

2

, · · · , X

n

from some unknown continuous distribution, the his-

togram is mathematically deﬁned as

ˆ p

h

(x) =

1

nh

n

i=1

j

I(X

i

∈ B

j

)I(x ∈ B

j

), (4.1)

where

B

j

= [x

0

+ (j −1)h, x

0

+jh] , j ∈ Z

with x

0

being a selected origin, and

I(X

i

∈ B

j

) =

1 if X

i

∈ B

j

,

0 otherwise.

The shape of the histogram is essentially controlled by two parameters, h being the

binwidth and x

0

being the origin. There are some ways, such as average shifting, to free

the histogram from the dependence on the choice of an origin. In total, denoting m

j

as

the center of the bin B

j

, it can easily be seen, that the histogram assigns each certain x in

B

j

the same estimate for p, namely ˆ p(m

j

). This is actually not ﬂexible enough. There are

jumps between the boundaries, thus the estimation is discrete. In this work, however, due

to the need of ﬁrst and sometimes also the second order derivatives of objective functions for

the optimization process, the pdf has to be estimated continuously, which makes the simple

histogram approach not applicable and leads to the need of another, yet more ﬂexible kind

of estimation, which is described in the following section.

4.3. Kernel Density Estimator (Parzen-Rosenblatt Window)

It is clear that the histogram has some shortcomings. The estimation via histogram highly

depends on the choice of an anchor point (small shifting of such origins might lead to a

totally diﬀerent estimations than the expected ones), is not smooth, thus not diﬀerentiable

at particular points and very sensitive to the binwidth. Kernel density estimation (kde),

a fundamental data smoothing process, is widely applied to overcome these problems to

some extend. In the Appendix A, the derivation of the kernel density estimator is brieﬂy

explained. It is based on the idea, that the usual bingrids of histogram approach are totally

24

4.3. Kernel Density Estimator (Parzen-Rosenblatt Window)

taken out, which removes the problem of a choice of origin, and the underlying probability

density is estimated smoothly, thus the estimation is diﬀerentiable. However, the problem

of choosing an appropriate bandwidth still remains, which is actually almost the most

important part of this work. As described in Appendix A, a kernel density estimator is

deﬁned as

ˆ p

h

(x) =

1

n

n

i=1

K

h

(x −X

i

) ,

where

K

h

(•) =

1

h

K(•/h).

Visually as shown in Figure 4.1, this procedure can be seen as a sum of bumps, small

rescaled kernels, which then build the ﬁnal structure of the estimation. It is obvious, that

ˆ p

h

(x) can be interpreted as a sum of rescaled kernels, which are the little bumps, where each

of them are centered at one of the observations. Furthermore, it can also be interpreted that

the graphical representation of small kernels would change depending on the bandwidth.

As a consequence the ﬁnal shape of the estimate, thus the sum of the bumps, ˆ p

h

, would

also change.

Image courtesy of Stefanie Scheid

a

a

http://compdiag.molgen.mpg.de/docs/talk_05_01_04_stefanie.pdf

Figure 4.1.: Graphical representation of kernel density estimation as a sum of bumps.

25

4. Probability Density Estimation

At that point, the most important remaining question is, whether the probability density

estimation heavily depends more on the choice of the kernel or the bandwidth. For practical

purposes the choice of the kernel function is almost irrelevant for the eﬃciency of the

estimate. Usually, the idea behind that is to consider an estimate ˆ p

h

using kernel K and

bandwidth h, and to compare it with another estimate ˆ p

ˆ

h

using kernel K

σ

and bandwidth

ˆ

h. It can easily be derived that,

ˆ p

h

(x) =

1

nh

n

i=1

K

Å

x −X

i

h

ã

=

1

n

ˆ

hσ

n

i=1

K

Å

x −X

i

ˆ

hσ

ã

= ˆ p

ˆ

h

(x), (4.2)

if the relation

ˆ

hσ = h holds. This means, all rescaled versions K

σ

of a kernel function K

are equivalent if the bandwidth is adjusted accordingly. Therefore the ﬁnal kernel density

estimation hardly depends more on the choice of an appropriate bandwidth h, rather than

on the kernel itself.

The most widely used kernels, also shown in Appendix A, are for example the Epanech-

nikov and Gaussian kernels. Furthermore, the Laplacian and multivariate Student kernels

are also applied widely. However, the same shape of an estimate using one of these kernels

can also be achieved using the other kernels together with diﬀerently adjusted bandwidths.

Therefore the concentration should rather be on the choice of the bandwidth.

4.3.1. Selection of Bandwidth

Choosing the bandwidth h is a crucial problem in non-parametric density estimation. Too

small values of h lead to undersmoothing whereas larger values lead to oversmoothing. The

motivation is to choose the bandwidth in such a way, that both the variance and the bias

are reduced, which are deﬁned as

Bias {ˆ p

h

(x)} = E {ˆ p

h

(x)} −p(x)

=

1

h

K

Å

x −u

h

ã

p(u)du −p(x)

and

26

4.3. Kernel Density Estimator (Parzen-Rosenblatt Window)

Var {ˆ p

h

(x)} = Var

1

n

n

i=1

K

h

(x −Xi)

=

1

n

2

n

i=1

Var {K

h

(x −Xi)}

=

1

n

Var {K

h

(x −X)}

=

1

n

Ä

E

¶

K

2

h

(x −X)

©

−[E {K

h

(x −X)}]

2

ä

=

1

n

1

h

2

K

2

Å

x −u

h

ã

p(u)du −[E {K

h

(x −X)}]

2

Using variable substitutions, the symmetry of the kernel and second-order Taylor expansion

of p around x, it can be shown that

Bias {ˆ p

h

(x)} =

h

2

2

p

(x)

s

2

K(s)ds +o(h

2

), as h →0 (4.3)

and

Var {ˆ p

h

(x)} =

1

nh

ß

K

2

(s)ds

™

p(x) +o

Å

1

nh

ã

, as nh →∞. (4.4)

Looking at formulas 4.3 and 4.4, it is easily seen, that increasing the bandwidth h will

lower the variance, whereas it will raise the bias. Decreasing the bandwidth does the

opposite. The aim is surely to keep both the variance and bias small. In can be concluded,

that a compromise has to be found, which can for example be aimed by minimizing the

mean squred error (MSE), this is the sum between variance and squared bias.Figure 4.2

visualizes this trade-oﬀ.

MSE{ˆ p

h

(x)} = E

î

{ˆ p

h

(x) −p(x)}

2

ó

= Var {ˆ p

h

(x)} + [Bias {ˆ p

h

(x)}]

2

(4.5)

However, inserting the formulas 4.3 and 4.4 into 4.5, it is observed that the MSE depends

on both p and p

**. In practice, those functions are unknown, hence this process is not
**

applicable to derive the optimal bandwidth, h

opt

(x), directly. Therefore, there is the need to

ﬁnd suitable substitutes for the true estimate and its second order derivative. Furthermore,

note that h

opt

(x) depends on x, hence it is a local bandwidth.

In order to reduce the dimensionality and to rather have a global bandwidth estimation,

it is reasonable to minimize an approximated formula for mean integrated squared error,

called AMISE, instead of MSE. However, this still depends on p

**, so the problem of having
**

to deal with unknown quantities is still not completely solved with minimization of AMISE.

27

4. Probability Density Estimation

Image courtesy of Wolfgang H¨ardle

a

a

Nonparametric and Semiparametric Models [21]

Figure 4.2.: Squared bias part (thin solid), variance part (thin dashed) and MSE (thick solid)

for kernel density estimate.

Besides that Haifeng Chen gives an excellent review in the paper for “Robust Computer

Vision through Kernel Density Estimation” [12].

Note that there is no uniquely applicable way of choosing the best bandwidth. However,

there are some suggestions, which might be applied and would work in most of the cases for

univariate problems. Multivariate density estimation (multivariate densities instead of one-

dimensional ones) is a bit more complicated if it comes to choose of a suitable bandwidth

matrix (this problem is not considered within this work). The easiest and a quick method

to choose the bandwidth is the plug-in method, also referred to as Silverman’s Rule of

Thumb. On the other hand a more ﬂexible, but time-consuming, method, called cross

validation is also widely applied to estimate an almost optimum bandwidth. This work

considers both of these methods, which are brieﬂy described in the following subsections.

Bandwidth choice via Silverman’s Rule of Thumb

Recall that the minimization of AMISE, an approximated formula for the mean integrated

squared error, theoretically would yield an optimal and global bandwidth value h, if the

second derivative p

**of the true density would be known. However this is not the case in
**

practice. Silverman proposed a plug-in method, which replaces the unknown parameter p

**of AMISE with an estimate. This estimation is done in a way, that the unknown density p
**

is assumed to belong to the family of normal distributions. The plug-in method is deﬁned

as

ˆ p = 1.06 min

ß

ˆ σ,

R

1.34

™

n

−1/5

. (4.6)

28

4.3. Kernel Density Estimator (Parzen-Rosenblatt Window)

where ˆ σ stands for standard deviation while R is the interquartile range (see [11]) of the

observations.

In practice, the distribution of the true density is not known. If it is normally distributed,

then the plug-in method will yield an optimal bandwidth h. If the true density is not nor-

mally distributed, but it is uni-modal (thus having one peak and being fairly symmetric)

and similar in shape, then this method will yield a bandwidth h, which is not far away

from the optimum. The assumption, that the true density resembles the normal distribu-

tion, actually does not ﬁt the intention of doing a non-parametric estimation. Using this

method rather aspired to a semi-parametric estimation. However, in case of a multi-modal

distribution, thus having more peaks than just one, this rule-of-thumb method might yield

considerably misled estimates of the bandwidth. In this case, applying another method,

called (least squares) cross validation, might be helpful but to the cost of an additional

optimization process, which might be costly.

Leave-one-out estimation of Bandwidth (Cross Validation)

This approach is based on the minimization process of a dissimilarity measure, called

integrated square error (ISE), which is deﬁned as

ISE{ˆ p

h

} =

{ˆ p

h

(x) −p(x)}

2

dx

=

ˆ p

2

h

(x)dx −2

{ˆ p

h

p} (x)dx +

p

2

(x)dx. (4.7)

Since the minimization over h is concerned, the last term

p

2

(x)dx can be removed from

the formulation as it does not contain h. The ﬁrst term

ˆ p

2

h

(x)dx can be calculated from

the observations. The remaining issue is the second term, which depends on h and involves

the unknown quantity p.

The term {ˆ p

h

p} (x)dx is actually the expected value of ˆ p

h

(X), where X is an independent

random variable, with respect to which the expectation can be estimated. This estimation

is also referred to as leave-one-out process.

Looking at 4.7 again and replacing

ˆ p

2

h

(x)dx with a term that employs sums rather than

an integral, and estimating the second term

{ˆ p

h

p} (x) dx via leave-one-out process, gives

the so called cross-validation:

CV(h) =

1

n

2

h

i

j

K ∗ K

Å

X

j

−X

i

h

ã

−

2

n(n −1)

i

j=i

K

h

(X

i

−X

j

). (4.8)

where K ∗ K(u) is the convolution of K, i.e. K ∗ K(u) =

K(u −v)K(v)dv.

One of the nice features of cross validation is that the estimated bandwidth h automati-

cally adapts to the smoothness of p. Furthermore this process can also be applied to other

density estimation methods other than the kernel method. However, this process might

29

4. Probability Density Estimation

be costly as too many training data has to be validated. There is also the need for and

additional optimization process to ﬁnd the optimal bandwidth h.

4.4. Summary and other References

Besides the described suggestions in this chapter, such as kernel density estimation, many

other methods are available to the aim of probability density estimation. Another example,

is the ﬁnite Gaussian mixture models, which is applied in the article [16] using fast Gauss

transforms. On the other hand, a maximum likelihood kernel density estimation is described

in [25]. In the article [4], for example, a thorough and extensive experimental comparison

is presented between two of the most popular methods being Parzen windows and ﬁnite

Gaussian mixtures. However, there is no consensus in the literature about which one is to

use.

30

5. Continuous Estimation of Mutual

Information

The underlying probability density p, needed for computation of the entropy, has to be

estimated for mutual information serving as a similarity measure. This chapter gives an

overview for a way which is widely used to do such approximation. In order to be applied

for an existent optimization process, based on the work [33], the gradient and the Hessian

of mutual information will be given analytically. As will be seen later in this chapter,

there are some arising bandwidth choice problems, which are still being researched and

not completely solved yet. On the other hand, the Hessian of mutual information will be

assessed and the amount of massive numerical calculations needed to estimate the Hessian

will be pointed out. Results with and without applying the Hessian, the quality of the

alignment and the timing issues are later shown in the chapters of experimental results

and evaluations. Furthermore, there are also possibilities to overcome such an estimation

process in other ways, which will be brieﬂy introduced at that point.

5.1. Parzen-Window Estimate of (Joint-)Entropy

In information theory, entropy is a measure of the uncertainty associated with a random

variable. A “must read” for an in-depth overview about both the entropy and information

theory is given in the book of Robert M. Gray[19]. Usually, and also in the context of

this work, the name refers to the Shannon entropy [38], which is actually restricted to

random variables taking discrete values, with p

mass

being the probability density mass. An

analogous term or an extension of the Shannon entropy to the domain of real numbers can

be deﬁned as an expectation of the negative logarithm of the probability density function

p, which for a random variable Z is mathematically given as

H(Z) = −E

Z

(log

b

p(Z)) = −

∞

−∞

p(z) log p(z)dz, (5.1)

where b refers to the base and is usually considered to be 2, thus bit, being the unit of

entropy, or the Euler value e (in this case −E

Z

(ln p(Z)). It is important to note, that in

case of p

i

= 0 for some i, the corresponding value for 0 log

b

0 is taken to be 0, which is

consistent with the limit

lim

p→0+

p log p = 0.

31

5. Continuous Estimation of Mutual Information

Considering Viola’s work [47], the statistical expectation can be approximated with the

sample average over a sample B drawn from Z,

H(Z) = −E

Z

(log

b

p(Z)) ≈ −

1

N

B

z

i

∈B

log

b

p(z

i

).

Secondly, kernel density estimation as described in the section 4.3 is applied to approximate

the underlying probability density p(z) by superposition of Gaussian densities centered on

the elements of another sample A drawn from Z,

p(z) ≈

1

N

A

z

j

∈A

G

ψ

(z −z

j

),

where ψ is the covariance matrix and

G

ψ

(z) ≡ (2π)

−

n

2

|ψ|

−

1

2

exp

Å

−

1

2

z

T

ψ

−1

z

ã

.

Note, that the use of Gaussian kernel is not mandatory. This is already stated in section

4.3 and is shown in the formula 4.2. Any kernel could also be used, such as Cauchy density

or Epanechnikov kernel, where the quality of estimation will not be eﬀected, as long as

the bandwidth values covered in the covariance matrix are optimally estimated, which is

the actual problem of the entire entropy estimation process being described. In the work

of Rui XU [34], they have designed their own kernel function with certain properties and

stated that it performs better in comparison to spline based kernels. However, there was

no comparison against the Gaussian kernel, which we have decided to use.

Now, the estimation of the entropy can be approximately derived as

H(Z) ≈ −

1

N

B

z

i

∈B

log

b

Ñ

1

N

A

z

j

∈A

G

ψ

(z

i

−z

j

)

é

. (5.2)

This kind of entropy estimation is called EMMA and is based on Viola’s pioneering work

[47]. Its derivative is given as

d

dT

H(Z(T)) ≈

1

N

B

z

i

∈B

z

j

∈A

W

Z

(z

i

, z

j

)(z

i

−z

j

)

T

ψ

−1

d

dT

(z

i

−z

j

), (5.3)

where

W

Z

(z

i

, z

j

) ≡

G

ψ

(z

i

−z

j

)

z

k

∈A

G

ψ

(z

i

−z

k

)

,

and T, with respect to which the derivative is given, denotes a set of parameters which the

density z might depend on. Note that usually some transformation parameters are meant

32

5.1. Parzen-Window Estimate of (Joint-)Entropy

here. However, in our framework the transformations are separated from the similarity

measures. On the other hand, Viola’s approach is based on a stochastical optimization

process, which is not applied within this work. In order to ﬁt our needs, we will later derive

the gradient and also the Hessian from the ﬁrst order derivative given in the formula 5.3

above with respect to participating intensity values only, instead of to some transformation

parameters.

The indicator W

Z

(z

i

, z

j

) serves as a weighting factor, that takes on values between zero

and one. If the participating densities z

i

and z

j

are signiﬁcantly close to each other, the

weighting factor will approach one, whereas if the densities are far away from each other,

the factor will tend to yield values close to zero. Distance is interpreted with respect to the

squared Mahalanobis distance (see Duda and Hart, 1973 [15]).

Note that the entropy estimation in 5.2 and its derivative in 5.3 refers to both marginal and

joint entropies. In the case of marginality the term z is meant to be one-dimensional density

and the covariance ψ is a real value, whereas in the case of joint probability estimation the

term z is a two-dimensional value referring to the joint density pair with ψ being a covariance

matrix. And this is exactly, where the multivariate estimation comes in and makes things

a bit complicated. Usually, in one-dimensional case the kernel density estimation yields

almost in all cases robust shape estimation. However, this is not the case in multivariate

situations, where though research is currently being done. The covariance matrix plays a

tremendous role. Its eigenvalues and eigenvectors indicate the orientation of the data and its

spread in diﬀerent directions. This exploits the boundaries of this work and we simply apply

diagonal covariance matrices in bivariate case. Thus, the joint entropy is approximated via

product kernels, which is also referred to as multivariate product kernel estimator and can

generally be given as

ˆ p(x) =

1

nh

1

· · · h

d

n

i=1

d

j=1

K

Ç

x

j

−X

ij

h

j

å

=

1

n

n

i=1

d

j=1

K

h

j

(x

j

−X

ij

)

,

where x = (x

1

· · · x

d

) is a random vector, X

ij

are independent observations sampled from a

multivariate density p(x) of dimension d and K

h

(•) =

1

h

K(•/h).

33

5. Continuous Estimation of Mutual Information

5.2. Mutual Information and its Derivatives

Based on the theories described above, we can now formulate the approximation of mutual

information, its gradient and Hessian. From now on we consider two random variables U

and V , describing the participating images for an alignment. U refers to the ﬁxed reference

image, whereas V refers to ﬂoating (or moving) images. The approximation of mutual

information might be given as

MI(U, V ) ≈ −

1

N

B

x

i

∈B

log

b

Ñ

1

N

A

x

j

∈A

G

ψ

u

(u

i

−u

j

)

é

+

x

i

∈B

log

b

Ñ

1

N

A

x

j

∈A

G

ψ

v

(v

i

−v

j

)

é

− (5.4)

x

i

∈B

log

b

Ñ

1

N

A

x

j

∈A

G

ψ

uv

(w

i

−w

j

)

é

**where A and B are index sets containing the sampled coordinates x
**

i

and x

j

in form of

one-dimensional pixel locations i = 0 · · · |B|, j = 0 · · · |A|, and

u

i

≡ U(x

i

), v

i

≡ V (x

i

), and w

i

≡ [u

i

, v

i

]

T

. (5.5)

Note that w

i

stands for the overlapping pixel pair of ith coordinate locations from both

images. The index sets A and B containing the locations of participating pixels, with

respect to which the estimation is evaluated, have to be suitably chosen. According to the

experiments, the best results were obtained by choosing one of the index sets as odd and

the other one as even numbers:

A = {0, 2, 4, · · · }

B = {1, 3, 5, · · · }

This means, that the underlying probability density, p(x), is estimated for pixels at even

locations (in one-dimensional case), where the Gaussian bumps are set on top of each pixel

at odd locations.

Furthermore, note that in the implementation of such formulas as in 5.7, underﬂow might

happen in the evaluation of exponential terms, which would yield a computation of logarithm

with zero. This cannot be fully avoided, but there is a trick described in the appendix C,

which might help to reduce the amount of such outcomes.

34

5.2. Mutual Information and its Derivatives

5.2.1. The Gradient of Mutual Information

Given the sample, which the approximation of MI will be based on, the gradient of MI has

to be evaluated in terms of the changes of image intensity values at the pixel locations of

the sample. Let these locations be deﬁned in a form such as X

locations

= {x

0

, x

1

, · · · , x

n

}.

Considering the equation 5.3 and the two samples A and B drawn from X

locations

, the

derivative of MI is given as

d

dT

MI(U, V ) =

d

dT

H(V ) −

d

dT

H(U, V )

=

1

N

B

x

i

∈B

x

j

∈A

(v

i

−v

j

)

T

î

W

V

(v

i

, v

j

)ψ

−1

v

−W

UV

(w

i

, w

j

)ψ

−1

vv

ó

d

dT

(v

i

−v

j

)

where the parameters for T correspond to X

locations

, with respect to which

d

dT

(v

i

−v

j

) has

to be evaluated. Furthermore, given a covariance matrix

ψ

−1

uv

=

ñ

ψ

−1

uu

0

0 ψ

−1

vv

ô

,

the gradient of MI can be deﬁned with respect to x

a

, where a = {0, 1, · · · , n}, as

∂

∂

x

a

MI(U, V ) =

1

N

B

I(x

a

∈ B)

x

j

∈A

(v

a

−v

j

)

T

î

W

V

(v

a

, v

j

)ψ

−1

v

−W

UV

(w

a

, w

j

)ψ

−1

vv

ó

¸

¸

−

I(x

a

∈ A)

x

i

∈B

(v

i

−v

a

)

T

î

W

V

(v

i

, v

a

)ψ

−1

v

−W

UV

(w

i

, w

a

)ψ

−1

vv

ó

¸

¸

.

where

W

V

(v

i

, v

j

) ≡

G

ψ

v

(v

i

−v

j

)

x

k

∈A

G

ψ

v

(v

i

−v

k

)

and W

UV

(w

i

, w

j

) ≡

G

ψ

uv

(w

i

−w

j

)

x

k

∈A

G

ψ

uv

(w

i

−w

k

)

,

and

I(x ∈ A) =

1 if x ∈ A,

0 otherwise.

.

35

5. Continuous Estimation of Mutual Information

5.2.2. The Hessian of Mutual Information

Now we are ready to give an analytical formula of second order derivatives of MI, forming the

Hessian matrix, based on the aforementioned gradient. Let ﬁrst deﬁne some WW(x

a

, x

b

)

as

WW(x

a

, x

b

) ≡ W

V

(v

a

, v

b

)ψ

−1

v

−W

UV

(w

a

, w

b

)ψ

−1

vv

The elements on the diagonal of the Hessian matrix of MI can be written as

∂

2

∂

x

2

a

MI(U, V ) =

1

N

B

I(x

a

∈ B)

x

j

∈A

(+1)WW(x

a

, x

j

) + (v

a

−v

j

)

T

∂

∂

x

a

WW(x

a

, x

j

)

¸

¸

−

I(x

a

∈ A)

x

i

∈B

(−1)WW(x

i

, x

a

) + (v

i

−v

a

)

T

∂

∂

x

a

WW(x

i

, x

a

)

¸

¸

**and the elements of all locations other than the diagonals are given as
**

∂

∂

x

a

x

b

MI(U, V ) =

1

N

B

®

I(x

a

∈ B) I(x

b

∈ A)

ñ

(−1) [WW(x

a

, x

b

)] + (v

a

−v

b

)

T

∂

∂

x

b

WW(x

a

, x

b

)

ô

−

I(x

a

∈ A)I(x

b

∈ B)

ñ

(+1) [WW(x

b

, x

a

)] + (v

b

−v

a

)

T

∂

∂

x

b

WW(x

b

, x

a

)

ô´

.

Note that x

a

= x

b

and WW(x

a

, x

b

) is not symmetric, thus

∂

∂

x

b

WW(x

a

, x

b

) =

∂

∂

x

b

WW(x

b

, x

a

).

The most important terms within these formulas are the partial derivatives of WW, which

contain a huge amount of exponential calculations (considering the Gaussians G

ψ

as kernel

functions) and are analytically evaluated as

36

5.2. Mutual Information and its Derivatives

∂

∂

x

b

WW(x

a

, x

b

)

=

∂

∂

x

b

W

V

(v

a

, v

b

)ψ

−1

v

−

∂

∂

x

b

W

UV

(w

a

, w

b

)ψ

−1

vv

=

∂

∂

x

b

G

ψ

v

(v

a

−v

b

)

x

k

∈A

G

ψ

v

(v

a

−v

k

)

−G

ψ

v

(v

a

−v

b

)

x

k

∈A

∂

∂

x

b

G

ψ

v

(v

a

−v

k

)

x

k

∈A

G

ψ

v

(v

a

−v

k

)

2

ψ

−1

v

−

∂

∂

x

b

G

ψ

uv

(w

a

−w

b

)

x

k

∈A

G

ψ

uv

(w

a

−w

k

)

−G

ψ

uv

(w

a

−w

b

)

x

k

∈A

∂

∂

x

b

G

ψ

uv

(w

a

−w

k

)

x

k

∈A

G

ψ

uv

(w

a

−w

k

)

2

ψ

−1

vv

=

G

ψ

v

(v

a

−v

b

)(v

a

−v

b

)

T

ψ

−1

v

x

k

∈A

G

ψ

v

(v

a

−v

k

)

−[G

ψ

v

(v

a

−v

b

)]

2

(v

a

−v

b

)

T

ψ

−1

v

x

k

∈A

G

ψ

v

(v

a

−v

k

)

2

ψ

−1

v

−

G

ψ

uv

(w

a

−w

b

)(w

a

−w

b

)

T

ψ

−1

vv

x

k

∈A

G

ψ

uv

(w

a

−w

k

)

−[G

ψ

uv

(w

a

−w

b

)]

2

(w

a

−w

b

)

T

ψ

−1

vv

x

k

∈A

G

ψ

uv

(w

a

−w

k

)

2

ψ

−1

vv

=

G

ψ

v

(v

a

−v

b

)(v

a

−v

b

)

T

x

k

∈A

G

ψ

v

(v

a

−v

k

) −G

ψ

v

(v

a

−v

b

)

x

k

∈A

G

ψ

v

(v

a

−v

k

)

2

ψ

−2

v

−

G

ψ

uv

(w

a

−w

b

)(w

a

−w

b

)

T

x

k

∈A

G

ψ

uv

(w

a

−w

k

) −G

ψ

uv

(w

a

−w

b

)

x

k

∈A

G

ψ

uv

(w

a

−w

k

)

2

ψ

−2

vv

=

G

ψ

v

(v

a

−v

b

)(v

a

−v

b

)

T

î

x

k

∈A,x

k

=x

b

G

ψ

v

(v

a

−v

k

)

ó

x

k

∈A

G

ψ

v

(v

a

−v

k

)

2

ψ

−2

v

−

G

ψ

uv

(w

a

−w

b

)(w

a

−w

b

)

T

î

x

k

∈A,x

k

=x

b

G

ψ

uv

(w

a

−w

k

)

ó

x

k

∈A

G

ψ

uv

(w

a

−w

k

)

2

ψ

−2

vv

and

37

5. Continuous Estimation of Mutual Information

∂

∂

x

b

WW(x

b

, x

a

)

=

∂

∂

x

b

W

V

(v

b

, v

a

)ψ

−1

v

−

∂

∂

x

b

W

UV

(w

b

, w

a

)ψ

−1

vv

=

∂

∂

x

b

G

ψ

v

(v

b

−v

a

)

x

k

∈A

G

ψ

v

(v

b

−v

k

)

−G

ψ

v

(v

b

−v

a

)

x

k

∈A

∂

∂

x

b

G

ψ

v

(v

b

−v

k

)

x

k

∈A

G

ψ

v

(v

b

−v

k

)

2

ψ

−1

v

−

∂

∂

x

b

G

ψ

uv

(w

b

−w

a

)

x

k

∈A

G

ψ

uv

(w

b

−w

k

)

−G

ψ

uv

(w

b

−w

a

)

x

k

∈A

∂

∂

x

b

G

ψ

uv

(w

b

−w

k

)

x

k

∈A

G

ψ

uv

(w

b

−w

k

)

2

ψ

−1

vv

=

−G

ψ

v

(v

b

−v

a

)(v

b

−v

a

)

T

ψ

−1

v

x

k

∈A

G

ψ

v

(v

b

−v

k

)

x

k

∈A

G

ψ

v

(v

b

−v

k

)

2

ψ

−1

v

+

G

ψ

v

(v

b

−v

a

)

î

x

k

∈A

G

ψ

v

(v

b

−v

k

)(v

b

−v

k

)

T

ψ

−1

v

ó

x

k

∈A

G

ψ

v

(v

b

−v

k

)

2

ψ

−1

v

+

G

ψ

uv

(w

b

−w

a

)(w

b

−w

a

)

T

ψ

−1

vv

x

k

∈A

G

ψ

uv

(w

b

−w

k

)

x

k

∈A

G

ψ

uv

(w

b

−w

k

)

2

ψ

−1

vv

−

G

ψ

uv

(w

b

−w

a

)

î

x

k

∈A

G

ψ

uv

(w

b

−w

k

)(w

b

−w

k

)

T

ψ

−1

vv

ó

x

k

∈A

G

ψ

uv

(w

b

−w

k

)

2

ψ

−1

vv

=

G

ψ

v

(v

b

−v

a

)

î

x

k

∈A

G

ψ

v

(v

b

−v

k

)(v

b

−v

k

)

T

−(v

b

−v

a

)

T

x

k

∈A

G

ψ

v

(v

b

−v

k

)

ó

x

k

∈A

G

ψ

v

(v

b

−v

k

)

2

ψ

−2

v

−

G

ψ

uv

(w

b

−w

a

)

î

x

k

∈A

G

ψ

uv

(w

b

−w

k

)(w

b

−w

k

)

T

−(w

b

−w

a

)

T

x

k

∈A

G

ψ

uv

(w

b

−w

k

)

ó

x

k

∈A

G

ψ

uv

(w

b

−w

k

)

2

ψ

−2

vv

.

38

5.3. Normalized Variants

5.3. Normalized Variants

Besides the general form of mutual information, other variants have also been considered

in this work. Normalized versions are given in section 3.3. Two normalized variants were

implemented and compared against other variants. Following sections give detailed descrip-

tion of the applied methods. A more detailed description and comparison about diﬀerent

normalized variants can be found in the article of Xuan Vinh [46].

5.3.1. Normalized Mutual Information

One way to normalize the mutual information to the range of [0, 1] can be given as

NMI(U; V ) =

H(U) +H(V )

H(U, V )

.

This corresponds to an estimation in following form:

NMI(U, V ) ≈

î

−

1

N

B

x

i

∈B

log

b

Ä

1

N

A

x

j

∈A

G

ψ

u

(u

i

−u

j

)

äó

+

î

−

1

N

B

x

i

∈B

log

b

Ä

1

N

A

x

j

∈A

G

ψ

v

(v

i

−v

j

)

äó

−

1

N

B

x

i

∈B

log

b

Ä

1

N

A

x

j

∈A

G

ψ

uv

(w

i

−w

j

)

ä

where the elements of the formula are deﬁned in the same way as in previous section.

The Gradient of Normalized Mutual Information

First order derivative of this normalization is formally given as

d

dT

NMI(U; V ) =

d

dT

H(V )H(U, V ) −(H(U) +H(V ))

d

dT

H(U, V )

H(U, V )

2

=

1

H(U, V )

1

N

B

x

i

∈B

x

j

∈A

(v

i

−v

j

)

T

W

V

(v

i

, v

j

)ψ

−1

v

d

dT

(v

i

−v

j

) −

NMI(U; V )

1

N

B

x

i

∈B

x

j

∈A

(v

i

−v

j

)

T

W

UV

(w

i

, w

j

)ψ

−1

vv

d

dT

(v

i

−v

j

)

¸

¸

This yields the following elements of the gradient:

39

5. Continuous Estimation of Mutual Information

∂

∂

x

a

NMI(U, V ) =

1

H(U, V )N

B

·

I(x

a

∈ B)

x

j

∈A

(v

a

−v

j

)

T

W

V

(v

a

, v

j

)ψ

−1

v

− NMI(U; V )

x

j

∈A

(v

a

−v

j

)

T

W

UV

(w

a

, w

j

)ψ

−1

vv

¸

¸

I(x

a

∈ A)

x

i

∈B

(v

i

−v

a

)

T

W

V

(v

i

, v

a

)ψ

−1

v

− NMI(U; V )

x

i

∈B

(v

i

−v

a

)

T

W

UV

(w

i

, w

a

)ψ

−1

vv

¸

¸

**5.3.2. Relative Mutual Information (Symmetric Uncertainty)
**

The behaviour of relative mutual information [42], also referred to as the symmetric uncer-

tainty, is also investigated in this work. Its mathematical deﬁnition is given in the section

3.3 and its ﬁrst order derivative can be formalized as

d

dT

RMI(U; V ) = 2

d

dT

MI(U; V )(H(U) +H(V )) −MI(U; V )

d

dT

H(V )

(H(U) +H(V ))

2

(5.6)

Exact analytical derivation of the gradient can be evaluated by replacing the derivatives of

participating entropies in the same manner as given in the previous section 5.3.1.

40

5.4. An Alternative Method

5.4. An Alternative Method

In the previously given approximation methods, the process for joint density estimation is

performed using product kernels. This is that the joint entropy is not evaluated at each

potential pixel pair. An alternative way is also investigated in this work. Our idea is to

process the estimation of both marginal and joint entropies for each pixel as well as for each

pixel pair up to a certain resolution n, instead of performing the estimation at pixel points

derived from certain sub samples, such as the index sets A and B described in equation 5.7

under the subsection 5.2.

This means that the set B, in this alternative method, refers to constant pixels up to

some resolution n ranging from 0.0 to 1.0 (recall that the pixel values we are working on are

always normalized into this range), whereas the set A this times refers to the pixel values

of the entire (down-sampled) image. Considering this situation, we formulate the sum of

marginal entropies, SME(U, V ), for images U and V , mathematically as

SME(U, V ) ≈ −

1

N

B

x

i

∈B

log

b

Ñ

1

N

A

x

j

∈A

G

ψ

u

(c

i

−u

j

)

é

+

x

i

∈B

log

b

Ñ

1

N

A

x

j

∈A

G

ψ

v

(c

i

−v

j

)

é

(5.7)

This time both index sets are given as

A = {1, 2, 3, · · · , m}

B = {1, 2, 3, · · · , n} ,

where u

i

and v

i

are deﬁned as in equations 5.5, and c

i

≡ C(x

i

) refers to the constant pixel

values:

C(x

i

) =

ß

0

n

,

1

n

, · · · ,

n

n

™

,

where n > 1 gives the resolution of constant intensities. Furthermore, m is the number of

available samples from participating (down-sampled) images.

The remaining question is the estimation of the joint entropy J

E

. In previous section it

was evaluated at pixel pairs where they occur in single images. Now, we want to estimate

the joint entropy at each pixel pair (x, y). Thus x and y both range from 0.0 to 1.0 at

certain resolution n. Formally, we deﬁne an estimation for the joint entropy, J

E

, as

41

5. Continuous Estimation of Mutual Information

J

E

(U, V ) ≈ −

1

N

B

N

B

x

i

∈B

x

k

∈B

log

b

Ñ

1

N

A

x

j

∈A

G

ψ

u

(c

i

−u

j

)G

ψ

v

(c

k

−v

j

)

é

.

and the MI is ﬁnally deﬁned as

MI(U, V ) = SME(U, V ) −J

E

(U, V ). (5.8)

Note that this kind of approximation has one advantage and one drawback. The advan-

tage is seen at exponentials of the Gaussians, in form such as

G

ψ

(c −z) = K

1

exp

Ä

K

2

(c −z)

T

ψ

−1

(c −z)

ä

,

where K

1

, K

2

and especially c are constants. This makes the evaluation of the gradient

and the Hessian of MI much easier. The drawback lies in the fact, that the joint entropy

is now estimated at each pixel pair (up to a certain resolution to be deﬁned properly),

which is computationally more expensive. Furthermore an appropriate resolution n has to

be chosen.

5.4.1. The Gradient

In order to evaluate the gradient of the method described above, consider the amount of

Gaussians for both reference and ﬂoating images as n ×m matrices:

G

ref

= G

ψ

u

(c

1

, u) G

flo

= G

ψ

v

(c

1

, v)

G

ψ

u

(c

2

, u) G

ψ

v

(c

2

, v)

.

.

.

.

.

.

G

ψ

u

(c

n

, u) G

ψ

v

(c

n

, v)

where u = (u

1

, · · · , u

m

) and v = (v

1

, · · · , v

m

). Looking at the formula 5.8, it is seen, that

in the ﬁrst derivation of MI, one is confronted with derivatives of Gaussians in the form of

d

dv

G

ψ

u

(c

i

, u)

T

G

ψ

v

(c

i

, v) = G

ψ

u

(c

i

, u)

T

d

dv

G

ψ

v

(c

i

, v).

The ﬁrst term above stands for one row vector, whereas the second term corresponds to the

Jakobian, thus yielding an n ×n matrix.

42

5.5. Bandwidth Estimation

5.4.2. The Hessian

As one of the variables in the exponential terms are now eliminated using constant intensity

values instead of values based on input image samples, a more easier evaluation of Hessian

can be applied. In general, the Hessian of the logarithm of sum of exponentials is analytically

deﬁnes as

∇

2

f(x) =

1

(1

T

z)

2

((1

T

z)diag(z) −zz

T

),

where

f(x) = log

n

k=1

exp(x

k

) and z = (e

x

1

, · · · , e

x

n

).

These general deﬁnitions can easily be applied to the log-sum of exponentials for the MI

estimation in 5.8. Recall that this amounts to extensive exponential calculations, especially

for the Hessian of joint entropy.

5.5. Bandwidth Estimation

For all described methodologies to estimate the mutual information, one is confronted with

the problem of bandwidth chose. This is the most diﬃcult as well as important problem in

kernel density estimation. Especially in the approximation of joint entropy, thus in case of

such bi-variate density estimation, an appropriate covariance matrix has to be chosen. This

matrix contains the bandwidth values and exhibits certain structure. Such multi-variate

density estimation is still under extensive research. Diﬀerent structures of the covariance

matrix directly inﬂuence the outcomes to a great extend.

In the journal of Stephan R. Sain [36], the theoretical behaviour of some algorithms

are shown, which are useful in estimation of smoothing parameters. In the paper, they

derive and compare multi variate versions of the bootstrap method of Taylor [43], the

least squares cross-validation, and a biased cross-validation method for multi variate kernel

approximation using product kernel estimator.

In this work, the bi-variate density estimation is also performed over product kernel es-

timation, thus diagonal covariance matrices are applied. In one-dimensional cases (these

are the marginal entropies), both Silverman’s plug-in method as well as the least squares

leave-one-out cross-validation estimators, described in 4.3.1, are applied to approximate the

optimal bandwidth. Note, that Silverman’s method assumes the structure to be Gaussian.

This might for example not be the case in multi-modal images. Cross-validation deliv-

ers more robust estimation in such cases, but the drawback is, that it is computationally

expensive and requires additional optimization process.

The chosen bandwidth derived is applied globally. However, this process might be prob-

lematically in some cases. For example, global bandwidth estimation might yield very good

43

5. Continuous Estimation of Mutual Information

estimation at some points on one site, and inappropriate estimations at other locations on

the other hand. This might usually be the case if the underlying structure has several peaks.

There also exist some other local bandwidth estimation methods, which are diﬀerently

and independently chosen for each intensity value, where the density has to be estimated for.

However, such adaptive processes are not considered in this work due to their complexity.

44

Part III.

Implementation and Experiments

45

6. GPU and CPU Implementation

Our reference implementation consist of several internal libraries (VecAD, SplineReconst,

OpenCVTools and EigenTools) performing both data preparation and the reconstruction

itself. All programs have been written in ANSI C++, using some libraries such as Boost

[1], OpenCV [7] and the linear algebra toolkit Eigen [20].

The tests have been conducted mostly on Linux systems (Ubuntu distribution) as well as

on a Mac OS X machine. The CPU implementation of our algorithms has been mostly run

on a system with an Intel Core i7-820QM 1.73 GHz quad core CPU, using only one of the

CPU cores.

The evaluation of the similarity metrics, especially the gradients and the Hessian of dif-

ferent variations for mutual information, have been also ported to the GPU using NVIDIA’s

OpenCL [3] implementation for the CUDA architecture [2] (a new hardware and software

architecture for computing on the GPU). For that, a single NVIDIA graphics card of model

GeForce GTX 480 has been used, containing 480 CUDA cores and a GPU memory of around

1, 5 GB.

6.1. The Implementation of Mutual Information and its

Derivatives

The estimation algorithm of probability densities for approximation of both marginal and

joint entropies as well as the mutual information itself have been implemented both on

CPU and GPU (for GPU version and achieved speed-ups, refer to section 6.2) including

the derivatives and the Hessian of mutual information. Our evaluations are based on kernel

density estimation as described in diﬀerent forms in chapter 5.

Not only the mutual information in its general form as in equation 5.7, but also diﬀerent

normalizations of MI were implemented as given in the subsection 5.3.1. Furthermore, ﬁrst

and second order derivatives are calculated analytically. An evaluation of the ﬁrst order

derivative is given in Viola’s work [47], whereas the second order derivative is not present

there. However, our evaluation was diﬀerent, since the derivatives in our work are based

on changes in input intensity values instead of transformation parameters. Furthermore,

we found nobody evaluating and applying the gradient as well as (especially) the Hessian

of mutual information analytically in the same way as described in this work in sections

5.2.1 and 5.2.2. The correctness of our analytical calculations for gradient as well as for the

Hessian of MI has been evaluated with a comparison against our reference implementation

for ﬁnite diﬀerences.

47

6. GPU and CPU Implementation

Additionally, we have suggested and implemented an alternative way of estimation as

described in 5.4. The diﬀerence here was to perform the estimation at constant pixel

values up to a certain resolution and to evaluate the joint entropy et each possible pixel

pair within the given resolution. This however adds an additional complexity for how to

choose the resolution. Furthermore, the evaluation of joint densities at each pixel pair is

computationally quite expensive. Although we have considered this method in our ﬁrst

tests and had some success on face reconstruction, we did not continue with this method in

further tests, due to the reason of expensive computations.

6.2. GPU Version

Computation of mutual information is a highly demanding process due to the enormous

number of exponential evaluations. It needs to compute the entropy and the joint entropy

of two variables, which involves estimating the marginal and joint probabilities of each

sample pixel. It is therefore the bottleneck in many applications regarding data fusion,

where the speed up of computing mutual information becomes a critical issue.

However, we demonstrate that these computations are parallelizable and can be eﬃciently

ported onto the GPU architecture. We developed GPU algorithms to compute the mutual

information as well as both its ﬁrst and second order derivatives. We focused on the speed up

of approaches described in 6.1 for computing mutual information and its derivatives, where

the number of exponential computations is n

2

(n is the size of samples used to estimate the

entropy of a variable).

Although Shams [37] has presented a CUDA implementation for computing mutual in-

formation, his work is based on histograms to compute both marginal and joint probability

densities. However, this is hard to extend to a GPU version of computing the derivatives of

mutual information, which is essential for our work. Besides that, we found nobody having

ported the Hessian of mutual information on a GPU architecture.

Compared with the same CPU implementation, we reached a speed-up by a factor of 10

in computing mutual information, its gradient and the Hessian all together by a sample

size of around 4000. Increasing the sample size resulted in the lack of GPU memory,

from which we had 1.5 GB available. This is due to the reason, that we perform the

computations on each RGB channel as well as on each frame of the sliding window at once.

Separated computations for diﬀerent channels and frames decreases the performance due to

the permanent data transfer between the host and device memory.

Since this work is not intended to be a fully GPU thesis in its actual form, our GPU

implementation still needs improvement. The code therefore can be better optimized at

a later stage and other GPU hardware (for example Tesla instead of GeForce), providing

more GPU memory, could be used in multiple form (e.g one GPU for each RGB channel).

A more interesting issue would be the application of GPU clusters. We intend to make the

current implementation publicly available once it is documented.

48

7. Evaluation

7.1. Evaluation

In order to test the stability of the reconstruction, several tests have been performed on a set

of artiﬁcial rendered image sequences as well as on real images. Using synthetic projections

for rendered images, depth-map ground truth data of the surface in the scene is always

available, whereas in the case of real scenes this information is not existent. However,

sequences of real scenes have been used to demonstrate that the approach also works in real

world.

Our algorithms were explored from the viewpoint of image alignment. The following

three measures were calculated as required criteria: Joint Entropy (JE) between two image

samples, Mutual Information (MI) and Normalized Mutual Information (NMI). Since this

work is handled as a minimization problem and MI as well as NMI is maximized by the

best match, the signs were changed for both similarity measures. For the optimization

process, an SQP (Sequential Quadric Programming) based implementation were applied as

described in [33]. For the ﬁrst two measurements, JE and MI, the gradient as well as the

Hessian were evaluated, whereas for the latter one, NMI, we only applied the gradient and

set the Hessian to be the identity matrix. Whenever the Hessian of the aforementioned

methods were disabled or not applicable at all, the optimizer performed as a non-linear

gradient descent approach.

The evaluation of the three similarity measures, JE, MI and NMI, were also compared

against the performance of Pseudo Huber cost function, PHC, previously applied and de-

scribed in the article [33]. As stated, PHC measurement delivered very good results for

synthetic data sets, where the light source stays constant. However, it performed not that

well if it comes to the tubular structures in real world scenes, changes in brightness as well

as in cases of synthetic data sets with moving light source. In the next chapter, the out-

comes for the application of mutual information (in three diﬀerent forms) and the previously

applied cost function are given and the qualities of the reconstruction results are evaluated.

7.2. Test Cases

The tests were separated in three cases ranging from the simplest to the most complicated

ones. First and simpler tests were performed on synthetic data sets under static lighting.

The objects considered in such scenes were shapes like cone and sphere. Later on, changes

in brightness were added in these sequences.

49

7. Evaluation

Secondly, we run tests on more complicated situations with artiﬁcially rendered image

sequences, where the light source is moving around. The latter case also demonstrated the

simulation of endoscopic images in form of tubular donut shaped structures with changing

illumination.

Finally, we used images from the real world scenes. Diﬀerent situations were under

demonstration, such as face modelling as well as recovery of real tubular structures with

less textures (such as medical endoscopy sequences).

7.3. Reconstruction Accuracy

The knowledge about the ground truth information for synthetic data sets allows mathe-

matical evaluation of the reconstruction quality. The comparison between the reconstructed

surface and the ground truth depth map were achieved by determining the normalized cross-

correlation (NCC) between them. The NCC value ranges from −1.0 to +1.0, where 1 is

considered to be the theoretical optimum.

Visual impression of the results as well as the evaluations of the reconstruction quality

are given in next chapter in more detail. To summaries, for synthetic datasets we achieved

excellent NCC rations of 0.9999 using around 3000 samples, where the accuracy decreased

to 0.998, using 400 samples only, where this is still very good considering the situations,

that the reconstruction was performed in under one second by 400 samples.

For more complicated scenes, where the light source is moving around in the sequence,

we observed a decreased accuracy with NCC rations between 0.916 and 0.982. Refer to

chapter 8 for more details.

7.4. Running Times

The use of the Hessian was computationally demanding. The question, whether to activate

the Hessian for the non-linear optimization process or to run the optimizer based on gradient

descent approach, is yet diﬃcult to answer for us to some extend. The optimizer converges

at less steps, if the Hessian is activated, whereas the computations take more time in

comparison to gradient only version (for example, by a factor of 10 for each evaluation of

MI, and its derivatives. This factor increases with the sample size). In our experiments,

there was no urgent need to use the Hessian. Besides that, using the gradient only was

already fast enough for our test cases, where one frame was generally processed in one

to ﬁve seconds considering 400 to 2000 samples. However, running times depend on the

resolution of the surface model as well as on the sample size. Increasing the sample size to

more than 2000 delivered a reconstruction in more than 5 seconds, but usually under thirty

seconds.

50

8. Results

8.1. Synthetic Datasets under Static Lighting

Following results demonstrate the reconstruction quality for artiﬁcial images generated by

a renderer. The light source does not move around. This case simpliﬁes the process and is

suitable for the ﬁrst tests.

Cone

(a) One of the participating im-

ages

(b) The depth map (c) 3D Reconstruction

(d) Original depth map (e) Reconstructed depth map (f) Error: red=high error, green

= low error

Figure 8.1.: The Reconstruction of a cone under constant light source.

51

8. Results

Figures in 8.1 are depicting one of the participating 2D images of a cone together with its

reconstructed 3D model, original and reconstructed depth maps as well as the NCC errors

of recovery process. The cone is viewed from its upper site, where its tip is pointing towards

the camera.

Depending on the sample size applied for reconstruction, all applied similarity measures

(JE, MI, NMI and PHC) have yielded excellent NCC values such as 0.9999 (where 1.0 is the

best) with c.a. 3000 samples using 49 control points (7 in each direction) for the quadric

spline surface. Using 400 samples were also enough in this test case with a decreased, but

still very good NCC value of 0.998.

Additionally, in order to test the robustness of MI cost functions against diﬀerent sit-

uations, we applied changes in brightness for the images in the sequence. The change in

brightness was applied frame for frame with increased darkness. This is illustrated in Figure

8.2, where the brightness diﬀerence between ﬁrst and last image is obvious. According to

this situation, the MI versions were still able to perform the reconstruction with a very good

accuracy (NCC values of around 0.9995), whereas PHC failed as it was expected.

(a) Two of the participating images with diﬀerent brightness

(b) The depth map (c) 3D Reconstruction

Figure 8.2.: The Reconstruction of a cone with changing brightness in participating images

52

8.1. Synthetic Datasets under Static Lighting

Sphere

Other tests of artiﬁcial rendered images under constant lighting conditions were performed

on a rotating sphere. Reconstruction accuracy was very similar to the previously described

test cases (NCC rations ranging from 0.998 to 0.9999 depending on sample size). Figure

8.3 visualizes the outcomes achieved by MI versions.

(a) Original depth map (b) One of the participating im-

ages

(c) Reconstructed depth map (d) Error: red=high error, green

= low error

(e) 3D Reconstruction (f) 3D Reconstruction

Figure 8.3.: The Reconstruction of a rotating sphere under constant light source.

53

8. Results

8.2. Synthetic Datasets under Varying Lighting Conditions

More complicated scenes have been also demonstrated with rendered images, where the

light source is moving around. This is e.g. the case in medical video endoscopy, where

surfaces exhibits less texture and the light source is moving along with the camera, which

results in changes of illumination from frame to frame.

Tube (Simulated Endoscopy)

A donut-shaped tube is created artiﬁcially in order to test our methods for the cases of

tubular structures, where the camera is moving inside the tube and the light source is

moving around. Figure 8.4 shows the results.

(a) One of the participating im-

ages

(b) The depth map (c) 3D Reconstruction

(d) Original depth map (e) Reconstructed depth map (f) Error: red=high error, green

= low error

Figure 8.4.: The Reconstruction of a donut-shaped tube under changing illumination.

For the tube, we could achieve results with an NCC ratio ranging from 0.891 with JE to

0.916 with MI and 0, 9304 with NMI, whereas PHC again failed as obvious. This is not as

good as in the previously described scenes with simple shapes and constant light source.

However, this is to be expected due to the discontinuity of the tubular shape in the image,

54

8.2. Synthetic Datasets under Varying Lighting Conditions

which cannot be easily modelled using a spline surface. As can be seen in the depth map,

it is indeed blurry in the area around the discontinuity, and the error map shows that the

biggest discrepancy lies there. Figure 8.5 shows the cameras point of view.

(a) (b)

Figure 8.5.: Reconstructed donut-shaped tube visualized from inside.

Cone

For the image sequence of cone structure with the light source moving around, we achieved

an NCC value of 0.982 using 1815 samples with 3×5 spline control points. Here we observed

that the most important error was around the tip of the cone. As it can be seen in ﬁgure

8.6(c), the shape of the cone is recognizable pretty well, but compared to the previous tests

(refer to ﬁgure 8.2), the tip of the cone was not high enough. In my opinion, this lies on the

complexity in the shape of the (joint) distributions changing frame for frame with moving

light source. This is diﬃcult to estimate using globally applied smoothing parameters. We

plan to progress with further improvements on the estimation of such parameters applying

locally adaptive methods at a later stage.

(a) Template image (b) Depth map (c) 3D Reconstruction

Figure 8.6.: Cone under changing illumination

55

8. Results

8.3. Real Images

Finally, our work is demonstrated for the real scenes with data sequences generated directly

from the camera. In previous tests, synthetic image sequences have been used to mathemat-

ically evaluate the reconstruction quality. In a real-world scenario, however, it is diﬃcult

to determine the ground truth. Still it is important to note that our approach also works

in reality.

Face Modelling under changing brightness

In ﬁgure 2.2 showed at the beginning, we already gave a visual impression for face recovery

from image sequences (with constant brightness) captured by a monocular camera. These

results have been achieved with mutual information (JE, MI and NMI) as well as with PHC.

However, if the brightness between diﬀerent frames is changed, PHC failed as expected.

Mutual information still performed very well in these cases, which is illustrated in the

ﬁgure 8.7.

(a) Template image (b) One of the captured images.

The change in brightness in com-

parison to template image is ob-

vious

(c) Depth map (d) 3D Reconstruction

Figure 8.7.: Face modelling under diﬀerent brightness

56

8.3. Real Images

Rolled Newspaper

Figure 8.8 shows the outcomes of a tubular reconstruction, where the input image sequence

was produced by a video-endoscope moving down a rolled-up newspaper. The second ﬁgure

8.9 visualizes the 3D structure from inside achieved from diﬀerent runs of our reconstruction

framework. The results were achieved using the NMI version with a sample size of 4225.

Using such amount of samples required around 5 seconds for one evaluation of the (normal-

ized) mutual information and its gradient on the CPU, whereas the time was decreased to

1.2 seconds in the GPU version. However, the optimization process converged in a couple

of steps and processing of one frame was already achieved within one minute.

(a) Template image (b) Depth map (c) 3D Reconstruction

Figure 8.8.: Newspaper Reconstruction

(a) (b) (c)

Figure 8.9.: Reconstructed newspaper visualized from inside. (Camera’s point of view)

57

8. Results

Medical Endoscopy Images

Video-Endoscopic data from an airway is shown in ﬁgure 8.10. The reconstruction quality

is limited due to some outliers at the border of the reconstructed ROI. In addition, the data

set used were extremely noisy and the movement almost did not generate any baseline.

Although this makes the reconstruction pretty diﬃcult, the shape of a tubular structure is

clearly recognizable. Results were achieved using MI and NMI versions, whereas JE and

PHC failed.

(a) Template image (b) Depth map (c) 3D Reconstruction

Figure 8.10.: Video-Endoscopy

8.4. Summary

Diﬀerent cost functions have been compared against each other in diﬀerent situations. PHC

delivered excellent NCC ratios (also achieved with JE, MI and NMI) in simpler scenes,

where the light stays in a constant location, whereas it failed as expected in situations with

varying illumination. Our proposed methods, JE, MI and NMI demonstrated a much better

behaviour against PHC. In certain cases such as the tubular structures in the real world

scenes, however, JE did not perform well, whereas MI and NMI still delivered good results,

from which the NMI version performed better and showed a more consistent as well as more

accurate behaviour.

58

Part IV.

Conclusion

59

9. Discussion and Future Work

The application of (normalized) mutual information for dense recovery of weakly textured

surfaces in monocular vision is a promising approach as a cost function proposed for the

mathematical minimization problem originally presented in [33] (as a novel sliding-window

bundle adjustment algorithm that works directly with image intensities) and described in

this work. Additionally, it is robust to changes in brightness and to variations of illumina-

tion.

Using normalized mutual information, we were able to achieve recovery of weakly textured

objects using synthetic datasets with a simulation of light source moving around frame for

frame. In addition, reconstruction was also achieved for surfaces exhibiting less texture in

real scenes with changing brightness. Thus, we were able to demonstrate that our proposed

methods also work in real-world.

The process involves density estimation for marginal and joint probabilities of intensities

in participating images, where no assumption can be made about the shape of the distri-

butions. Kernel density estimation (a non-parametric way of estimate) is a better way to

approximate the densities in comparison with classical parametric methods in cases where

the structure of the underlying distribution for densities are usually not known, or when it

is hard to specify the right model. Also kernel density estimation is preferred over the use of

histogram techniques in certain applications, since a smooth and continuous estimate of the

density is obtained, which allows to analytically calculate the derivatives of the objective

function in question.

Non-parametric density estimation is an important tool in the statistical analysis of data.

The approach uses a common approximation process of density distribution, which makes

it depend on some additional parameter adjustment methodologies. Besides the promising

theoretical results, there is no doubt that the problem of ﬁnding appropriate smoothing

parameters in kernel density estimation grows in complexity as the dimensionality of the

data increases. In practice, several methods can and should be applied, from which the

best estimate can be chosen. Some of these methods, we would suggest to use, are plug-in

methods and the leave-one-out cross validation as described in this work. However, there

is no unique technology for choosing these auxiliary smoothing parameters, which was one

of the most important problems for this work.

61

9. Discussion and Future Work

The bandwidths for density estimates were globally chosen by aforementioned methods.

One of the most complicated problems encountered was the estimation of the covariance

matrix for the bi-variate case (joint probability density estimation). Since we have used

a global bandwidths, we set the covariance matrix to be a diagonal matrix with the same

smoothing parameters. Although we achieved very good estimations in most cases, we

came to the conclusion that a global bandwidth estimation might not be the best choice

for densities of some complex shapes in certain scenes. This is a point where our work

needs improvement. Applying locally adaptive smoothing parameters is more complicated

in comparison to global approaches, whereas it might show a better behaviour in certain

cases. A recent work has been presented on an adaptive kernel density estimation based on

linear diﬀusion [6]. Other adaptive methods can be observed in [8, 35].

The computation for kernel density estimation of mutual information using Gaussian

kernels was a demanding process due to the huge amount of exponential calculations. The

computations of the gradient as well as the Hessian of mutual information can be seen as an

additional bottleneck. As these computations are pretty parallelizable, we have developed

and implemented GPU algorithms to massively speed up computation of the cost function

and its derivatives. We achieved a speed-up by factor of 10, which allowed us to perform

our tests and to observe the outcomes much faster. Nevertheless, further improvements are

necessary in our reference GPU implementation in order to increase the speed-up factor.

In future, several GPU devices in one PC system or a small GPU cluster system could be

applied.

Furthermore, another very interesting and promising approach, k-nearest neighbors es-

timation of mutual information, can be applied as a future work to approximate the

marginal and joint probabilities without the need for estimation of any smoothing parame-

ters [39, 23, 9, 45, 28].

62

Appendix

63

A. Derivation of Kernel Density

Estimator

As brieﬂy described in the section 4.2, the histogram approach, to estimate some density

p(x), is based on the feasible idea, that it is reasonable to count the frequency of observations

falling into the same small interval as x itself, such as

1

nh

{ frequency count of observations within the interval CONTAINING x} .

The kernel density estimator is actually also based on the similar idea, but even in a more

ﬂexible way, and does not contain the disadvantage of choice-of-an-origin problem. It can

be formulated such as

1

nh

{ frequency count of observations within the interval AROUND x} .

Note the important diﬀerence between containing and around to the construction of his-

togram. This time the estimation relies on frequency count of observations in an interval

placed around x, and not on the interval containing x, which is placed around some bin

center m

j

depending on the choice of an origin x

0

. Considering an interval length of 2h,

thus a form such as [x −h, x +h), the density estimation can be formulated as

ˆ p

h

(x) =

1

2hn

#{X

i

∈ [x −h, x +h)} . (A.1)

where X

1

, X

2

, · · · , X

n

are random samples drawn from an unknown distribution p. Using a

weighting function K (such as a uniform kernel function), the formula A.1 can be reformed

to

ˆ p

h

(x) =

1

nh

n

i=1

K

Å

x −X

i

h

ã

=

1

nh

n

i=1

1

2

I

Å

x −X

i

h

≤ 1

ã

,

where K(u) =

1

2

I(|u| ≤ 1) is called the uniform kernel function, which assigns the weight

1

2

to each observation X

i

whose distance from x, the point which the pdf is to be estimated

at, is not bigger than h. Points far away from x are zero weighted per deﬁnition and do not

contribute to the pdf estimate of x.

65

A. Derivation of Kernel Density Estimator

As it can be easily noted, the uniform kernel function gives the same weight to each

observation X

i

, provided that they fall into the interval [x −h, x +h), no matter how close

they are to x. This might not be ﬂexible enough, because the contribution of observations

much closer to x should actually get more weight than those being more distant. There

are some other weighting functions already considering this situation. One of them can be

given as

ˆ p

h

(x) =

1

2nh

n

i=1

3

2

®

1 −

Å

x −X

i

h

ã

2

´

I

Å

x −X

i

h

≤ 1

ã

=

1

nh

n

i=1

K

Å

x −X

i

h

ã

,

and is called the Epanechnikov kernel

K(u) =

3

4

(1 −u

2

)I(|u| ≤ 1).

Looking at the last kernel, it is clear that this is a procedure of counting the observations

falling into the interval around x, where contributions from X

i

, which are far away to x,

are less weighted than those which are closer to x. Some of the other kernels having the

same property are given in the Table A.1.

Table A.1.: Kernel functions

Kernel K(u)

Uniform

1

2

I(|u| ≤ 1)

Triangle (1 −|u|)I(|u| ≤ 1)

Epanechnikov

3

4

(1 −u

2

)I(|u| ≤ 1)

Quartic

15

16

(1 −u

2

)

2

I(|u| ≤ 1)

Triweight

35

32

(1 −u

2

)

3

I(|u| ≤ 1)

Gaussian

1

√

2π

exp(

−1

2

u

2

)

Cosine

π

4

cos(

π

2

u)I(|u| ≤ 1)

66

Thus, given the random observations X

1

, X

2

, · · · , X

n

from a distribution p, a kernel

density estimator ﬁnally can be mathematically formulated as

ˆ p

h

(x) =

1

n

n

i=1

K

h

(x −X

i

) , (A.2)

where

K

h

(•) =

1

h

K(•/h).

Note that the weighting function K(•) refers to a kernel function such as the ones from the

Table A.1 , whereas the formula A.2 is the kernel density estimator.

67

A. Derivation of Kernel Density Estimator

68

B. Projective Issues and Camera Models

Projection of vessel centerlines as a 3D-model is performed by a camera, which is a matrix

that deﬁnes a mapping between the 3D world and a 2D image. The general projective

camera is usually examined using the tools of projective geometry, and its matrix represen-

tation allows to calculate the geometric entities such as the center of projection, and image

plane.

The most specialized and simplest camera model is the (basic) pinhole camera. Its

properties and generalizations can also be applied to monocular vision.

B.1. Pinhole Camera Model

The main issue is the central projection of points in space onto a plane. A pinhole camera

model consist of a projection center (also called camera center), which is where the camera is

located, and an image plane (focal plane). The line from the camera center perpendicular

to the image plane is called principal axis, and the point, P, where the principal axis

meets the image plane is called principal point. The distance from camera center to the

image plane is called focal length (f). The projection of a point in space with coordinates

X

S

= (x, y, z)

T

is a mapping to another point X

P

on the image plane, where a line joining

the point X

S

to the center of projection meets the image plane. Assuming the camera center

being at the origin of a Euclidean coordinate system, ﬁgure B.1 illustrates the geometry of

a pinhole camera:

By simple geometric calculations, one quickly sees that the point X

S

= (x, y, z)

T

is

mapped to the point X

P

= (fx/z, fy/z, f)

T

. That describes the central projection mapping

from world space (Euclidean 3D-space R

3

) to image coordinates (Euclidean 2D-space R

2

):

(x, y, z)

T

→(fx/z, fy/z)

T

(B.1)

Considering the homogeneous representation of world and image points (B.1) can be

written as a matrix multiplication in form of a linear mapping between homogeneous coor-

dinates:

á

x

y

z

1

ë

→

Ö

fx

fy

z

è

=

f 0

f 0

1 0

¸

¸

¸

á

x

y

z

1

ë

(B.2)

Thus, X

S

is represented by the homogeneous 4-vector (x, y, z, 1)

T

and X

P

is represented

69

B. Projective Issues and Camera Models

Figure B.1.: Pinhole Camera Geometry: C is the center of projection and p is the principal

point. Image plane is placed between the camera center and the world space.

70

B.1. Pinhole Camera Model

by the homogeneous 3-vector (fx, fy, z)

T

. Furthermore, let P be a 3 × 4 homogeneous

camera projection matrix. Then (B.2) can be compactly written as

X

P

= PX

S

which deﬁnes the camera matrix for the pinhole model of central projection as

P = diag(f, f, 1)[I|0]

where diag(f, f, 1) is a diagonal matrix and [I|0] represents a 3×4 matrix with a 3×3 block

being I, thus the identity matrix, and a column being the zero vector (0, 0, 0)

T

.

Recall that in ﬁgure B.1 the image plane was in the middle (between the camera center

and the world space). However for some systems, such as a medical C-arm device, the

world space is placed between the X-ray source and the the image detector. However, the

geometrical properties stay the same in both cases. Figure B.2 illustrates this.

Figure B.2.: Pinhole Camera Geometry: World space is placed between the camera center

and the image plane (such as the C-arm). Recall that in ﬁgure B.1 the image

plane was placed between the camera center and the world space. In both

cases,the geometry and all properties are still the same.

Up to now, it was assumed that the image plane has its origin at principal point. However,

that must not always be the case in practice, so that there is a general mapping deﬁned as

(x, y, z)

T

→(fx/z +p

x

, fy/z +p

y

)

T

where (p

x

, p

y

)

T

are the coordinates of the principal point (see ﬁgure B.3) and (B.2) can now

be written as

71

B. Projective Issues and Camera Models

Figure B.3.: Image (x, y) and camera (x

cam

, y

cam

) coordinate systems.

á

x

y

z

1

ë

→

Ö

fx +zp

x

fy +zp

y

z

è

=

f p

x

0

f p

y

0

1 0

¸

¸

¸

á

x

y

z

1

ë

(B.3)

and in compact form as

X

P

= K[I|0]X

S

(B.4)

where

K =

f p

x

f p

y

1

¸

¸

¸ (B.5)

is called the camera calibration matrix. Recall that up to know the camera center is

assumed to be at the origin of a Euclidean coordinate system with the principal axis of the

camera pointing straight down the Z-axis, and the point X

S

is expressed in this coordinate

system which is called the camera coordinate frame.

In general, X

S

will be expressed in terms of a diﬀerent Euclidean coordinate frame called

world coordinate frame. The two coordinate frames are related via a rotation and a

translation (see ﬁgure B.4). Now, let

˜

X

S

be an inhomogeneous 3-vector representing the

coordinates of the same point X

S

, but this time in the world coordinate frame. Then, the

relation between these two point can be expressed as X

S

= R(

˜

X

S

−

˜

C), where

˜

C represents

the coordinates of the camera center in the world coordinate frame, and R is a 3×3 rotation

matrix representing the orientation of the camera coordinate frame. This equation may be

written in homogeneous coordinates as

72

B.2. CCD Cameras

Figure B.4.: The Euclidean transformation between the world and camera coordinate frames

X

S

=

ñ

R −R

˜

C

0 1

ô

á

x

y

z

1

ë

=

ñ

R −R

˜

C

0 1

ô

˜

X

S

. (B.6)

Putting all together with (B.4) leads to

X

P

= KR[I| −

˜

C]

˜

X

S

(B.7)

This is the general mapping from world coordinate frame (the point

˜

X

S

) to image coordi-

nates (the point X

P

) given by a pinhole camera. As can be seen, a general pinhole camera,

P = KR[I| −

˜

C], has 9 degrees of freedom: 3 parameters being the elements f, p

x

, p

y

in

the calibration matrix K , 3 parameters for the rotation R and 3 for translation

˜

C. The

parameters contained in K are called intrinsic camera parameters, whereas the ones for

the rotation and translation are called extrinsic camera parameters.

B.2. CCD Cameras

The general pinhole camera model just derived assumes the image plane having Euclidean

coordinates with equal spaces in each axial directions. In contrast, CCD cameras might

have non-square pixels. This has an extra eﬀect of introducing unequal scale factors in

each direction. In such a case (B.5) has to be multiplied on the left by an extra factor

diag(m

x

, m

y

, 1), where m

x

and m

y

are the number of pixels per unit distance in the x and

y directions of the image coordinates. This results in

73

B. Projective Issues and Camera Models

K =

α

x

x

0

α

y

y

0

1

¸

¸

¸ (B.8)

as being the calibration matrix if a CCD camera, where α

x

= fm

x

and α

y

= fm

y

represent

the focal length of the camera in terms of pixel dimensions in the x and y directions respec-

tively. Furthermore, (x

0

, y

0

) is the principal point also in terms of pixel dimensions, with

coordinates x

0

= m

x

p

x

and y

0

= m

y

p

y

. Thus, a CCD camera has 10 degrees of freedom.

B.3. Finite Projective Camera

In some unusual instances, the calibration matrix has 11 degrees of freedom by adding

an additional, so called, skew parameter s into K. The models are referred to as finite

projective camera and their calibration matrix is of the form

K =

α

x

s x

0

α

y

y

0

1

¸

¸

¸. (B.9)

However, for most normal cameras, the skew parameter s will be zero. A non-zero skew

parameter can be interpreted as a skewing of the pixel elements, so that x- and y-axes are

not perpendicular. This type of models are not considered any further in this report.

74

C. Log-Sum-Exp Trick

The functional form, so-called logarithm of sum of exponentials might for example be en-

countered in non-parametric probability estimation for mutual information. It arises when

the density p from within the Shannon-entropy has to be continuously approximated via

kernel density estimation, where the particular application aspires to use the Gaussian as a

kernel, which contains the exponential function. As known from the entropy formula, such

as p ∗ log(p), it can easily be seen, that applying the Gaussian kernel to estimate p results

in a formula consisting of logarithm of sum of some exponentials, such as

log

Ç

1

N

x

G(x)

å

= log

Å

1

N

ã

+ log

Ç

x

c

1

∗ exp (c

2

∗ x)

å

= C + log (exp (x

1

) + exp (x

2

) + · · · + exp (x

n

)) .

This expression has to be numerically evaluated. In complied languages, such as C (or

similar), special care is required to calculate this expression, in order to avoid numerical

problems. The vector valued parameter x might take very large or very small components,

where the functions still needs to work. However, too large components can cause overﬂow

due to exponentiation. On the other hand for very small values the exponential term might

vanish. The logarithm of a very small value can result in underﬂow.

In order to make the explanations easier, following term can be considered,

exp(a) + exp(b) = [exp(a −c) + exp(b −c)] ∗ exp(c),

for any c, which later has to be chosen appropriately. Taking the logarithm yields

log ([exp(a −c) + exp(b −c)] ∗ exp(c))) = log ([exp(a −c) + exp(b −c)]) +c.

The shifting value c has to be chosen in such a way that the possibility of an overﬂow

is reduced. Note that underﬂow is also possible due to the reason that log(x) → −∞ as

x → 0. Considering the vector x, one reasonable way for the choice of shifting c is to take

the element of x that is largest in absolute value.

Note that this is still not completely robust. It is still prone to over- or underﬂow.

Especially in the case of the density estimation for mutual information via Parzen-windows,

the joint entropy contains the joint probability, which is estimated (within this work) by

75

C. Log-Sum-Exp Trick

multiplication of the single probabilities resulting in taking the product of exponentials. Now

if both of these exponentials are adjusted in a way described above, in order to prevent an

underﬂow caused by taking the logarithm of a very small value, yielding larger values, the

multiplication of two such large values might then yield +∞.

76

D. B-Splines

D.1. B´ezier Curve

A B´ ezier curve is a parametric curve that deﬁnes a polygonal curve of a certain degree.

Such curves are deﬁned by control points that consists of two B´ ezier points (BP), being

start and end points, and of a set of B´ ezier curve points (BCP) aﬀecting the bends of the

curve.

Figure D.1.: A Cubic B´ezier Curve P

0

and P

3

are the BPs, whereas P

1

and P

2

are the

BCPs

D.1.1. General Deﬁnition

Given n +1 control points P

0

, P

1

...P

n

, the B´ ezier curve of degree n is generally deﬁned as:

B(t) =

N

i=0

b

i,n

(t)P

i

(D.1)

where t ∈ [0, 1] and the polynomials

b

i,n

(t) =

Ç

n

i

å

t

i

(1 −t)

n−i

, i = 0, . . . , n (D.2)

77

D. B-Splines

are known as degree-n Bernstein basis polynomials that deﬁne the base of the associated

vector space and can be recursively deﬁned as:

b

i,n

(t) = (1 −t)b

i,n−1

(t) +tb

i−1,n−1

(t) (D.3)

where b

0,0

:= 1 and b

i,n

:= 0 for i < 0 oder i > n. Thus, the degree-n B´ ezier curve is a

linear interpolation between two degree-n −1 B´ ezier curves.

D.1.2. Examination of Cases and Constructing B´ezier Curves

Linear B´ezier Curves

A linear B´ ezier curve is simply deﬁned as a straight line between two control points P

0

and

P

1

and is equivalent to linear interpolation:

B(t) = P

0

+t(P

1

−P

0

) = (1 −t)P

0

+tP

1

, t ∈ [0, 1] (D.4)

where t describes how far B(t) is of the way from P

0

to P

1

.

Figure D.2.: A linear B´ ezier curve, t ∈ [0, 1]

78

D.1. B´ezier Curve

Quadratic B´ezier Curves

Given the control points P

0

, P

1

and P

2

, a quadratic B´ ezier curve is the path traced by

departing from P

0

in the direction of P

1

and bending to arrive at P

2

:

B(t) = (1 −t)

2

P

0

+ 2(1 −t)tP

1

+t

2

P

2

, t ∈ [0, 1] (D.5)

Figure D.3.: Construction of a quadric B´ ezier curve, t ∈ [0, 1]

79

D. B-Splines

Cubic B´ezier Curves

B´ ezier deﬁned that four points P

0

, P

1

, P

2

and P

3

are enough to deﬁne a cubic curve in the

plane or in three-dimensional space. The curve starts from P

0

going toward P

1

and arrives

at P

3

coming from the direction of P

2

and is deﬁned as:

B(t) = (1 −t)

3

P

0

+ 3(1 −t)

2

tP

1

+ 3(1 −t)t

2

P

2

+t

3

P

3

, t ∈ [0, 1] (D.6)

The BCPs P

1

and P

2

provide directional information, and the distance between P

0

and

P

1

determines how long the curve moves into direction P

2

before turning towards P

3

.

Figure D.4.: Construction of a quadric B´ ezier curve, t ∈ [0, 1]

As can be seen, one can construct intermediate points Q

0

, Q

1

and Q

2

that describe linear

B´ ezier curves, and points R

0

, R

1

that describe quadratic B´ ezier curves.

80

D.2. Spline

D.2. Spline

A spline of degree n can be seen as a piecewise deﬁned polynomial B´ ezier curve, thus as a

collection of B´ ezier curves, connected end to end, where the polynomials all have a degree

of maximum n. One of the other constrains is that the spline is n − 1 times continuously

diﬀerentiable, thus C

n−1

at that points where two polynomial parts join each other.

The spline is called linear if the polynomial pieces are linear. Quadratic and Cubic

splines are also deﬁned respectively.

Let S be the spline function that takes values from an interval and maps them to R

considering the univariate case:

S: [a, b] →R

The spline function is deﬁned in form of piecewise polynomials P

i

where

P

i

: [t

i

, t

i+1

] →R with a = t

0

≤ t

1

≤ . . . ≤ t

k−1

≤ t

k

= b

and S is deﬁned on the subinterval of [a, b] as follows:

S(t) = P

0

(t), t

0

≤ t < t

1

S(t) = P

1

(t), t

1

≤ t < t

2

.

.

.

S(t) = P

k−1

(t), t

k−1

≤ t < t

k

(D.7)

The given k + 1 points are called the knots and the associated vector t = (t

0

, . . . , t

k

) is

called a knot vector of the spline. Multiple knots at any point results in loss of continuity

at that point. An extended knot vector can be deﬁned as

(t

0

, t

1

, . . . , t

1

, t

2

, . . . , t

2

, t

3

, . . . , t

k−2

, t

k−1

, . . . , t

k−1

, t

k

)

where t

i

is repeated j

i

times for i = 1, . . . , k −1.

A spline curve is deﬁned as a parametric curve on the interval [a, b]

G(t) = (X(t), Y (t)), t ∈ [a, b]

where both X and Y are spline functions of the same degree with the same extended knot

vectors on that interval.

81

D. B-Splines

D.3. B-Spline

A B-spline curve (basis-spline) deﬁnes a sequence of degree-n B´ ezier curves that auto-

matically have C

n−1

continuity at joints, regardless of where the control points are placed.

Thus, a generalisation of a B´ ezier curve can be seen as a spline curve parametrized by

spline functions that are expressed as linear combinations of B-splines.

A fundamental theory of Carl de Boor states, that every spline function of a given degree,

smoothness, and domain partition, can be represented as a linear combination of B-splines

of that same degree and smoothness, and over that same partition[?]:

S(t) =

m−n−2

i=0

P

i

b

i,n

(t), t ∈ [t

n

, t

m−n−1

] (D.8)

with m given knots t

0

≤ t

1

≤ . . . ≤ t

m−1

, where P

i

are m−n−1 piece control points forming

a convex hull, b

i,n

are the B-splines of degree n building the base and S: [t

0

, t

m−1

] →R

2

is the parametric curve (spline) of degree n.

The space of piecewise polynomials is a vector space with a base. The choice of the base

is critical for a potential rounding error. A recursive method, known as de Boor algorithm,

is a numerically stable way to deﬁne these base functions:

b

j,0

(t): =

1, if t

j

≤ t < t

j+1

0, otherwise

(D.9)

b

j,n

(t): =

t −t

j

t

j+n

−t

j

b

j,n−1

(t) +

t

j+n+1

−t

t

j+n+1

−t

j+1

b

j+1,n−1

(t)

When the knots are equidistant the B-spline is said to be uniform, otherwise non-

uniform. Note that j+n+1 can not exceed m-1, which limits both j and n.

82

Bibliography

[1] The boost library. http://www.boost.org/, 2011.

[2] Cuda architecture. http://developer.nvidia.com/category/zone/cuda-zone, 2011.

[3] Open computing language. http://www.khronos.org/opencl/, 2011.

[4] C´edric Archambeau, M. Valle, A. Assenza, and Michel Verleysen. Assessment of prob-

ability density estimation methods: Parzen window and ﬁnite gaussian mixtures. In

ISCAS. IEEE, 2006.

[5] Benjamin Auﬀarth, Maite L´opez, and Jesus Cerquides. Comparison of redundancy and

relevance measures for feature selection in tissue classiﬁcation of ct images. Artiﬁcial

Intelligence, pages 248–262, 2010.

[6] Z. I. Botev, J. F. Grotowski, and D. P. Kroese. Kernel density estimation via diﬀusion.

The Annals of Statistics, 38(5):2916–2957, 2010.

[7] G. Bradski. The opencv library. http://opencv.willowgarage.com/wiki/, 2011.

[8] Thomas Brox, Bodo Rosenhahn, Daniel Cremers, and Hans-Peter Seidel. Nonparamet-

ric density estimation with adaptive anisotropic kernels for human motion tracking. In

Ahmed Elgammal, Bodo Rosenhahn, and Reinhard Klette, editors, 2nd Workshop on

Human Motion, volume 4814 of Lecture Notes in Computer Science, pages 152–165,

Rio de Janeiro, Brazil, 2007. Springer.

[9] Haixiao Cai, Sanjeev R. Kulkarni, and Sergio Verd´ u. Universal entropy estimation via

block sorting. IEEE Transactions on Information Theory, 50(7):1551–1561, 2004.

[10] L. Le Cam. Maximum likelihood: An introduction. Intl. Stat. Rev, 58:153–171, 1990.

[11] Clark Carter. Interquartile range in encyclopedia of statistics in behavioral science.

October 2005.

[12] Haifeng Chen and Peter Meer. Robust computer vision through kernel density estima-

tion. In In 7th European Conf. on Computer Vision, pages 236–250, 2002.

[13] C.H. Coombs, R.M. Dawes, and A. Tversky. Mathematical psychology: an elementary

introduction. Prentice-Hall series in mathematical psychology. Prentice-Hall, 1970.

[14] T.M. Cover and J.A. Thomas. Elements of information theory. Wiley Series in Telecom-

munications and Signal Processing. Wiley-Interscience, 2006.

[15] R. O. Duda and P. E. Hart. Pattern Classiﬁcation and Scene Analysis. John Willey

& Sons, New Yotk, 1973.

[16] Ahmed Elgammal, Ramani Duraiswami, and Larry S. Davis. Eﬃcient kernel density

83

Bibliography

estimation using the fast gauss transform with applications to color modeling and

tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25:1499–

1504, 2003.

[17] Chris Engels, Henrik Stew´enius, and David Nist´er. Bundle adjustment rules. In In

Photogrammetric Computer Vision, 2006.

[18] Ardy Goshtasby. 2-d and 3-d image registration for. In Medical, Remote Sensing, and

Industrial Applications. Wiley Press, 2005.

[19] Robert M. Gray. Entropy and information theory. Springer-Verlag New York, Inc.,

New York, NY, USA, 1990.

[20] Ga¨el Guennebaud, Benoˆıt Jacob, et al. Eigen v3. http://eigen.tuxfamily.org, 2011.

[21] W. H¨ardle. Nonparametric and semiparametric models. Springer series in statistics.

Springer, 2004.

[22] R. Hartley and A. Zisserman. Multiple view geometry in computer vision. Cambridge

University Press, 2003.

[23] Vladimir Hnizdo, Eva Darian, Adam Fedorowicz, Eugene Demchuk, Shengqiao Li,

and Harshinder Singh. Nearest-neighbor nonparametric method for estimating the

conﬁgurational entropy of complex molecules. Journal of Computational Chemistry,

28(3):655–668, 2007.

[24] Andreas Hubeli and Markus Gross. A survey of surface representations for geometric

modeling. Technical report, 2000.

[25] M. C. Jones and D. A. Henderson. Maximum likelihood kernel density estimation:

On the potential of convolution sieves. Computational Statistics & Data Analysis,

53(10):3726–3733, 2009.

[26] J.W. Larson. Information-theoretic strategies for quantifying variability and model-

reality comparison in the climate system. 2009.

[27] Michael London, Matthew E Larkum, and Michael H¨ausser. Predicting the synaptic

information eﬃcacy in cortical layer 5 pyramidal neurons using a minimal integrate-

and-ﬁre model. Biological Cybernetics, 99(4-5):393–401, 2008.

[28] D. P´al, B. P´oczos, and Cs. Szepesv´ari. Estimation of R´enyi entropy and mutual in-

formation based on generalized nearest-neighbor graphs (extended version). In NIPS,

December 2010.

[29] Liam Paninski. Estimation of entropy and mutual information. Neural Comput.,

15:1191–1253, June 2003.

[30] J. P. W. Pluim, J. B. A. Maintz, and M. A. Viergever. Mutual-information-based

registration of medical images: a survey. Medical Imaging, IEEE Transactions on,

22(8):986–1004, July 2003.

[31] W.H. Press, B.P. Flannery, S.A. Teukolsky, and W.T. Vetterling. Numerical Recipes

in C: The Art of Scientiﬁc Computing. Cambridge University Press, 1988.

84

Bibliography

[32] Fazlollah M. Reza. An introduction to information theory. McGraw-Hill, New York,

1961.

[33] Oliver Ruepp and Darius Burschka. Fast recovery of weakly textured surfaces from

monocular image sequences. In Proceedings of the 10th Asian conference on Computer

vision - Volume Part IV, ACCV’10, pages 474–485, Berlin, Heidelberg, 2011. Springer-

Verlag.

[34] X. U. Rui, C. H. E. N. Yen-Wei, T. A. N. G. Song-Yuan, Shigehiro Morikawa, and Yoshi-

masa Kurumi. Parzen-Window Based Normalized Mutual Information for Medical Im-

age Registration. IEICE Transactions on Information and Systems, E91-D(1):132–144,

January 2008.

[35] Stephan R. Sain. Adaptive kernel density estimation, 1994.

[36] Stephan R. Sain, Keith A. Baggerly, and David W. Scott. Cross-validation of multi-

variate densities. Journal of the American Statistical Association, 89:807–817, 1992.

[37] R. Shams, P. Sadeghi, R. A. Kennedy, and R. Hartley. Parallel computation of mu-

tual information on the GPU with application to real-time registration of 3D medical

images. Computer Methods and Programs in Biomedicine, 99(2):133–146, aug 2010.

[38] Claude Elwood Shannon. A mathematical theory of communication. Bell Systems

Technical Journal, 27:379–423,623–656, 1948.

[39] H Singh. Nearest neighbor estimates of entropy. American Journal of Mathematical

and Management Sciences, 23(3, 4):301–321, 2003.

[40] Alexander Strehl, Joydeep Ghosh, and Claire Cardie. Cluster ensembles - a knowl-

edge reuse framework for combining multiple partitions. Journal of Machine Learning

Research, 3:583–617, 2002.

[41] C Studholme, Dlg Hill, and Dj Hawkes. An overlap invariant entropy measure of 3d

medical image alignment . Pattern Recognition, 32(1):71–86, 1999.

[42] Janusz Szczepanski, M. Arnold, Elek Wajnryb, Jos´e M. Amig´o, and Maria V. Sanchez-

Vives. Mutual information and redundancy in spontaneous communication between

cortical neurons. Biological Cybernetics, 104(3):161–174, 2011.

[43] Charles C. Taylor. Bootstrap choice of the smoothing parameter in kernel density

estimation. Biometrika, 76(4):pp. 705–712, 1989.

[44] Bill Triggs, Philip F. Mclauchlan, Richard I. Hartley, and Andrew W. Fitzgibbon.

Bundle Adjustment – A Modern Synthesis, volume 1883. January 2000.

[45] Martin Vejmelka and Katerina Hlav´ackov´a-Schindler. Mutual information estimation in

higher dimensions: A speed-up of a k -nearest neighbor based estimator. In Bartlomiej

Beliczynski, Andrzej Dzielinski, Marcin Iwanowski, and Bernardete Ribeiro, editors,

ICANNGA (1), volume 4431 of Lecture Notes in Computer Science, pages 790–797.

Springer, 2007.

[46] Nguyen X. Vinh, Julien Epps, and James Bailey. Information Theoretic Measures

for Clusterings Comparison: Variants, Properties, Normalization and Correction for

85

Bibliography

Chance. Journal of Machine Learning Research.

[47] Paul A. Viola and William M. Wells III. Alignment by maximization of mutual infor-

mation. International Journal of Computer Vision, 24(2):137–154, 1997.

[48] I.H. Witten, E. Frank, and M.A. Hall. Data Mining: Practical Machine Learning

Tools and Techniques. Morgan Kaufmann series in data management systems. Elsevier

Science & Technology, 2011.

[49] Y. Y. Yao. Information-theoretic measures for knowledge discovery and data mining.

In Karmeshu, editor, Entropy Measures, Maximum Entropy Principle and Emerging

Applications, pages 115–136, Berlin, 2003. Springer.

86

Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

We've moved you to where you read on your other device.

Get the full title to continue

Get the full title to continue reading from where you left off, or restart the preview.

scribd