Chapter 7. Object Recognition

TRƯỜNG ĐẠI HỌC BÁCH KHOA HÀ NỘI
XỬ LÝ ẢNH TRONG CƠ ĐIỆN TỬ

Machine Vision
Giảng viên: TS. Nguyễn Thành Hùng

Đơn vị: Bộ môn Cơ điện tử, Viện Cơ khí
Hà Nội, 2021 1
Chapter 7. Object Recognition
❖1. Introduction
❖2. Pattern Matching
❖3. Feature-based Methods
❖4. Artificial Neural Networks
Simon Achatz, State of the art of object recognition techniques, Technische Universitat Muchen. 2
1. Introduction
▪ Object recognition: localize and to classify objects.
▪ General concept:
➢ training datasets containing images with known and labelled objects;
➢ extracts different types of information (colours, edges, geometric forms) based on the
chosen algorithm
➢ for any new image the same information is gathered and compared to the training
dataset to find the most suitable classification.
1. Introduction
▪ Applications:
➢ robots in industrial environments,
➢ face or handwriting recognition
➢ autonomous systems such as modern cars which use object recognition for pedestrian
detection, emergency brake assistant and so on.
➢…
1. Introduction
▪ General Object Recognition Strategies

➢ Appearance-based method
➢ Feature-based method
➢ Interpretation Tree
➢ Pattern Matching
➢ Artificial Neural Networks
1. Introduction
▪ General Object Recognition Strategies: Appearance-based method
➢ Face or handwriting recognition
➢ Reference training images
➢ This dataset is compressed to obtain a lower dimension subspace, also called eigenspace.
➢ Parts of the new input images are projected on the eigenspace and then correspondence is
examined.
1. Introduction
▪ General Object Recognition Strategies: Feature-based Method
➢ Characteristic for each object
➢ Colours, contour lines, geometric forms or edges
➢ The basic concept of feature-based object recognition strategies is following:
• Every input image is searched for a specific type of feature,
• This feature is then compared to a database containing models of the objects in order to
verify if there are recognised objects.
1. Introduction
▪ General Object Recognition Strategies: Feature-based method

➢ Features and their descriptors can be either found considering the whole image (global
feature) or after observing just small parts of the image (local feature).
➢ An histogram of the pixel intensity or colour are simple examples for global features.
➢ It is not always reasonable to compare the whole image, as already slight changes in
illumination, position (occlusion) or rotation lead to significant differences and a correct
recognition is not possible anymore.
1. Introduction
▪ General Object Recognition Strategies: Feature-based method

➢ Descriptors of local features are more robust against these problems and therefore
algorithms with local features often outperform global feature-based methods.
Two patches of different

images are cut and
compared if the error
between the patches is
below a certain threshold.
1. Introduction
▪ General Object Recognition Strategies: Interpretation Tree

➢ Interpretation tree is a depth first search algorithm for model matching.
➢ Algorithms based on this approach often try to recognise n-dimensional geometric
objects, therefore a database containing models with known features is necessary.
➢ The feature set might consist of distance, angle and direction constraints between points
on the surface of the objects.
1. Introduction
▪ General Object Recognition Strategies: Interpretation Tree
Procedure of an interpretation tree algorithm
1. Introduction
▪ General Object Recognition Strategies: Pattern Matching

➢ Methods of pattern matching, or sometimes called template matching, are often used
because of their simplicity.
➢ Template matching is a technique for finding small parts of an image which match a
template image.
1. Introduction
▪ General Object Recognition Strategies: Pattern Matching

➢ One famous application of template matching is traffic sign recognition, small parts of the
input image are tried to be matched with a database full of different images of traffic signs.
➢ This approach has lots of disadvantages such as problems with occlusion, rotation, scaling,
different illuminations.
1. Introduction
▪ General Object Recognition Strategies: Artificial neural networks

➢ A model consists of several layer, in which each layer is composed of a certain number of
neurons.
A neural network containing one input layer, two hidden layer and one output layer.
1. Introduction

➢ An input and an output layer is the minimum amount of layers a network can have, but
normally hidden layer are included to be able to learn more complex things such as object
recognition.
➢ All neurons from one layer are connected to all neurons from the next layer and therefore
create a huge network with millions of parameters.
➢ All of these connections have a weight which is updated during learning phase. Neurons
are activated if the sum of the input signals is above a certain threshold and a activation
function triggers the output.
1. Introduction
A neural network containing one input layer, two hidden layer and
one output layer
1. Introduction

➢ There are different types of networks such as feed-forward, recurrent with different
number and types of hidden layers, while the input (e.g. number of pixels) and output
(number of classes) layer are fixed.
➢ Later, convolutional neural networks and their hidden layers are explained in a more
detailed way in Section 4. New inputs go through the same way, some neurons might be
activated based on the trained network and finally, this leads to the most suitable
classification.
1. Introduction
▪ Performance Analysis
➢ Invariances and Robustness
➢ Complexity
➢ Reliability and Accuracy
1. Introduction
❖ Performance Analysis: Invariances and Robustness
▪ First, the algorithms are analysed and checked whether invariances occur and what level
of robustness they have.
1. Introduction
❖ Performance Analysis: Complexity
▪ Secondly, the algorithms are compared regarding complexity, especially in terms of
computational load and memory usage.
1. Introduction
❖ Performance Analysis: Reliability and Accuracy
The development of accuracy rates of

traditional computer vision and deep
learning regarding ImageNet
❖1. Introduction
2. Pattern Matching
❖ Template matching is a technique for finding areas of an image that match (are similar) to
a template image (patch).
❖ How does it work?
▪ We need two primary components:
▪ Source image (I): The image in which we expect to find a match to the template image
▪ Template image (T): The patch image which will be compared to the template image our
goal is to detect the highest matching area:
https://docs.opencv.org/4.3.0/de/da9/tutorial_template_matching.html 29
2. Pattern Matching
❖ Template matching
2. Pattern Matching
▪ To identify the matching area, we have to compare the template image against the source
image by sliding it:
2. Pattern Matching
▪ By sliding, we mean moving the patch one pixel at a time (left to right, up to down). At
each location, a metric is calculated so it represents how "good" or "bad" the match at that
location is (or how similar the patch is to that particular area of the source image).
▪ For each location of T over I, you store the metric in the result matrix R. Each location
(x,y) in R contains the match metric:
2. Pattern Matching
2. Pattern Matching
▪ The image above is the result R of sliding the patch with a metric
TM_CCORR_NORMED. The brightest locations indicate the highest matches. As you can
see, the location marked by the red circle is probably the one with the highest value, so
that location (the rectangle formed by that point as a corner and width and height equal to
the patch image) is considered the match.
2. Pattern Matching
▪ Which are the matching methods available in OpenCV?
2. Pattern Matching
▪ Which are the matching methods available in OpenCV?
❖1. Introduction
3. Feature-based Methods
▪ Feature Detectors
▪ Feature Descriptors
▪ Feature Matching
38
❖ Feature detectors
Image pairs with extracted patches below. Notice how some patches
can be localized or matched with higher accuracy than others.
Richard Szeliski, Computer Vision Algorithms and Applications, Springer-Verlag London Limited 2011. 39
▪ The simplest possible matching criterion for comparing two image patches:
where I0 and I1 are the two images being compared, u = (u, v) is the displacement vector, w(x) is a
spatially varying weighting (or window) function, and the summation i is over all the pixels in the patch.
Aperture problems for different image patches: (a) stable (“corner-like”) flow; (b) classic aperture problem
(barber-pole illusion); (c) textureless region. The two images I0 (yellow) and I1 (red) are overlaid. The red
vector u indicates the displacement between the patch centers and the w(xi) weighting function (patch
window) is shown as a dark circle.
▪ auto-correlation function or surface
Three auto-correlation surfaces EAC(Δu) shown as both grayscale

images and surface plots: (a) The original image is marked with
three red crosses to denote where the auto-correlation surfaces
were computed; (b) this patch is from the flower bed (good
unique minimum); (c) this patch is from the roof edge (one-
dimensional aperture problem); and (d) this patch is from the
cloud (no good peak). Each grid point in figures b–d is one value
of Δu.
▪ auto-correlation function or surface
Uncertainty ellipse corresponding to an eigenvalue analysis of

the auto-correlation matrix A.
▪ Forstner–Harris
Interest operator responses: (a) Sample image, (b) Harris response, and (c) DoG response. The circle sizes
and colors indicate the scale at which each interest point was detected. Notice how the two detectors
tend to respond at complementary locations.
▪ Adaptive non-maximal suppression (ANMS)
Adaptive non-maximal suppression (ANMS) (Brown,

Szeliski, and Winder 2005): The upper two images show
the strongest 250 and 500 interest points, while the
lower two images show the interest points selected
with adaptive non-maximal suppression, along with the
corresponding suppression radius r. Note how the latter
features have a much more uniform spatial distribution
across the image.
▪ Scale invariance
Multi-scale oriented patches (MOPS) extracted at five pyramid levels (Brown, Szeliski, and
Winder 2005). The boxes show the feature orientation and the region from which the
descriptor vectors are sampled.
▪ Scale invariance
Scale-space feature detection using a sub-octave Difference of Gaussian pyramid (Lowe 2004): (a) Adjacent
levels of a sub-octave Gaussian pyramid are subtracted to produce Difference of Gaussian images; (b) extrema
(maxima and minima) in the resulting 3D volume are detected by comparing a pixel to its 26 neighbors.
▪ Rotational invariance and orientation estimation
A dominant orientation estimate

can be computed by creating a
histogram of all the gradient
orientations (weighted by their
magnitudes or after thresholding
out small gradients) and then
finding the significant peaks in this
distribution (Lowe 2004)
▪ Rotational invariance and orientation estimation
Affine region detectors used to match two images taken from dramatically different viewpoints
(Mikolajczyk and Schmid 2004)
▪ Affine invariance
Affine normalization using the second moment matrices, as described by Mikolajczyk, Tuytelaars, Schmid et
al. (2005): After image coordinates are transformed using the matrices A0-1/2 and A1-1/2, they are related by a
pure rotation R, which can be estimated using a dominant orientation technique.
▪ Affine invariance
Maximally stable extremal regions (MSERs) extracted and matched from a number of images
(Matas, Chum, Urban et al. 2004)
52
❖ Feature descriptors
Feature matching: how can we extract local descriptors that are invariant to inter-image variations and yet
still discriminative enough to establish correct correspondences?
▪ Bias and gain normalization (MOPS)
MOPS descriptors are formed using an 8×8 sampling of bias and gain normalized intensity values, with a
sample spacing of five pixels relative to the detection scale (Brown, Szeliski, and Winder 2005). This low
frequency sampling gives the features some robustness to interest point location error and is achieved by
sampling at a higher pyramid level than the detection scale.
▪ Scale invariant feature transform (SIFT)
A schematic representation of Lowe’s
(2004) scale invariant feature transform
(SIFT): (a) Gradient orientations and
magnitudes are computed at each pixel
and weighted by a Gaussian fall-off
function (blue circle). (b) A weighted
gradient orientation histogram is then
computed in each subregion, using trilinear
interpolation. While this figure shows an 8
× 8 pixel patch and a 2 × 2 descriptor array,
Lowe’s actual implementation uses 16 × 16
patches and a 4 × 4 array of eight-bin
histograms.
▪ Gradient location-orientation histogram (GLOH)
The gradient location-

orientation histogram
(GLOH) descriptor uses log-
polar bins instead of square
bins to compute orientation
histograms (Mikolajczyk and
Schmid 2005).
Spatial summation blocks for SIFT, GLOH, and some newly developed feature descriptors (Winder and Brown 2007): (a)
The parameters for the new features, e.g., their Gaussian weights, are learned from a training database of (b) matched
real-world image patches obtained from robust structure from motion applied to Internet photo collections (Hua, Brown,
and Winder 2007).
58
❖ Feature matching
▪ Matching strategy and error rates
Recognizing objects in a cluttered scene (Lowe 2004). Two of the training images in the database are shown on the left.
These are matched to the cluttered scene in the middle using SIFT features, shown as small squares in the right image.
The affine warp of each recognized database image onto the scene is shown as a larger parallelogram in the right image.
False positives and negatives: The black digits 1 and

2 are features being matched against a database
of features in other images. At the current
threshold setting (the solid circles), the green 1 is a
true positive (good match), the blue 1 is a false
negative (failure to match), and the red 3 is a false
positive (incorrect match). If we set the threshold
higher (the dashed circles), the blue 1 becomes a
true positive but the brown 4 becomes an
additional false positive.
The number of matches correctly and incorrectly estimated by a feature matching algorithm, showing the number of
true positives (TP), false positives (FP), false negatives (FN) and true negatives (TN). The columns sum up to the actual
number of positives (P) and negatives (N), while the rows sum up to the predicted number of positives (P’) and
negatives (N’). The formulas for the true positive rate (TPR), the false positive rate (FPR), the positive predictive value
(PPV), and the accuracy (ACC) are given in the text.
ROC curve and its related rates: (a) The
▪ Matching strategy and error rates ROC curve plots the true positive rate
against the false positive rate for a
particular combination of feature
extraction and matching algorithms.
Ideally, the true positive rate should be
close to 1, while the false positive rate is
close to 0. The area under the ROC curve
(AUC) is often used as a single (scalar)
measure of algorithm performance.
Alternatively, the equal error rate is
sometimes used. (b) The distribution of
positives (matches) and negatives (non-
matches) as a function of inter-feature
distance d. As the threshold θ is increased,
the number of true positives (TP) and false
positives (FP) increases.
▪ Nearest neighbor distance ratio
where d1 and d2 are the nearest and second nearest neighbor distances, DA is the target
descriptor, and DB and DC are its closest two neighbors
Performance of the feature descriptors evaluated by Mikolajczyk and Schmid (2005), shown for three matching
strategies: (a) fixed threshold; (b) nearest neighbor; (c) nearest neighbor distance ratio (NNDR). Note how the
ordering of the algorithms does not change that much, but the overall performance varies significantly between the
different matching strategies.
❖1. Introduction
4. Artificial Neural Networks
❖ CNN - Convolutional Neural Network
▪ (Deep) convolutional neural networks (CNN): The term deep means that there is at least
one hidden layer and convolutional implies the use of convolution layers. The basic
principles of CNNs are inspired by the biological visual cortex of humans.
▪ The architecture of an example CNN can be seen in Slide 70. Input images with 28x28
pixels are convoluted with a filter to obtain 3D feature maps. The succeeding sub-
sampling, or often called pooling, layer further reduces the amount of data. This
procedure is continued until a one-dimensional vector, which represents the different
classes, is obtained.
One example architecture of a convolutional neural network using subsampling and convolution hidden layers.

▪ As most of the object recognition algorithms, CNNs need a training to adapt all weights of
the neurons. During that phase, different levels of features are extracted (see Slide 72).
▪ Low-level features contain colour, lines or contrast, whereas edges and corner belong to
mid-level features. High-level features already include class specific forms or sections.
Intermediate results from hidden layers. From left to right: low-level, mid-level and high-level features.

▪ Architecture Overview
➢ Regular Neural Nets: Neural Networks receive an input (a single vector), and
transform it through a series of hidden layers.
➢ Regular Neural Nets don’t scale well to full images.
➢ 3D volumes of neurons: width, height, depth.
➢ A ConvNet is made up of Layers. Every Layer has a simple API: It transforms an
input 3D volume to an output 3D volume with some differentiable function that
may or may not have parameters.
Stanford University CS231n: Convolutional Neural Networks for Visual Recognition, 2020 73

A regular 3-layer Neural Network.


A ConvNet arranges its neurons in three dimensions (width, height, depth), as visualized in one of the layers.
Every layer of a ConvNet transforms the 3D input volume to a 3D output volume of neuron activations. In this
example, the red input layer holds the image, so its width and height would be the dimensions of the image,
and the depth would be 3 (Red, Green, Blue channels).

▪ Layers used to build ConvNets
➢ Three main types of layers to build ConvNet architectures:
▪ Convolutional Layer
▪ Pooling Layer, and
▪ Fully-Connected Layer
➢ Example Architecture: Overview: a simple ConvNet for CIFAR-10 classification
The activations of an example ConvNet architecture.


▪ INPUT [32x32x3]: an image of width 32, height 32, and with three color
channels R,G,B.
▪ CONV layer will compute the output of neurons that are connected to local
regions in the input → volume [32x32x12] if we decided to use 12 filters.

▪ RELU layer: elementwise activation function → the size of the volume
unchanged ([32x32x12]).
▪ POOL layer: downsampling operation → volume such as [16x16x12].
▪ FC (i.e. fully-connected) layer will compute the class scores, resulting in
volume of size [1x1x10].

➢ A ConvNet architecture is in the simplest case a list of Layers that transform the
image volume into an output volume (e.g. holding the class scores)
➢ There are a few distinct types of Layers (e.g. CONV/FC/RELU/POOL are by far
the most popular)
➢ Each Layer accepts an input 3D volume and transforms it to an output 3D
volume through a differentiable function
➢ Each Layer may or may not have parameters (e.g. CONV/FC do, RELU/POOL
don’t)
➢ Each Layer may or may not have additional hyperparameters (e.g.
CONV/FC/POOL do, RELU doesn’t)

➢ Input image
4x4x3 RGB Image

➢ Convolutional Layer



Parameters that control the behavior of each convolved layer
▪ Stride
▪ Padding (Zero-padding)
▪ Number of filter (depth of next layer)
▪ Size of the filter
▪ Stride
Value of stride is 1, with filter 3x3 on 7x7 image. Value of stride is 2, with filter 3x3 on 7x7 image.
▪ Stride
Showing image padded with zero-padding with value 2.

• same convolution: preserve the dimension of the image
• wide convolution: Adding zero-padding
• narrow convolution: not using zero-padding
▪ Number of filter (depth of next layer)
• Example: 6x6x3 image with four 3x3 filter.
• After convolving, will get 4x4xn, n is depends on the number of filter you use,
in another words, means that depends on the number of feature detector you
use. In this case, n will be 4.
▪ Size of the filter
• Size of the filter usually is odd number so that the filter has the “central
pixel”/”central vision” so to know the position of the filter.

➢ Activation Function

▪ Sigmoid Activation Function

▪ ReLU Activation Function
▪ ReLU Activation Function
Applying ReLU to image.


➢ The Pooling Layer: Spatial Pooling reduces the dimensionality of each feature
map and retains the most important information of an image.
▪ Average pooling
▪ Max pooling
▪ Sum pooling

➢ The Pooling Layer
Max-pooling in 2D image.
Max-pooling in 3D image, which is the one that we normally deal with.

Applying Max/Sum pooling to image that applied ReLU.


➢ Fully Connected Layer (FC)
Fully connected layer.


▪ Softmax function: takes a vector of arbitrary real-valued scores and squashes
it to a vector of values between zero and one that sum to one.

➢ LeNet

➢ AlexNet

➢ ZFNet

➢ Inception-v4

➢ VGGNet

➢ VGGNet

➢ ResNet

➢ Example: LeNet
110

➢ Example: LeNet
111

➢ Example: LeNet
112

Chapter 7. Object Recognition

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 7. Object Recognition

Uploaded by

Copyright:

Available Formats

TRƯỜNG ĐẠI HỌC BÁCH KHOA HÀ NỘI

XỬ LÝ ẢNH TRONG CƠ ĐIỆN TỬ

Giảng viên: TS. Nguyễn Thành Hùng

❖2. Pattern Matching

❖3. Feature-based Methods

❖4. Artificial Neural Networks

▪ General Object Recognition Strategies

▪ General Object Recognition Strategies: Feature-based method

▪ General Object Recognition Strategies: Feature-based method

Two patches of different

▪ General Object Recognition Strategies: Interpretation Tree

▪ General Object Recognition Strategies: Interpretation Tree

Procedure of an interpretation tree algorithm

▪ General Object Recognition Strategies: Pattern Matching

▪ General Object Recognition Strategies: Pattern Matching

▪ General Object Recognition Strategies: Artificial neural networks

▪ General Object Recognition Strategies: Artificial neural networks

▪ General Object Recognition Strategies: Artificial neural networks

▪ General Object Recognition Strategies: Artificial neural networks

The development of accuracy rates of

❖2. Pattern Matching

❖3. Feature-based Methods

❖4. Artificial Neural Networks

❖2. Pattern Matching

❖3. Feature-based Methods

❖4. Artificial Neural Networks

Three auto-correlation surfaces EAC(Δu) shown as both grayscale

Uncertainty ellipse corresponding to an eigenvalue analysis of

Adaptive non-maximal suppression (ANMS) (Brown,

A dominant orientation estimate

The gradient location-

False positives and negatives: The black digits 1 and

❖2. Pattern Matching

❖3. Feature-based Methods

❖4. Artificial Neural Networks

❖ CNN - Convolutional Neural Network

❖ CNN - Convolutional Neural Network

❖ CNN - Convolutional Neural Network

A regular 3-layer Neural Network.

❖ CNN - Convolutional Neural Network

❖ CNN - Convolutional Neural Network

➢ Example Architecture: Overview: a simple ConvNet for CIFAR-10 classification

The activations of an example ConvNet architecture.

❖ CNN - Convolutional Neural Network

❖ CNN - Convolutional Neural Network

❖ CNN - Convolutional Neural Network

❖ CNN - Convolutional Neural Network

4x4x3 RGB Image

❖ CNN - Convolutional Neural Network

❖ CNN - Convolutional Neural Network

❖ CNN - Convolutional Neural Network

❖ CNN - Convolutional Neural Network

Showing image padded with zero-padding with value 2.

❖ CNN - Convolutional Neural Network

❖ CNN - Convolutional Neural Network

❖ CNN - Convolutional Neural Network

Applying ReLU to image.

❖ CNN - Convolutional Neural Network

❖ CNN - Convolutional Neural Network

➢ The Pooling Layer

Max-pooling in 3D image, which is the one that we normally deal with.

➢ The Pooling Layer

Applying Max/Sum pooling to image that applied ReLU.

❖ CNN - Convolutional Neural Network