Computer Vision Course

COMPUTER VISION
LECTURE I
INTRODUCTION AND OVERVIEW
DR. GEORGE KARRAZ, Ph. D.

This lecture is an overview of the ideas and
techniques to be covered during the course.
DR. GEORGE KARRAZ, Ph. D. 2

Outlines
Computer Vision
• Definition
• Branches
• Between Various Related Fields..
• Commuter Vision Requisites
• Applications
Course Topics
Examples of Completed Works
References
Computer Vision Definition
Computer vision is a field of artificial intelligence that
trains computers to interpret and understand the visual
world. Using digital images from cameras and videos
and deep learning models, machines can accurately
identify and classify objects and then react to what
they “see” .
Computer Vision
Artificial Intelligence
Machine Learning
4
Computer Vision Branches
Sub-domains of computer vision include:
• Scene Reconstruction.
• Object Detection.
• Event Detection.
• Video Tracking.
• Object Recognition.
• Motion Estimation.
• 3D scene modeling.
• image restoration.

Computer Vision between Various
Related Fields..
6
Commuter Vision Requisites
Good understanding of:
• Linear algebra,
• Probability Theory and Statistics.
• Digital Image Processing (OpenCV or MATLAB).
• Digital Signal Processing (MATLAB ).
• Programming language (Python or C++,..)

Computer Vision Applications
• Self-driving cars ( in 2023 autonomous vehicles).
• Medical images automatic analysis and classification (X-
ray, CT-Scan, MRI, PET, …etc.), (i.e. cancer detection).
• Mineral and Oil exploration.
• Space science.
• Surveillance cameras.
• Pedestrian detection.
• Parking occupancy detection.
• Traffic flow analysis and road condition monitoring.
• ….

Course Topics
1. Introduction
2. Digital Signals Processing and Analysis
3. Digital Images.
4. Digital Image Processing and Analysis
5. Edge & Structure Extraction
6. Local Image Features
7. Geometric transformations
8. Face Detection and Recognition
9. Videos
10. Motion tracking & Estimation
DR. GEORGE KARRAZ, Ph. D.9
Examples of Completed Works
It is important in this lecture as introduction to computer
Vision, to give you a set of real modern related examples
which were completed during the last years, arranged from
newest to oldest (feel free to follow the link of each research
for more information)
1- George Karraz, “An Intelligent System to Analyze the Functional
Magnetic Resonance Imaging fMRI”, International Journal of Innovative
Science and Research Technology, 2023, 8 (1):1672-1681.
DOI: https://zenodo.org/record/7814289
2- Rawan Abo Zidan and George Karraz, “Gaussian Pyramid for Nonlinear
Support Vector Machine”, Applied Computational Intelligence and Soft
Computing, Hindawi, Article ID 5255346,2022, pp. 1-9.
DOI: https://www.hindawi.com/journals/acisc/2022/5255346/
Examples of Completed Works, cont..
3- George Karraz, “Effect of adaptive line enhancement filters on noise
cancellation in ECG signals”, Serbian Journal of Electrical Engineering,
2021, 18(3):291:302.
DOI: http://www.doiserbia.nb.rs/Article.aspx?ID=1451-48692103291K
4- Haneen Shhadeh.Wesam Bachir and George Karraz, “A Sensitive Fibre
Optic Probe for Autofluorescence Spectroscopy of Oral Tongue Cancer:
Monte Carlo Simulation Study”, BioMed Research International,
Hindawi, 2020, Article ID 1936570, pp.1-11
DOI: https://doi.org/10.1155/2020/1936570DO
5- Kawthar M. K. Alghourani, Wesam Bachir and George Karraz, Effect of
Absorption and Scattering on Fluorescence of Buried Tumours, Journal
of Spectroscopy, Hindawi, 2020, Article ID 8730471, pp.1-7
DOI: https://doi.org/10.1155/2020/8730471

Examples of Completed Works, cont..
6- George Karraz, A Novel Technique to Predict and Detect Lung Cancer
in the Computerized Tomography Images, Al-Bath University Journal,
Syria, 2019, 41(26):135-155
7-George Karraz, Sonia Jalgha and Ali Aji, “Estimation of Porosity using
Artificial Neural Networks in Sazaba Oil Fields”, Journal of Basic
Science, Damascus University, Syria, 2019, 61(2):422-436
8- George Karraz, Intelligent System to Reduce Size and Time of Video
Display, Tishreen University Journal for Research and Scientific Studies -
Engineering Sciences Series, Syria, 2018, 40 (4):31-47.

References
Textbooks
D. Forsyth, J. Ponce
Computer Vision – A Modern Approach
Prentice Hall, 2002
R. Hartley, A. Zisserman
Multiple View Geometry in Computer Vision
2nd Ed., Cambridge Univ. Press, 2004

Through this course:
I focused on general computer vision
techniques and methodologies that have been
useful in many applications.

14
THANK YOU!
NEXT: DIGITAL SIGNALS PROCESSING & ANALYSIS
15
COMPUTER VISION
LECTURE II
DIGITAL SIGNALS PROCESSING & ANALYSIS

Contents:
1. Waveforms and Sampling Theorem.
2. Digital Signal (Audio) Processing.
2
Waveforms and Sampling Theorem:
• Frequency is the number of cycles per
second and is measured in Hertz (Hz)
• Wavelength is inversely proportional to
frequency i.e. Wavelength varies as Simple Waveforms
1/frequency
• The general form of the sine is as follows:
Y= A sin(2Pi *n* Fw / Fs)
Fs: is the sample frequency, n: is the
sample index.
• Fs must be ≥ 2* max(Fw) (Nyquist
Theorem)
3 DR. GEORGE KARRAZ, Ph. D.

Digital Signal Processing:
• The Decibel (dB) When referring to measurements of power or

intensity, we express these in decibels (dB):
• XdB=10 log10(X/X0)
• X: is the actual value of the quantity being measured.
X0: is a specified or implied reference level.
XdB: is the quantity expressed in units of decibels, relative to X0.
• X and X0 must have the same dimensions, they must measure the
same type of quantity in the same units.
• The reference level itself is always at 0 dB, as shown by setting X =
X0 (note: log10 (1) = 0).

• Why Use Decibel Scales?

1 - When there is a large range in frequency or magnitude, logarithm
units often used.
2 - If X is greater than X0 then XdB is positive (Power Increase)
3- If X is less than X0 then XdB is negative (Power Decrease).
4- Power Magnitude = 𝑋(𝑖) 2 so (with respect to reference level)
XdB=10 log10( 𝑋(𝑖) 2 )= 20 log10( 𝑋(𝑖) ) which is an expression of dB we

often come across.

• Why Use Decibel Scales?
5- dB is commonly used to quantify sound levels relative to some 0 dB reference.
6- The reference level is typically set at the threshold of human perception.
7- Human ear is capable of detecting a very large range of sound pressures.
8- The ratio of sound pressure that causes permanent damage from short exposure
to the limit that (undamaged) ears can hear is above a million, so 120 dB is the
quote threshold of pain for humans.
9- Maximum human sensitivity at noise levels at between 2 and 4 kHz (Speech)

• Signal to Noise: Signal-to-noise ratio is a term for the power ratio

between a signal (meaningful information) and the background noise:
2
𝑃𝑠𝑖𝑔𝑛𝑎𝑙 𝐴𝑠𝑖𝑔𝑛𝑎𝑙
𝑆𝑁𝑅 = =
𝑃𝑛𝑜𝑖𝑠𝑒 𝐴𝑛𝑜𝑖𝑠𝑒 2
𝐴𝑠𝑖𝑔𝑛𝑎𝑙
𝑆𝑁𝑅𝑑𝐵 = 20 log
𝐴𝑛𝑜𝑖𝑠𝑒
Both signal and noise power (or amplitude) must be measured at the same
or equivalent points in a system, and within the same system bandwidth.

• Algorithms and Signal Flow Graphs: It is common to represent digital

system signal processing routines as a visual signal flow graphs. We use a
simple equation relation to describe the algorithm.
Three Basic Building Blocks
We will need to consider three processes:
1- Delay
2- Multiplication
3- Summation

• Signal Flow Graphs (Delay):

• Signal Flow Graphs (Delay):

• Signal Flow Graphs

(Multiplication):


(Summation):


(Summation):

• Signal Flow Graphs: We can combine all above algorithms to build up

more complex algorithms.

• Signal Flow Graphs: We can combine all above algorithms to build up

more complex algorithms.


















THANK YOU!
NEXT: IMAGES
33
COMPUTER VISION
LECTURE III
IMAGES

Outline
• Still Images
• Vector Drawing
• Bitmaps
• Popular File Formats

Still Images
 Still images are generated by the computer in two ways :

 bitmaps (paint graphics, raster images).
 Vector-drawn graphics.
 Bitmaps
 photo-realistic images, complex drawings ,fine detail
 Vector-drawn objects
 lines, boxes, circles, polygons, and other graphic
shapes,
 mathematically expressed in angles, coordinates, and distances.

Vector Drawing
• vector-drawn objects → lines, rectangles, ovals, polygons,
complex drawings created from those objects, and text.
• Computer-aided design (CAD) (like AutoCAD, …)
• Used vector-drawn object systems for creating the highly
complex and geometric renderings needed by architects and
engineers.
• Graphic artists designing for print media use vector-drawn

objects
• Programs for 3-D animation also use vector-drawn graphics.

• Mathematical mapping for rotation, translation,…

How Vector Drawing Works
• A vector is a line that is described by the location of its two
endpoints.
• Vector drawing uses Cartesian coordinates (x, y, z).
• <object , principal attributes, options>
• <line x1="0" y1="0" x2="200" y2="100"/>
• <rect x="0" y="0" width="200" height="100" fill="#FFFFFF”/>
• <circle cx="50" cy="50" r="10" fill="none" />
• Scalable Vector Graphics (.svg) file:

• SVG files can be saved in a small amount of memory and because
they are scalable without distortion
• Vector drawing tools use Bézier curves or paths to
mathematically represent a curve.→ curve with handles (points on
the path).

Vector Drawing
• Low memory
• Faster download
• Same quality for different resolutions (no pixelation).
• Refresh time lower with higher drawn objects

Bitmap vs. Vector

Bitmap in Brief
• Image are broken up into a grid recorded individually ,as
Sequence of pixel
• Works well for complex variations
• Not flexible
• Might need a lot of memory
• Problems when scaled
• Formats: BMP, GIF, JPEG, TIFF

Bitmap Images as Matrices
• Still pictures which (uncompressed) are represented as a bitmap (a grid
of pixels).

Bitmaps
• Bitmap: The two-dimensional array of pixel values that represents the
graphics/image data.
• A bit is the simplest element in the digital world: 0,1 … on, off … true, false.
• A map is a two-dimensional matrix of these bits.
• A bitmap, then, is a simple matrix of the tiny dots that form an image ,
displayed on a computer screen or printed.
• A one-dimensional matrix (1-bit depth) is used to display mono-chrome
images
• a bitmap where each bit is most commonly set to black or white.
• picture elements (pixels)
• 1-bit bitmap,
• N-bits bitmap for varying shades of color

Bitmaps Types
• Binary images: 1-bit Images
• Gray-level Images: 8-bit per pixel
• Color Images: The most common data types for graphics and image
file formats
• 24-bit true color and;
• 8-bit pseudo color.

Bitmaps

Bitmaps
• These images show the color depth of bitmaps as described
in the last Figure.
• Note that Images 4 and 6 require the same memory (same
file size), but the gray-scale image is superior. If file size
(download time) is important, you can dither GIF bitmap files
to the lowest color depth that will still provide an acceptable
image.

1-Bit Images
• Each pixel is stored as a single bit (0 or 1), so also referred to as
binary image.
• Such an image is also called a 1-bit monochrome image since it
contains no color.
• Next Fig. shows a 1-bit monochrome image.
Monochrome 1-bit Lena image

8-bit Gray-level Images
• Each pixel has a gray-value between 0 and 255. Each pixel is represented by a
single byte; e.g., a dark pixel might have a value of 10, and a bright one might be
230.
• Bitmap: The two-dimensional array of pixel values that represents the
graphics/image data.
• Image resolution refers to the number of pixels in a digital image (higher
resolution always yields better quality).
• Fairly high resolution for such an image might be 1,600 x 1,200, whereas
lower resolution might be 640 x 480.
• Each pixel is usually stored as a byte (a value between 0 to 255), so a 640 x
480 grayscale image requires 300 kB of storage (640 x 480 = 307, 200).
• Next Fig. shows grayscale image.
Grayscale image

24-bit Color Images
• In a color 24-bit image, each pixel is represented by three bytes, usually
representing RGB.
• This format supports 256 x 256 x 256 possible combined colors, or a total of
16,777,216 possible colors.
• However such flexibility does result in a storage penalty: A 640 x 480 24-bit
color image would require 921.6 kB of storage without any compression.
• An important point: many 24-bit color images are actually stored as 32-
bit images, with the extra byte of data for each pixel used to store an
alpha value representing special effect information (e.g., transparency).
• Next Figure shows the image forestfire.bmp, a 24-bit image in Microsoft

Windows BMP format. Also shown are the grayscale images for just the
Red, Green, and Blue channels, for this image.

(a) (b)
(c) (d)
Fig. 3.5: High-resolution color and separate R, G, B color channel images. (a):
Example of 24-bit color image “forestfire.bmp”. (b, c, d): R, G, and B color channels
for this image

8-bit Color Images
• Many systems can make use of 8 bits of color information (the so-
called “256 colors”) in producing a screen image.
• Such image files use the concept of a lookup table to store color
information.
• Basically, the image stores not color, but instead just a set of bytes, each of
which is actually an index into a table with 3-byte values that specify the
color for a pixel with that lookup table index.

Color Look-up Tables (LUTs)
• The idea used in 8-bit color images is to store only the index, or
code value, for each pixel. Then, e.g., if a pixel stores the value 25,
the meaning is to go to row 25 in a color look-up table (LUT).
Color LUT for 8-bit color images.

Color Look-up Tables
• A Color-picker consists of an array of fairly large blocks of color (or a
semi-continuous range of colors) such that a mouse-click will select
the color indicated.
• In reality, a color-picker displays the palette colors associated with index

values from 0 to 255.
• Next Fig. displays the concept of a color-picker: if the user selects the color
block with index value 2, then the color meant is cyan, with RGB values (0,
255, 255).
• A very simple animation process is possible via simply changing the

color table: this is called color cycling or palette animation.

Color Look-up Tables
• Color-picker for 8-bit color: each block of the color-picker corresponds

to one row of the color LUT

• (a) shows a 24-bit color image of “Lena”,
• (b) shows the same image reduced to only 5 bits via dithering.
• A detail of the left eye is shown in (c).
(a) (b)
(c)

Popular File Formats
• 8-bit GIF : one of the most important formats because of its historical
connection to the WWW and HTML markup language as the first
image type recognized by net browsers.
• JPEG: currently the most important common file format.

GIF
• GIF standard: (We examine GIF standard because it is so simple! yet
contains many common elements.)
• Limited to 8-bit (256) color images only, which, while producing
acceptable color images, is best suited for images with few distinctive
colors (e.g., graphics or drawing).
• GIF standard supports interlacing — successive display of pixels in
widely-spaced rows by a 4-pass display process.
• GIF actually comes in two flavors:
• 1. GIF87a: The original specification.
• 2. GIF89a: The later version. Supports simple animation via a Graphics
Control Extension block in the data, provides simple control over delay time,
a transparency index, etc.

JPEG
• JPEG: The most important current standard for image

compression.
• The human vision system has some specific limitations and
JPEG takes advantage of these to achieve high rates of
compression.
• JPEG allows the user to set a desired level of quality, or
compression ratio (input divided by output).

PNG
• PNG format: standing for Portable Network Graphics — meant to

supersede the GIF standard, and extends it in important ways.
• Special features of PNG files include:
1. Support for up to 48 bits of color information — a large increase.
2. Files may contain gamma-correction information for correct display of
color images, as well as alpha-channel information for such uses as
control of transparency.
3. The display progressively displays pixels in a 2-dimensional fashion by
showing a few pixels at a time over seven passes through each 8 x 8
block of an image.

TIFF
• TIFF: stands for Tagged Image File Format.

• The support for attachment of additional information (referred to as
“tags”) provides a great deal of flexibility.
1. The most important tag is a format signifier: what type of compression
etc. is in use in the stored image.
2. TIFF can store many different types of image: 1-bit, grayscale, 8-bit color,
24-bit RGB, etc.
3. TIFF was originally a lossless format but now a new JPEG tag allows one
to opt for JPEG compression.
4. The TIFF format was developed by the Aldus Corporation in the 1980's
and was later supported by Microsoft.

THANK YOU!
NEXT: DIGITAL IMAGE PROCESSING & ANALYSIS

COMPUTER VISION
LECTURE IV
DIGITAL IMAGE PROCESSING AND ANALYSIS

Topics of This Lecture
 Common Types of Noise
 Linear filters gx
 What are they? How are they applied? I
 Application: smoothing
 Gaussian filter
 What does it mean to filter an image?
 Nonlinear Filters
 Median filter
 Multi-Scale representations
 How to properly rescale an image?
 Image derivatives
 How to compute gradients robustly?

Common Types of Noise
 Salt & pepper noise
 Random occurrences of
black and white pixels
 Impulse noise
 Random occurrences of
white pixels
 Gaussian noise
 Variations in intensity drawn
from a Gaussian (“Normal”)
distribution.
 Basic Assumption
 Noise is i.i.d. (independent &
identically distributed)

Gaussian Noise
>> noise = randn(size(im)).*sigma;

4
>> output = im + noise; DR. GEORGE KARRAZ, Ph. D.
First Attempt at a Solution
 Assumptions:
 Expect pixels to be like their neighbors
 Expect noise processes to be independent from pixel to pixel
(“i.i.d. = independent, identically distributed”)
 Let’s try to replace each pixel with an average of all the

values in its neighborhood…

Moving Average in 2D
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
0 0 0 90 90 90 90 90 0 0
0 0 0 90 90 90 90 90 0 0
0 0 0 90 90 90 90 90 0 0
0 0 0 90 0 90 90 90 0 0
0 0 0 90 90 90 90 90 0 0
0 0 0 0 0 0 0 0 0 0
0 0 90 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 10
0 0 0 90 90 90 90 90 0 0
0 0 0 90 90 90 90 90 0 0
0 0 0 90 90 90 90 90 0 0
0 0 0 90 0 90 90 90 0 0
0 0 0 90 90 90 90 90 0 0
0 0 0 0 0 0 0 0 0 0
0 0 90 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 10 20
0 0 0 90 90 90 90 90 0 0
0 0 0 90 90 90 90 90 0 0
0 0 0 90 90 90 90 90 0 0
0 0 0 90 0 90 90 90 0 0
0 0 0 90 90 90 90 90 0 0
0 0 0 0 0 0 0 0 0 0
0 0 90 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 10 20 30
0 0 0 90 90 90 90 90 0 0
0 0 0 90 90 90 90 90 0 0
0 0 0 90 90 90 90 90 0 0
0 0 0 90 0 90 90 90 0 0
0 0 0 90 90 90 90 90 0 0
0 0 0 0 0 0 0 0 0 0
0 0 90 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 10 20 30 30
0 0 0 90 90 90 90 90 0 0
0 0 0 90 90 90 90 90 0 0
0 0 0 90 90 90 90 90 0 0
0 0 0 90 0 90 90 90 0 0
0 0 0 90 90 90 90 90 0 0
0 0 0 0 0 0 0 0 0 0
0 0 90 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 10 20 30 30 30 20 10
0 0 0 90 90 90 90 90 0 0 0 20 40 60 60 60 40 20
0 0 0 90 90 90 90 90 0 0 0 30 60 90 90 90 60 30
0 0 0 90 90 90 90 90 0 0 0 30 50 80 80 90 60 30
0 0 0 90 0 90 90 90 0 0 0 30 50 80 80 90 60 30
0 0 0 90 90 90 90 90 0 0 0 20 30 50 50 60 40 20
0 0 0 0 0 0 0 0 0 0 10 20 30 30 30 30 20 10
0 0 90 0 0 0 0 0 0 0 10 10 10 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0

Correlation Filtering
 Say the averaging window size is 2k+1 x 2k+1:
Attribute uniform Loop over all pixels in

weight to each neighborhood around image pixel
pixel F[i,j]
 Now generalize to allow different weights depending
on neighboring pixel’s relative position:
Non-uniform weights
12
Correlation Filtering
 This is called cross-correlation, denoted
 Filtering an image
 Replace each pixel by a
1 2
weighted combination of (0,0)
its neighbors. H
3 4
 The filter “kernel” or “mask”
is the prescription for the F
weights in the linear
combination.
(N,N)

Convolution
 Convolution:
 Flip the filter in both dimensions (bottom to top, right to left)
 Then apply cross-correlation
4 3
1 2 (0,0)
H
2
H 1
3 4
Notation for
F
convolution
operator (N,N)

Convolution vs. Correlation
 Correlation
 Convolution Note the difference!
 Note
 If H[-u,-v] = H[u,v], then correlation  convolution.

Shift Invariant Linear System
 Shift invariant:
 Operator behaves the same everywhere, i.e. the value of the
output depends on the pattern in the image neighborhood,
not the position of the neighborhood.
 Linear:
 Superposition: h * (f1 + f2) = (h * f1) + (h * f2)
 Scaling: h * (k f ) = k (h * f)

Properties of Convolution
 Linear & shift invariant
 Commutative: f g=g f
 Associative: (f g) h = f (g h)
 Often apply several filters in sequence: (((a b1) b2) b3)
 This is equivalent to applying one filter: a (b1 b2 b3)
 Identity: f e=f
 for unit impulse e = […, 0, 0, 1, 0, 0, …].
 Differentiation:

Averaging Filter
 What values belong in the kernel H[u,v] for the moving
average example?
=
0 0 0 0 0 0 0 0 0 0
0
0
0
0
0
0
90
0
90
0
90
0
90
0
90
0
0
0
0
1 1 1 0 10 20 30 30
0
0
0
0
0
90
90
90
90
90
90
90
90
90
90
0
0
0
0 ?
1 1 1
0 0 0 90 0 90 90 90 0 0
1 1 1
0 0 0 90 90 90 90 90 0 0
“box filter”
0 0 0 0 0 0 0 0 0 0
0 0 90 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0

Smoothing by Averaging
depicts box filter:
white = high value, black = low value
Original Filtered
“Ringing” artifacts!
Smoothing with a Gaussian
Original Filtered

Gaussian Smoothing
 Gaussian kernel
 Rotationally symmetric
 Weights nearby pixels more
than distant ones
 This makes sense as
‘probabilistic’ inference
about the signal
 A Gaussian gives a good model

of a fuzzy blob

Gaussian Smoothing
 What parameters matter here?
 Variance  of Gaussian
 Determines extent of smoothing
σ = 2 with 3030 σ = 5 with 3030

kernel kernel

Gaussian Smoothing
 What parameters matter here?
 Size of kernel or mask
 Gaussian function has infinite support, but discrete filters use finite
kernels
σ = 5 with 1010 σ = 5 with 3030

kernel kernel
 Rule : set filter half-width to about 3σ

Gaussian Smoothing in Matlab
>> hsize = 10;

>> sigma = 5;
>> h = fspecial(‘gaussian’ hsize, sigma);
>> mesh(h);
>> imagesc(h);
>> outim = imfilter(im, h);

>> imshow(outim);

 Linear filters
 What are they? How are they applied?
 Gaussian filter
 Median filter

Why Does This Work?
 A small excursion into the Fourier transform to talk
about spatial frequencies…
3 cos(x)
+ 1 cos(3x)
+ 0.8 cos(5x)
+ 0.4 cos(7x)
+… DR. GEORGE KARRAZ, Ph. D.

26
The Fourier Transform in Pictures
 A small excursion into the Fourier transform to talk

“high” “low” “high”
about spatial frequencies…
¨
Frequency spectrum
3 cos(x)
+ 1 cos(3x)
+ 0.8 cos(5x)
+ 0.4 cos(7x)
+… DR. GEORGE KARRAZ, Ph. D.

27
Fourier Transforms of Important Functions
 Sine and cosine transform to…
1.5
1.5
1
1
? ?
0.5 0.5
0
¨ 0
¨
-0.5 -0.5
-1 -1
-1.5 -1.5

 Sine and cosine transform to “frequency spikes”
1.5
1.5
 1

0.5 0.5
0
¨ -1
1
0
¨
-0.5 -0.5
-1 1
-1
- -1
-1.5 -1.5
 A Gaussian transforms to…
¨
?

1.5
1.5
 1

0.5 0.5
0
¨ -1
1
0
¨
-0.5 -0.5
-1 1
-1
- -1
-1.5 -1.5
 A Gaussian transforms to a Gaussian
 A box filter transforms to…

¨
?
1.5
1.5
 1

0.5 0.5
0
¨ -1
1
0
¨
-0.5 -0.5
-1 1
-1
- -1
-1.5 -1.5
 A Gaussian transforms to a Gaussian

All of this is
symmetric!
¨
 A box filter transforms to a sinc

sin x
¨ sinc( x) =
x
31
Effect of Convolution
 Convolving two functions in the image domain
corresponds to taking the product of their transformed
versions in the frequency domain.
f g ¨ F G
 This gives us a tool to manipulate image spectra.

 A filter attenuates or enhances certain frequencies through this
effect.

Low-Pass vs. High-Pass
Low-pass
filtered
High-pass
filtered
Original image

Image Source: S. Chenney
Quiz: What Effect Does This Filter Have?

Sharpening Filter
Original
Sharpening filter
− Accentuates differences
with local average
Sharpening Filter

Application: High Frequency Emphasis
Original High pass Filter
High Frequency High Frequency Emphasis

Emphasis +
37 DR. GEORGE KARRAZ, Ph. D. Histogram Equalization
 Linear filters
 Gaussian filter
 Median filter

Non-Linear Filters: Median Filter
 Basic idea
 Replace each pixel by the
median of its neighbors.
 Properties
 Doesn’t introduce new pixel
values
 Removes spikes: good for
impulse, salt & pepper
noise
 Linear?

Median Filter
Salt and Median

pepper filtered
noise
Plots of a row of the image

Median Filter
 The Median filter is edge preserving.

Median vs. Gaussian Filtering
3x3 5x5 7x7
Gaussian
Median
42
 Linear filters
 Gaussian filter
 Median filter

Motivation: Fast Search Across Scales

Image Pyramid
Low resolution
High resolution
How Should We Go About Resampling?
 Resa,mple Let’s resample the
checkerboard by taking
one sample at each
circle.
In the top left board, the

new representation is
reasonable. Top right
also yields a reasonable
representation.
Bottom left is all black

(dubious) and bottom
right has checks that are
too big.

Fourier Interpretation: Discrete Sampling
 Sampling in the spatial domain is like multiplying with a
spike function.
 Sampling in the frequency domain is like...

Source: S. Chenney
Fourier Interpretation: Discrete Sampling
 Sampling in the spatial domain is like multiplying with a
spike function.
 Sampling in the frequency domain is like convolving with a

spike function.

Sampling and Aliasing

“Nyquist limit”
 Nyquist theorem:
 In order to recover a certain frequency f, we need to sample with at least 2f.
 This corresponds to the point at which the transformed frequency spectra start
to overlap.

“Nyquist limit”

Aliasing in Graphics

Resampling with Prior Smoothing
 Note: We cannot recover the high frequencies, but we can

avoid artifacts by smoothing before resampling.

The Gaussian Pyramid
Low resolution G4 = (G3 * gaussian)  2
G3 = (G2 * gaussian
blur )  2
blur
G2 = (G1 * gaussian)  2
blur
G1 = (G0 * gaussian)  2
G0 = Image
blur
High resolution
Gaussian Pyramid – Stored Information
All the extra
levels add very
little overhead
for memory or
computation!

Summary: Gaussian Pyramid
 Construction: create each level from previous one
 Smooth and sample
 Smooth with Gaussians, in part because

 Gaussian*Gaussian = another Gaussian
 G(1) * G(2) = G(sqrt(1 2 + 2 2))
 Gaussians are low-pass filters, so the representation is

redundant once smoothing has been performed.
 There is no need to store smoothed images at the
full original resolution.

The Laplacian Pyramid 57 DR. GEORGE KARRAZ, Ph. D.
Li = G i − expand(G i +1 )
Gaussian Laplacian Pyramid
G i = Li + expand(G i +1 )
GPyramid
n Ln = Gn
G2
- = L2
G1 L1
- =
G0
L0
- =
Why is this useful?
Laplacian ~ Difference of Gaussian
- =
DoG = Difference of Gaussians

Cheap approximation – no derivatives needed.
- =

 Linear filters
 Gaussian filter
 Median filter

Edges and Derivatives…
1st derivative
2nd derivative

Differentiation and Convolution
 For the 2D function f(x,y), the partial derivative is:
f ( x, y) f ( x +  , y) − f ( x, y)
= lim
x  →0 
 For discrete data, we can approximate this using finite
differences:
f ( x, y) f ( x + 1, y) − f ( x, y)

x 1
 To implement the above as convolution, what would be
the associated filter?

Partial Derivatives of an Image
f ( x, y) f ( x, y )
x y
-1 ? 1
-1 1 1
or
-1
Which shows changes with respect to x?
Assorted Finite Difference Filters
>> My = fspecial(‘sobel’);
>> outim = imfilter(double(im), My);
>> imagesc(outim);
>> colormap gray;

Image Gradient
 The gradient of an image:
 The gradient points in the direction of most rapid intensity change
 The gradient direction (orientation of edge normal) is given by:
 The edge strength is given by the gradient magnitude

Effect of Noise
 Consider a single row or column of the image

 Plotting intensity as a function of position gives a signal
Where is the edge?

Solution: Smooth First
Where is the edge? Look for peaks in

Derivative Theorem of Convolution
 Differentiation property of convolution.

Derivative of Gaussian Filter
( I  g )  h = I  ( g  h)
 
0.0030 0.0133 0.0219 0.0133 0.0030
 1 −1 
0.0133 0.0596 0.0983 0.0596 0.0133
0.0219 0.0983 0.1621 0.0983 0.0219
0.0133 0.0596 0.0983 0.0596 0.0133
0.0030 0.0133 0.0219 0.0133 0.0030
Why is this preferable?

Derivative of Gaussian Filters
x-direction y-direction

Source: Svetlana Lazebnik
Laplacian of Gaussian (LoG)
 Consider
Where is the edge? Zero-crossings of bottom graph

Summary: 2D Edge Detection Filters
Laplacian of Gaussian
Gaussian Derivative of Gaussian
 is the Laplacian operator:

Note: Filters are Templates
 Applying a filter at some point can  Insight
be seen as taking a dot-product  Filters look like the effects they
between the image and some are intended to find.
vector.  Filters find effects they look
 Filtering the image is a set of dot like.
products.

Where’s Waldo?
Template
74 DR. GEORGE KARRAZ, Ph. D. Scene

Where’s Waldo?
Template

Detected template
Where’s Waldo?
Detected template Correlation map

Correlation as Template Matching
 Think of filters as a dot product of the filter vector with
the image region
 Now measure the angle between the vectors
a b
a  b =| a || b | cos cos  =
| a || b |
 Angle (similarity) between vectors can be measured by

normalizing length of each vector to 1.
a
  b
Template
77 DR. GEORGE KARRAZ, Ph. D. Image region Vector interpretation

Summary: Mask Properties
 Smoothing
 Values positive
 Sum to 1  constant regions same as input
 Amount of smoothing proportional to mask size
 Remove “high-frequency” components; “low-pass” filter
 Derivatives
 Opposite signs used to get high response in regions of high contrast
 Sum to 0  no response in constant regions
 High absolute value at points of high contrast
 Filters act as templates

• Highest response for regions that “look the most like the filter”
• Dot product as correlation

Summary Linear Filters
• Linear filtering: Examples:

➢ Form a new image whose • Smoothing with a box filter
pixels are a weighted sum • Smoothing with a Gaussian
of original pixel values • Finding a derivative
• Searching for a template
• Properties
➢ Output is a shift-invariant
function of the input (same Pyramid representations
at each image location) • Important for describing and
searching an image at all
scales

THANK YOU!
NEXT: EDGE & STRUCTURE EXTRACTION
80
COMPUTER VISION
Lectures II, III Image Filters
LECTURE V
EDGE & STRUCTURE EXTRACTION
D R.George
Dr. GEORGE KARRAZ
Karraz, , Ph. D.
Ph.D.
1
Course Outline
• Image Processing Basics
➢ Image Formation
➢ Binary Image Processing
➢ Linear Filters
➢ Edge & Structure Extraction
➢ Color
• Segmentation
• Local Features & Matching
• Object Recognition and Categorization
• 3D Reconstruction
• Motion and Tracking
2
Recap: Gaussian Smoothing
• Gaussian kernel
• Rotationally symmetric
• Weights nearby pixels more
than distant ones
➢ This makes sense as
‘probabilistic’ inference
about the signal
• A Gaussian gives a good model

of a fuzzy blob
3
Smoothing with a Gaussian
Parameter σ is the “scale” / “width” / “spread” of the
Gaussian kernel, and controls the amount of smoothing.
for sigma=1:3:10
h = fspecial('gaussian‘, fsize, sigma);
out = imfilter(im, h);
imshow(out);
pause; 4
end DR. GEORGE KARRAZ, Ph. D.
Recap: Derivatives and Edges…
1st derivative
2nd derivative
6
Recap: 2D Edge Detection Filters
Laplacian of Gaussian
Gaussian Derivative of Gaussian
• is the Laplacian operator:
7
• Edge detection
➢ Recap: Gradients, scale influence
➢ Canny edge detector
• Fitting as template matching

➢ Distance transform
➢ Chamfer matching
➢ Application: traffic sign detection
• Fitting as parametric search

➢ Line detection
➢ Hough transform
➢ Extension to circles
➢ Generalized Hough transform
8
Edge Detection
• Goal: map image from 2D array of pixels to a set of
curves or line segments or contours.
• Why?
• Main idea: look for strong gradients, post-process
9
What Can Cause an Edge?
Depth discontinuity:
Reflectance change: object boundary
appearance
information, texture
Cast shadows
Change in surface
orientation: shape
10
Contrast and Invariance
11
Recall: Images as Functions
Edges look like steep cliffs

12
Gradients → Edges
Primary edge detection steps

1. Smoothing: suppress noise
2. Edge enhancement: filter for contrast
3. Edge localization
➢ Determine which local maxima from filter output are actually
edges vs. noise
➢ Thresholding, thinning
13
Effect of  on Derivatives
σ = 1 pixel σ = 3 pixels
• The apparent structures differ depending on Gaussian’s

scale parameter.
 Larger values: larger scale edges detected

 Smaller values: finer features detected
14
So, What Scale to Choose?
• It depends on what we’re looking for…
• Too fine a scale… can’t see the forest for the trees.
• Too coarse a scale… can’t tell the maple from the cherry.
Recall: Thresholding
• Choose a threshold t
• Set any pixels less than t
to zero (off).
• Set any pixels greater than
or equal t to one (on).
1, if F i, j   t
FT i, j  = 
0, otherwise

Original Image
17
Gradient Magnitude Image
18
Thresholding with a lower threshold
19
Thresholding with a Higher Threshold
20
Designing an Edge Detector
• Criteria for an “optimal” edge detector:
➢ Good detection: the optimal detector must minimize the
probability of false positives (detecting spurious edges caused by
noise), as well as that of false negatives (missing real edges)
➢ Good localization: the edges detected must be as close as
possible to the true edges
➢ Single response: the detector must return one point only for
each true edge point; that is, minimize the number of local
maxima around the true edge
21
Canny Edge Detector
• This is probably the most widely used edge detector in
computer vision
• Theoretical model: step-edges corrupted by additive
Gaussian noise
• Canny has shown that the first derivative of the
Gaussian closely approximates the operator that
optimizes the product of signal-to-noise ratio and
localization
22
Canny Edge Detector
• Filter image with derivative of Gaussian
• Find magnitude and orientation of gradient
• Non-maximum suppression:
➢ Thin multi-pixel wide “ridges” down to single pixel width
• Linking and thresholding (hysteresis):
➢ Define two thresholds: low and high
➢ Use the high threshold to start edge curves and the low
threshold to continue them
• MATLAB:
>> edge(image, ‘canny’);
>> help edge
23
The Canny Edge Detector
original image (Lena)

24
Norm of the gradient
25
Thresholding
26
How to turn
these thick
regions of
the gradient
into curves?
Thresholding
27
Non-Maximum Suppression
• Check if pixel is local maximum along gradient direction,

select single max across width of the edge
➢ requires checking interpolated pixels p and r
28
Problem:
pixels along
this edge
didn’t survive
the
thresholding
Thinning
(non-maximum suppression)
29
Hysteresis Thresholding
• Hysteresis: A lag or momentum factor
• Idea: Maintain two thresholds khigh and klow
➢ Use khigh to find strong edges to start edge chain
➢ Use klow to find weak edges which continue edge chain
• Typical ratio of thresholds is roughly

khigh / klow = 2
30
Hysteresis Thresholding
Original image
courtesy of G. Loy
High threshold Low threshold Hysteresis threshold

(strong edges) (weak edges)
31
Object Boundaries vs. Edges
Background Texture Shadows

32
Edge Detection is Just the Beginning…
Image Human segmentation Gradient magnitude
33
Fitting
• Want to associate a model with observed features
For example, the model could be a line, a circle, or an arbitrary

shape.
34
• Edge detection


➢ Line detection
➢ Hough transform
35
Fitting as Template Matching
• We’ve already seen that correlation filtering can be
used for template matching in an image.
• Let’s try this idea with “edge templates”.

➢ Example: traffic sign detection in (grayvalue) video.
Templates 36
How Can This Be Made Efficient?
• Fast edge-based template matching
➢ Distance transform of the edge image
Original Gradient Distance transform
Value at (x,y) tells how

far that position is from
the nearest edge point
(or other binary mage
structure)
>> help bwdist
Edges 37
Distance Transform
• Image reflecting distance to nearest point in point set
(e.g., edge pixels, or foreground pixels).
4-connected 8-connected
adjacency adjacency
38
Distance Transform Algorithm (1D)
• Two-pass O(n) algorithm for 1D L1 norm
1. Initialize: For all j
➢ D[j]  1P[j] // 0 if j is in P, infinity otherwise
2. Forward: For j from 1 up to n-1

➢ D[j]  min( D[j], D[j-1]+1 )
3. Backward: For j from n-2 down to 0

➢ D[j]  min( D[j], D[j+1]+1 )

Distance Transform Algorithm (2D)
• 2D case analogous to 1D
➢ Initialization
➢ Forward and backward pass
– Fwd pass finds closest above and to the left
– Bwd pass finds closest below and to the right
40
Chamfer Matching
• Chamfer Distance
➢ Average distance to nearest feature
➢ This can be computed efficiently by correlating the edge

template with the distance-transformed image
Edge image Distance transform image

Chamfer Matching
• Efficient implementation
➢Instead of correlation, sample fixed number
of points on template contour.
 Chamfer score boils down to series of DT lookups.
 Computational effort independent of scale.

Chamfer Matching Results

Chamfer Matching for Pedestrian Detection
• Organize templates in tree structure for fast matching
44
Summary Chamfer Matching
• Pros
➢ Fast and simple method for matching edge-based templates.
➢ Works well for matching upright shapes with little intra-class
variation.
➢ Good method for finding candidate matches in a longer
recognition pipeline.
• Cons
➢ Chamfer score averages over entire contour, not very
discriminative in practice.
 Further verification needed.
➢ Low matching cost in cluttered regions with many edges.
 Many false positive detections.
➢ In order to detect rotated & rescaled shapes, need to match
with rotated & rescaled templates  can get very expensive.
45
• Edge detection


➢ Line detection
➢ Hough transform
46
Fitting as Search in Parametric Space
• Choose a parametric model to represent a set of
features
• Membership criterion is not local
➢ Can’t tell whether a point belongs to a given model just by
looking at that point.
• Three main questions:
➢ What model represents this set of features best?
➢ Which of several model instances gets which feature?
➢ How many model instances are there?
• Computational complexity is important
➢ It is infeasible to examine every possible set of parameters and
every possible combination of features
47
Example: Line Fitting
• Why fit lines?
Many objects characterized by presence of straight lines
• Wait, why aren’t we done just by running edge detection?

Difficulty of Line Fitting
• Extra edge points (clutter),

multiple models:
➢ Which points go with
which line, if any?
• Only some parts of each

line detected, and some
parts are missing:
➢ How to find a line that
bridges missing evidence?
• Noise in measured edge

points, orientations:
➢ How to detect true underlying
parameters?
49
Voting
• It’s not feasible to check all combinations of features by
fitting a model to each possible subset.
• Voting is a general technique where we let the features
vote for all models that are compatible with it.
➢ Cycle through features, cast votes for model parameters.
➢ Look for model parameters that receive a lot of votes.
• Noise & clutter features will cast votes too, but typically
their votes should be inconsistent with the majority of
“good” features.
• Ok if some features not observed, as model can span
multiple fragments.
50
Fitting Lines
• Given points that belong to a line,
what is the line?
• How many lines are there?
• Which points belong to which lines?
• Hough Transform is a voting

technique that can be used to answer
all of these
• Main idea:
1. Record all possible lines on which each
edge point lies.
2. Look for lines that get many votes.
51
Finding Lines in an Image: Hough Space
y b
b0
x m0 m
Image space Hough (parameter) space
• Connection between image (x,y) and Hough (m,b) spaces

➢ A line in the image corresponds to a point in Hough space.
➢ To go from image space to Hough space:
– Given a set of points (x,y), find all (m,b) such that y = mx + b
52
y b
y0
x0 x m
• Connection between image (x,y) and Hough (m,b) spaces

➢ A line in the image corresponds to a point in Hough space.
➢ To go from image space to Hough space:
– Given a set of points (x,y), find all (m,b) such that y = mx + b
➢ What does a point (x0, y0) in the image space map to?
– Answer: the solutions of b = -x0m + y0
– This is a line in Hough space
53
y b
(x1, y1)
y0
(x0, y0)
b = –x1m + y1
x0 x m
• What are the line parameters for the line that contains
both (x0, y0) and (x1, y1)?
➢ It is the intersection of the lines b = –x0m + y0 and
b = –x1m + y1
54
y b
x m
• How can we use this to find the most likely parameters

(m,b) for the most prominent line in the image space?
➢ Let each edge point in image space vote for a set of possible
parameters in Hough space
➢ Accumulate votes in discrete set of bins; parameters with the
most votes indicate line in image space.
55
Polar Representation for Lines
• Issues with usual (m,b) parameter space: can take on
infinite values, undefined for vertical lines.
[0,0] x d : perpendicular distance

 from line to origin
y d  : angle the
perpendicular makes with
the x-axis
x cos  − y sin  = d
• Point in image space  sinusoid segment in Hough space

56
Hough Transform Algorithm
H: accumulator array (votes)
Using the polar parameterization:
x cos  − y sin  = d
Basic Hough transform algorithm 
1. Initialize H[d,] = 0.
2. For each edge point (x,y) in the image
for  = 0 to 180 // some quantization d

d = x cos  − y sin 
H[d, ] += 1
3. Find the value(s) of (d,) where H[d,] is maximum.
4. The detected line in the image is given by d = x cos − y sin 
Hough line demo
• Time complexity (in terms of number of votes)?

57
Example: HT for Straight Lines
d
y
x 
Image space Votes
edge coordinates
Bright value = high vote count
Black = no votes
58
Square:
59
60
Real-World Examples
61
Showing longest segments found
62
Impact of Noise on Hough Transform
y d
x 
Image space Votes
edge coordinates
What difficulty does this present for an implementation?

63
Impact of Noise on Hough Transform
Image space Votes

edge coordinates
Here, everything appears to be “noise”, or random edge

points, but we still see peaks in the vote space.
64
Extensions
Extension 1: Use the image gradient
1. same
2. for each edge point I[x,y] in the image
 = gradient at (x,y)
d = x cos − y sin 
H[d,] += 1
3. same
4. same
(Reduces degrees of freedom)
65
Extensions
Extension 1: Use the image gradient
1. same
2. for each edge point I[x,y] in the image
compute unique (d,) based on image gradient at (x,y)
H[d,] += 1
3. same
4. same
(Reduces degrees of freedom)
Extension 2
➢ Give more votes for stronger edges (use magnitude of gradient)
Extension 3
➢ Change the sampling of (d,) to give more/less resolution
Extension 4
➢ The same procedure can be used with circles, squares, or any other
shape… 66
Extension: Cascaded Hough Transform
• Let’s go back to the original (m,b) parametrization
• A line in the image maps to a pencil of lines in the
Hough space
• What do we get with parallel lines or a pencil of lines?
➢ Collinear peaks in the Hough space!
• So we can apply a Hough transform to the output of the
first Hough transform to find vanishing points
67
Finding Vanishing Points
68
Cascaded Hough Transform
• Issue: Dealing with the unbounded parameter space
69
Cascaded Hough Transform
70
Hough Transform for Circles
• Circle: center (a,b) and radius r
( xi − a ) 2 + ( yi − b) 2 = r 2
• For a fixed radius r, unknown gradient direction

b
Image space Hough space a

71
( xi − a ) 2 + ( yi − b) 2 = r 2
• For a fixed radius r, unknown gradient direction
Intersection:
most votes for
center occur
here.
Image space Hough space

72
( xi − a) 2 + ( yi − b) 2 = r 2
• For an unknown radius r, unknown gradient direction

r
b
a
73
( xi − a) 2 + ( yi − b) 2 = r 2
• For an unknown radius r, unknown gradient direction

r
b
a
74
( xi − a) 2 + ( yi − b) 2 = r 2
• For an unknown radius r, known gradient direction
75
For every edge pixel (x,y) :
For each possible radius value r:
For each possible gradient direction θ:
// or use estimated gradient
a = x – r cos(θ)
b = y + r sin(θ)
H[a,b,r] += 1
end
end
76
Example: Detecting Circles with Hough
Crosshair indicates results of Hough transform,

bounding box found via motion differencing.
77
Original Edges Votes: Penny
Note: a different Hough transform (with separate accumu-

lators) was used for each circle radius (quarters vs. penny).
78
Original Edges Votes: Quarter
Combined detections
79
Voting: Practical Tips
• Minimize irrelevant tokens first (take edge points with
significant gradient magnitude)
• Choose a good grid / discretization
➢ Too coarse: large votes obtained when too many different lines
correspond to a single bucket
➢ Too fine: miss lines because some points that are not exactly
collinear cast votes for different buckets
• Vote for neighbors, also (smoothing in accumulator
array)
• Utilize direction of edge to reduce free parameters by 1
• To read back which points voted for “winning” peaks,
keep tags on the votes.
80
Hough Transform: Pros and Cons
Pros
• All points are processed independently, so can cope with
occlusion
• Some robustness to noise: noise points unlikely to
contribute consistently to any single bin
• Can detect multiple instances of a model in a single pass
Cons
• Complexity of search time increases exponentially with
the number of model parameters
• Non-target shapes can produce spurious peaks in
parameter space
• Quantization: hard to pick a good grid size
81
Generalized Hough Transform
• What if want to detect arbitrary shapes defined by
boundary points and a reference point?
At each boundary point,

compute displacement
vector: r = a – pi.
x
a
For a given model shape:
θ θ
p1 p2 store these vectors in a
table indexed by gradient
orientation θ.
Image space
[Dana H. Ballard, Generalizing the Hough Transform to Detect Arbitrary Shapes, 1980]
82
Generalized Hough Transform
To detect the model shape in a new image:
• For each edge point
➢ Index into table with its gradient orientation θ
➢ Use retrieved r vectors to vote for position of reference point
• Peak in this Hough space is reference point with most

supporting edges
Assuming translation is the only transformation here,

i.e., orientation and scale are fixed.
83
Example: Generalized Hough Transform
Say we’ve already
stored a table of
displacement vectors
as a function of edge
orientation for this
model shape.
Model shape
84
Now we want to look
at some edge points
detected in a new
image, and vote on
the position of that
shape.
DR. GEORGE KARRAZ, Ph. D. Displacement vectors for model points 85

DR. GEORGE KARRAZ, Ph. D. Range of voting locations for test point 86

Votes for points with θ = 88
DR. GEORGE KARRAZ, Ph. D. Displacement vectors for model points 89

DR. GEORGE KARRAZ, Ph. D. Votes for points with θ = 91

Application in Recognition
• Instead of indexing displacements by gradient
orientation, index by “visual codeword”.
Visual codeword with

displacement vectors
Training image
92
Application in Recognition
• Instead of indexing displacements by gradient
orientation, index by “visual codeword”.
Test image
• We’ll hear more about this method in lecture 14…

93
THANK YOU!
NEXT: LOCAL IMAGE FEATURES
94
COMPUTER VISION
LECTURE VI
LOCAL IMAGE FEATURES

Contents
• Overview of Keypoint Matching
• Harris corner detector
• Features in Computer Vision
• SIFT Features
→ Scale Invariant Feature Transform

This section: correspondence and alignment
• Correspondence: matching points, patches,
edges, or regions across images

Overview of Keypoint Matching
1. Find a set of
distinctive key-
points
A1
2. Define a region
around each
A2 A3 keypoint
3. Extract and
normalize the
region content
fA fB
4. Compute a local
descriptor from the
normalized region
d ( f A, fB )  T
5. Match local
descriptors
Harris corner detector E(u, v)
• Approximate distinctiveness by local

auto-correlation.
• Approximate local auto-correlation by
second moment matrix
• Quantify distinctiveness (or cornerness)
as function of the eigenvalues of the
second moment matrix.
• But we don’t actually need to (max)-1/2
compute the eigenvalues by (min)-1/2
using the determinant and trace
of the second moment matrix.
Harris Detector [Harris88]
• Second moment matrix
 I x2 ( D ) I x I y ( D )
 ( I ,  D ) = g ( I )    1. Image Ix Iy
 I x I y ( D ) I y ( D ) 
2
derivatives
(optionally, blur first)
2. Square of Ix2 Iy2 IxIy

derivatives
det M = 12
trace M = 1 + 2
3. Gaussian g(Ix2) g(Iy2) g(IxIy)
filter g(I)
4. Cornerness function – both eigenvalues are strong

har = det[  ( I , D)] −  [trace(  ( I , D)) 2 ] =
g ( I x2 ) g ( I y2 ) − [ g ( I x I y )]2 −  [ g ( I x2 ) + g ( I y2 )]2
5. Non-maxima suppression
har
Automatic Scale Selection
f ( I i1im ( x,  )) = f ( I i1im ( x,  ))
How to find corresponding patch sizes?

• Function responses for increasing scale (scale signature)
f ( I i1im ( x,  )) f ( I i1im ( x,  ))

f ( I i1im ( x,  )) f ( I i1im ( x,  ))

f ( I i1im ( x,  )) f ( I i1im ( x,  ))

f ( I i1im ( x,  )) f ( I i1im ( x,  ))

f ( I i1im ( x,  )) f ( I i1im ( x,  ))

f ( I i1im ( x,  )) f ( I i1im ( x,  ))

Features in Computer Vision
• What is a feature?
– Location of sudden change
• Why use features?
– Information content high
– Invariant to change of view point, illumination
– Reduces computational burden

(One Type of) Computer Vision
Image 1
Feature 1
Feature 2 Computer
: Vision
Feature N Algorithm
Image 2
Feature 1
Feature 2
:
Feature N

Where Features Are Used
• Calibration
• Image Segmentation
• Correspondence in multiple images (stereo, structure
from motion)
• Object detection, classification
What Makes For Good Features?
• Invariance
– View point (scale, orientation, translation)
– Lighting condition
– Object deformations
– Partial occlusion
• Other Characteristics
– Fast to compute
– Uniqueness
– Sufficiently many
– Tuned to the task
Advanced Features: Topic
SIFT Features
→ Scale Invariant Feature Transform
Want to find … in here
18
SIFT Features
• Invariances:
– Scaling
– Rotation
– Illumination
– Translation
• Provides
– Good localization

SIFT
• SIFT features are first extracted from a set of
reference images and stored in a database.
• A new image is matched by individually
comparing each feature from the new image to
this previous database and finding candidate
matching features based on Euclidean distance
of their feature vectors.

Invariant Local Features
• Image content is transformed into local feature coordinates that are
invariant to translation, rotation, scale, and other imaging parameters
SIFT Features
Advantages of invariant local features
• Locality: features are local, so robust to occlusion and clutter (no prior
segmentation)
• Distinctiveness: individual features can be matched to a large
database of objects
• Quantity: many features can be generated for even small objects
• Efficiency: close to real-time performance
• Extensibility: can easily be extended to wide range of differing feature
types, with each adding robustness

SIFT Algorithm
Scale-space extrema detection
Keypoint localization
Interpolation of nearby data for accurate position
Discarding low-contrast keypoints
Eliminating edge responses
Orientation assignment
Keypoint descriptor

Scale-space extrema detection
The image is convolved with Gaussian filters at different scales, and then
the difference of successive Gaussian-blurred images are taken. Keypoints
are then taken as maxima/minima of the Difference of Gaussians (DoG)
that occur at multiple scales.

SIFT On-A-Slide
1. Enforce invariance to scale: Compute Gaussian difference max, for may
different scales; non-maximum suppression, find local maxima: key
points candidates
2. Localizable corner: For each maximum fit quadratic function. Compute
center with sub-pixel accuracy by setting first derivative to zero.
3. Eliminate edges: Compute ratio of eigenvalues, drop key points for
which this ratio is larger than a threshold.
4. Enforce invariance to orientation: Compute orientation, to achieve
rotation invariance, by finding the strongest second derivative direction
in the smoothed image (possibly multiple orientations). Rotate patch so
that orientation points up.
5. Compute feature signature: Compute a "gradient histogram" of the
local image region in a 4x4 pixel region. Do this for 4x4 regions of that
size. Orient so that largest gradient points up (possibly multiple
solutions). Result: feature vector with 128 values (15 fields, 8
gradients).
6. Enforce invariance to illumination change and camera saturation:
Normalize to unit length to increase invariance to illumination. Then
threshold all gradients, to become invariant to camera saturation.
Finding “Key points” (Corners)
Idea: Find Corners, but scale invariance
Approach:
• Run linear filter (diff of Gaussians)
• Do this at different resolutions of image
pyramid

Difference of Gaussians
Equals
Minus

DiffOfGauss
• Difference of Gaussian Pyramid
• Difference of each successive image in each
octave

Gaussian Kernel Size i=1










• Detect maxima and
minima of difference-
of-Gaussian in scale
space (the pyramid
idea)

Example of key points detection
(a) 233x189 image

(b) 832 DOG extrema
(c) 729 above threshold
Difference of Gaussian Pyramid DoG

Example of key points detection
Threshold on value at DOG peak and on ratio of principle curvatures
(Harris approach)
(c) 729 left after peak value threshold (from 832)

(d) 536 left after testing ratio of principle curvatures

Select canonical orientation
• Create histogram of local
gradient directions
computed at selected scale
• Assign canonical
orientation at peak of
smoothed histogram
• Each key specifies stable 2D
coordinates (x, y, scale,
orientation)
0 2
SIFT vector formation
• Thresholded image gradients are sampled over
16x16 array of locations in scale space
• Create array of orientation histograms
• 8 orientations x 4x4 histogram array = 128
dimensions

Nearest-neighbor matching to feature
database
• Ideal search: nearest neighbor (difficult in high-dim
spaces)
• Hypotheses are generated by approximate nearest
neighbor matching of each feature to vectors in the
database
– SIFT use best-bin-first (Beis & Lowe, 97) modification
to k-d tree algorithm
– Use heap data structure to identify bins in order by
their distance from query point
• Result: Can give speedup by factor of 1000 while finding

nearest neighbor (of interest) 95% of the time
45
• Extract outlines with
background
subtraction

3D Object Recognition
• Only 3 keys are needed
for recognition, so extra
keys provide robustness
• Affine model is no
longer as accurate

Recognition under occlusion

Test of illumination invariance
• Same image under differing illumination
273 keys verified in final match
49
THANK YOU!
NEXT: INTRODUCTION TO FACE DETECTION

& RECOGNITION
50
COMPUTER VISION
LECTURE VII
INTRODUCTION TO FACE
RECOGNITION & DETECTION
Outline
Face recognition •
Face recognition processing •
Analysis in face subspaces •
Technical challenges •
Technical solutions •
Face detection •
Appearance-based and learning based approaches •
Neural networks methods •
AdaBoost-based methods •
Dealing with head rotations •
Performance evaluation •

Face Recognition by Humans
• Performed routinely and effortlessly by humans
• Enormous interest in automatic processing of digital images and videos
due to wide availability of powerful and low-cost desktop embedded
computing
• Applications:
• biometric authentication,
• surveillance,
• human-computer interaction
• multimedia management

Face recognition
Advantages over other biometric technologies:
• Natural
• Non intrusive
• Easy to use
Among the six biometric attributes considered by Hietmeyer, facial

features scored the highest compatibility in a Machine Readable Travel
Documents (MRTD) system based on:
• Enrollment
• Renewal
• Machine requirements
• Public perception

Classification
A face recognition system is expected to identify faces present in images
and videos automatically. It can operate in either or both of two
modes:
Face verification (or authentication): involves a one-to-one match that
compares a query face image against a template face image whose identity is
being claimed.
Face identification (or recognition): involves one-to-many matches that

compares a query face image against all the template images in the database to
determine the identity of the query face.
First automatic face recognition system was developed by Kanade 1973.

Outline
• Face recognition
• Face recognition processing
• Analysis in face subspaces
• Technical challenges
• Technical solutions
• Face detection
• Appearance-based and learning based approaches
• Preprocessing
• Neural networks and kernel-based methods
• AdaBoost-based methods
• Dealing with head rotations
• Performance evaluation

Face recognition processing
Face recognition is a visual pattern recognition problem.
A face is a three-dimensional object subject to varying illumination, pose,
expression is to be identified based on its two-dimensional image ( or
three- dimensional images obtained).
A face recognition system generally consists of 4 modules - detection,

alignment, feature extraction, and matching.
Localization and normalization (face detection and alignment) are
processing steps before face recognition (facial feature extraction and
matching) is performed.

• Face detection segments the face areas from the background.
• In the case of video, the detected faces may need to be tracked
using a face tracking component.
• Face alignment is aimed at achieving more accurate localization
and at normalizing faces, whereas face detection provides coarse
estimates of the location and scale of each face.
• Facial components and facial outline are located; based on the
location points,
• The input face image is normalized in respect to geometrical
properties, such as size and pose, using geometrical transforms
or morphing,
• The face is further normalized with respect to photometrical
properties such as illumination and gray scale.

After a face is normalized, feature extraction is performed to

provide effective information that is useful for
distinguishing between faces of different persons and
stable with respect to the geometrical and photometrical
variations.
For face matching, the extracted feature vector of the input

face is matched against those of enrolled faces in the
database; it outputs the identity of the face when a match
is found with sufficient confidence or indicates an
unknown face otherwise.

Face recognition processing flow.

Outline
• Face detection
• Preprocessing

Analysis in face subspaces
Subspace analysis techniques for face recognition are based on the fact
that a class of patterns of interest, such as the face, resides in a subspace
of the input image space:
A small image of 64 × 64 having 4096 pixels can express a large number of pattern
classes, such as trees, houses and faces.
Among the 2564096 > 109864 possible “configurations”, only a few correspond to
faces. Therefore, the original image representation is highly redundant, and the
dimensionality of this representation could be greatly reduced .

With the eigenface or PCA approach, a small number (40 or lower) of

eigenfaces are derived from a set of training face images by using the
Karhunen-Loeve transform or PCA.
A face image is efficiently represented as a feature vector (i.e. a vector of

weights) of low dimensionality.
The features in such subspace provide more salient and richer information for
recognition than the raw image.

The manifold (i.e. distribution) of all faces accounts for variation in face
appearance whereas the nonface manifold (distribution) accounts for everything else.
If we look into facial manifolds in the image space, we find them highly
nonlinear and nonconvex.
The figure (a) illustrates face versus nonface manifolds and (b) illustrates the
manifolds of two individuals in the entire face manifold.
Face detection is a task of distinguishing between the face and nonface manifolds
in the image (sub window) space and face recognition between those of
individuals in the face manifolds.
(a) Face versus nonface manifolds. (b) Face manifolds of different individuals.
14
Handwritten manifolds
• Two dimensional embedding of handwritten digits ("0"-"9") by Laplacian
Eigenmap, Locally Preserving Projection, and PCA
• Colors correspond to the same individual handwriting

Examples
• The Eigenfaces, Fisher faces and Laplacian faces calculated from the face
images in the Yale database.
Eigenfaces
Fisherfaces
Laplacianfaces

Outline
• Face detection
• Neural networks methods

Technical Challenges
The performance of many state-of-the-art face recognition methods
deteriorates with changes in lighting, pose and other factors. The key
technical challenges are:
• Large Variability in Facial Appearance: Whereas shape and reflectance are
intrinsic properties of a face object, the appearance (i.e. texture) is subject
to several other factors, including the facial pose, illumination, facial
expression.
Intrasubject variations in pose, illumination, expression, occlusion,

accessories (e.g. glasses), color and brightness.

• Highly Complex Nonlinear Manifolds: The entire face manifold (distribution) is highly
nonconvex and so is the face manifold of any individual under various changes. Linear
methods such as PCA, independent component analysis (ICA) and linear discriminant
analysis (LDA) project the data linearly from a high-dimensional space (e.g. the image
space) to a low-dimensional subspace. As such, they are unable to preserve the
nonconvex variations of face manifolds necessary to differentiate among individuals.
• In a linear subspace, Euclidean distance and Mahalanobis distance do not perform well
for classifying between face and nonface manifolds and between manifolds of
individuals. This limits the power of the linear methods to achieve highly accurate face
detection and recognition.

• High Dimensionality and Small Sample Size: Another challenge is the ability to
generalize as illustrated in figure. A canonical face image of 112 × 92 resides in a
10,304-dimensional feature space. Nevertheless, the number of examples per
person (typically fewer than 10) available for learning the manifold is usually
much smaller than the dimensionality of the image space; a system trained on so
few examples may not generalize well to unseen instances of the face.

Outline
• Technical solutions:
• Statistical (learning-based)
• Geometry-based and appearance-based
• Non-linear kernel techniques
• Taxonomy
• Face detection
• Appearance-based and learning-based approaches
• Non-linear and Neural networks methods

Technical Solutions
• Feature extraction: construct a “good” feature space in which the face
manifolds become simpler i.e. less nonlinear and nonconvex than those in
the other spaces. This includes two levels of processing:
Normalize face images geometrically and photometrically, such as using

morphing and histogram equalization
Extract features in the normalized images which are stable with respect to such
variations, such as based on Gabor wavelets.
• Pattern classification: construct classification engines able to solve difficult

nonlinear classification and regression problems in the feature space and
to generalize better.

Technical Solutions
Learning-based approach - statistical learning
• Learns from training data to extract good features and construct classification
engines.
• During the learning, both prior knowledge about face(s) and variations seen in
the training data are taken into consideration.
• The appearance-based approach such as PCA and LDA based methods, has
significantly advanced face recognition techniques.
• They operate directly on an image-based representation (i.e. an array of pixel

intensities) and extracts features in a subspace derived from training images.

Technical Solutions
Appearance-based approach utilizing
geometric features
Detects facial features such as eyes, nose, mouth and chin.
- Detects properties of and relations (e.g. areas, distances, angles)
between the features are used as descriptors for face recognition.
Advantages:
• economy and efficiency when achieving data reduction and insensitivity

to variations in illumination and viewpoint
• facial feature detection and measurement techniques are not reliable
enough is they are based on the geometric feature based recognition
only
• rich information contained in the facial texture or appearance is still
utilized in appearance-based approach.

Technical Solutions
Nonlinear kernel techniques
Linear methods can be extended using nonlinear
kernel techniques (kernel PCA and kernel LDA) to deal
with nonlinearly in face recognition.
• A non-linear projection (dimension reduction) from the image space to

a feature space is performed; the manifolds in the resulting feature
space become simple, yet with subtleties preserved.
• A local appearance-based feature space uses appropriate image filters,

so the distributions of faces are less affected by various changes.
Examples:
• Local feature analysis (LFA)
• Gabor wavelet-based features such as elastic graph bunch matching (EGBM)
• Local binary pattern (LBP)

Taxonomy of face recognition algorithms
Taxonomy of face recognition algorithms based on pose-dependency,

face representation, and features used in matching.
Outline
• Face detection
• Preprocessing

Face detection
Face detection is the first step in automated face recognition.
Face detection can be performed based on several cues:

• skin color
• motion
• facial/head shape
• facial appearance or
• a combination of these parameters.
Most successful face detection algorithms are appearance-based
without using other cues.

Face detection
The processing is done as follows:

• An input image is scanned at al possible locations and scales by a
subwindow.
• Face detection is posed as classifying the pattern in the subwindow as
either face or nonface.
• The face/nonface classifier is learned from face and nonface training
examples using statistical learning methods
• Note: The ability to deal with nonfrontal faces is important for many real
applications because approximately 75% of the faces in home photos are
nonfrontal.

Appearance-based and learning based approaches
• Face detection is treated as a problem of classifying each scanned
sub window as one of two classes (i.e. face and nonface).
• Appearance-based methods avoid difficulties in modeling 3D

structures of faces by considering possible face appearances
under various conditions.
• A face/nonface classifier may be learned from a training set

composed of face examples taken under possible conditions as
would be seen in the running stage and nonface examples as well.
• Disadvantage: large variations brought about by changes in facial

appearance, lighting and expression make the face manifold or
face/non-face boundaries highly complex.

• Principal component analysis (PCA) or eigenface representation is
created by Turk and Pentland; only likelihood in the PCA subspace is
considered.
• Moghaddam and Pentland consider the likelihood in the orthogonal

complement subspace modeling the product of the two likelihood
estimates.
• Schneiderman and Kanade use multiresolution information for different

levels of wavelet transform.
• A nonlinear face and nonface classifier is constructed using statistics of

products of histograms computed from face and nonface examples
using AdaBoost learning. Viola and Jones built a fast, robust face
detection system in which AdaBoost learning is used to construct
nonlinear classifier.

• Liu presents a Bayesian Discriminating Features (BDF) method. The input image,
its one-dimensional Harr wavelet representation, and its amplitude projections
are concatenated into an expanded vector input of 768 dimensions. Assuming
that these vectors follow a (single) multivariate normal distribution for face,
linear dimension reduction is performed to obtain the PCA modes.
• Li et al. present a multi view face detection system. A new boosting algorithm,
called Float Boost, is proposed to incorporate Floating Search into AdaBoost. The
backtrack mechanism in the algorithm allows deletions of weak classifiers that
are ineffective in terms of error rate, leading to a strong classifier consisting of
only a small number of weak classifiers.
• Lienhart et al. use an extended set of rotated Haar features for dealing with in-
plane rotation and train a face detector using Gentle Adaboost with trees as base
classifiers. The results show that this combination outperforms that of Discrete
Adaboost.

Neural Networks and Kernel Based Methods
Nonlinear classification for face detection may be performed using neural
networks or kernel-based methods.
Neural methods: a classifier may be trained directly using preprocessed

and normalized face and nonface training subwindows.
• The input to the system of Sung and Poggio is derived from the six face and
six nonface clusters. More specifically, it is a vector of 2 × 6 = 12 distances in
the PCA subspaces and 2 × 6 = 12 distances from the PCA subspaces.
• The 24 dimensional feature vector provides a good representation for
classifying face and nonface patterns.
• In both systems, the neural networks are trained by back-propagation
algorithms.
Kernel SVM classifiers perform nonlinear classification for face detection

using face and nonface examples.
• Although such methods are able to learn nonlinear boundaries, a large
number of support vectors may be needed to capture a highly nonlinear
boundary. For this reason, fast realtime performance has so far been a
difficulty with SVM classifiers thus trained.
AdaBoost-based Methods

The AdaBoost learning procedure is aimed at learning a sequence of best
weak classifiers hm(x) and the best combining weights αm.
A set of N labeled training examples {(x1, y1), …, (xN, yN)} is assumed n

available, where yi Є {+1, -1} is the class label for the example xi Є R . A
distribution [w1, …, wN] of the training examples, where wi is associated
with a training example (xi, yi), is computed and updated during the
learning to represent the distribution of the training examples.
After iteration m, harder-to-classify examples (xi, yi) are given larger

weights wi(m), so that at iteration m + 1, more emphasis is placed on
these examples.
AdaBoost assumes that a procedure is available for learning a weak

classifier hm(x) from the training examples, given the distribution [wi(m)].

Haar-like features
Viola and Jones propose four basic types of scalar features for face detection as shown in
figure. Such a block feature is located in a subregion of a subwindow and varies in shape
(aspect ratio), size and location inside the subwindow.
For a subwindow of size 20 × 20, there can be tens of thousands of such features for varying
shapes, sizes and locations. Feature k, taking a scalar value zk(x) Є R, can be considered a
transform from the n-dimensional space to the real line. These scalar numbers form an
overcomplete feature set for the intrinsically low- dimensional face pattern.
Recently, extended sets of such features have been proposed for dealing with out-of-plan
head rotation and for in-plane head rotation.
These Haar-like features are interesting for two reasons:

powerful face/nonface classifiers can be constructed based on these features
they can be computed efficiently using the summed-area table or integral image
technique.
Four types of rectangular Haar wavelet-like

features. A feature is a scalar calculated by
summing up the pixels in the white region and
subtracting those in the dark region.

Constructing weak classifiers
The AdaBoost learning procedure is aimed at learning a sequence of best
weak classifiers to combine hm(x) and the combining weights αm. It
solves the following three fundamental problems:
Learning effective features from a large feature set
Constructing weak classifiers, each of which is based on one of the

selected features
Boosting the weak classifiers to construct a strong classifier

Constructing weak classifiers (cont’d)
AdaBoost assumes that a “weak learner” procedure is available.
The task of the procedure is to select the most significant feature from a set of
candidate features, given the current strong classifier learned thus far, and then
construct the best weak classifier and combine it into the existing strong
classifier.
In the case of discrete AdaBoost, the simplest type of weak classifiers is a “stump”.
A stump is a single-node decision tree. When the feature is real-valued, a stump
may be constructed by thresholding the value of the selected feature at a certain
threshold value; when the feature is discrete-valued, it may be obtained
according to the discrete label of the feature.
A more general decision tree (with more than one node) composed of several
stumps leads to a more sophisticated weak classifier.

Boosted strong classifier
• AdaBoost learns a sequence of weak classifiers hm and boosts them into a
strong one HM effectively by minimizing the upper bound on classification
error achieved by HM. The bound can be derived as the following
exponential loss function:
where i is the index for training examples.

AdaBoost learning algorithm
AdaBoost learning algorithm

FloatBoost Learning
AdaBoost attempts to boost the accuracy of an ensemble of weak classifiers. The
AdaBoost algorithm solves many of the practical difficulties of earlier boosting
algorithms. Each weak classifier is trained stage-wise to minimize the empirical error
for a given distribution reweighted according to the classification errors of the
previously trained classifiers. It is shown that AdaBoost is a sequential forward search
procedure using the greedy selection strategy to minimize a certain margin on the
training set.
A crucial heuristic assumption used in such a sequential forward search procedure is the
monotonicity (i.e. that addition of a new weak classifier to the current set does not
decrease the value of the performance criterion). The premise offered by the
sequential procedure in AdaBoost breaks down when this assumption is violated.
Floating Search is a sequential feature selection procedure with backtracking, aimed to

deal with nonmonotonic criterion functions for feature selection. A straight sequential
selection method such as sequential forward search or sequential backward search
adds or deletes one feature at a time. To make this work well, the monotonicity
property has to be satisfied by the performance criterion function. Feature selection
with a nonmonotonic criterion may be dealt with using a more sophisticated
technique, called plus-L-minus-r, which adds or deletes L features and then backtracks
r steps.

DR. GEORGE KARRAZ, Ph. D. FloatBoost
algorithm
The Float Boost Learning procedure is

composed of several parts:
• the training input,
• initialization,
• forward inclusion,
• conditional exclusion and
• output.
In forward inclusion, the currently most
significant weak classifiers are added one
at a time, which is the same as in
AdaBoost.
In conditional exclusion, Float Boost
removes the least significant weak
classifier from the set HM of current weak
classifiers, subject to the condition that
the
min
removal leads to a lower cost than
J M-1. Supposing that the weak classifier
removed was the m’-th in HM, then
hm’,…,hM-1 and the αm’s must be
relearned. These steps are repeated until
no more removals can be done.
42 FloatBoost algorithm
Cascade of Strong Classifiers: A boosted strong classifier effectively •
eliminates a large portion of nonface subwindows while
maintaining a high detection rate. Nonetheless, a single strong
classifier may not meet the requirement of an extremely low false
alarm rate (e.g. 10-6 or even lower). A solution is to arbitrate
between several detectors (strong classifier), for example, using
the “AND” operation.
A cascade of n strong classifiers (SC). The input is a subwindow x. It is sent to

the next SC for further classification only if it has passed all the previous SCs
as the face (F) pattern; otherwise it exists as nonface (N). x is finally
considered to be a face when it passes all the n SCs.

Outline
• Face detection

Dealing with Head Rotations
Multiview face detection should be able to detect non frontal faces. There
are three types of head rotation:
out-of-plane rotation (look to the left – to the right)

in-plane rotation (tilted toward shoulders)
up-and-down nodding rotation (up-down)
Adopting a coarse-to-fine view-partition strategy, the detector-pyramid

architecture consists of several levels from the coarse top level to the fine
Bottom level.
Rowley et al. propose to use two neural network classifiers for detection of
frontal faces subject to in-plane rotation.
• The first is the router network, trained to estimate the orientation of an
assumed face in the sub window, though the window may contain a nonface
pattern. The inputs to the network are the intensity values in a preprocessed 20
× 20 sub window. The angle of rotation is represented by an array of 36 output
units, in which each unit represents an angular range.
• The second neural network is a normal frontal, upright face detector.

Coarse-to-fine: The partitions of the out-of-plane rotation for the three-
level detector-pyramid is illustrated in figure.
Out-of-plane view partition. Out-of-plane head rotation (row 1), the

facial view labels (row 2), and the coarse-to-fine view partitions at the
three levels of the detector-pyramid (rows 3 to 5).

Simple-to-complex: A large number of sub windows result from
the scan of the input image. For example, there can be tens to
hundreds of thousands of them for an image of size 320 × 240, the
actual number depending on how the image is scanned.
Merging from different channels. From left to right: Outputs of frontal, left and
right view channels and the final result after the merge.

Outline
• Face detection

Performance Evaluation
The result of face detection from an image is affected
by the two basic components:
• The face/nonface classifier: consists of face icons of a fixed
size (as are used for training). This process aims to evaluate
the performance of the face/nonface classifier
(preprocessing included), without being affected by
merging.
• The postprocessing (merger): consists of normal images. In

this case, the face detection results are affected by both
trained classifier and merging; the overall system
performance is evaluated.

Performance Measures
• The face detection performance is primarily measured by two rates: the
correct detection rate (which is 1 minus the miss detection rate) and
the false alarm rate.
• As AdaBoost-based methods (with local Haar wavelet features) have so

far provided the best face detection solutions in terms of the statistical
rates and the speed
• There are a number of variants of boosting algorithms: DAB- discrete
Adaboost; RAB- real Adaboost; and GAB- gentle Adaboost, with
different training sets and weak classifiers.
• Three 20-stage cascade classifiers were trained with DAB, RAB and GAB
using the Haar-like feature set of Viola and Jones and stumps as the
weak classifiers. It is reported that GAB outperformed the other two
boosting algorithms; for instance, at an absolute false alarm rate of 10
on the CMU test set, RAB detected only 75.4% and DAB only 79.5% of
all frontal faces, and GAB achieved 82.7% at a rescale factor of 1.1.

• Two face detection systems were trained: one with the basic Haar-like
feature set of Viola and Jones and one with the extended Haar-like
feature set in which rotated versions of the basic Haar features are
added.
• On average, the false alarm rate was about 10% lower for the extended
haar-like feature set at comparable hit rates.
• At the same time, the computational complexity was comparable.
• This suggests that whereas the larger haar-like feature set makes it
more complex in both time and memory in the boosting learning phase,
gain is obtained in the detection phase.

Regarding the AdaBoost approach, the following conclusions can be drawn:
• An over-complete set of Haar-like features are effective for face detection. The use of the
integral image method makes the computation of these features efficient and achieves scale
invariance. Extended Haar-like features help detect nonfrontal faces.
• Adaboost learning can select best subset from a large feature set and construct a powerful
nonlinear classifier.
• The cascade structure significantly improves the detection speed and effectively reduces false
alarms, with a little sacrifice of the detection rate.
• Float Boost effectively improves boosting learning result. It results in a classifier that needs
fewer weaker classifiers than the one obtained using AdaBoost to achieve a similar error rate, or
achieve a lower error rate with the same number of weak classifiers. This run time improvement
is obtained at the cost of longer training time.
• Less aggressive versions of Adaboost, such as Gentle Boost and Logit Boost may be preferable to
discrete and real Adaboost in dealing with training data containing outliers (distinct, unusual
cases).
• More complex weak classifiers (such as small trees) can model second-order and/or third-order
dependencies, and may be beneficial for the nonlinear task of face detection.

THANK YOU!
NEXT: V IOLA JONES FACE DETECTOR
53
COMPUTER VISION
LECTURE VIII
VIOLA JONES FACE & DETECTOR

The Viola/Jones Face Detector
(2001)
➢ A widely used method for real-time object detection.

➢ Training is slow, but detection is very fast.

Classifier is Learned from Labeled Data
• Training Data
– 5000 faces
• All frontal
– 300 million non faces
• 9400 non-face images
– Faces are normalized
• Scale, translation
• Many variations
– Across individuals
– Illumination
– Pose (rotation both in plane and out)
Key Properties of Face Detection
• Each image contains 10 - 50 thousand locs/scales
• Faces are rare 0 - 50 per image
– 1000 times as many non-faces as faces
• Extremely small # of false positives: 10-6

AdaBoost
• Given a set of weak classifiers
originally : h j (x) {+1, − 1}
– None much better than random
• Iteratively combine classifiers
– Form a linear combination
 
C ( x ) =    ht ( x ) + b 
 t 
– Training error converges to 0 quickly
– Test error is related to training margin

Weak
Classifier 1
AdaBoost
Freund & Shapire
Weights
Increased
Weak
Classifier 2
Weak
classifier 3
Final classifier is
linear combination of
weak classifiers
AdaBoost:
Super Efficient Feature Selector
• Features = Weak Classifiers

• Each round selects the optimal feature
given:
– Previous selected features
– Exponential Loss

Boosted Face Detection: Image Features
“Rectangle filters”
Similar to Haar wavelets

Papageorgiou, et al.
t if f t ( xi )   t
ht ( xi ) = 
 t otherwise
 
C ( x ) =    ht ( x ) + b 
 t  60,000 features to choose from

The Integral Image
• The integral image

computes a value at each
pixel (x,y) that is the sum
of the pixel values above (x,y)
and to the left of (x,y),
inclusive.
• This can quickly be
computed in one pass
through the image

Computing Sum within a Rectangle
• Let A,B,C,D be the values of

the integral image at the
D B
corners of a rectangle
• Then the sum of original
image values within the
rectangle can be computed: C A
sum = A – B – C + D
• Only 3 additions are required
for any size of rectangle!
– This is now used in many areas
of computer vision

Feature Selection
• For each round of boosting:

– Evaluate each rectangle filter on each example
– Sort examples by filter values
– Select best threshold for each filter (min Z)
– Select best filter/threshold (= Feature)
– Reweight examples
• M filters, T thresholds, N examples, L learning time
– O( MT L(MTN) ) Naïve Wrapper Method
– O( MN ) Adaboost feature selector

Example Classifier for Face Detection
A classifier with 200 rectangle features was learned using AdaBoost
95% correct detection on test set with 1 in 14084

false positives.
Not quite competitive...
ROC curve for 200 feature classifier

Building Fast Classifiers
• Given a nested set of classifier

% False Pos
hypothesis classes 0 50
100
vs false neg determined by
% Detection
50
• Computational Risk Minimization
T T T
IMAGE Classifier 2 Classifier 3
SUB-WINDOW
Classifier 1 FACE
F F F
NON-FACE NON-FACE NON-FACE

13
Cascaded Classifier
50% 20% 2%
IMAGE 1 Feature 5 Features 20 Features
SUB-WINDOW FACE
F F F
NON-FACE NON-FACE NON-FACE
• A 1 feature classifier achieves 100% detection rate

and about 50% false positive rate.
• A 5 feature classifier achieves 100% detection rate
and 40% false positive rate (20% cumulative)
– using data from previous stage.
• A 20 feature classifier achieve 100% detection
rate with 10% false positive rate (2% cumulative)

Output of Face Detector on Test Images
15
Solving other “Face” Tasks
Facial Feature Localization

Profile Detection
Demographic
Analysis
16
Feature Localization Features DR. GEORGE KARRAZ, Ph. D.
• Learned features reflect the task
17
Profile Detection
18
Profile Features

Review: Classifiers
• Bayes risk, loss functions

• Histogram-based classifiers
• Kernel density estimation
• Nearest-neighbor classifiers
• Neural networks
Viola/Jones face detector

• Integral image
• Cascaded classifier

THANK YOU!
NEXT: G EOMETRIC TRANSORMATIONS
21
COMPUTER VISION
LECTURE IX
GEOMETRIC TRANSFORMATIONS

Geometric transformations
Review some basics of linear algebra and

geometric transformations

Outline
• Representation
• Basics of linear algebra
• Homogeneous Coordinates
• Geometrical transformations

Representation
• Digital Pictures are 2D arrays (matrices) of numbers
• Each pixel is a measure of the brightness (intensity of light)
– that falls on an area of an sensor (typically a CCD chip)

Picture as a Vector in Dimension N
XN
Vector of
dimension N
256
Appearance
X1

Vectors in
• We can think of vectors as points in a multidimensional space with
respect to some coordinate system
• Ordered set of numbers
• Example in two dimensions

Vectors in
• Notation:

Scalar Product
• A product of two vectors

• Amounts to projection of one vector onto the other
• Example in 2D:
The shown segment has length <x, y>, if x and y are unit vectors.

Scalar Product
• Various notations:
• Other names: dot product, inner product

Scalar Product in
• Definition:
• In terms of angles:
• Other properties: commutative, associative, distributive

Basis
• A basis is a linearly independent set of vectors that spans the “whole
space”. I. e., we can write every vector in our space as linear
combination of vectors in that set.
• Every set of n linearly independent vectors in is a basis of
• Orthogonality: Two non-zero vectors x and y are orthogonal if x.y = 0
• A basis is called
– orthogonal, if every basis vector is orthogonal to all other basis
vectors
– orthonormal, if additionally all basis vectors have length 1.

Bases
• Orthonormal basis:

Overview
2D Transformations
• Basic 2D transformations
• Matrix representation
• Matrix composition
3D Transformations
• Same as 2D
2D Modeling Transformations
Modeling
Coordinates
Scale
y Translate
Scale
Rotate
Translate
World Coordinates
Modeling
Coordinates
y
Let’s look
at this in
detail…
World Coordinates
Modeling
Coordinates
y
Initial location
at (0, 0) with
x- and y-axes
aligned
Modeling
Coordinates
y
Scale .3, .3
Rotate -90
Translate 5, 3
Modeling
Coordinates
y
Scale .3, .3
Rotate -90
Translate 5, 3
Modeling
Coordinates
y
Scale .3, .3
Rotate -90
Translate 5, 3
World Coordinates
Scaling
Scaling
Scaling a coordinate means multiplying each of its
components by a scalar
Uniform scaling means this scalar is the same for
all components:
2
Scaling
Non-uniform scaling: different scalars per
component:
X  2,
Y  0.5
How can we represent this in matrix form?

Scaling
Scaling operation:  x' ax 

 y ' = by 
   
Or, in matrix form:  x' = a 0  x 

 y ' 0 b  y 
    
scaling matrix
2-D Rotation
(x’, y’)
(x, y)

2-D Rotation
x = r cos (f)
y = r sin (f)
x’ = r cos (f + )
y’ = r sin (f + )
(x’, y’)
Trig Identity…
(x, y) x’ = r cos(f) cos() – r sin(f) sin()
y’ = r sin(f) sin() + r cos(f) cos()
 f Substitute…
x’ = x cos() - y sin()
y’ = x sin() + y cos()
Geometric Transformations
Rotation Equations:
26
2-D Rotation
This is easy to capture in matrix form:
 x' cos( ) − sin ( )  x 

 y ' =  sin ( ) cos( )   y 
    
Even though sin() and cos() are nonlinear

functions of ,
• x’ is a linear combination of x and y
• y’ is a linear combination of x and y
2D Translation:
28
2D Translation Equation:
Basic 2D Transformations
Translation:
• x’ = x + tx
• y’ = y + ty
Scale:
• x’ = x * sx
• y’ = y * sy
Shear:
Transformations
• x’ = x + hx*y can be combined
• y’ = y + hy*x (with simple algebra)
Rotation:
• x’ = x*cosQ - y*sinQ
• y’ = x*sinQ + y*cosQ
Translation:
• x’ = x + tx
• y’ = y + ty
Scale:
• x’ = x * sx
• y’ = y * sy
Shear:
• x’ = x + hx*y
• y’ = y + hy*x
Rotation:
Translation:
• x’ = x + tx
• y’ = y + ty
Scale: (x,y)
• x’ = x * sx (x’,y’)
• y’ = y * sy
Shear:
• x’ = x + hx*y x’ = x*sx
• y’ = y + hy*x y’ = y*sy
Rotation:
Translation:
• x’ = x + tx
• y’ = y + ty
Scale:
• x’ = x * sx
• y’ = y * sy
Shear:
(x’,y’)
• x’ = x + hx*y
x’ = (x*sx)*cosQ - (y*sy)*sinQ
• y’ = y + hy*x
y’ = (x*sx)*sinQ + (y*sy)*cosQ
Rotation:
Translation:
• x’ = x + tx
• y’ = y + ty
Scale:
(x’,y’)
• x’ = x * sx
• y’ = y * sy
Shear:
• x’ = x + hx*y x’ = ((x*sx)*cosQ - (y*sy)*sinQ) + tx
• y’ = y + hy*x y’ = ((x*sx)*sinQ + (y*sy)*cosQ) + ty
Rotation:
Translation:
• x’ = x + tx
• y’ = y + ty
Scale:
• x’ = x * sx
• y’ = y * sy
Shear:
• x’ = x + hx*y x’ = ((x*sx)*cosQ - (y*sy)*sinQ) + tx
• y’ = y + hy*x y’ = ((x*sx)*sinQ + (y*sy)*cosQ) + ty
Rotation:
Outline
2D Transformations
3D Transformations
• Same as 2D
Matrix Representation
Represent 2D transformation by a matrix
a b 
 c d 
Multiply matrix by column vector

 apply transformation to point
 x' = a b   x  x ' = ax + by
 y '  c d   y  y ' = cx + dy
Matrix Representation
Transformations combined by multiplication
 x'  = a b   e f  i j x
 y '  c d   g h  k l   y 
Matrices are a convenient and efficient way

to represent a sequence of transformations!
2x2 Matrices
What types of transformations can be
represented with a 2x2 matrix?
2D Identity?
x' = x  x '  = 1 0   x 
y' = y  y ' 0 1  y 
2D Scale around (0,0)?

x' = s x * x  x '  s x 0   x 
y' = s y * y  y' =  0 s   y 
   y  
2x2 Matrices
2D Rotate around (0,0)?
x' = cos Q * x − sin Q * y  x ' cos Q − sin Q  x 
y' = sin Q * x + cos Q * y  y ' =  sin Q cos Q   y 
    
2D Shear?
x ' = x + shx * y  x '  1 shx   x 
y ' = shy * x + y  y' =  sh  
   y 1  y
2x2 Matrices
2D Mirror about Y axis?
x' = − x  x '  = − 1 0  x 
y' = y  y '  0 1  y 
2D Mirror over (0,0)?

x' = − x  x' = − 1 0   x 
y' = − y  y '  0 − 1  y 
2x2 Matrices
2D Translation?
x' = x + t x
NO!
y' = y + t y
Only linear 2D transformations

can be represented with a 2x2 matrix
Linear Transformations
Linear transformations are combinations of …
• Scale,
 x' a b   x 
 y' =  c d   y 
• Rotation,
• Shear, and     
• Mirror
Properties of linear transformations:
• Satisfies: T ( s1p1 + s2p 2 ) = s1T (p1 ) + s2T (p 2 )
• Origin maps to origin
• Lines map to lines
• Parallel lines remain parallel
• Ratios are preserved
• Closed under composition
Homogeneous Coordinates
Q: How can we represent translation as a 3x3
matrix?
x' = x + t x
y' = y + t y
Homogeneous
coordinates  x
 x  homogeneous coords  
• represent coordinates in 2  y  ⎯⎯ ⎯ ⎯ ⎯⎯→ y 
dimensions with a 3-    1 
vector
Homogeneous coordinates seem unintuitive,

but they make graphics operations much
easier
Q: How can we represent translation as a 3x3
matrix? x ' = x + t
x
y' = y + t y
A: Using the rightmost column:

1 0 t x 
 
Translation = 0 1 t y 
0 0 1 
 
Homogeneous Coordinates ➔ Back to Cartesian
Coordinates
2D Translation using Homogeneous Coordinates
2D Translation using Homogeneous Coordinates
Translation
Example of translation
 x ' 1 0 t x   x   x + t x 
 y ' = 0 1 t   y  =  y + t 
   y    y
 1  0 0 1   1   1 
tx = 2
ty = 1
Scaling Equation
Add a 3rd coordinate to every 2D point
• (x, y, w) represents a point at location (x/w, y/w)
• (x, y, 0) represents a point at infinity
• (0, 0, 0) is not allowed y
2
(2,1,1) or (4,2,2) or (6,3,3)
1
Convenient 1 2 x
coordinate system to
represent many
useful
transformations
Basic 2D transformations as 3x3 matrices
 x '  s x 0 0  x 
 x ' 1 0 t x   x   y ' =  0 s 0   y 
 y ' =  0 1 t   y     y  
   y    1   0 0 1  1 
 1  0 0 1   1 
Translate Scale
 x' cos Q − sin Q 0  x   x '  1 shx 0  x 

 y ' =  sin Q cos Q 0  y   y ' =  sh  y
      y 1 0  
 1   0 0 1  1   1   0 0 1  1 
Rotate Shear
Affine Transformations
Affine transformations are combinations of …
• Linear transformations, and  x'  a b c   x 
 y ' = d e f   y 
• Translations  w   0 0 1   w
    
Properties of affine transformations:
• Origin does not necessarily map to origin
• Lines map to lines
• Parallel lines remain parallel
• Ratios are preserved
• Closed under composition
Outline
2D Transformations
3D Transformations
• Same as 2D
Matrix Composition
Transformations can be combined by
matrix multiplication
 x'   1 0 tx cos Q − sin Q 0 sx 0 0  x 
 y '  =  0 1 ty  sin Q cos Q 0  0 sy 0  y 
w'  0 0 1   0 0 1     
  0 0 1 w
   
p’ = T(tx,ty) R(Q) S(sx,sy) p
Matrix Composition
Matrices are a convenient and efficient way
to represent a sequence of transformations
• General purpose representation
• Hardware matrix multiply
p’ = (T * (R * (S*p) ) )
p’ = (T*R*S) * p
Matrix Composition
Be aware: order of transformations matters
– Matrix multiplication is not commutative
p’ = T * R * S * p
“Global” “Local”
Matrix Composition
What if we want to rotate and translate?
• Ex: Rotate line segment by 45 degrees about
endpoint a
and lengthen
a a
Multiplication Order – Wrong Way
Our line is defined by two endpoints
• Applying a rotation of 45 degrees, R(45), affects both points
• We could try to translate both endpoints to return endpoint a to
its original position, but by how much?
a
a a
Correct
Wrong 1. T(-3)
R(45) 2. R(45)
3. T(3)
Multiplication Order - Correct
Isolate endpoint a from rotation effects
a
• First translate line so a is at origin: T (-3)

a
• Then rotate line 45 degrees: R(45)

a
• Then translate back so a is where it was: T(3)

a
Matrix Composition
Will this sequence of operations work?
1 0 − 3 cos(45) − sin( 45) 0 1 0 3  a x   a' x 
0 1 0   sin( 45) cos(45) 0 0 1 0 a  = a' 
    y   y 
0 0 1   0 0 1 0 0 1  1   1 
Matrix Composition
After correctly ordering the matrices
Multiply matrices together
What results is one matrix – store it (on stack)!
Multiply this matrix by the vector of each vertex
All vertices easily transformed with one matrix
multiply
Overview
2D Transformations
3D Transformations
• Same as 2D
3D Transformations
Same idea as 2D transformations
• Homogeneous coordinates: (x,y,z,w)
• 4x4 transformation matrices
 x'   a b c d   x 
 y'  e f g h   y 
 z'  =  i j k l   z 
 w' m n o p   w
 
 x '  s x 0 0 0  x 
 x '  1 0  x   y '  0 s 0 0  y 
0 0
 y ' 0 1 0 0  y   =
 z '  = 0
y
0 1 0  z   z '   0 0 sz 0  z 
 w  0 0 0 1  w     
w   0 0 0 1  w 
Identity Scale
 x ' 1 0 0 t x  x 
 y ' 0  x' − 1 0  x 
1 0 t y   y 
0 0
 =  y '  0 1 0 0  y 
 z '  0 0 1 tz  z   z' =  0 0 1 0  z 
      w   0 0 0 1  w
 w  0 0 0 1  w 
Translation Mirror about Y/Z plane
3D Translation of Points:
 x' cos Q − sin Q 0 0  x 
Rotate around Z axis:  y ' =  sin Q cos Q 0 0  y 
z' 0 0 1 0  z 
 w   0 0 0 1  w
 x '  cos Q 0 sin Q 0  x 

 y '  0 1 0 0  y 
Rotate around Y axis:   = 
 z '  − sin Q 0 cos Q 0  z 
    
w   0 0 0 1  w 
 x' 1 0 0 0  x 
Rotate around X axis:  y ' = 0 cos Q − sin Q 0  y 
 z '  0 sin Q cos Q 0  z 
 w  0 0 0 1  w
3D Rotation of Points:

• Scaling & Translating equations


THANK YOU!
NEXT: V IDEO MPEG

COMPUTER VISION
LECTURE X
VIDEO MPEG
Video Compression
• We need to compress video (more so than audio/images) in practice
since:
• 1 Uncompressed video (and audio) data are huge.
• In HDTV, the bit rate easily exceeds 1 Gbps| big problems for storage
and network communications. E.g. HDTV: 1920 x 1080 at 30 frames
per second, 8 bits per (PAL) channel =1.5 Gbps.
• 2 Lossy methods have to be employed since the compression ratio of
lossless methods (e.g.Human, Arithmetic, LZW) is not high enough for
image and video compression.

Video Compression: MPEG
• Not the complete picture studied here!
• Much more to MPEG | plenty of other tricks employed.
• We only concentrate on some basic principles of video compression:
• Earlier H.261 and MPEG 1 and 2 standards. with a brief introduction of
ideas used in new standards such as H.264 (MPEG-4 Advanced Video
Coding).
• Image, video, and audio compression standards have been specified and
released by two main groups since 1985:
• ISO International Standards Organization: JPEG, MPEG.
• ITU International Telecommunications Union: H.261-264.

Compression Standards
• Whilst in many cases one of the groups have specified separate standards there is
some crossover between the groups. E.g.:
• JPEG issued by ISO in 1989 (but adopted by ITU as ITU T.81) MPEG 1 released by
ISO in 1991, H.261 released by ITU in 1993 (based on CCITT 1990 draft).
• CCITT stands for Committee Consultant if International Telephonies et Telegram
whose parent organization is ITU.
• H.262 (better known as MPEG 2) released in 1994.
• H.263 released in 1996 extended as H.263+, H.263++.
• MPEG 4 released in 1998.
• H.264 releases in 2002 to lower the bit rates with comparable quality video and
support wide range of bit rates, and is now part of
• MPEG 4 (Part 10, or AVC { Advanced Video Coding).

How to Compress Video?
• Basic Idea of Video Compression:
• Exploit the fact that adjacent frames are similar.
• Spatial redundancy removal | intra frame coding (JPEG)
• NOT ENOUGH BY ITSELF?
• Temporal | greater compression by noting the temporal
coherence/incoherence over frames. Essentially we note the difference
between frames.
• Spatial and temporal redundancy removal | intra frame and inter frame
coding (H.261, MPEG).
• Things are much more complex in practice of course.

How to Compress Video?
It has been customary in the past to transmit successive complete

images of the transmitted picture." . . . \In accordance with this
invention, this diculty is avoided by transmitting only the diference
between successive images of the object".

Simple Motion Example
• Consider a simple image of a moving circle.
• Lets just consider the difference between 2 frames.
• It is simple to encode/decode:

Estimating Motion of Blocks
• We will examine methods of estimating motion vectors
• in due course.

Decoding Motion of Blocks

Motion Estimation Example

How is Motion Compensation Used?
• Block Matching:
• MPEG-1/H.261 relies on block matching techniques.
• For a certain area (block) of pixels in a picture: Find a good estimate of
this area in a previous (or in a future!) frame, within a specified search
area.
• Motion compensation: Uses the motion vectors to compensate the
picture. Parts of a previous (or future) picture can be reused in a
subsequent picture.
• Individual parts spatially compressed | JPEG type compression.

Any Overheads?
• Motion estimation/compensation techniques reduces the
• video bitrate significantly but Introduce extra computational
complexity. Decoder needs to buffer reference pictures | backward
and forward referencing.
• Delay.
• Lets see how such ideas are used in practice.

Overview of H.261
• Developed by CCITT in 1988-1990 for video telecommunication applications.
• Meant for videoconferencing, video telephone applications over ISDN
telephone lines.
• Baseline ISDN is 64 k bits/sec, and integral multiples (p64).
• Frame types are CCIR 601 CIF (Common Intermediate Format) (352x288) and
QCIF (176x144) images with 4:2:0 subsampling.
• Two frame types: Intra-frames (I-frames) and Inter-frames (P-frames). I-frames
use basically JPEG | but YUV (YCrCb) and larger DCT windows, different
quantisation.
• I-frames provide us with a refresh accessing point | key frames.
• P-frames use pseudo-differences from previous frame (predicted), so frames
depend on each other.

H.261 Group of Pictures
• We typically have a group of pictures | one I-frame followed by
several P-frames | a group of pictures.
• Number of P-frames followed by each I-frame determines the size of
GOP | can be fixed or dynamic.
• Why this cannot be too large?

Intra-frame Coding
• Various lossless and lossy compression techniques use like JPEG.
• Compression contained only within the current frame.
• Simpler coding | not enough by itself for high compression.
• Can't rely on intra frame coding alone not enough compression:
• Motion JPEG (MJPEG) standard does exist | not commonly used.
• So introduce idea of inter frame difference coding.
• However, cant rely on inter frame differences across a large number
of frames
• So when errors get too large | start a new I-frame.

Intra-frame Coding (Cont.)
• Intra-frame coding is very similar to JPEG:

Intra-frame Coding (Cont.)
• A basic intra-frame coding scheme is as follows:
• Macroblocks are typically 16x16 pixel areas on Y plane of original image.
• A macroblock usually consists of 4 Y blocks, 1 Cr block, and 1 Cb block. (4:2:0
Chroma subsampling)
• Eye most sensitive to luminance, less sensitive to chrominance.
• We operate on a more effective color space: YUV (YCbCr) color which we
studied earlier.
• Typical to use 4:2:0 macroblocks: one quarter of the chrominance
information used.
• Quantization is by constant value for all DCT coefficients. I.e., no quantization
table as in JPEG.

Inter-frame (P-frame) Coding
• Intra frame limited to spatial basis relative to 1 frame.
• Considerably more compression if the inherent temporal basis is exploited
as well.
• BASIC IDEA:
• Most consecutive frames within a sequence are very similar to the frames
both before (and after) the frame of interest.
• Aim to exploit this redundancy.
• Use a technique known as block-based motion compensated prediction.
• Need to use motion estimation.
• Coding needs extensions for inter- but encoder can also supports an intra-
subset.

Inter-frame (P-frame) Coding (Cont.)
• P-coding can be summarized as follows:



Motion Vector Search
• So we know how to encode a P-block.
• How do we find the motion vector?

Motion Estimation
• The temporal prediction technique used in MPEG video is based on
motion estimation.
• The basic premise:
• Consecutive video frames will be similar except for changes induced
by objects moving within the frames.
• Trivial case of zero motion between frames | no other differences
except noise etc.
• Easy for the encoder to predict the current frame as a duplicate of the
prediction frame.
• When there is motion in the images, the situation is not as simple.

Example
• The problem for motion estimation to solve is:
• How to adequately represent the changes, or differences, between
these two video frames.

Solution
• A comprehensive 2-dimensional spatial search is performed for each
luminance macroblock.
• Motion estimation is not applied directly to chrominance in MPEG
• MPEG does not dene how this search should be performed.
• A detail that the system designer can choose to implement in one of many
possible ways.
• Well known that a full, exhaustive search over a wide 2-D area yields the
best matching results in most cases, but at extreme computational cost to
the encoder.
• Motion estimation usually is the most computationally expensive portion
of the video encoding.

Motion Estimation Example

Motion Vectors, Matching Blocks
• Previous figure shows an example of a particular macro block from
Frame 2 of earlier example, relative to various macroblocks of Frame 1:
• The top frame has a bad match with the macro block to be coded.
• The middle frame has a fair match, as there is some commonality
between the 2 macroblocks.
• The bottom frame has the best match, with only a slight error between
the 2 macroblocks.
• Because a relatively good match has been found, the encoder assigns
motion vectors to that macroblock,

Final Motion Estimation Prediction

Final Motion Estimation Prediction (Cont.)
• The predicted frame is subtracted from the desired frame,
• Leaving a (hopefully) less complicated residual error frame which can
then be encoded much more efficiently than before motion
estimation.

Example

Example

Example

Further Coding Efficiency
• Differential Coding of Motion Vectors
• Motion vectors tend to be highly correlated between macroblocks:
• The horizontal component is compared to the previously valid
horizontal motion vector and
• Only the difference is coded.
• Same difference is calculated for the vertical component
• Difference codes are then described with a variable length code (e.g.
Huffman) for maximum compression efficiency.

Recap: P-Frame Coding Summary

Estimating the Motion Vectors
• So how do we find the motion?
• Basic Ideas is to search for Macro block (MB)
• Within a n x m pixel search window
• Work out for each window
• Sum of Absolute Difference (SAD) (or Mean Absolute Error (MAE))
• Choose window where SAD/MAE is a minimum. If the encoder decides that
no acceptable match exists then it has the option of
• Coding that particular macroblock as an intra macro block, even though it
may be in a P frame!
• In this manner, high quality video is maintained at a slight cost to coding
effciency.

Full Search
• Search exhaustively the whole (2R + 1) (2R + 1) window in the
reference frame.
• A macro block centered at each of the positions within the window is
compared to the macroblock in the target frame pixel by pixel and
their respective SAD (or MAE) is computed.
• The vector (i, j) that offers the least SAD (or MAE) is designated as the
motion vector for the macroblock in the target frame.
• Full search is very costly.

Full Search
• Advantages:
Guaranteed to find optimal motion vector within search range.
• Disadvantages:
Can only search among many candidates. What if the motion is in fractional
number of pixels?
• High computation complexity: O((2R + 1) S).
• HOW TO IMPROVE?
1.Accuracy: consider fractional translations, this requires interpolation
(e. g. bilinear in H.263).
2.Speed: try to avoid checking unlikely candidates.

Bilinear Interpolation

2D Logarithmic Search
• An approach takes several iterations akin to a binary search.
Computationally cheaper, suboptimal but usually effective.
• Initially only nine locations in the search window are used as seeds
for a SAD-based search (marked as `1').
• After locating the one with the minimal SAD, the center of the new
search region is moved to it and the step-size is reduced to half.
• In the next iteration, the nine new locations are marked as `2' and this
process repeats. positions, only 9L positions are checked. If L
iterations are applied, for altogether 9L

2D Logarithmic Search (Cont.)

Hierarchical Motion Estimation
1. Form several low resolution version of the target and

reference pictures.
2. Find the best match motion vector in the lowest
resolution version.
3. Modify the motion vector level by level when going
up.

Hierarchical Motion Estimation

Performance Comparison
• Operation for 720x480 at 30 fps (GOPS):
• Search Method p = 15 p=7
• Full Search 29.890 6.990
• Logarithmic 1.020 0.778
• Hierarchical 0.507 0.399

MPEG Compression
• MPEG stands for:
• Motion Picture Expert Group | established circa 1990 to create standard
for delivery of audio and video
• MPEG-1 (1991).Target: VHS quality on a CD-ROM (320 x 240 + CD audio @
1.5 Mbits/sec).
• MPEG-2 (1994): Target Television Broadcast.
• MPEG-3: HDTV but subsumed into an extension of MPEG-2.
• MPEG 4 (1998): Very Low Bitrate Audio-Visual Coding, later MPEG-4 Part
10 (H.264) for wide range of bitrates and better compression quality.
• MPEG-7 (2001) \Multimedia Content Description Interface".
• MPEG-21 (2002) \Multimedia Framework".

Three Parts to MPEG
• The MPEG standard had three parts:
• Video: based on H.261 and JPEG
• Audio: based on MUSICAM (Masking pattern adapted
• Universal Sub-band Integrated Coding And Multiplexing) technology
• System: control interleaving of streams

MPEG Video
• MPEG compression is essentially an attempt to overcome some
shortcomings of H.261 and JPEG:

The Need for a Bidirectional Search
• The problem here is that many macroblocks need information that is
not in the reference frame.
• For example:
• Occlusion by objects affects differencing
• Difficult to track occluded objects etc.
• MPEG uses forward/backward interpolated prediction.

MPEG B-Frames
• The MPEG solution is to add a third frame type which is a
bidirectional frame, or B-frame
• B-frames search for macroblock in past and future frames.
• Typical pattern is IBBPBBPBB
• IBBPBBPBB. Actual pattern is up to encoder, and need not be regular.

Example: I, P, and B frames
• Consider a group of pictures that lasts for 6 frames:
• Given: I,B,P,B,P,B,I,B,P,B,P,B,.
• I frames are coded spatially only (as before in H.261).
• P frames are forward predicted based on previous I and P frames (as before
in H.261).
• B frames are coded based on a forward prediction from a previous I or P
frame, as well as a backward prediction from a succeeding I or P frame.

Bidirectional Prediction

Example: I, P, and B frames (Cont.)
• 1st B frame is predicted from the 1st I frame

and 1st P frame.
• 2nd B frame is predicted from the 1st and
2nd P frames.
• 3rd B frame is predicted from the 2nd and
3rd P frames.
• 4th B frame is predicted from the 3rd P
frame and the 1st I frame of the next group
of pictures.

Bidirectional Prediction

Backward Prediction Implications
• Note: Backward prediction requires that the future frames that
are to be used for backward prediction be
• Encoded and transmitted first, i.e. out of order.
• This process is summarized:

Backward Prediction Implications (Cont.)
• Also NOTE:
• No dened limit to the number of consecutive B frames that may be
used in a group of pictures.
• Optimal number is application dependent.
• Most broadcast quality applications, however, have tended to use 2
consecutive B frames (I,B,B,P,B,B,P,. . . ) as the ideal trade-o
between compression efficiency and video quality.
• MPEG suggests some standard groupings.

Advantage of Using B frames
• Coding efficiency.
• Most B frames use less bits.
• Quality can also be improved in the case of moving objects that reveal
hidden areas within a video sequence.
• Better error propagation: B frames are not used to predict future frames,
errors generated will not be propagated further within the sequence.
• Disadvantage:
• Frame reconstruction memory buffers within the encoder and decoder
must be doubled in size to accommodate the 2 anchor frames.
• More delays in real-time applications.

Frame Sizes

Random Access Points

MPEG-2, MPEG-3, and MPEG-4

THANK YOU!
NEXT: M OTION TRACKING

COMPUTER VISION
LECTURE XI
MOTION TRACKING

Contents:
The Problem
Goals
Approaches
The Optical Flow Method
Algorithm

The Problem
Given a set of images in time which are similar but not identical,
derive a method for identifying the motion that has occurred (in
2d) between different images.

Goals
Input:
➢ an image sequence
➢ captured with a fixed camera
➢ containing one or more moving objects of interest
Processing goals: determine the image regions where significant
motion has occurred
Output: an outline of the motion within the image sequence

Motion Detection and Estimation
Image differencing
➢ based on the thresholded difference of successive images
➢ difficult to reconstruct moving areas
Background subtraction
➢ foreground objects result by calculating the difference between an image
in the sequence and the background image (previously obtained)
➢ remaining task: determine the movement of these foreground objects
between successive frames
Block motion estimation
➢ Calculates the motion vector between frames for sub-blocks of the image
➢ mainly used in image compression
➢ too coarse
Optical Flow
What Is Optical Flow?
Optical flow is the displacement field for

each of the pixels in an image sequence.
For every pixel, a velocity vector  , 
 dx dy
 dt dt 
is found which says:
➢how quickly a pixel is moving across
the image
➢the direction of its movement.

Optical Flow Examples
Translation Rotation Scaling

Algorithm
Optical flow: maximum one pixel large

movements
Optical flow: larger movements
Morphological filter
Contour detection (demo purposes)

Optical Flow: maximum one pixel large
movements
The optical flow for a pixel (i, j ) given 2

successive images k and k + 1 :
mk (i, j ) = ( x, y ) so that
I k (i, j ) − I k +1 (i + x, j + y ) (1)
is minimum for − 1  x  1,−1  y  1

k k+1
Optical Flow: maximum one pixel large
movements (2)
More precision: consider a 3×3 window around

the pixel:
Optical flow for pixel (i, j ) becomes:

mk (i, j ) = ( x, y ) so that
1 1 1 1
I
u = −1v = −1
k (i + u, j + v) −   I k +1 (i + u + x, j + v + y )
u = −1v = −1
(2)
10
is minimum for − 1  x  1,−1  y  1 DR. GEORGE KARRAZ, Ph. D.
Optical Flow: larger movements
Reduce the size of the image

=> reduced size of the movement
Solution: multi-resolution analysis of the images

Advantages: computing efficiency, stability
Multi-resolution Analysis
Coarse to fine optical flow estimation:

32×32
64×64
128×128
256×256
Original image k Original image k+1

Gaussian Pyramid
Lowest level g 0 - the original image

Level gl - the weighed average of values in gl −1
in a 5×5 window:
2 2
g l (i, j ) =   w(m, n )g l −1 (2i + m,2 j + n ) (3)
m = −2n = −2

Gaussian Pyramid (2)
The mask G (m, n ) is an approximation of the 2D

Gaussian:
0.003 0.013 0.022 0.013 0.003
0.013 0.060 0.098 0.060 0.013
0.022 0.098 0.162 0.098 0.022
0.013 0.060 0.098 0.060 0.013
0.003 0.013 0.022 0.013 0.003
The mask is symmetric and separable:

G(m, n) = Gr (m)* Gc (n) (4)
Optical Flow: Top-down Strategy
Algorithm (1/4 scale of resolution reduction):

Step 1: compute optical flow vectors for the highest
level of the pyramid l (smallest resolution)
Step 2: double the values of the vectors
Step 3: first approximation: optical flow vectors for the
(2i, 2j), (2i+1, 2j), (2i, 2j+1), (2i+1, 2j+1) pixels in the l-1
level are assigned the value of the optical flow vector for
the (i,j) pixel from the l level

Level l Level l-1
Optical Flow: Top-down Strategy (2)
Step 4:
➢ adjustment of the vectors of the l-1 level in the pyramid
➢ method: detection of maximum one pixel displacements
around the initially approximated position
Step 5: smoothing of the optical flow field (Gaussian

filter) DR. GEORGE KARRAZ, Ph. D. 16
Filtering the Size of the Detected Regions
Small isolated regions of motion detected by the

optical flow method are classified as noise and
are eliminated with the help of morphological
operations:
Step 1: Apply the opening:
Step 2: Apply the closing:

Contour Detection
For demonstration purposes, the contours of the moving regions detected
are outlined
Method: the Sobel edge detector:

 f f 
f (x, y ) =  ,  = ( f x , f y )
➢ Compute the intensity gradient:
(5)
 x y 
using the Sobel masks:  − 1 0 1 − 1 − 2 − 1
Gx = − 2 0 2 , G y =  0 0  (6)
1 1
0
4 4
 − 1 0 1   1 2 1 
➢ Compute the magnitude of the gradient:
M (x, y ) = f (x, y ) = fx + f y (7)

2 2
➢ if M (x, y )  threshold then edge pixel

else non-edge pixel
A Block Diagram of the System

THANK YOU!
NEXT: M OTION ESTIMATION

COMPUTER VISION
LECTURE XII
MOTION ESTIMATION
Problem definition: optical flow
How to estimate pixel motion from image H to image I?

• Solve pixel correspondence problem
– given a pixel in H, look for nearby pixels of the same color in I
Key assumptions
• color constancy: a point in H looks the same in I
– For grayscale images, this is brightness constancy
• small motion: points do not move very far
2
This is called the optical flow problem DR. GEORGE KARRAZ, Ph. D.
Optical flow constraints (grayscale images)
Let’s look at these constraints more closely

• brightness constancy: Q: what’s the equation?
• small motion: (u and v are less than 1 pixel)

– suppose we take the Taylor series expansion of I:
3
Optical flow equation
Combining these two equations
In the limit as u and v go to zero, this becomes exact

Optical flow equation
Q: how many unknowns and equations per pixel?
Intuitively, what does this constraint mean?

• The component of the flow in the gradient direction is determined
• The component of the flow parallel to an edge is unknown

Aperture problem

Aperture problem

Solving the aperture problem
How to get more equations for a pixel?
• Basic idea: impose additional constraints
– most common is to assume that the flow field is smooth locally
– one method: pretend the pixel’s neighbors have the same (u,v)
» If we use a 5x5 window, that gives us 25 equations per pixel!

RGB version
How to get more equations for a pixel?
• Basic idea: impose additional constraints
– most common is to assume that the flow field is smooth locally
– one method: pretend the pixel’s neighbors have the same (u,v)
» If we use a 5x5 window, that gives us 25*3 equations per pixel!

Lukas-Kanade flow
Prob: we have more equations than unknowns
Solution: solve least squares problem

• minimum least squares solution given by solution (in d) of:
• The summations are over all pixels in the K x K window

• This technique was first proposed by Lukas & Kanade (1981)
– described in Trucco & Verri reading 10
Conditions for Solvability
• Optimal (u, v) satisfies Lucas-Kanade equation
When is This Solvable?

• ATA should be invertible
• ATA should not be too small due to noise
– eigenvalues l1 and l2 of ATA should not be too small
• ATA should be well-conditioned
– l1/ l2 should not be too large (l1 = larger eigenvalue)

Eigenvectors of ATA
Suppose (x,y) is on an edge. What is ATA?

• gradients along edge all point the same direction
• gradients away from edge have small magnitude
• is an eigenvector with eigenvalue

• What’s the other eigenvector of ATA?
– let N be perpendicular to
– N is the second eigenvector with eigenvalue 0

The eigenvectors of ATA relate to edge direction and magnitude 12
Edge
– large gradients, all the same

– large l1, small l2
13
Low texture region
– gradients have small magnitude

– small l1, small l2
High textured region
– gradients are different, large magnitudes

– large l1, large l2
Observation
This is a two image problem BUT
• Can measure sensitivity by just looking at one of the images!
• This tells us which pixels are easy to track, which are hard
– very useful later on when we do feature tracking...

Errors in Lukas-Kanade
What are the potential causes of errors in this procedure?
• Suppose ATA is easily invertible
• Suppose there is not much noise in the image
When our assumptions are violated
• Brightness constancy is not satisfied
• The motion is not small
• A point does not move like its neighbors
– window size is too large
– what is the ideal window size?

Improving accuracy
Recall our small motion assumption
This is not exact

• To do better, we need to add higher order terms back in:
This is a polynomial root finding problem

• Can solve using Newton’s method
– Also known as Newton-Raphson method
• Lukas-Kanade method does one iteration of Newton’s method

– Better results are obtained via more iterations

Iterative Refinement
Iterative Lukas-Kanade Algorithm
1. Estimate velocity at each pixel by solving Lucas-Kanade equations
2. Warp H towards I using the estimated flow field
- use image warping techniques
3. Repeat until convergence

Revisiting the small motion assumption
Is this motion small enough?

• Probably not—it’s much larger than one pixel (2nd order terms dominate)
• How might we solve this problem?
Reduce the resolution!

Coarse-to-fine optical flow estimation 22
u=1.25 pixels
u=2.5 pixels
u=5 pixels
image H u=10 pixels image I
Gaussian pyramid of image H Gaussian pyramid of image I

Coarse-to-fine optical flow estimation 23
run iterative L-K

warp & upsample
run iterative L-K

.
.
.
image H
J image I
Gaussian pyramid of image H Gaussian pyramid of image I

Multi-resolution Lucas Kanade Algorithm

Optical Flow Results

Optical Flow Results

Optical flow Results

THANK YOU!
END OF COMPUTER VISION COURSE:

Computer Vision Course

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Computer Vision Course

Uploaded by

Copyright:

Available Formats

COMPUTER VISION

DR. GEORGE KARRAZ, Ph. D.

DR. GEORGE KARRAZ, Ph. D. 2

DR. GEORGE KARRAZ, Ph. D. 5

Good understanding of:

DR. GEORGE KARRAZ, Ph. D. 7

DR. GEORGE KARRAZ, Ph. D. 8

DR. GEORGE KARRAZ, Ph. D. 11

DR. GEORGE KARRAZ, Ph. D. 12

DR. GEORGE KARRAZ, Ph. D. 13

DR. GEORGE KARRAZ, Ph. D.

NEXT: DIGITAL SIGNALS PROCESSING & ANALYSIS

DR. GEORGE KARRAZ, Ph. D.

DR. GEORGE KARRAZ, Ph. D.

DR. GEORGE KARRAZ, Ph. D.

3 DR. GEORGE KARRAZ, Ph. D.

• The Decibel (dB) When referring to measurements of power or

4 DR. GEORGE KARRAZ, Ph. D.

• Why Use Decibel Scales?

XdB=10 log10( 𝑋(𝑖) 2 )= 20 log10( 𝑋(𝑖) ) which is an expression of dB we

5 DR. GEORGE KARRAZ, Ph. D.

6 DR. GEORGE KARRAZ, Ph. D.

• Signal to Noise: Signal-to-noise ratio is a term for the power ratio

7 DR. GEORGE KARRAZ, Ph. D.

• Algorithms and Signal Flow Graphs: It is common to represent digital

8 DR. GEORGE KARRAZ, Ph. D.

• Signal Flow Graphs (Delay):

9 DR. GEORGE KARRAZ, Ph. D.

• Signal Flow Graphs (Delay):

10 DR. GEORGE KARRAZ, Ph. D.

• Signal Flow Graphs

11 DR. GEORGE KARRAZ, Ph. D.

• Signal Flow Graphs

12 DR. GEORGE KARRAZ, Ph. D.

• Signal Flow Graphs

13 DR. GEORGE KARRAZ, Ph. D.

• Signal Flow Graphs: We can combine all above algorithms to build up

14 DR. GEORGE KARRAZ, Ph. D.

• Signal Flow Graphs: We can combine all above algorithms to build up

15 DR. GEORGE KARRAZ, Ph. D.

16 DR. GEORGE KARRAZ, Ph. D.

17 DR. GEORGE KARRAZ, Ph. D.

18 DR. GEORGE KARRAZ, Ph. D.

19 DR. GEORGE KARRAZ, Ph. D.

20 DR. GEORGE KARRAZ, Ph. D.

21 DR. GEORGE KARRAZ, Ph. D.

22 DR. GEORGE KARRAZ, Ph. D.

23 DR. GEORGE KARRAZ, Ph. D.

24 DR. GEORGE KARRAZ, Ph. D.

25 DR. GEORGE KARRAZ, Ph. D.

26 DR. GEORGE KARRAZ, Ph. D.

27 DR. GEORGE KARRAZ, Ph. D.

28 DR. GEORGE KARRAZ, Ph. D.

29 DR. GEORGE KARRAZ, Ph. D.

30 DR. GEORGE KARRAZ, Ph. D.

31 DR. GEORGE KARRAZ, Ph. D.

32 DR. GEORGE KARRAZ, Ph. D.

DR. GEORGE KARRAZ, Ph. D.

DR. GEORGE KARRAZ, Ph. D.

2 DR. GEORGE KARRAZ, Ph. D.

 Still images are generated by the computer in two ways :

3 DR. GEORGE KARRAZ, Ph. D.

• Graphic artists designing for print media use vector-drawn