You are on page 1of 16

Scale Invariant Feature Transform (SIFT) is an approach for detecting and extracting local feature descriptors

that are reasonably invariant to changes in illumination, image noise, rotation, scaling, and small changes in
viewpoint. SIFT is detector and descriptor algorithms in one.

Types of invariance

Illumination Scale Rotation

Affine Full Perspective


Goals:

1. Extract distinctive invariant features


 Correctly matched against a large database of features from many images
2. Invariance to image scale and rotation
3. Robustness to  Affine (rotation, scale, shear) distortion
 Change in 3D viewpoint
 Addition of noise
 Change in illumination.

Advantages:

1. Locality: features are local, so robust to occlusion and clutter (against global feature such as histogram)
2. Distinctiveness: individual features can be matched to a large database of objects
3. Quantity: many features can be generated for even small objects
4. Efficiency: close to real-time performance

How to achieve illumination invariance?

 The easy way is normalization such as using histogram equalization.


 Difference based metrics (random tree, Haar, and SIFT)

SIFT Features Descriptor Algorithm

The basic steps for SIFT algorithm are:


Constructing a scale space:

This is the initial preparation. You create internal representations of the original image to ensure
scale invariance. This is done by generating a "scale space".

Scales:

 What should be sigma value for Canny and LoG edge detection?
 If use multiple sigma values (scales), how do you combine multiple edge maps?

Scale Space:

 Apply whole spectrum of scales (sigma in


Gaussian)
 Plot zero-crossings vs scales in a scale
space

Laplacian-of-Gaussian (LoG)

Automatic scale selection

Intuition:

• Find scale that gives local maxima of some function f in both position and scale.
• Function responses for increasing scale (scale signature)
Continue until the maxima appear in both images

• Function responses for increasing scale (scale signature)


What Is A Useful Signature Function?

Looking for maximum responses in each scale, this is different not like edge for each we look for
zero-crossing.

Scale-space blob detector: Example


How to achieve scale invariance:

a) Pyramids

 Divide width and height by 2


 Take average of 4 pixels for each pixel (or Gaussian blur)
 Repeat until image is tiny
 Run filter over each size image and hope it is robust

b) Scale Space (DOG method)

 Like having a nice linear scaling without the expense


 Take features from differences of these images
 If the feature is repeatable present in between Difference of
Gaussians it is Scale Invariant and we should keep it.

In this lecture we covered DOG method and later we


care about Pyramids.

Constructing Scale Space

Gaussian kernel used to create scale space

Scale space extrema detection:

The first step toward the detection of interest points is the convolution of the image with Gaussian
filters at different scales, and the generation of difference-of-Gaussian images from the difference of
adjacent blurred images.

This is just the difference of the Gaussian-blurred images at scales σ and k σ. That is mean if we
take a difference of Gaussian in different scales (σ and k σ) is the same as LoG operation.
Find keypoint from maximum and minimum of D over neighboring pixels and scales above and
below.

The above diagram shows the blurred images at different


scales, and the computation of the difference-of-Gaussian
images.

The convolved images are grouped by octave (an octave


corresponds to doubling the value of), and the value of k
is selected so that we obtain a fixed number of blurred
images per octave. This also ensures that we obtain the
same number of difference-of-Gaussian images per
octave.

Note: The difference-of-Gaussian filter provides an


approximation to the scale-normalized Laplacian of
Gaussian . The difference-of-Gaussian filter is in
effect a tunable bandpass filter.

Scale Space Peak Detection (or localize interest points)

 Compare a pixel (X) with 26 pixels in current and adjacent scales (Green Circles)
 Select a pixel (X) if larger/smaller than all 26 pixels
 Large number of extrema ,computationally expensive
 Detect the most stable subset with a coarse sampling of scales

Local extrema detection, the pixel marked x is compared against its 26 neighbors in a 3 x 3 x 3
neighborhood that spans adjacent DoG images.

Interest points (called keypoints in the SIFT framework) are identified as local maxima or
minima of the DoG images across scales. Each pixel in the DoG images is compared to its 8
neighbors at the same scale, plus the 9 corresponding neighbors at neighboring scales. If the
pixel is a local maximum or minimum, it is selected as a candidate keypoint.

Initial Outlier Rejection


1. Low contrast candidates
2. Poorly localized candidates along an edge
 Taylor series expansion of DOG, D

 By perform derivatization on the D(x) function and equal it to zero then minima or maxima
are located at:

 Value of D(x) at minima/maxima must be large, |D(x)|>th.

For each candidate keypoint:


- Interpolation of nearby data is used to accurately determine its position.
- Keypoints with low contrast are removed
- Responses along edges are eliminated
- The keypoint is assigned an orientation
Further Outlier Rejection
Use Trace and determinate similar to Harris corner detector
Compute Hessian of D (principal curvature)

Remove outliers by evaluating

Following quantity is minimum (eigen values are equal) when r=1


It increases with r, where

Eliminate key points if (is edge because one direction is high and the other is low)
Orientation Assignment

To determine the keypoint orientation, a gradient orientation histogram is computed in the


neighborhood of the keypoint (using the Gaussian image at the closest scale to the keypoint's scale).
 To achieve rotation invariance, All the properties of the keypoint are measured relative to the
keypoint orientation, this provides invariance to rotation. if the dominant direction is assigned then
if the object (image) is rotated it can be normalized with respect to its dominant direction.
 Compute central derivatives, gradient magnitude and direction of L (smooth image) at the scale of
key point (x,y)

( ) √[ ( ) ( )] [ ( ) ( )]
( ) [ ( ) ( )]⁄[ ( ) ( )]
 Create a weighted direction histogram in a neighborhood of a key point (36 bins). i.e. 10 for each
bin (360/10=36)
 Weights are
 Gradient magnitudes (the long one is more important)
 Spatial Gaussian filter with =1.5 x <scale of key point>

Sum all gradient magnitudes in the same direction

 Select the peak as direction of the key point


 Introduce additional key points at same location, if another peak is within 80% of max peak of the
histogram with different directions

SIFT feature representation

Once a keypoint orientation has been selected, the feature descriptor is computed as a set of
orientation histograms on 4 x 4 pixel neighborhoods. The orientation histograms are relative to
the keypoint orientation, the orientation data comes from the Gaussian image closest in scale to
the keypoint's scale. Just like before, the contribution of each pixel is weighted by the gradient
magnitude, and by a Gaussian with  1.5 times the scale of the keypoint.

 Compute relative orientation and magnitude in a 16x16 neighborhood at key point


 Form weighted histogram (8 bin) for 4x4 regions
 Weight by magnitude and spatial Gaussian
 Concatenate 16 histograms in one long vector of 128 dimensions
Example for 8x8 to 2x2 descriptors
The histograms contain 8 bins and each descriptor contains an array of 4 histograms around the
keypoint. This leads to a SIFT feature vector with 4 x 4 x 8 =128 elements. This vector is
normalized to enhance invariance to changes in illumination.
SIFT feature matching

- Find nearest neighbor in a database of SIFT features from training images.


- For robustness, use ratio of nearest neighbor to ratio of second nearest neighbor.
- Neighbor with minimum Euclidean distance  expensive search.
- Use an approximate, fast method to find nearest neighbor with high probability.

Example:

Gaussian blurred images and Difference of Gaussian images

Gaussian and DoG images grouped by octave


Keypoint detection

a) Maxima of DoG across scales.


b) b) Remaining keypoints after removal of
low contrast points.
c) C) Remaining keypoints after removal of
edge responses (bottom).

Final keypoints with selected orientation and


scale

Warped image and extracted keypoints

You might also like