You are on page 1of 45

Object Detection versus Object Recognition

You must have frequently heard the terms "object detection" and "object recognition",
and they are often mistaken to be the same thing. There is a very distinct difference
between the two.

Object detection refers to detecting the presence of a particular object in a given scene.
We don't know what the object might be.

Object recognition is the process of identifying an object in a given image. For instance,
an object recognition system can tell you if a given image contains a dress or a pair of
shoes.

In fact, we can train an object recognition system to identify many different objects. The
problem is that object recognition is a really difficult problem to solve. It has eluded
computer vision researchers for decades now, and has become the holy grail of
computer vision.

Humans can identify a wide variety of objects very easily. We do it every day and we do
it effortlessly, but computers are unable to do it with that kind of accuracy.

Let's consider the following image of a latte cup:

An object detector will give you the following information:


Now, consider the following image of a teacup:

If you run it through an object detector, you will see the following result:
As you can see, the object detector detects the presence of the teacup, but nothing
more than that. If you train an object recognizer, it will give you the following
information, as shown in the image below:

If you consider the second image, it will give you the following information:
As you can see, a perfect object recognizer would give you all the information
associated with that object. An object recognizer functions more accurately if it knows
where the object is located. If you have a big image and the cup is a small part of it, then
the object recognizer might not be able to recognize it. Hence, the first step is to detect
the object and get the bounding box. Once we have that, we can run an object
recognizer to extract more information.

Object Detection vs Object Recognition vs Image


Segmentation

Object Recognition: 

Object recognition is the technique of identifying the object present in images and
videos. It is one of the most important applications of machine learning and deep
learning. The goal of this field is to teach machines to understand (recognize) the
content of an image just like humans do.
  
Object Recognition Using Machine Learning

HOG (Histogram of oriented Gradients) feature Extractor and SVM (Support Vector
Machine) model: Before the era of deep learning, it was a state-of-the-art method
for object detection. It takes histogram descriptors of both positive (those images
which contain object) and negative(that image that does not contain objects)
samples and trains our SVM model on that.  

Bag of features model: Just like bag of words considers document as an orderless
collection of words, this approach also represents an image as an orderless collection
of image features. Examples of this are SIFT, MSER, etc.

Viola-Jones algorithm:  This algorithm is widely used for face detection in the image
or real-time. It performs Haar-like feature extraction from the image. This generates
a large number of features. These features are then passed into a boosting classifier.
This generates a cascade of the boosted classifier to perform image detection. An
image needs to pass to each of the classifiers to generate a positive (face found)
result. The advantage of Viola-Jones is that it has a detection time of 2 fps which can
be used in a real-time face recognition system.

Object Recognition Using Deep Learning

Convolution Neural Network (CNN) is one of the most popular ways of doing object
recognition. It is widely used and most state-of-the-art neural networks used this
method for various object recognition related tasks such as image classification.
This CNN network takes an image as input and outputs the probability of the different
classes. If the object present in the image then it’s output probability is high else the
output probability of the rest of classes is either negligible or low. The advantage of
Deep learning is that we don’t need to do feature extraction from data as compared to
machine learning.
Challenges of Object Recognition:

 Since we take the output generated by last (fully connected) layer of the CNN model
is a single class label. So, a simple CNN approach will not work if more than one class
labels are present in the image.
 If we want to localize the presence of an object in the bounding box, we need to try
a different approach that outputs not only outputs the class label but also outputs
the bounding box locations.

Image Classification:
 
In Image classification, it takes an image as an input and outputs the classification label
of that image with some metric (probability, loss, accuracy, etc). For Example: An image
of a cat can be classified as a class label “cat” or an image of Dog can be classified as a
class label “dog” with some probability.

Object Localization: This algorithm locates the presence of an object in the image and
represents it with a bounding box. It takes an image as input and outputs the location of
the bounding box in the form of (position, height, and width).

Object Detection:
Object Detection algorithms act as a combination of image classification and object
localization. It takes an image as input and produces one or more bounding boxes with
the class label attached to each bounding box. These algorithms are capable enough to
deal with multi-class classification and localization as well as to deal with the objects
with multiple occurrences.

Challenges of Object Detection:

 In object detection, the bounding boxes are always rectangular. So, it does not help
with determining the shape of objects if the object contains the curvature part.

 Object detection cannot accurately estimate some measurements such as the area
of an object, perimeter of an object from image.
Image Segmentation:

Image segmentation is a further extension of object detection in which we mark the


presence of an object through pixel-wise masks generated for each object in the image.

This technique is more granular than bounding box generation because this can helps us
in determining the shape of each object present in the image.
This granularity helps us in various fields such as medical image processing, satellite
imaging, etc.

There are primarily two types of segmentation:

 Instance Segmentation:  Identifying the boundaries of the object and label their


pixel with different colors.
 Semantic Segmentation: Labeling each pixel in the image (including background)
with different colors based on their category class or class label.

Applications: 
The above-discussed object recognition techniques can be utilized in many fields such
as:
 Driver-less Cars: Object Recognition is used for detecting road signs, other vehicles,
etc.
 Medical Image Processing: Object Recognition and Image Processing techniques can
help detect disease more accurately. For Example, Google AI for breast cancer
detection detects more accurately than doctors.
 Surveillance and Security: such as Face Recognition, Object Tracking, Activity
Recognition, etc.

Support Vector Machine Algorithm


Support Vector Machine or SVM is one of the most popular Supervised Learning
algorithms, which is used for Classification as well as Regression problems. However,
primarily, it is used for Classification problems in Machine Learning.

The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data point
in the correct category in the future. This best decision boundary is called as hyperplane.

SVM chooses the extreme points/vectors that help in creating the hyperplane. These
extreme cases are called as support vectors, and hence algorithm is termed as Support
Vector Machine. Consider the below diagram in which there are two different categories
that are classified using a decision boundary or hyperplane:
Example: SVM can be understood with the example that we have used in the KNN
classifier. Suppose we see a strange cat that also has some features of dogs, so if we
want a model that can accurately identify whether it is a cat or dog, so such a model can
be created by using the SVM algorithm.

We will first train our model with lots of images of cats and dogs so that it can learn
about different features of cats and dogs, and then we test it with this strange creature.

So as support vector creates a decision boundary between these two data (cat and dog)
and choose extreme cases (support vectors), it will see the extreme case of cat and dog.
On the basis of the support vectors, it will classify it as a cat. Consider the below
diagram:

1.9M

machine lerning project vintage colorizer


SVM algorithm can be used for Face detection, image classification, text
categorization, etc.

Types of SVM
SVM can be of two types:

o Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can
be classified into two classes by using a single straight line, then such data is termed as
linearly separable data, and classifier is used called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means
if a dataset cannot be classified by using a straight line, then such data is termed as non-
linear data and classifier used is called as Non-linear SVM classifier.

Hyperplane and Support Vectors in the SVM algorithm:


Hyperplane: There can be multiple lines/decision boundaries to segregate the classes in
n-dimensional space, but we need to find out the best decision boundary that helps to
classify the data points. This best boundary is known as the hyperplane of SVM.

The dimensions of the hyperplane depend on the features present in the dataset, which
means if there are 2 features (as shown in image), then hyperplane will be a straight line.
And if there are 3 features, then hyperplane will be a 2-dimension plane.

We always create a hyperplane that has a maximum margin, which means the maximum
distance between the data points.
Support Vectors:

The data points or vectors that are the closest to the hyperplane and which affect the
position of the hyperplane are termed as Support Vector. Since these vectors support
the hyperplane, hence called a Support vector.

How does SVM works?


Linear SVM:

The working of the SVM algorithm can be understood by using an example. Suppose we
have a dataset that has two tags (green and blue), and the dataset has two features x1
and x2. We want a classifier that can classify the pair(x1, x2) of coordinates in either
green or blue. Consider the below image:

So as it is 2-d space so by just using a straight line, we can easily separate these two
classes. But there can be multiple lines that can separate these classes. Consider the
below image:
Hence, the SVM algorithm helps to find the best line or decision boundary; this best
boundary or region is called as a hyperplane. SVM algorithm finds the closest point of
the lines from both the classes. These points are called support vectors. The distance
between the vectors and the hyperplane is called as margin. And the goal of SVM is to
maximize this margin. The hyperplane with maximum margin is called the optimal
hyperplane.
Non-Linear SVM:

If data is linearly arranged, then we can separate it by using a straight line, but for non-
linear data, we cannot draw a single straight line. Consider the below image:

So to separate these data points, we need to add one more dimension. For linear data,
we have used two dimensions x and y, so for non-linear data, we will add a third
dimension z. It can be calculated as:

z=x2 +y2
By adding the third dimension, the sample space will become as below image:

So now, SVM will divide the datasets into classes in the following way. Consider the
below image:

Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we
convert it in 2d space with z=1, then it will become as:
Hence we get a circumference of radius 1 in case of non-linear data.

Bag-of-Words Model
The bag-of-words model is a way of representing text data when modeling text with
machine learning algorithms.

The bag-of-words model is simple to understand and implement and has seen great
success in problems such as language modeling and document classification.

The Problem with Text

A problem with modeling text is that it is messy, and techniques like machine learning
algorithms prefer well defined fixed-length inputs and outputs.

Machine learning algorithms cannot work with raw text directly; the text must be
converted into numbers. Specifically, vectors of numbers.

In language processing, the vectors x are derived from textual data, in order to reflect
various linguistic properties of the text.

This is called feature extraction or feature encoding.

A popular and simple method of feature extraction with text data is called the bag-of-
words model of text.
What is a Bag-of-Words?

A bag-of-words model, or BoW for short, is a way of extracting features from text for
use in modeling, such as with machine learning algorithms.

The approach is very simple and flexible, and can be used in a many ways for extracting
features from documents.

A bag-of-words is a representation of text that describes the occurrence of words within


a document. It involves two things:

1. A vocabulary of known words.


2. A measure of the presence of known words.

It is called a “bag” of words, because any information about the order or structure of
words in the document is discarded. The model is only concerned with whether known
words occur in the document, not where in the document.

A very common feature extraction procedure for sentences and documents is the bag-
of-words approach (BOW). In this approach, we look at the histogram of the words
within the text, i.e. considering each word count as a feature.

The intuition is that documents are similar if they have similar content. Further, that
from the content alone we can learn something about the meaning of the document.

The bag-of-words can be as simple or complex as you like. The complexity comes both in
deciding how to design the vocabulary of known words (or tokens) and how to score the
presence of known words.

We will take a closer look at both of these concerns.

Example of the Bag-of-Words Model

Let’s make the bag-of-words model concrete with a worked example.

Step 1: Collect Data

It was the best of times,


it was the worst of times,
it was the age of wisdom,
it was the age of foolishness,
For this small example, let’s treat each line as a separate “document” and the 4 lines as
our entire corpus of documents.

Step 2: Design the Vocabulary

Now we can make a list of all of the words in our model vocabulary.

The unique words here (ignoring case and punctuation) are:

 “it”
 “was”
 “the”
 “best”
 “of”
 “times”
 “worst”
 “age”
 “wisdom”
 “foolishness”

That is a vocabulary of 10 words from a corpus containing 24 words.

Step 3: Create Document Vectors

The next step is to score the words in each document.

The objective is to turn each document of free text into a vector that we can use as
input or output for a machine learning model.

Because we know the vocabulary has 10 words, we can use a fixed-length document
representation of 10, with one position in the vector to score each word.

The simplest scoring method is to mark the presence of words as a Boolean value, 0 for
absent, 1 for present.

Using the arbitrary ordering of words listed above in our vocabulary, we can step
through the first document (“It was the best of times“) and convert it into a binary
vector.

The scoring of the document would look as follows:


 “it” = 1
 “was” = 1
 “the” = 1
 “best” = 1
 “of” = 1
 “times” = 1
 “worst” = 0
 “age” = 0
 “wisdom” = 0
 “foolishness” = 0

As a binary vector, this would look as follows:

[1, 1, 1, 1, 1, 1, 0, 0, 0, 0]

The other three documents would look as follows:

"it was the worst of times" = [1, 1, 1, 0, 1, 1, 1, 0, 0, 0]

"it was the age of wisdom" = [1, 1, 1, 0, 1, 0, 0, 1, 1, 0]

"it was the age of foolishness" = [1, 1, 1, 0, 1, 0, 0, 1, 0, 1]

All ordering of the words is nominally discarded and we have a consistent way of
extracting features from any document in our corpus, ready for use in modeling.

New documents that overlap with the vocabulary of known words, but may contain
words outside of the vocabulary, can still be encoded, where only the occurrence of
known words is scored and unknown words are ignored.

Deep Learning Algorithms used for Object Detection


and Recognition
1. R-CNN Model Family
The R-CNN family of methods refers to the R-CNN, which may stand for “Regions with
CNN Features” or “Region-Based Convolutional Neural Network,”

This includes the techniques R-CNN, Fast R-CNN, and Faster-RCNN designed and
demonstrated for object localization and object recognition.
R-CNN
The proposed R-CNN model is comprised of three modules; they are:

Module 1: Region Proposal. Generate and extract category independent region


proposals, e.g. candidate bounding boxes.
Module 2: Feature Extractor. Extract feature from each candidate region, e.g. using a
deep convolutional neural network.
Module 3: Classifier. Classify features as one of the known class, e.g. linear SVM
classifier model.

The architecture of the model is summarized in the image below

Fig : Summary of the R-CNN Model Architecture taken from Rich feature hierarchies for
accurate object detection and semantic segmentation.

A computer vision technique is used to propose candidate regions or bounding boxes of


potential objects in the image called “selective search,” although the flexibility of the
design allows other region proposal algorithms to be used.

The feature extractor used by the model was the AlexNet deep CNN. The output of the
CNN was a 4,096 element vector that describes the contents of the image that is fed to
a linear SVM for classification; specifically one SVM is trained for each known class.

It is a relatively simple and straightforward application of CNNs to the problem of object


localization and recognition. A downside of the approach is that it is slow, requiring a
CNN-based feature extraction pass on each of the candidate regions generated by the
region proposal algorithm.
Fast R-CNN
The limitations of R-CNN, which can be summarized as follows:

Training is a multi-stage pipeline. Involves the preparation and operation of three


separate models.
Training is expensive in space and time. Training a deep CNN on so many region
proposals per image is very slow.
Object detection is slow. Make predictions using a deep CNN on so many region
proposals is very slow.

Fast R-CNN is proposed as a single model instead of a pipeline to learn and output
regions and classifications directly.

The architecture of the model takes the photograph a set of region proposals as input
that are passed through a deep convolutional neural network. A pre-trained CNN, such
as a VGG-16, is used for feature extraction. The end of the deep CNN is a custom layer
called a Region of Interest Pooling Layer, or RoI Pooling, that extracts features specific
for a given input candidate region.

The output of the CNN is then interpreted by a fully connected layer then the model
bifurcates into two outputs, one for the class prediction via a softmax layer, and another
with a linear output for the bounding box. This process is then repeated multiple times
for each region of interest in a given image.

The architecture of the model is summarized in the image below.


Fig: Summary of the Fast R-CNN Model Architecture. Taken from: Fast R-CNN.

The model is significantly faster to train and to make predictions, yet still requires a set
of candidate regions to be proposed along with each input image.

Faster R-CNN
The model architecture was further improved for both speed of training and detection.
The architecture was designed to both propose and refine region proposals as part of
the training process, referred to as a Region Proposal Network, or RPN.

These regions are then used in concert with a Fast R-CNN model in a single model
design. These improvements both reduce the number of region proposals and
accelerate the test-time operation of the model to near real-time with then state-of-
the-art performance.

Although it is a single unified model, the architecture is comprised of two modules:

Module 1: Region Proposal Network. Convolutional neural network for proposing


regions and the type of object to consider in the region.

Module 2: Fast R-CNN. Convolutional neural network for extracting features from the
proposed regions and outputting the bounding box and class labels.

Both modules operate on the same output of a deep CNN. The region proposal network
acts as an attention mechanism for the Fast R-CNN network, informing the second
network of where to look or pay attention.

The architecture of the model is summarized in the image below.


Fig: Summary of the Faster R-CNN Model Architecture.Taken from: Faster R-CNN:
Towards Real-Time Object Detection With Region Proposal Networks.

2. YOLO Model Family

Another popular family of object recognition models is referred to collectively as YOLO


or “You Only Look Once,”.

The R-CNN models may be generally more accurate, yet the YOLO family of models is
fast, much faster than R-CNN, achieving object detection in real-time.

YOLO
The approach involves a single neural network trained end to end that takes a
photograph as input and predicts bounding boxes and class labels for each bounding
box directly. The technique offers lower predictive accuracy (e.g. more localization
errors), although operates at 45 frames per second and up to 155 frames per second for
a speed-optimized version of the model.

The model works by first splitting the input image into a grid of cells, where each cell is
responsible for predicting a bounding box if the center of a bounding box falls within the
cell. Each grid cell predicts a bounding box involving the x, y coordinate and the width
and height and the confidence. A class prediction is also based on each cell.
For example, an image may be divided into a 7×7 grid and each cell in the grid may
predict 2 bounding boxes, resulting in 94 proposed bounding box predictions. The class
probabilities map and the bounding boxes with confidences are then combined into a
final set of bounding boxes and class labels. The image below summarizes the two
outputs of the model.

Fig: Summary of Predictions made by YOLO Model.

YOLOv2 (YOLO9000) and YOLOv3


Although this variation of the model is referred to as YOLO v2, an instance of the model
is described that was trained on two object recognition datasets in parallel, capable of
predicting 9,000 object classes, hence given the name “YOLO9000.”

A number of training and architectural changes were made to the model, such as the
use of batch normalization and high-resolution input images.

Like Faster R-CNN, YOLOv2 model makes use of anchor boxes, pre-defined bounding
boxes with useful shapes and sizes that are tailored during training. The choice of
bounding boxes for the image is pre-processed using a k-means analysis on the training
dataset.
Importantly, the predicted representation of the bounding boxes is changed to allow
small changes to have a less dramatic effect on the predictions, resulting in a more
stable model. Rather than predicting position and size directly, offsets are predicted for
moving and reshaping the pre-defined anchor boxes relative to a grid cell and
dampened by a logistic function.

Fig: Example of the Representation Chosen when Predicting Bounding Box Position and
Shape

Object Tracking Techniques


1. Frame differencing
2. Colorspace based tracking
3. Feature based tracking

1. Frame differencing
This is, possibly, the simplest technique we can use to see what parts of the video are
moving. When we consider a live video stream, the difference between successive
frames gives us a lot of information. The concept is fairly straightforward! We just take
the difference between successive frames and display the differences.

If I move my laptop rapidly from left to right, we will see something like this:

If I rapidly move the TV remote in my hand, it will look something like this:
As you can see from the previous images, only the moving parts in the video get
highlighted. This gives us a good starting point to see what areas are moving in the
video.

Here is the code to do this:

import cv2
# Compute the frame difference
def frame_diff(prev_frame, cur_frame, next_frame):
# Absolute difference between current frame and next frame
diff_frames1 = cv2.absdiff(next_frame, cur_frame)
# Absolute difference between current frame and # previous frame
diff_frames2 = cv2.absdiff(cur_frame, prev_frame)
# Return the result of bitwise 'AND' between the # above two resultant images
return cv2.bitwise_and(diff_frames1, diff_frames2)
# Capture the frame from webcam
def get_frame(cap):
# Capture the frame
ret, frame = cap.read()
# Resize the image
frame = cv2.resize(frame, None, fx=scaling_factor,
fy=scaling_factor, interpolation=cv2.INTER_AREA)
# Return the grayscale image
return cv2.cvtColor(frame, cv2.COLOR_RGB2GRAY)
if __name__=='__main__':
cap = cv2.VideoCapture(0)
scaling_factor = 0.5
prev_frame = get_frame(cap)
cur_frame = get_frame(cap)
next_frame = get_frame(cap)
# Iterate until the user presses the ESC key
while True:
# Display the result of frame differencing
cv2.imshow("Object Movement", frame_diff(prev_frame, cur_frame, next_frame))

# Update the variables


prev_frame = cur_frame
cur_frame = next_frame
next_frame = get_frame(cap)
# Check if the user pressed ESC
key = cv2.waitKey(10)
if key == 27:
break
cv2.destroyAllWindows()

2. Colorspace based tracking

Frame differencing gives us some useful information, but we cannot use it to build
anything meaningful. In order to build a good object tracker, we need to understand
what characteristics can be used to make our tracking robust and accurate.

So, let's take a step in that direction and see how we can use colorspaces to come up
with a good tracker. HSVcolorspace is very informative when it comes to human
perception. We can convert an image to the HSV space, and then use colorspace
thresholding to track a given object.

Consider the following frame in the video:


If you run it through the colorspace filter and track the object, you will see something
like this:

As we can see here, our tracker recognizes a particular object in the video, based on the
color characteristics. In order to use this tracker, we need to know the color distribution
of our target object.
Following is the code:

import cv2
import numpy as np
# Capture the input frame from webcam
def get_frame(cap, scaling_factor):
# Capture the frame from video capture object
ret, frame = cap.read()
# Resize the input frame
frame = cv2.resize(frame, None, fx=scaling_factor,
fy=scaling_factor, interpolation=cv2.INTER_AREA)
return frame
if __name__=='__main__':
cap = cv2.VideoCapture(0)
scaling_factor = 0.5
# Iterate until the user presses ESC key
while True:
frame = get_frame(cap, scaling_factor)

# Convert the HSV colorspace


hsv = cv2.cvtColor(frame, cv2.COLOR_BGR2HSV)
# Define 'blue' range in HSV colorspace
lower = np.array([60,100,100])
upper = np.array([180,255,255])
# Threshold the HSV image to get only blue color
mask = cv2.inRange(hsv, lower, upper)
# Bitwise-AND mask and original image
res = cv2.bitwise_and(frame, frame, mask=mask)
res = cv2.medianBlur(res, 5)
cv2.imshow('Original image', frame)
cv2.imshow('Color Detector', res)
# Check if the user pressed ESC key
c = cv2.waitKey(5)
if c == 27:
break
cv2.destroyAllWindows()
3. Feature based tracking

Feature based tracking refers to tracking individual feature points across successive
frames in the video. We use a technique called optical flow to track these features.
Optical flow is one of the most popular techniques in computer vision. We choose a
bunch of feature points and track them through the video stream.

When we detect the feature points, we compute the displacement vectors and show
the motion of those key points between consecutive frames. These vectors are called
motion vectors.

There are many ways to do this, but the Lucas-Kanade method is perhaps the most
popular of all these techniques. We start the process by extracting the feature points.
For each feature point, we create 3x3 patches with the feature point in the center.

The assumption here is that all the points within each patch will have a similar motion.
We can adjust the size of this window depending on the problem at hand.

For each feature point in the current frame, we take the surrounding 3x3 patch as our
reference point. For this patch, we look in its neighborhood in the previous frame to get
the best match. This neighborhood is usually bigger than 3x3 because we want to get
the patch that's closest to the patch under consideration.

Now, the path from the center pixel of the matched patch in the previous frame to the
center pixel of the patch under consideration in the current frame will become the
motion vector. We do that for all the feature points and extract all the motion vectors.

Let's consider the following frame:


If I move in a horizontal direction, you will see the motion vectors in a horizontal
direction:

If I move away from the webcam, you will see something like this:
So, if you want to play around with it, you can let the user select a region of interest in
the input video (like we did earlier). You can then extract feature points from this region
of interest and track the object by drawing the bounding box.

Here is the code to perform optical flow based tracking:

import cv2
import numpy as np
def start_tracking():
# Capture the input frame
cap = cv2.VideoCapture(0)
# Downsampling factor for the image
scaling_factor = 0.5
# Number of frames to keep in the buffer when you
# are tracking. If you increase this number,
# feature points will have more "inertia"
num_frames_to_track = 5
# Skip every 'n' frames. This is just to increase the speed.
num_frames_jump = 2

tracking_paths = []
frame_index = 0
# 'winSize' refers to the size of each patch. These patches
# are the smallest blocks on which we operate and track
# the feature points. You can read more about the parameters
# here: http://goo.gl/ulwqLk
tracking_params = dict(winSize = (11, 11), maxLevel = 2,
criteria = (cv2.TERM_CRITERIA_EPS | cv2.TERM_CRITERIA_COUNT, 10, 0.03))
# Iterate until the user presses the ESC key
while True:
# read the input frame
ret, frame = cap.read()
# downsample the input frame
frame = cv2.resize(frame, None, fx=scaling_factor,
fy=scaling_factor, interpolation=cv2.INTER_AREA)
frame_gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
output_img = frame.copy()
if len(tracking_paths) > 0:
prev_img, current_img = prev_gray, frame_gray
feature_points_0 = np.float32([tp[-1] for tp in tracking_paths]).reshape(-1, 1, 2)
# Compute feature points using optical flow. You can
# refer to the documentation to learn more about the
# parameters here: http://goo.gl/t6P4SE
feature_points_1, _, _ = cv2.calcOpticalFlowPyrLK(prev_img, current_img,
feature_points_0,
None, **tracking_params)
feature_points_0_rev, _, _ = cv2.calcOpticalFlowPyrLK(current_img, prev_img,
feature_points_1,
None, **tracking_params)
# Compute the difference of the feature points
diff_feature_points = abs(feature_points_0- feature_points_0_rev).reshape(-1,
2).max(-1)

# threshold and keep the good points


good_points = diff_feature_points < 1
new_tracking_paths = []
for tp, (x, y), good_points_flag in zip(tracking_paths,
feature_points_1.reshape(-1, 2), good_points):
if not good_points_flag:
continue
tp.append((x, y))
# Using the queue structure i.e. first in,
# first out
if len(tp) > num_frames_to_track:
del tp[0]
new_tracking_paths.append(tp)
# draw green circles on top of the output image
cv2.circle(output_img, (x, y), 3, (0, 255, 0), -1)
tracking_paths = new_tracking_paths
# draw green lines on top of the output image
cv2.polylines(output_img, [np.int32(tp) for tp in tracking_paths], False, (0, 150, 0))
# 'if' condition to skip every 'n'th frame
if not frame_index % num_frames_jump:
mask = np.zeros_like(frame_gray)
mask[:] = 255
for x, y in [np.int32(tp[-1]) for tp in tracking_paths]:
cv2.circle(mask, (x, y), 6, 0, -1)
# Extract good features to track. You can learn more
# about the parameters here: http://goo.gl/BI2Kml
feature_points = cv2.goodFeaturesToTrack(frame_gray,
mask = mask, maxCorners = 500, qualityLevel = 0.3,
minDistance = 7, blockSize = 7)

if feature_points is not None:


for x, y in np.float32(feature_points).reshape (-1, 2):
tracking_paths.append([(x, y)])
frame_index += 1
prev_gray = frame_gray
cv2.imshow('Optical Flow', output_img)
# Check if the user pressed the ESC key
c = cv2.waitKey(1)
if c == 27:
break
if __name__ == '__main__':
start_tracking()
cv2.destroyAllWindows()

Stereo Correspondence
When we capture images, we project the 3D world around us on a 2D image plane. So
technically, we only have 2D information when we capture those photos. Since all the
objects in that scene are projected onto a flat 2D plane, the depth information is lost.
We have no way of knowing how far an object is from the camera or how the objects
are positioned with respect to each other in the 3D space. This is where stereo vision
comes into the picture.

Humans are very good at inferring depth information from the real world. The reason is
that we have two eyes positioned a couple of inches from each other. Each eye acts as a
camera and we capture two images of the same scene from two different viewpoints,
that is, one image each using the left and right eyes.
So, our brain takes these two images and builds a 3D map using stereo vision. This is
what we want to achieve using stereo vision algorithms. We can capture two photos of
the same scene using different viewpoints, and then match the corresponding points to
obtain the depth map of the scene.
Let's consider the following image:
Now, if we capture the same scene from a different angle, it will look like this:

As you can see, there is a large amount of movement in the positions of the objects in
the image. If you consider the pixel coordinates, the values of the initial position and
final position will differ by a large amount in these two images.

Consider the following image:


If we consider the same line of distance in the second image, it will look like this:
The difference between d1 and d2 is large.

Now, let's bring the box closer to the camera:


Now, let's move the camera by the same amount as we did earlier, and capture the
same scene from this angle:
As you can see, the movement between the positions of the objects is not much. If you
consider the pixel coordinates, you will see that the values are close to each other. The
distance in the first image would be:

If we consider the same line of distance in the second image, it will be as shown in the
following image:
The difference between d3 and d4 is small. We can say that the absolute difference
between d1 and d2 is greater than the absolute difference between d3 and d4. Even
though the camera moved by the same amount, there is a big difference between the
apparent distances between the initial and final positions.

This happens because we can bring the object closer to the camera; the apparent
movement decreases when you capture two images from different angles. This is the
concept behind stereo correspondence: we capture two images and use this knowledge
to extract the depth information from a given scene.

What does An Augmented Reality System look like?


Augmented Reality refers to the superposition of computer-generated input such as
imagery, sounds, graphics, and text on top of the real world.

Augmented reality tries to blur the line between what's real and what's computer-
generated by seamlessly merging the information and enhancing what we see and feel.
It is actually closely related to a concept called mediated reality where a computer
modifies our view of the reality. As a result of this, the technology works by enhancing
our current perception of reality.
Now the challenge here is to make it look seamless to the user. It's easy to just overlay
something on top of the input video, but we need to make it look like it is part of the
video. The user should feel that the computer-generated input is closely following the
real world. This is what we want to achieve when we build an augmented reality system.
Computer vision research in this context explores how we can apply computer-
generated imagery to live video streams so that we can enhance the perception of the
real world.

Augmented reality technology has a wide variety of applications including, but not
limited to, head-mounted displays, automobiles, data visualization, gaming,
construction, and so on. Now that we have powerful smartphones and smarter
machines, we can build high-end augmented reality applications with ease.

Let's consider the following figure:

As we can see here, the camera captures the real world video to get the reference point.
The graphics system generates the virtual objects that need to be overlaid on top of the
video. Now the video-merging block is where all the magic happens. This block should
be smart enough to understand how to overlay the virtual objects on top of the real
world in the best way possible.

Geometric Transformations for Augmented Reality


The outcome of augmented reality is amazing, but there are a lot of mathematical
things going on underneath. Augmented reality utilizes a lot of geometric
transformations and the associated mathematical functions to make sure everything
looks seamless.
When talking about a live video for augmented reality, we need to precisely register the
virtual objects on top of the real world. To understand it better, let's think of it as an
alignment of two cameras—the real one through which we see the world, and the
virtual one that projects the computer generated graphical objects.
In order to build an augmented reality system, the following geometric transformations
need to be established:
Object-to-scene: This transformation refers to transforming the 3D coordinates of a
virtual object and expressing them in the coordinate frame of our real-world scene. This
ensures that we are positioning the virtual object in the right location.

Scene-to-camera: This transformation refers to the pose of the camera in the real
world. By "pose", we mean the orientation and location of the camera. We need to
estimate the point of view of the camera so that we know how to overlay the virtual
object.

Camera-to-image: This refers to the calibration parameters of the camera. This defines
how we can project a 3D object onto a 2D image plane. This is the image that we will
actually see in the end.

Consider the following image:


As we can see here, the car is trying to fit into the scene but it looks very artificial. If we
don't convert the coordinates in the right way, it looks unnatural. This is what we were
talking about in the object-to-scene transformation! Once we transform the 3D
coordinates of the virtual object into the coordinate frame of the real world, we need to
estimate the pose of the camera:

We need to understand the position and rotation of the camera because that's what the
user will see. Once we estimate the camera pose, we are ready to put this 3D scene on a
2D image.

Once we have these transformations, we can build the complete system.

You might also like