You are on page 1of 4

Q1: Image Rectification

Rectification refers to a transformation process that re-projects both the left and right
images onto a common image plane parallel to the baseline as shown in the figure below
where image planes shown by dotted and bold lines refer to image planes before and after
rectification, respectively. In this case, the cameras are parallel, and consequently the
axes are parallel to the baseline. So, the corresponding epipolar lines would be horizontal
i.e., they have the same y-coordinate. This stereo imaging setup is called a standard or
canonical stereo setup.

The canonical stereo setup makes a stereo matching problem much easier. This is because
the search can be done along the horizontal line in the rectified images instead of
searching the entire image for finding the matching points. Rectification reduces the
dimensionality of the search space for matching points from two-dimensional space to
one-dimensional space. This can be done with the help of a 3×3 homography matrix H.
Mathematically, transforming the coordinates of the original image plane to a common
image plane can be written as follow
𝑈′ 𝑈
[ 𝑉′ ] = 𝐻 [ 𝑉 ]
𝑊′ 𝑊
Q2: Components of a Computer Vision System
Q3: Weak Perspective Projection and Orthographic Projection
In a human visual system, our eyes collapse a 3D world to a 2D retinal image, and the
brain has to reconstruct 3D information. In computer vision, this process occurs by
projection.

For this arrangement, the size of the image formed by the process of projection is given
by:
𝑦
𝑦𝑠 = d
𝑧
The optical center (center of projection) is put at the origin, and the image plane
(projection plane) is placed in front of the center of projection (COP) to avoid inverted
images. The camera looks down the negative z axis. From the similar triangles, it is
𝑥 𝑦
observed that the point (x, y, z) is mapped into (−d 𝑧 , − d 𝑧 , −d). The projection coordinate
𝑥 𝑦
on image is obtained by discarding the last coordinate, (x, y, z) → (−d , −d ). This is
𝑧 𝑧
known as perspective projection.

Let us now consider that the relative depths of points on the object are much smaller than
the average distance 𝑍𝑎𝑣 to COP. For each point on the object, the equation is given by

So, projection is reduced to uniform scaling for all the object point coordinates. This is
called weak persepective projection. In this case, the points at about the same depth are
considered, and each point is divided by the depth of its group.
Suppose, 𝑑 → ∞ in perspective projection model, Hence, for 𝑧 → −∞, the ratio −𝑑/𝑧 → 1
Therefore, the point (x, y, z) is mapped into (x, y). This is called Orthographic projection
or parallel projection.

Q9: Neural Network Structures for pattern recognition


Layers:
1. Convolutional Layer: The convolutional layer is the core building block of CNN
which does a lot of computations. With each of the neurons in output volume, a 3D
array of weights is associated, which covers a small region but extends through
the full depth of input volume. Each entry of a feature map is computed by simply
doing element wise multiplication of 3D weight matrix and a small region of input
volume followed by a summation, i.e., convolution. This weighted sum is then
passed through a non-linearity. The most popular non-linear function is rectified
linear unit (ReLU). This ReLU is simply the half-wave rectifier f(x) = max(0, x). The
purpose of the convolutional layer is to detect the local features present in the
previous layer (input image).
2. Pooling Layer: These low-level features are then merged to form higher level
features (say, motifs). It is the pooling layer which merges semantically similar
features to form higher level features. Different layers of convolution, non-
linearity and pooling are put in multiple stages to detect and merge features at
different levels. In a typical pooling operation, a maximum value of a small patch
of feature maps is determined. This is called “max pooling.” Similarly, we can have
average pooling as well, where average value of that patch can be determined.
Since the pooling is usually done in non-overlapping fashion, the spatial size of the
output volume is reduced. This also reduces the computational requirement of
upcoming layers, and more importantly, it creates an invariance to small shifts and
distortions.
3. Fully Connected Layers: After a series of stack of convolutional layers and
pooling layers, there exists a layer called “fully connected layer,” which performs
a high-level reasoning based on the results obtained so far. Based on the 2D
feature maps from the previous layers, the fully connected layer finally produces
a 1D score vector for different classes.
Using Example

The conceptual diagram of a CNN being employed to do image classification operation


is shown in Figure above. For simplicity, let us consider that the input image can be
either of the four animals, viz., dog, cat, lion and bird. In the initial few stages
(convolution plus ReLU and pooling), layers are stacked as shown. The purpose of this
series of layers is to extract different classes of features at different abstraction layers
as has already been discussed. The final pooling layer is flattened which results in a
1D vector. If the last pooling volume is of size 5 × 5 × 16, then the flattened vector
would be of length 400. Then the last stage consists of one or more fully connected
layers followed by a classification rule (say softmax approach). Finally a score vector
is generated having length equal to number of classes (4 in our case). This score vector
associates a real number between 0 and 1 to each of the input classes. A decision is
made in favour of a class having the highest score. In our example, the class “dog” gets
the highest score (say 0.94). In other words, the machine identifies the input image as
a class having the highest score.

You might also like