You are on page 1of 11

1. Convolutional neural network.

Answer the following questions for a CNN that


processes RGB color images with the following input size and two layers. –
 Input: 256×256RGB image
 Conv1: 2D convolution, K=9×9 filters, No=12 output channels, valid mode.
 MaxPool1: 2D max pooling, K=4×4 pool size, horizontal and vertical stride s=2.

(a) What are the dimensions of the input and the outputs for the first two layers for a mini-
batch of 100 images?

(b)How many parameters are there in the Conv1 and MaxPool1 layers?

convolutional neural network (CNN) processing RGB color images:

2. let the ground truth bounding box be [A1=50, B1=100, C1=200,


D1=300] and the predicted bounding box is [A2=80, B2=120, C2=220,
D2=310]. Compute the IoU

Consider a convolutional neural network block whose input size is 64 × 64 × 8.


The block consists of the following layers:

 A convolutional layer 32 filters with height and width 3 and 0 padding which has both
a weight and a bias (i.e. CONV3-32)
 A 2 × 2 max-pooling layer with stride 2 and 0 padding (i.e. POOL-2)
 A batch normalization layer (i.e. BATCHNORM)

Compute the output activation volume dimensions and number of parameters of the layers.
You can write the activation shapes in the format (H, W, C where H, W, C are the height,
width, and channel dimensions, respectively.

1. What is the output activation volume dimensions and number of parameters for
CONV3-32?
2. What is the output activation volume dimensions and number of parameters for
POOL2?
3. What is the output activation volume dimensions and number of parameters for
BATCHNORM?

Suppose you want to redesign the AlexNet architecture to reduce the number of arithmetic
operations required for each backprop update. (i) Would you try to cut down on the number
of weights, units, or connections? Justify your answer. (ii) Would you modify the convolution
layers or the fully connected layers? Justify your answer.

Reducing Arithmetic Operations in AlexNet Backprop

To reduce the number of arithmetic operations required for each backprop


update in AlexNet, you would focus on:
(i) Cutting down on the number of weights.
Justification:
 Backpropagation involves calculating gradients for each weight in the
network. The more weights there are, the more calculations are needed.
 Reducing the number of weights directly translates to fewer multiplications
and additions during backprop.
Connections and units are indirectly related to the number of weights. Connections
define how units are linked, and units are collections of weights for a specific layer.
However, the primary focus should be on weights since they directly determine the
number of calculations in backprop.
(ii) You would modify the convolution layers more than the fully connected
layers.
Justification:
 Convolutional layers typically have a significantly larger number of weights
compared to fully connected layers due to the use of filters with multiple
channels.
 Reducing the filter size, number of filters, or input channels in convolutional
layers can significantly decrease the number of weights without drastically
affecting the network's ability to learn features.
 Fully connected layers, while also contributing to the number of weights,
usually have fewer weights compared to convolutional layers, especially in
architectures like AlexNet. Modifying them would have a less pronounced
effect on reducing backprop computations.

Here are some additional strategies to consider:

 Down sampling techniques: Employ techniques like max pooling or strided


convolutions to reduce the size of feature maps, leading to fewer weights in
subsequent layers.
 Network pruning: During training, identify and remove weights with minimal
impact on the network's performance.
 Channel pruning: Reduce the number of channels in convolutional filters,
leading to fewer weight updates.
Trade-off: Reducing weights might necessitate adjustments to the network
architecture to maintain accuracy. It's essential to strike a balance between
computational efficiency and model performance.

In AlexNet, the input image is 227 * 227 * 3 and the first convolutional layer
contains 96 filters with K=11 and stride = 4.
1. What would be the width, height and depth of the output with padding = 0.
2. What would be the width, height and depth of the output with padding = 2

Here's the calculation of the output dimensions for the first convolutional layer in
AlexNet:

Input Image:
 Width: 227 pixels
 Height: 227 pixels
 Depth (channels): 3 (RGB)
Convolutional Layer Parameters:
 Kernel size (K): 11 x 11
 Number of filters (No): 96
 Stride: 4
 Padding:

1. Padding = 0 (Case 1): In this case, there's no padding added around


the input image. This can potentially lead to a smaller output size
compared to the input.
2. Padding = 2 (Case 2): Padding adds a border of zeros around the
input image, allowing the filter to "see" some of the neighboring pixels
beyond the original image boundaries.
Output Dimensions:

We can use the following formula to calculate the output height (H_out) and width
(W_out) for convolutional layers:

H_out = (Input Height - Kernel Height + 2 * Padding) / Stride


W_out = (Input Width - Kernel Width + 2 * Padding) / Stride
Case 1: Padding = 0
H_out = (227 - 11 + 2 * 0) / 4 = 55 (integer division)
W_out = (227 - 11 + 2 * 0) / 4 = 55 (integer division)
Output dimensions: (55, 55, 96) - Width: 55 pixels, Height: 55 pixels, Depth: 96
channels (one for each filter)
Case 2: Padding = 2
H_out = (227 - 11 + 2 * 2) / 4 = 56 (integer division)
W_out = (227 - 11 + 2 * 2) / 4 = 56 (integer division)
Output dimensions: (56, 56, 96) - Width: 56 pixels, Height: 56 pixels, Depth: 96
channels
Note:
 In the case of padding = 0, the output size is smaller than the input size due to
the stride of 4, which skips pixels during convolution.
 Padding helps preserve some spatial information in the output by creating a
larger "effective" input for the filters.

For the YOLO algorithm, assume we have the following two boxes: the

lower-right box with the size of 2×3 and the upper-left box with the

size of 2×2, and the size of the overlapping region is 1×1. What is the

loU (Intersection over Union) between these two boxes?

Box Dimensions:
 Lower-right box: 2 x 3
 Upper-left box: 2 x 2
 Overlapping region: 1 x 1
Calculating IoU:
1. Intersection Area: The overlapping region is a square with a size of 1 x 1,
resulting in an area of 1 square unit.
2. Union Area:
o Calculate the area of each box separately:
 Lower-right box: 2 (width) * 3 (height) = 6 square units
 Upper-left box: 2 (width) * 2 (height) = 4 square units
o Since there might be some overlap in the calculation of the union area
(the overlapping region is counted once for each box), we need to
subtract the overlapping area to avoid double-counting.
o Union Area = Area of lower-right box + Area of upper-left box -
Overlapping Area = 6 + 4 - 1 = 9 square units
3. IoU: Intersection Area / Union Area = 1 / 9 = 0.111... (rounded to four decimal
places)

Therefore, the IoU between the two boxes is approximately 0.1111.

Suppose you are running non-max suppression during the YOLO algorithm on the
predicted boxes as shown below. Assume that boxes with probability less than or equal to
0.4 are discarded, and the low threshold for deciding if two boxes overlap is 0.5. How
many boxes will remain after non-max suppression stage?

Boxing Box Dimension ( length * 1 unit * breath * 1 unit )


Tree 0.46 5*4
Tree 0.74 2*4
Motorcycle 0.58 2*2
Car 0.62 2*4
Car 0.73 2*4
Car 0.26 3*3
Tree 0.46 5*4
******************************************************************

To compute the Average Precision (AP) for each object class (A and B) and the Mean
Average Precision (mAP), we need to follow these steps:

1. Compute the Precision and Recall for each detection.

2. Compute the Precision-Recall curve for each object class.

3. Compute the Average Precision (AP) for each object class.


4. Compute the Mean Average Precision (mAP) by averaging the APs for all object
classes.

Let's go through these steps:

1. Compute the Precision and Recall:

For each detection, we need to determine whether it's a true positive (TP) or false positive
(FP) based on its IoU (Intersection over Union) with the ground truth bounding box. We'll
use a threshold of 0.5 for IoU.

Then, we'll compute the Precision and Recall at different thresholds.

2. Compute the Precision-Recall curve:

We'll plot Precision against Recall for each object class.

3. Compute the Average Precision (AP):

We'll compute the area under the Precision-Recall curve for each object class.

4. Compute the Mean Average Precision (mAP):

We'll average the APs for all object classes.

Let's perform these computations:

First, let's calculate the IoU for each bounding box and ground truth. Then, we'll
determine whether each detection is a TP or FP based on the IoU threshold. Finally, we'll
compute Precision and Recall at different thresholds.

To calculate the Intersection over Union (IoU) for each bounding box and ground truth,
we'll use the formula:

Where:

- Area_of_Overlap is the area of intersection between the bounding box and the ground
truth.

- Area_of_Union is the area of union between the bounding box and the ground truth.
Let's calculate IoU for each detection and ground truth:

For Object A:

- Detection 1 in Image 1: IoU = Intersection / Union = 7 / (37 + 21 - 7) = 7 / 51 ≈ 0.137

- Detection 2 in Image 1: IoU = Intersection / Union = 7 / (31 + 37 - 7) = 7 / 61 ≈ 0.115

- Detection 3 in Image 2: IoU = Intersection / Union = 0 / (140 + 42 - 0) = 0 / 182 ≈ 0

- Detection 4 in Image 2: IoU = Intersection / Union = 0 / (130 + 42 - 0) = 0 / 172 ≈ 0

For Object B:

- Detection 1 in Image 1: IoU = Intersection / Union = 5 / (28 + 28 - 5) = 5 / 51 ≈ 0.098

- Detection 2 in Image 1: IoU = Intersection / Union = 0 / (12 + 28 - 0) = 0 / 40 ≈ 0

- Detection 3 in Image 2: IoU = Intersection / Union = 1 / (9 + 42 - 1) = 1 / 50 ≈ 0.02

- Detection 4 in Image 2: IoU = Intersection / Union = 1 / (30 + 35 - 1) = 1 / 64 ≈ 0.016

Next, we'll determine whether each detection is a true positive (TP) or false positive (FP)
based on the IoU threshold of 0.5.

To determine whether each detection is a true positive (TP) or false positive (FP) based
on the IoU threshold of 0.5, we'll compare the calculated IoU values with the threshold:

For Object A:

- Detection 1 in Image 1: IoU = 0.137, TP (IoU > 0.5)

- Detection 2 in Image 1: IoU = 0.115, TP (IoU > 0.5)

- Detection 3 in Image 2: IoU = 0, FP (IoU ≤ 0.5)

- Detection 4 in Image 2: IoU = 0, FP (IoU ≤ 0.5)

For Object B:

- Detection 1 in Image 1: IoU = 0.098, FP (IoU ≤ 0.5)

- Detection 2 in Image 1: IoU = 0, FP (IoU ≤ 0.5)

- Detection 3 in Image 2: IoU = 0.02, FP (IoU ≤ 0.5)

- Detection 4 in Image 2: IoU = 0.016, FP (IoU ≤ 0.5)


Next, we'll compute Precision and Recall at different thresholds. Since we only have one
TP for each class, the Precision and Recall values will be straightforward.

For Object A:

- TP: 2 (out of 2 detections)

- FP: 2

- Precision = TP / (TP + FP) = 2 / (2 + 2) = 0.5

- Recall = TP / (Total number of ground truth objects) = 2 / 2 = 1.0

For Object B:

- TP: 0 (out of 2 detections)

- FP: 4

- Precision = TP / (TP + FP) = 0 / (0 + 4) = 0.0

- Recall = TP / (Total number of ground truth objects) = 0 / 2 = 0.0

Now, let's compute the Average Precision (AP) for each class and then compute the Mean
Average Precision (mAP).

Now that we have computed the Precision and Recall values for each object class, let's
compute the Average Precision (AP) for each class and then compute the Mean Average
Precision (mAP).

For Object A:

- Precision = 0.5

- Recall = 1.0

To compute AP for Object A, we'll use the trapezoidal rule to approximate the area under
the Precision-Recall curve (since we only have one point, the area is simply the rectangle
formed by the point and the axes).

AP for Object A = Precision * Recall = 0.5 * 1.0 = 0.5

For Object B:
- Precision = 0.0

- Recall = 0.0

AP for Object B = Precision * Recall = 0.0

Now, let's compute the Mean Average Precision (mAP) by averaging the APs for both
object classes:

mAP = (AP for Object A + AP for Object B) / Total number of object classes

= (0.5 + 0.0) / 2

= 0.25

So, the Mean Average Precision (mAP) for the given detections is 0.25.

This indicates the overall performance of the object detector across both classes,
considering both precision and recall.

You might also like