You are on page 1of 7

DASC7606 Assignment 1

An Yuao
3036198451

March 2, 2024

Contents
1 Introduction 1

2 Related Work 1
2.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2.2 Focal Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

3 Methodology 3
3.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3.2 Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

4 Experiment 3
4.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
4.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
4.2.1 Ablation study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
4.2.2 Quantitative evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

5 Appendix 6
5.1 A. Quantitative evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1 Introduction
From autonomous vehicles to medical image analysis, object detection lies at the heart of the
challenge. Existing object detection methods include traditional detection methods and deep
learning approaches. Traditional object detection models were constructed by integrating a se-
ries of hand-manuscript feature extractors such as Viola-Jones[1], HOG[2], SIFT[3], DPM[4].
The emergence of deep learning has changed the landscape of this field. Traditional methods
began to perform in low accuracy with a poor runtime compared with deep learning-based ap-
proaches. Some new approaches, powered by convolutional neural networks (CNNs), do not rely
on hand-engineered features. Instead, they automatically learn to extract features from data.
This allows them to achieve significantly higher accuracy and better generalization across differ-
ent datasets and domains.

RetinaNet[5] is the first one-stage object detection network that outperformed the two-stage net-
works. Although its architectural might not be considered revolutionary, its core contribution,
the Focal Loss, addresses a critical challenge in one-stage detectors: the extreme imbalanced class
of foreground-background. This imbalance arises because one-stage detectors tend to propose a
vast number of potential object locations, most of which are background (negative samples).

In this report, we will first introduce the task of object detection and some related works. Then,
we will present a comprehensive review of the RetinaNet design. Detailed experiment analysis
will be given in the last part of this report.

2 Related Work
In this section, we will comprehensively introduce the design of RetinaNet from its architecture
and the core mechanism, Focal Loss.

2.1 Architecture
Shown in Figure 1, the architecture of RetinaNet has four major components.

Figure 1: Architecture of RetinaNet

RetinaNet use classical single-stage network with Backbone, Neck and Head components. In this
report, the Backbone is ResNet, the Neck is rely on Feature Pyramid Network, and the Head is
consisted of two parallel sub-networks.

1
2.2 Focal Loss
For classification problems in deep learning, we usually use Cross Entropy (CE)[6] as the
measurement of training performance. For a binary classification problem, the CE loss is:
#
´ logppq if y “ 1
CEpp, yq “
´ logp1 ´ pq otherwise.

In which pP r0, 1s is the probability of the positive prediction made by the model. More generally,
we can define CE as:
CEpp, yq “ CEppt q “ ´logppt q
when y “ 1, pt “ p and when y “ 0, pt “ 1 ´ p.
However, when dealing with imbalanced sample distributions, the CE can lead to poor model
performance because it treats all samples equally, thus allowing the majority class to dominate
the learning process. A common approach to tackle class imbalance is to introduce a weighting
factor for each class[7]. That is,

CEppt q “ ´αt logppt q or CEpp, yq “ ´αy ¨ y logppq ´ α1´y ¨ p1 ´ yq logp1 ´ pq

Though Weighted CE have balanced the importance, or contribution of every sample, it addresses
no attention on distinguish hard and easy examples within each class. Hard examples are those
for which the model predicts a low probability for the correct class (e.g., a probability less than
0.5 for a positive sample), indicating a lack of certainty or incorrect leaning. Easy examples are
those that the model predicts with high confidence (e.g., a probability greater than 0.5 for a
positive sample).
In respond of this issue, the Focal Loss modulating the balanced CE with an additional factor,
decreases the contribution of easy examples and increases the contribution of the hards. The
formula of the Focal Loss is:

F Lppt q “ ´αt p1 ´ pt qγ logppt q


where:

• pt is defined as p if y “ 1 else 1 ´ p, aligning the predicted probability with the true class
label.

• αt is a weighting factor to balance the importance of positive/negative classes, similar to


Balanced CE.
• γ is the focusing parameter that smoothly adjusts the rate at which the contribution of
easy examples to the loss is down-weighted.

• F Lppt q is the Focal Loss, which dynamically scales the CE loss component for each sample
based on the prediction’s accuracy.

As shown in Figure 2, the blue curve represents the CE loss. Even for cases where the predicted
probability is greater than 0.5, the CE loss value remains substantively high. This might lead to
a situation where despite the accurate prediction of background classes, the voluminous quantity
of background samples significantly outweighs the positive samples. Consequently, the overall
loss is seriously affected by the background, resulting in insufficient prioritization of the positive
class. On the other hand, with γ, Focal Loss effectively suppresses this trend.

2
5
=0
= 0.5
4 =1
=2
=5
3
loss

2
well-classified
examples
1

0
0 0.2 0.4 0.6 0.8 1
probability of ground truth class

Figure 2: Curves of different losses Figure 3: Backbone of RetinaNet

3 Methodology
3.1 Implementation
As Figure 3 shows, use P5 as example, we filled the backbone code. In the Part 2.2, we have
introduce Focal Loss in detail, we implement it as
# Backbone ResNet
self.P5_1 = conv1x1(C5_size, feature_size)
self.P5_upsampled = nn.Upsample(scale_factor=2, mode=’nearest’)
self.P5_2 = conv3x3(feature_size, feature_size, stride=1)
# Focal Loss
focal_weight = alpha_factor * torch.pow(focal_weight, gamma)
bce = (-1) * (targets * torch.log(classification) + (1.0 - targets) * torch.log
(1.0 - classification))
cls_loss = focal_weight * bce

3.2 Improvements
Overall speaking, opting for a larger batch size during training can offer improved resilience
against noise[8]. In this study, with the backbone network architecture consistently designated as
ResNet, we have fixed the network depth at 18 to maximize the batch size within the constraints
of predetermined GPU memory. We implement this by modifying parser module and embedding
interfaces in the code.

4 Experiment
4.1 Dataset
COCO[9] is a large-scale object detection, segmentation, and captioning dataset. In our work,
we use part of the dataset to train a object detection neural network. Total 2031 image were

3
used for train, 1517 for validation, and 643 for test.

4.2 Performance
4.2.1 Ablation study
Depth of ResNet Our first attempt is to explore how the depth of ResNet will affect the
performance. We use Focal Loss value and mAP (mean Average Precision) in validation as the
performance measure. We set epoch to 10 and 30, learning rate to 0.0001. The result is shown
below.

Depthepoch Loss mAP


1810 0.547 0.343
5010 0.501 0.346
10110 0.503 0.442
1830 0.183 0.393
5030 0.240 0.378
10130 0.207 0.391

Figure 5: Ablation on epoch and depth


Figure 4: Loss curve over epoch

Epoch & Learning rate We also explore the performance of different learning rates, the result
is shown in Figure 4. Note that we have set a Multi-step scheduler in the training stage. We
compare the loss curve under the learning rate of 0.001 and 0.0001. The blue line, corresponding
to 0.0001, shows a generally decreasing trend, indicating that the running loss is decreasing as
the number of epochs increases. This suggests an improvement in model training overtime at
this learning rate. However, the orange line, corresponding to 0.001, shows more variability. It
starts higher than the blue line, decreases, then spikes sharply upwards around epoch 10, before
decreasing and leveling off with some fluctuation. The spike suggests a significant increase in
running loss at that point in training, which could be indicative of instability or an issue with
the training process at this higher learning rate.

Final model We finally trained a model with the configuration listed as: Depth: 18, Learning
rate: 0.0001, epoch: 50, batch_size: 8. The mAP of this model is 0.416.

4.2.2 Quantitative evaluation


Figures are in Appendix 5.1. Our model shows the capability to detect a diverse array of
objects, ranging in size from diminutive to substantial, across categories from buses to birds.
This exemplifies the high performance of our model under various scenarios.

4
References
[1] Paul Viola and Michael Jones. Rapid object detection using a boosted cascade of simple
features. In Proceedings of the 2001 IEEE computer society conference on computer vision
and pattern recognition. CVPR 2001, volume 1, pages I–I. Ieee, 2001.
[2] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection. In 2005
IEEE computer society conference on computer vision and pattern recognition (CVPR’05),
volume 1, pages 886–893. Ieee, 2005.
[3] David G Lowe. Object recognition from local scale-invariant features. In Proceedings of the
seventh IEEE international conference on computer vision, volume 2, pages 1150–1157. Ieee,
1999.
[4] Pedro F Felzenszwalb, Ross B Girshick, David McAllester, and Deva Ramanan. Object
detection with discriminatively trained part-based models. IEEE transactions on pattern
analysis and machine intelligence, 32(9):1627–1645, 2009.

[5] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense
object detection. In Proceedings of the IEEE international conference on computer vision,
pages 2980–2988, 2017.
[6] Thomas M Cover. Elements of information theory. John Wiley & Sons, 1999.

[7] Yuri Sousa Aurelio, Gustavo Matheus De Almeida, Cristiano Leite de Castro, and Anto-
nio Padua Braga. Learning from imbalanced data sets with weighted cross-entropy function.
Neural processing letters, 50:1937–1949, 2019.
[8] Sam McCandlish, Jared Kaplan, Dario Amodei, and OpenAI Dota Team. An empirical model
of large-batch training. arXiv preprint arXiv:1812.06162, 2018.

[9] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan,
Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In
Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September
6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.

5
5 Appendix
5.1 A. Quantitative evaluation

You might also like