DETR and YOLOv5

2021 34th International Symposium on Computer-Based Medical Systems (CBMS)
DETR and YOLOv5: Exploring Performance and

Self-Training for Diabetic Foot Ulcer Detection
Raphael Brüngel1 , and Christoph M. Friedrich1,2,∗ , Member, IEEE
1
Department of Computer Science, University of Applied Sciences and Arts Dortmund, 44227 Dortmund, Germany
2
Institute for Medical Informatics, Biometry and Epidemiology (IMIBE), University Hospital Essen, 45122 Essen, Germany
2021 IEEE 34th International Symposium on Computer-Based Medical Systems (CBMS) | 978-1-6654-4121-6/21/$31.00 ©2021 IEEE | DOI: 10.1109/CBMS52027.2021.00063
∗ Corresponding author: christoph.friedrich@fh-dortmund.de
Abstract—Diabetic feet are a long-term effect of diabetes melli- pressure ulcers [9] in particular, possibly not yet visible to the
tus that are at risk of ulceration due to neuropathy and ischemia. human eye, may prove as a game changer in this context.
Early ulcer stages show subtle changes hard to recognize by the As computational costs of methods, especially those of deep
human eye, especially on darker skin types. Acquired ulcers may
become chronic for various reasons, requiring extensive docu- learning-based approaches, are high, gathered images are com-
mentation to monitor healing progression. For early stage detec- monly transmitted to remote servers for inference. However,
tion and documentation support, object detection algorithms are due to the sensitivity of medical records this procedure may
a key technology for prevention and care improvement. However, be criticized in the light of transport security [10] and storage
attendant symptoms like malformed toenails, hyperkeratosis, and in regards to data protection frameworks. There may also not
rhagades display challenges regarding faulty detections. The
research at hand explores two disparate state-of-the-art detection always be availability of services, e.g., during ambulant care
frameworks: Detection Transformer (DETR) as representative of in rural regions. On-device processing of images is thus of
the novel transformer-based architectures for computer vision, interest for various scenarios.
and You Only Look Once v5 (YOLOv5) as an expedited PyTorch The work at hand compares two state-of-the-art detection
port of YOLOv4 with explicit mobile-focus. Both are compared frameworks, Detection Transformer (DETR) [11] and You
on a recently released dataset for diabetic foot ulcer detection
with images typical for common wound care documentation. In Only Look Once v5 (YOLOv5) [12], in regards to their
addition, effects of self-training for performance improvement are out-of-the-box detection performance for diabetic foot ulcer
investigated. Achieved results outperform those of other state-of- (DFU) detection. As a representative for the recent trend of
the-art methods. These are discussed highlighting differences and transformer-based architectures, DETR [11] is currently in an
potential for further optimization. initial phase and has no explicit mobile-focus. In contrast,
Index Terms—Machine learning, object detection, diabetic foot
ulcer, DETR, YOLOv5
YOLOv5 features a direct model export for mobile machine
learning platforms and reached a matured state with its recent
release. A performance comparison is done on the DFU Chal-
I. I NTRODUCTION
lenge (DFUC) 2020 [13] dataset [14], following a simplified
Globally, in 2017 approx. 425 million people were affected approach used in prior work [15] involving self-training [16].
by diabetes mellitus with an expected rise to 629 million cases It is not to be understood as a competitive comparison, but
until the year 2045 [1]. The diabetic foot is an associated shall present fair perspectives regarding the capabilities of two
long-term complication with a prevalence of approx. 6 % [2] novel frameworks with different foci, used for a non-trivial
that, beside other symptoms, can manifest with neuropathy- task. Further, effects of self-training for both shall be explored.
related pressure ulcers and ischemia-related gangrenes. Due to Detection of DFUs has been investigated by [17] and for
condition-related impaired wound healing capabilities, affected use in mobile applications by [6], [7]. Further recognition of
persons are at risk of developing wounds of chronic state. ischemia and infection was assessed by [18]. The topic was
Once acquired, treatment of chronic wounds is a delicate recently evaluated in a broader context of the DFUC 2020
task. Backlashes like dressing-related contact dermatitis [3] [13] in [15]. An approach used by [7] utilized Faster R-CNN
or colonization [4] are frequent and necessitate strategic [19] with Inception V2 [20]. A benchmark on the DFUC
switches. Wounds require frequent monitoring and detailed 2020 dataset was conducted by [14], comparing Faster R-
documentation to trace changes. This is hard to accomplish CNN with backbones ResNet-101 [21], Inception V2-ResNet-
in overburdened care systems. 101, and R-FCN [22], YOLOv5, and EfficientDet [23]. Top-
Availability of reliable check-up and monitoring methods performing approaches of the DFUC 2020 reported in [15]
[5] can prevent chronic wound development, respectively utilized variants of Faster R-CNN [24]–[26], YOLOv3 [27],
prolonged or absent healing progression. Machine learning- the initial release v1.0 [28] of YOLOv5, EfficientDet, and
based point-of-care applications on mobile devices [6], [7] a new approach named Cascade Attention DetNet based on
for these purposes are a key technology to enhance care and Cascade R-CNN [24] with a DetNet backbone [29].
relieve care systems. They have the potential to disburden In the following sections used methods, achieved results,
caregivers and to decrease the vast expenses regarding the a discussion on their interpretation, and drawn conclusions
cost-of-illness [8]. Applications for detection of early-stage are presented. Sec. II gives detailed information on the used
2372-9198/21/$31.00 ©2021 IEEE 148

DOI 10.1109/CBMS52027.2021.00063
Authorized licensed use limited to: Yuan Ze University. Downloaded on December 13,2021 at 09:40:28 UTC from IEEE Xplore. Restrictions apply.
frameworks, the dataset and evaluation measures as well as Models have been refined over time. In release v4.0 the
on the working environment. The approach related to prior standard models range from ∼7.3 M parameters in the s-model
work is elaborated in Sec. III. In Sec. IV achieved scores up to ∼87.7 M parameters in the x-model, and 283 layers
for different configurations are listed. These are addressed up to 607 layers respectively (including activations, referring
and compared in Sec. V, highlighting differences, suggesting to outputs). Common Objects in Context (COCO) [38] pre-
optimizations, and stating limitations. A brief summary is trained models are available, sizes range from ∼14.1 MB for
given in Sec. VI that closes with conclusions on individual the s-model up to ∼168.0 MB for the x-model.
framework performance and potential use cases. A novelty of release v4.0 is a full switch to the Sigmoid
Linear Unit (SiLU) [39] activation function that now unifies
II. M ETHODS
activation throughout the models and simplifies the architec-
In the following, DETR and YOLOv5 are elaborated, the ture. Another novelty are P6 models2 . These are variants of
DFUC 2020 dataset with related and prior work is introduced, base models, specialized for extra-large object detection, in
and the environment experiments took place in is described. which the architecture was extended with another output layer.
A. Detection Transformer (DETR) Small model sizes, high inference speed, and model export3
to TorchScript for TensorFlow Lite4 and CoreML5 underline
DETR [11] was released in May 2020 and is currently its focus on application on mobile platforms.
available in release v0.2. It represents a novel framework for
object detection and segmentation that uses the rising trans- C. Diabetic Foot Ulcer Challenge (DFUC) 2020 Dataset
former [30] technology, priorly revolutionizing natural lan- The DFUC 2020 [13] focused on DFU detection. The pro-
guage processing approaches [31]. Its detection performance vided dataset [14] was released in April 2020 and comprises
is comparable to that of Faster R-CNN [11], significantly images annotated by experts for single-class detection.
outperforming it on large objects [11]. It consists of 4, 200 images and is split into three parts
Its architecure [11] is based on a standard transformer [14]: (i) the training part with 2, 000 training images, (ii) the
encoder-decoder architecture [30], yet differs in parallel object validation part with 200 images, and (iii) the test part with
decoding of an arbitrary amount. A set-based global loss 2, 000 images including some pitfall images [15] for increased
is implemented after the Hungarian algorithm [32]. Use of complexity. The training part holds 2, 496 annotations [14]
a conventional Convolutional Neural Network (CNN)-based while the test part holds 2, 097 [15].
backbone, e.g., ResNet [21], is incorporated. This is used to Images have a resolution of 640 × 480 (portrait, landscape).
learn 2D representations that are flattened and enriched with The majority of training part images contains small to tiny
positional encoding prior to passing them into an encoder. A ulcers [14]: 1, 849 (74.08 %) with an area of less than 5 %,
decoder is supplied with so-called object queries and encoder 1, 250 (50.08 %) with less than 2 %. Contents are heteroge-
output. Each decoder output is then passed to a Feedforward neous [14] regarding distance, angle, orientation, lighting, fo-
Neural Network (FNN) that resembles the prediction heads. cus, and background objects. They contain [14] multiple ulcers
The created object detection pipeline is simplified by remov- and feet, same subjects over time, amputations, deformations,
ing steps depending on a priori knowledge [11], e.g., settings infections, ischemia, and stained wound dressings. Multiple
for non-maxima suppression (NMS) due to its set-based loss. ethnicities are present, yet light skin types are most common.
DETR can further be utilized for segmentation purposes and For test part predictions the DFUC 2020 portal6 yields the
features panoptic segmentation [33]. achieved F1-Score and mean Average Precision (mAP) for a
B. You Only Look Once v5 (YOLOv5) Jaccard index [40] (also known as Intersection over Union
(IoU)) threshold of 0.5 as evaluation measures.
YOLOv5 was released in May 2020 and is currently avail-
able in release v4.0 [12]. It basically started as a PyTorch port D. Environment
of YOLOv4 [34] and thus incorporated some of its principles, The working environment in which experiments were per-
but was expedited further. However, despite its naming it is not formed comprised an NVIDIA® DGX-1 super-computer for
to be confused as an official successor of YOLOv4. Regarding deep learning with eight NVIDIA® V100 tensor core graphical
its naming and presented performance comparison to YOLOv4 processing units (GPUs), each with 16 GB memory. Used
there is a controversy1 , a scientific paper is yet pending. drivers had version 418.126.02, the Compute Unified De-
It uses a two-stage detector. The backbone comprises a vice Architecture (CUDA) had version 10.1. For DETR and
Cross Stage Partial Network (CSPNet) [35] and a Spatial Pyra- YOLOv5 recent commits7 were pulled to benefit from further
mid Pooling network (SPP) [36] for dynamic input sizes and
2 P6 models: https://github.com/ultralytics/yolov5/issues/2110 (2021-02-12)
robustness against object deformations. The head uses a Path
3 Export: https://github.com/ultralytics/yolov5/issues/251 (2021-02-12)
Aggregation Network (PANet) [37] for instance segmentation. 4 TensorFlow Lite: https://www.tensorflow.org/lite (2021-02-12)
Four base models are provided: The s-, m-, l-, and x-model, 5 CoreML: https://developer.apple.com/documentation/coreml (2021-02-12)
differing in their increasing amount of layers and parameters. 6 DFUC 2020 portal: https://dfu2020.grand-challenge.org (2021-02-12)
7 DETR finetune commit e42a3b1 (https://github.com/woctezuma/detr/
1 YOLOv4 maintainer discussion on YOLOv5: https://github.com/pjreddie/
tree/finetune (2021-02-21)) with increased usability for custom data training
darknet/issues/2198 (2021-02-12) based on DETR commit 4e1a928, YOLOv5 commit c9bda11
149
improvements. Both were executed in Docker containers, run 1333 px at most. During training random crops are performed,
via Nvidia-Docker8 in version 19.03.5. For DETR a Deepo9 a rectangle is taken and resized again with a probability of 0.5.
image10 was used, for YOLOv5 it was the provided image11 . YOLOv5 incorporates a broad set. For image alteration pho-
tometric (hue, saturation, value (HSV) colorspace alterations)
III. A PPROACH
and geometric (rotation, translation, scaling, shearing, perspec-
The approach outlined in the following, shown in Fig. 1, tive alteration, and up-down left-right flipping) distortions are
is based on prior work [15] but differs in simplified pre- and applied. For image combination MixUp [41] is used. The novel
post-processing for a fair comparison. It uses baseline data and mosaic data loading combines four images with a random
does not apply dataset-specific optimizations. ratio as tiles of a new one. Self-adversarial training generates
deceiving images that should not yield detections. Occurence

probabilities are dependent on the used hyperparameter set

of which two defaults, regular training (hyp.scratch.yaml) and

finetuning (hyp.finetune.yaml), exist.

D. Base Training

Base training represents the first of two training steps during
which models with COCO [38] pre-trained weights were

trained on the training part of the dataset.

Hyperparameters used for DETR and YOLOv5 display the
default sets. DETR hyperparameters were hard-coded default
Fig. 1: Simplified process of the presented approach.
settings for input parameters in the training script (main.py),
YOLOv5 hyperparameters were pre-configured in the con-
A. Pre-Processing figuration file for regular training (hyp.scratch.yaml). Chosen
Duplicate images in the training part of the dataset have batch sizes were the maximum single GPU would hold, 4
been identified via AntiDupl12 in release v2.3.10. 39 pairs for DETR and 18 for YOLOv5. The initial learning rate of
of duplicate images in the training dataset were found and dou- DETR was 1e−4 and that of YOLOv5 1e−2. Used optimizers
bles were removed, keeping their bounding boxes. These were were AdamW [42] for DETR and Stochastic Gradient Descent
merged with those of the now unique images. Intersections (SGD) [43] for YOLOv5. Training optima were estimated
were resolved keeping the outer coordinates of intersecting based on a five-fold cross-validation with an 80 : 20 split
bounding boxes, yielding a single expanded bounding box. of the training part, choosing 30 epochs for DETR with a
These were checked manually for consistency. This resulted in learning rate drop at 20 epochs and 25 epochs for YOLOv5.
1, 961 images with 2, 453 annotations in the cleansed training The default seed for DETR was 42, that for YOLOv5 was 0.
part of the dataset. Annotation data was then converted to the E. Self-Training
COCO and the YOLO data format to be able to train DETR
and YOLOv5, depending on these formats. Self-training [16] represents the second of two training steps
and was performed on individual extended training datasets
B. Models for DETR and YOLOv5. These comprise (i) the original
For both, DETR and YOLOv5, the largest of the available cleansed training part of the DFU dataset, and (ii) shares
base models were chosen. In case of DETR this applied for of the validation and test part. For the latter pseudo-labels
ResNet-101, for YOLOv5 it was the x-model. Both were used were created by predictions via the priorly trained DETR
with COCO pre-trained weights. For DETR the pre-trained and YOLOv5 base models. Hence, for both an individual
model head was dropped and re-built for a single class, in the training data extension was created. Predicted pseudo-labels
model configuration of YOLOv5 the amount of classes was compensate missing ground truth labels for validation and test
set to one respectively. images, and thus enable training with them. Images for which
base models did not yield predictions were not included, that
C. Data Augmentation is why the extended training sets only hold shares of the DFU
The default data augmentation technique sets built into dataset validation and test part.
DETR and YOLOv5 were used and neither altered nor ex- For DETR base model predictions on the validation and
tended. Augmentation methods are reported in the following. test part a minimum confidence level of 0.9 was chosen, for
DETR uses a basic set [11]. Images are scaled, assuring that YOLOv5 base model predictions it was 0.6. These levels were
the shorter side has 480 px up to 800 px and the longer side has chosen, as base models achieved the highest F1-Scores with
these. Details on inference and post-processing are given in
8 Nvidia-Docker: https://github.com/NVIDIA/nvidia-docker (2021-02-12)
9 Deepo:
Sec. III-F, base model performance is reported in Sec. IV-A.
https://github.com/ufoym/deepo (2021-02-12)
10 DockerHub image ufoym/deepo with ID: 75fad69ff121 Performing pseudo-label prediction on the dataset validation
11 DockerHub image ultralytics/yolov5 with ID e11fea2f791c and test part, the DETR base model yielded 2, 367 object
12 AntiDupl: https://github.com/ermig1979/AntiDupl (2021-02-12) predictions for 2, 090 images, and the YOLOv5 base model
150
yielded 2, 149 object predictions for 1, 945 images. Thus, the showed a low F1-Score of 0.3557, while YOLOv5 achieved a
new extended training sets held 4, 820 objects in 4, 051 for moderate F1-Score of 0.6621, respectively 0.6175 using TTA.
DETR, and 4, 602 objects in 3, 906 images for YOLOv5. DETR tends to yield notably higher detection amounts at
Self-training was then performed on the extended training decreasing confidence levels than YOLOv5 that shows a mod-
parts using the weights of base models as pre-trained weights, erate increase. Yet, use of TTA facilitates an increased amount
aiming for further generalization of the models. Hyperparame- of detections over all confidence levels. Further, while the
ters for the self-trained DETR model were the same as for base mAP generally benefits from TTA usage, at higher confidence
model training, but the learning rate was dropped to 1e−5. For levels also the F1-Score is increased.
the self-trained YOLOv5 model the pre-configured hyperpa-
rameter set for finetuning (hyp.finetune.yaml) was chosen that B. Self-Trained Model Performance
features a dropped learning rate of 3.2e−3 and altered data For DETR and YOLOv5, self-training showed a remarkable
augmentation probabilities. increase of F1-Scores over all, but with highest impact on low
Self-training optima were not approachable via a cross- confidence levels. With a slight increase, the highest F1-Score
validation. Instead, epochs were chosen based on experience achieved by DETR rose to 0.7384 at a confidence level of 0.9.
from prior experiments. For the DETR self-trained model Notable increases were achieved for YOLOv5 with its highest
5 epochs with a learning rate drop at 3 were chosen, for F1-Score of 0.7407 at confidence level 0.5, and the overall
YOLOv5 it were 25 epochs. highest F1-Score of 0.7474 at confidence level 0.7 using TTA.
F. Inference and Post-Processing The highest mAPs at confidence level 0.1 either showed
Predictions via DETR and YOLOv5 were performed at slight to moderate decreases or remained similar. The mAP of
different confidence levels 0.1 to 0.9 in steps of 0.1. DETR decreased to 0.7125 and that of YOLOv5 with TTA to
For YOLOv5, an additional comparison track using test- 0.7025. YOLOv5 remained similar with an mAP of 0.6754.
time augmentation13 (TTA) was examined. YOLOv5 performs Amounts of detected objects by DETR decreased over all
predictions on images with a maximum size of 640 px per side confidence levels, with low confidence levels most affected.
by default. TTA increases image sizes to 832 px, left-right flips YOLOv5 detections decreased for very low, but increased with
them and performs inference on three differently scaled image moderate confidence levels. Use of TTA also showed similar
instances. Detections on these are merged prior to the default behavior, yet an increase of detections did only occur with high
NMS. Inference speed decreases to an approx. threefold as a confidence levels. While the amount of images with detections
consequence, but F1-Score and mAP benefit. DETR does not by DETR was slightly decreased over all confidence levels,
provide such a feature in its current state. that of YOLOv5 increased.
Contrary to [15], NMS was not configured manually for
post-processing on YOLOv5 predictions. The default IoU V. D ISCUSSION
threshold of 0.45 was kept and applies independently from Compared to the best results of the DFUC 2020 [15], base
TTA usage. For DETR predictions no such manual post- model performance of DETR and YOLOv5 with out-of-the-
processing was necessary due to its simplified pipeline. box settings is reasonable. Using TTA, YOLOv5 outperforms
IV. R ESULTS the best DFUC 2020 F1-Score of 0.7437 [15]. DETR performs
comparable. Both, DETR and YOLOv5 using TTA, outper-
In the following the individual performance of DETR and form the best DFUC 2020 mAP of 0.6940 [15] with base and
YOLOv5 is reported for base and self-trained models. Results self-trained models. This is achieved following a rather rough
of experiments are reported in Tab. I. approach without further tuning.
A. Base Model Performance In regards to the F1-Score, especially DETR will benefit
In regards to the confidence levels base model detection from cautious adjustments of high confidence levels, as best
performance differs notably between DETR and YOLOv5. performance is present in a small band. In contrast, YOLOv5
DETR yields generally high confidence levels for correct shows reasonable performance in a rather broad band. At very
detections what coincides with [11]. The best F1-Score of low levels, YOLOv5 can still achieve moderate F1-Scores
0.7355 achieved at a confidence level of 0.9 with an mAP while that of DETR suffers from high amounts of detections
of 0.6587. YOLOv5 achieved its best F1-Score of 0.7302 at a for all images. This also indicates that DETR easily falls for
confidence level of 0.6 with an mAP of 0.6243. Use of TTA pitfall images, even at high levels.
increased the F1-Score to 0.7351 with an mAP of 0.6501 at Both benefit from self-training in different ways, but
an increased confidence level of 0.70. Best performances of YOLOv5 seems to benefit more notably. Self-trained models
DETR and YOLOv5 with use of TTA are similar. achieve higher F1-Scores over all confidence levels, especially
The highest mAP of 0.7284 was achieved by DETR at at lower ones. This is most prominently visible in the self-
the confidence level 0.1, while YOLOv5 achieved 0.6752, trained DETR model. DETR mAPs are impaired moderately
respectively 0.7080 using TTA. Yet, at the same level, DETR while YOLOv5 mAPs remain similar respectively increase.
When using TTA, YOLOv5 mAPs slightly decrease for low
13 TTA: https://github.com/ultralytics/yolov5/issues/303 (2021-02-12) to moderate levels and increase for higher ones.
151
TABLE I: Detection performance comparison at different confidence levels for DETR and YOLOv5 base and self-trained
models. YOLOv5 shows an additional track for use of TTA. Images with detected objects as well as F1-Score and mAP for
an IoU threshold of 0.5 are listed. Gray rows represent best F1-Score focused results, best metrics are highlighted bold.
DETR YOLOv5 YOLOv5 with TTA

Model Confidence Images Objects F1 mAP Images Objects F1 mAP Images Objects F1 mAP
Base 0.1 2,000 8,216 0.3557 0.7284 1,943 2,815 0.6621 0.6752 1,971 3,383 0.6175 0.7080
0.2 2,000 5,500 0.4726 0.7232 1,919 2,493 0.6980 0.6682 1,953 2,934 0.6619 0.7011
0.3 1,999 4,477 0.5370 0.7180 1,891 2,316 0.7138 0.6597 1,938 2,673 0.6902 0.6959
0.4 1,997 3,883 0.5826 0.7134 1,864 2,183 0.7206 0.6488 1,924 2,496 0.7072 0.6893
0.5 1,996 3,498 0.6159 0.7092 1,812 2,058 0.7259 0.6371 1,895 2,332 0.7185 0.6788
0.6 1,995 3,185 0.6456 0.7047 1,758 1,935 0.7302 0.6243 1,866 2,178 0.7298 0.6684
0.7 1,986 2,872 0.6782 0.6993 1,638 1,755 0.7139 0.5886 1,798 2,006 0.7351 0.6501
0.8 1,965 2,554 0.7039 0.6853 1,286 1,345 0.6566 0.4931 1,580 1,673 0.7167 0.5918
0.9 1,896 2,134 0.7355 0.6587 136 137 0.1182 0.0613 306 308 0.2453 0.1384
Self-trained 0.1 2,000 5,699 0.4602 0.7125 1,949 2,603 0.6928 0.6754 1,975 2,982 0.6552 0.7025
0.2 1,998 4,242 0.5531 0.7052 1,919 2,342 0.7213 0.6670 1,961 2,616 0.6955 0.6955
0.3 1,997 3,699 0.5973 0.7006 1,894 2,221 0.7309 0.6593 1,944 2,415 0.7137 0.6865
0.4 1,995 3,352 0.6291 0.6966 1,866 2,128 0.7399 0.6541 1,926 2,279 0.7299 0.6822
0.5 1,988 3,093 0.6555 0.6933 1,830 2,037 0.7407 0.6428 1,896 2,168 0.7381 0.6744
0.6 1,982 2,881 0.6758 0.6882 1,782 1,948 0.7367 0.6280 1,867 2,087 0.7424 0.6670
0.7 1,973 2,653 0.6973 0.6807 1,687 1,809 0.7266 0.6017 1,819 1,976 0.7474 0.6560
0.8 1,955 2,406 0.7151 0.6665 1,343 1,396 0.6585 0.4985 1,607 1,685 0.7213 0.5962
0.9 1,883 2,088 0.7384 0.6447 239 240 0.1917 0.1030 480 482 0.3544 0.2122
It is suggested that performance of DETR for F1-Score

or generally where in need for less computational-extensive
and mAP could strongly benefit from inclusion of further
methods. For use cases that involve additional segmentation
data augmentation methods for training and a TTA feature
tasks use of DETR may prove beneficial as extension is easy.
for inference. When using TTA, further performance improve-
Both, DETR and YOLOv5 models, benefit from further self-
ments for YOLOv5 regarding the F1-Score may be achieved
training, strongly stabilizing the F1-Scores at low confidences
by manually adjusting the NMS IoU threshold. However, this
and moderately increasing peak performances.
may just increase performance on a specific dataset but not
in general. DETR does involve manual NMS due to pipeline The base model of DETR achieved the highest mAP and the
simplifications. Performance of YOLOv5 models could be self-trained YOLOv5 model using TTA achieved the highest
further increased using its convenience feature for saving the F1-Score. These outperform the best F1-Score and the best
best performing epoch weights during training alongside with mAP achieved during the DFUC 2020 [15], not using image
those of the last trained epoch. This was not used for reasons pre-processing or detection post-processing. Performance is
of a fair comparison, as DETR does not offer such a feature. expected to further increase by refining the self-training ap-
proach, choosing confidence levels for pseudo-labels carefully.
Regarding the presented performance comparison and self-
training approach there are limitations. As a comparison was Both frameworks could increase effectiveness of computer-
performed for a single-class detection task with pre-trained aided medical documentation by decreasing the need for man-
weights, behavior of both frameworks may vary considerably ual corrections, and foster reliability of medical applications,
for a multi-class task and when trained from scratch. In e.g., early stage detection and healing progression monitoring.
addition, peak performance for base and self-trained models Future work will cover further experiments on a refined
was not meant to be achieved in the first place. Further self-training approach for the used frameworks. In addition,
tuning is possible. Generally, when considering self-training, a community-driven DETR variations [44], [45] addressing dis-
less aggressive approach may further facilitate generalization. cussed shortcomings will be investigated.
When aiming to max out performance, choosing higher confi-
ACKNOWLEDGMENT
dence levels is advised for pseudo-labeling to exclude as much
false-positive detections as possible. Thus, self-trained models The authors thank Johannes Rückert, Department of Com-
may be able to achieve even higher scores. Behavior may also puter Science, University of Applied Sciences and Arts Dort-
differ for multi-class tasks, yet also a convenient performance mund, 44227 Dortmund, Germany, for technical advice.
increase can be expected [16].
R EFERENCES
VI. C ONCLUSIONS
[1] N. Cho, J. Shaw, S. Karuranga, Y. Huang, J. da Rocha Fernandes et al.,
F1-Score focused out-of-the-box performance of DETR and “IDF Diabetes Atlas: Global estimates of diabetes prevalence for 2017
and projections for 2045,” Diabetes Research and Clinical Practice, vol.
YOLOv5 base models is similar for the given task. When used 138, pp. 271–281, 2018.
with TTA, YOLOv5 performs nearly equal. This is promising [2] P. Zhang, J. Lu, Y. Jing, S. Tang, D. Zhu, and Y. Bi, “Global epidemiol-
for mobile-focused use cases relying on on-device inference ogy of diabetic foot ulceration: a systematic review and meta-analysis,”
Annals of Medicine, vol. 49, no. 2, pp. 106–116, 2016.
152
[3] A. Alavi, R. G. Sibbald, B. Ladizinski, A. Saraiya, K. C. Lee et al., [23] M. Tan, R. Pang, and Q. V. Le, “EfficientDet: Scalable and Efficient
“Wound-Related Allergic/Irritant Contact Dermatitis,” Advances in Skin Object Detection,” in 2020 IEEE/CVF Conference on Computer Vision
& Wound Care, vol. 29, no. 6, pp. 278–286, 2016. and Pattern Recognition (CVPR). IEEE, 2020.
[4] A. R. Siddiqui and J. M. Bernstein, “Chronic wound infection: Facts [24] Z. Cai and N. Vasconcelos, “Cascade R-CNN: High Quality Object
and controversies,” Clinics in Dermatology, vol. 28, no. 5, pp. 519–526, Detection and Instance Segmentation,” IEEE Transactions on Pattern
2010. Analysis and Machine Intelligence, 2019.
[5] A. J. Boulton, L. Vileikyte, G. Ragnarson-Tennvall, and J. Apelqvist, [25] X. Zhu, H. Hu, S. Lin, and J. Dai, “Deformable ConvNets V2:
“The global burden of diabetic foot disease,” The Lancet, vol. 366, no. More Deformable, Better Results,” in 2019 IEEE/CVF Conference on
9498, pp. 1719–1724, 2005. Computer Vision and Pattern Recognition (CVPR). IEEE, 2019.
[6] M. H. Yap, K. E. Chatwin, C.-C. Ng, C. A. Abbott, F. L. Bowling et al., [26] Y. Cao, K. Chen, C. C. Loy, and D. Lin, “Prime Sample Attention in
“A New Mobile Application for Standardizing Diabetic Foot Images,” Object Detection,” in 2020 IEEE/CVF Conference on Computer Vision
Journal of Diabetes Science and Technology, vol. 12, no. 1, pp. 169–173, and Pattern Recognition (CVPR). IEEE, 2020.
2017. [27] J. Redmon and A. Farhadi, “YOLOv3: An Incremental Improvement,”
[7] M. Goyal, N. D. Reeves, S. Rajbhandari, and M. H. Yap, “Robust Meth- arXiv:1804.02767, 2018.
ods for Real-Time Diabetic Foot Ulcer Detection and Localization on [28] G. Jocher, L. Changyu, A. Hogan, L. Yu, changyu98 et al.,
Mobile Devices,” IEEE Journal of Biomedical and Health Informatics, “ultralytics/yolov5: Initial Release,” 2020. [Online]. Available: https:
vol. 23, no. 4, pp. 1730–1741, 2019. //zenodo.org/record/3908560
[8] B. Chan, S. Cadarette, W. Wodchis, J. Wong, N. Mittmann, and [29] Z. Li, C. Peng, G. Yu, X. Zhang, Y. Deng, and J. Sun, “DetNet: Design
M. Krahn, “Cost-of-illness studies in chronic ulcers: a systematic Backbone for Object Detection,” in Computer Vision – ECCV 2018.
review,” Journal of Wound Care, vol. 26, no. sup4, pp. S4–S14, 2017. Springer International Publishing, 2018, pp. 339–354.
[9] M. H. Yap, C.-C. Ng, K. Chatwin, C. A. Abbott, F. L. Bowling [30] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones et al., “At-
et al., “Computer Vision Algorithms in the Detection of Diabetic Foot tention is All you Need,” in Advances in Neural Information Processing
Ulceration,” Journal of Diabetes Science and Technology, vol. 10, no. 2, Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus
pp. 612–613, 2015. et al., Eds., vol. 30. Curran Associates, Inc., 2017.
[31] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training
[10] J. Müthing, R. Brüngel, and C. M. Friedrich, “Server-Focused Security
of Deep Bidirectional Transformers for Language Understanding,” in
Assessment of Mobile Health Apps for Popular Mobile Platforms,”
Proceedings of the 2019 Conference of the North American Chapter
Journal of Medical Internet Research, vol. 21, no. 1, p. e9818, 2019.
of the Association for Computational Linguistics: Human Language
[11] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and Technologies (NAACL-HLT), Volume 1 (Long and Short Papers). As-
S. Zagoruyko, “End-to-End Object Detection with Transformers,” in sociation for Computational Linguistics, 2019, pp. 4171–4186.
Computer Vision – ECCV 2020. Springer International Publishing, [32] H. W. Kuhn, “The Hungarian method for the assignment problem,”
2020, pp. 213–229. Naval Research Logistics Quarterly, vol. 2, no. 1-2, pp. 83–97, 1955.
[12] G. Jocher, A. Stoken, J. Borovec, NanoCode012, ChristopherSTAN [33] A. Kirillov, K. He, R. Girshick, C. Rother, and P. Dollar, “Panoptic
et al., “ultralytics/yolov5: v4.0 - nn.SiLU() activations, Weights & Segmentation,” in 2019 IEEE/CVF Conference on Computer Vision and
Biases logging, PyTorch Hub integration,” 2021. [Online]. Available: Pattern Recognition (CVPR). IEEE, 2019.
https://zenodo.org/record/4418161 [34] A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, “YOLOv4: Optimal
[13] M. H. Yap, N. Reeves, A. Boulton, S. Rajbhandari, D. Armstrong Speed and Accuracy of Object Detection,” arXiv:2004.10934, 2020.
et al., Diabetic Foot Ulcers Grand Challenge 2020. [Online]. Available: [35] C.-Y. Wang, H.-Y. M. Liao, Y.-H. Wu, P.-Y. Chen, J.-W. Hsieh, and I.-H.
https://zenodo.org/record/3731068 Yeh, “CSPNet: A new backbone that can enhance learning capability of
[14] B. Cassidy, N. D. Reeves, P. Joseph, D. Gillespie, C. O’Shea CNN,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern
et al., “DFUC2020: Analysis Towards Diabetic Foot Ulcer Detection,” Recognition Workshops (CVPRW). IEEE, 2020.
arXiv:2004.11853, 2020. [36] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial Pyramid Pooling in Deep
[15] M. H. Yap, R. Hachiuma, A. Alavi, R. Brüngel, M. Goyal et al., Convolutional Networks for Visual Recognition,” IEEE Transactions on
“Deep Learning in Diabetic Foot Ulcers Detection: A Comprehensive Pattern Analysis and Machine Intelligence, vol. 37, no. 9, pp. 1904–
Evaluation,” arXiv:2004.10934, 2020. 1916, 2015.
[16] S. Koitka and C. M. Friedrich, “Optimized Convolutional Neural Net- [37] S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia, “Path Aggregation Network for
work Ensembles for Medical Subfigure Classification,” in In Exper- Instance Segmentation,” in 2018 IEEE/CVF Conference on Computer
imental IR Meets Multilinguality, Multimodality, and Interaction 8th Vision and Pattern Recognition. IEEE, 2018.
International Conference of the CLEF Association, CLEF 2017, Lecture [38] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona et al., “Microsoft
Notes in Computer Science (LNCS). Springer International Publishing, COCO: Common Objects in Context,” in Computer Vision – ECCV
2017, pp. 57–68. 2014. Springer International Publishing, 2014, pp. 740–755.
[17] M. Goyal, N. D. Reeves, A. K. Davison, S. Rajbhandari, J. Spragg, [39] D. Hendrycks and K. Gimpel, “Gaussian Error Linear Units (GELUs),”
and M. H. Yap, “DFUNet: Convolutional Neural Networks for Diabetic arXiv:1606.08415, 2020.
Foot Ulcer Classification,” IEEE Transactions on Emerging Topics in [40] P. Jaccard, “The Distribution of the Flora in the Alpine Zone,” New
Computational Intelligence, vol. 4, no. 5, pp. 728–739, 2020. Phytologist, vol. 11, no. 2, pp. 37–50, Feb. 1912.
[18] M. Goyal, N. D. Reeves, S. Rajbhandari, N. Ahmad, C. Wang, and [41] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, “mixup:
M. H. Yap, “Recognition of ischaemia and infection in diabetic foot Beyond Empirical Risk Minimization,” in 6th International Conference
ulcers: Dataset and techniques,” Computers in Biology and Medicine, on Learning Representations (ICLR) 2018. ICLR, 2018.
vol. 117, p. 103616, 2020. [42] I. Loshchilov and F. Hutter, “Decoupled Weight Decay Regularization,”
[19] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards in 7th International Conference on Learning Representations (ICLR)
Real-Time Object Detection with Region Proposal Networks,” IEEE 2019. ICLR, 2019.
Transactions on Pattern Analysis and Machine Intelligence, vol. 39, [43] L. Bottou, “Large-Scale Machine Learning with Stochastic Gradient
no. 6, pp. 1137–1149, 2017. Descent,” in Proceedings of COMPSTAT'2010. Physica-Verlag HD,
[20] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking 2010, pp. 177–186.
the Inception Architecture for Computer Vision,” in 2016 IEEE Confer- [44] X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable
ence on Computer Vision and Pattern Recognition (CVPR). IEEE, DETR: Deformable Transformers for End-to-End Object Detection,” in
2016. 9th International Conference on Learning Representations (ICLR) 2021.
[21] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for ICLR, 2021.
Image Recognition,” in 2016 IEEE Conference on Computer Vision and [45] T. Prangemeier, C. Reich, and H. Koeppl, “Attention-Based Transform-
Pattern Recognition (CVPR). IEEE, 2016. ers for Instance Segmentation of Cells in Microstructures,” in 2020 IEEE
[22] J. Dai, Y. Li, K. He, and J. Sun, “R-FCN: Object Detection via Region- International Conference on Bioinformatics and Biomedicine (BIBM).
Based Fully Convolutional Networks,” in Proceedings of the 30th IEEE, 2020.
International Conference on Neural Information Processing Systems
(NIPS’16). Curran Associates Inc., 2016, p. 379–387.
153

DETR and YOLOv5

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DETR and YOLOv5

Uploaded by

Copyright:

Available Formats

2021 34th International Symposium on Computer-Based Medical Systems (CBMS)

DETR and YOLOv5: Exploring Performance and

∗ Corresponding author: christoph.friedrich@fh-dortmund.de

2372-9198/21/$31.00 ©2021 IEEE 148

DETR YOLOv5 YOLOv5 with TTA

It is suggested that performance of DETR for F1-Score

You might also like