You are on page 1of 9

CVPR CVPR

#9240 #9240
CVPR 2022 Submission #9240. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

000 054
001 055
002 056
003 Contour-Hugging Heatmaps for Landmark Detection 057
004 058
005 059
006 060
Anonymous CVPR submission
007 061
008 062
009 Paper ID 9240 063
010 064
011 065
012 066
Abstract
013 067
014 068
We propose an effective and easy-to-implement method
015 069
for simultaneously performing landmark detection in im-
016 070
ages and obtaining an ingenious uncertainty measurement
017 071
for each landmark. Uncertainty measurements for land-
018 072
marks are particularly useful in medical imaging applica-
019 073
tions: rather than giving an erroneous reading, a landmark
020 074
detection system is more useful when it flags its level of (a) Gaussian distributions out- (b) Gaussian distributions out-
021 075
confidence in its prediction. When an automated system put by Kumar et al. [1] - LU- put by LEE et al. [2] which
022 VLi Landmarks. uses a Bayesian CNN. 076
is unsure of its predictions, the accuracy of the results can
023 077
be further improved manually by a human. In the medi-
024 078
cal domain, being able to review an automated system’s
025 079
level of certainty significantly improves a clinician’s trust
026 080
in it. This paper obtains landmark predictions with un-
027 081
certainty measurements using a three stage method: 1) We
028 082
train our network on one-hot heatmap images, 2) We cali-
029 083
brate the uncertainty of the network using temperature scal-
030 084
ing, 3) We calculate a novel statistic called ‘Expected Ra-
031 085
dial Error’ to obtain uncertainty measurements. We find
032 086
that this method not only achieves localisation results on
033 087
par with other state-of-the-art methods but also an uncer-
034 088
tainty score which correlates with the true error for each
035 089
landmark thereby bringing an overall step change in what
036 090
a generic computer vision method for landmark detection
037 091
should be capable of. In addition we show that our uncer-
038 092
tainty measurement can be used to classify, with good accu-
039 093
racy, what landmark predictions are likely to be inaccurate.
040 094
Code available at: https://github.com/jfm15/
041 095
ContourHuggingHeatmaps.git
042 096
(c) Contours of the heatmaps output by our method. The dark blue
043 097
dots are the predicted landmark points and the bright green dots are
044 098
1. Introduction the ground truth. We call our heatmaps contour hugging because of
045 the way they bend around the edges (in this case of the head). Our 099
046 Automatic landmark detection from images is an impor- probability distributions are not restricted to being symmetrical and 100
uni-modal.
047 tant task in a number of applications from monitoring a 101
048 driver’s vital signs [3] to medical imaging applications on Figure 1. Images demonstrating the difference in how our method 102
049 numerous body parts including the knee, spine and lungs quantifies the uncertainty of its landmark positions compared to 103
050 [4–6]. Most modern approaches to landmark detection use previous approaches. 104
051 a deep learning pipeline and obtain impressive localisation 105
052 results. However these deep learning methods always detect 106
053 some landmarks erroneously during testing. Take, for ex- 107

1
CVPR CVPR
#9240 #9240
CVPR 2022 Submission #9240. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

108 162
ample, the task of cephalometric landmark detection from
109 163
x-rays of the head. These landmarks are used to compute
110 164
clinically useful angles and measurements from which clin-
111 165
icians can diagnose patients [7]. However, even the latest
112 166
deep learning approaches detect at least 3% of landmarks
113 167
incorrectly (greater than 4mm error) [8, 9] so it could be
114 168
dangerous to build these systems into safety-critical clinical
115 169
workflows, especially if there was no human expert supervi-
116 170
sion. In this paper we address this problem by formulating
117 171
the task of landmark detection as a classification task over (a) This heatmap is a Gaussian dis-
118 (b) Our network outputs a multi- 172
all pixels in an image. This allows us to obtain more ex- tribution like most recent methods.
119 modal distribution heatmap for the 173
pressive and interpretable heatmaps as shown in Figure 1c. However it isn’t representative of landmark in the center of this im-
120 the position of the landmark be- 174
These heatmaps can be calibrated (Section 3.3) and then age. This nuisance would not be
121 cause this point at the end of a chin captured by previous approaches 175
analysed using our novel statistic called Expected Radial is unlikely to be placed within the
122 which encode the uncertainty as 176
Error ERE (Section 3.4). This statistic correlates well with chin itself or out in space. It is more
123 Gaussian distributions. 177
the true localisation error and can be used to flag potentially likely to be placed along the con-
124 tour of the chin by a human. 178
erroneous predictions (Section 5.3.1).
125 179
126 Figure 2. (a) Shows a commonly used method for generating tar- 180
2. Background get heatmaps. (b) Shows how our network trained on one-hot
127 181
128 In recent literature, fully convolutional neural networks heatmaps can express a multi-modal distribution in its output. 182
129 (CNNs) have established themselves as the state of the art 183
130 in landmark detection overtaking previous approaches such 184
131 as random forests [10, 11]. This began with Tompson et trix representing the uncertainty of its position. The prob- 185
132 al. [12] who used a CNN to regress target heat maps achiev- lem with these approaches is that they restrict their output 186
133 ing state of the art performance on the human pose detec- probability distribution to a Gaussian distribution, which is 187
134 tion problem. Shortly afterwards, fully convolutional net- unrealistic for many real world tasks because it is uni-modal 188
135 works [13], including the U-Net [14], became very popular and symmetric. 189
136 for segmentation tasks and its encoder-decoder architecture 190
137 began to be applied to landmark detection as well, such as in 2.2. Our Work 191
138 Payer et al. [15]. More recent architectures in the area have 192
139 stacked or cascaded models like this sequentially [8, 16] or We address this disadvantage by formulating the prob- 193
140 in more complicated configurations [17]. However there is lem of landmark detection as a classification problem over 194
141 still evidence that the standard U-Net can perform at a high all pixels in the image to obtain output heatmaps which are 195
142 level when its hyper-parameters are tuned correctly in both not restricted. We theorise that we can make these out- 196
143 segmentation [18] and landmark detection problems [19]. put heatmaps well calibrated using a temperature scaling 197
144 The works mentioned so far produce landmark predic- method described in Guo et al. [21] and thus provide ac- 198
145 tions but do not produce any value of how ‘sure’ or how ‘un- curate heatmaps. We validate these heatmaps by assessing 199
146 certain’ their model is in that prediction. One reason for this their calibration using reliability diagrams and measuring 200
147 is that the majority of existing approaches train networks on the Expected Calibration Error (ECE) in Section 5.2. 201
148 synthetically generated heatmaps created by a Gaussian dis- In addition, we propose a statistic called Expected 202
149 tribution [9, 15, 20]. This has the disadvantage that the net- Radial Error (ERE) to summarise how uncertain our model 203
150 work is being trained on heatmaps which do not represent is based on the heatmap output. We firstly validate this 204
151 the uncertainty of where that landmark could realistically be statistic by showing there is a correlation between it and 205
152 placed, and so, the output of those models is not calibrated localisation error. Then, secondly, we perform experiments 206
153 either [1]. This is illustrated in Fig 2a. to see whether we can filter landmarks, on an individual 207
154 basis, using this statistic by applying a threshold to it to flag 208
2.1. Uncertainty Estimation Methods
155 up when our model is likely to have made an inaccurate 209
156 Recent works which aim to address this problem include prediction (Section 5.3.1). This is relevant in real world 210
157 Lee et al. [2], which uses a Bayesian CNN to output 2D applications: an AI system which can flag up when an 211
158 Gaussian probability distributions for each landmark repre- error in its output is likely to occur is most valuable. Such 212
159 senting the probabilities of where that landmark could be functionality also leads to increasing users’ trust in the 213
160 placed and Kumar et al. [1], which regresses the position system. 214
161 of the landmark as well as the values of a covariance ma- 215

2
CVPR CVPR
#9240 #9240
CVPR 2022 Submission #9240. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

216 270
2.3. Contributions
217 271
218 Our innovations are to: 272
219 273
1. Present a reproducible network which achieves
220 274
performance comparable to the state-of-the-art on
221 275
the cephalometric landmark detection task, and
222 276
can run on a modest 8GB GPU. We share the
223 277
code at: https : / / github . com / jfm15 /
224 278
ContourHuggingHeatmaps.git.
225 279
226
2. Show that it is possible to obtain near SOTA localisa- 280
227
tion performance by formulating the landmark detec- 281
228
tion task as a classification task (Table 1). 282
229 3. Demonstrate that the probabilities in the output 283
230 heatmaps can be calibrated using temperature scaling 284
(a) Shows a full cephalometric images (Section 5.2.1).
231 with boxes to highlight where 3b and 3c 285
232 are cropped from. 4. Show that our novel Expected Radial Error (ERE) 286
233 statistic correlates with the localisation error (Section 287
234 5.3) and build a binary classifier based on ERE to flag 288
235 up potential erroneous predictions with a good degree 289
236 of accuracy (Section 5.3.1). 290
237 291
238 3. Method 292
239 293
The principal novelty in our work comes from how we
240 294
have formulated the problem of landmark detection as a
241 295
classification problem and how we have validated the util-
242 296
ity of the output heatmaps in a qualitative and quantitative
243 297
way. This work uses the well-established U-Net as the main
244 298
network architecture, in a way which is described in the sub-
245 (b) The model is more uncertain on the po- 299
sequent section.
246 sitioning of the 2 landmarks in the left of 300
247 this patch, which is reflected in the higher 3.1. Architecture 301
ECE scores.
248 302
We perform our experiments using a U-Net [14] with a
249 303
ResNet-34 encoder pretrained on ImageNet [22]. We chose
250 304
the U-Net architecture because it is easy to implement, re-
251 305
producible and has evidence of obtaining good results on
252 306
landmark detection problems [19]. Our decoder has 5 lev-
253 307
els of upsampling with 256, 128, 64, 32 and 32 channels in
254 308
each of the levels from the bottom level to the top.1 After
255 309
each convolution there is a batch normalisation layer and a
256 310
ReLU activation function. We then have a final 1 × 1 con-
257 311
volutional layer to squash the 32 channels in the top layer
258 312
into N channels each of which represents the heatmap for
259 313
one of the landmarks, N being the number of landmarks
260 314
to be detected. This is implemented using the pytorch
261 (c) The model is reasonably certain on the 315
segmentation models library.2
262 positioning of these landmarks although 316
the spread out heatmaps have slightly We then apply a 2D softmax activation function to each
263 317
higher ERE scores. of these channels3 to convert them into a probability distri-
264 318
265 1 We ran our experiments on a 8GB GPU and were quite restricted it 319
Figure 3. Output heatmaps displayed with their Expected Radial
266 how many channel each layer could have. 320
Error (ERE) statistics. 2 https://github.com/qubvel/segmentation_models.
267 321
pytorch
268 3 Unless we are performing temperature scaling, see Section 3.3, in 322
269 which case we scale each channel by a temperature parameter first. 323

3
CVPR CVPR
#9240 #9240
CVPR 2022 Submission #9240. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

324 378
bution over all pixels in the image. Formally our network 3.4. Uncertainty Estimation
325 379
outputs a tensor comprised of n channels: {c1 , c2 , ..., cn },
326 We obtain an uncertainty measurement for each land- 380
cl ∈ Rw×h where w and h are the width and height of
327 mark by calculating a statistic we call the Expected Radial 381
the input images. The 2D softmax function works on each
328 Error (ERE). This statistic is calculated using the following 382
channel independently such that:
329 equation: 383
330 384
ecl (i,j)
331 σl (i, j) = Pw Ph (1) 385
cl (s,t) w X
h
332 t=1 e 386
s=1
X p
ERE δ = δ(i, j) (i − x̃)2 + (j − ỹ)2 ) (4)
333 387
334
where tensor σl (·, ·) is calculated for each channel. We use i=1 j=1
388
335
a negative log likelihood loss4 to train the network. At test 389
time we obtain predicted landmark points by selecting the Where w and h are the width and height of the image;
336 δ(·, ·) is the heatmap / probability distribution output by the 390
337
hottest point in the heatmap or, in other words, the mode of 391
the output distribution. softmax activation function at the end of our network and
338 (x̃, ỹ) are the coordinates of the predicted landmark, or in 392
339 other words, the hottest point in δ. It is worth noting that 393
3.2. Heatmap Generation
340 δ must be pre-processed before ERE can be calculated. 394
341 As mentioned in Section 2.2 we formulate the landmark This pre-processing consists of converting all values in δ 395
342 detection task as a classification problem. We do this by which are below 5% of the hottest point to 0 and then re- 396
343 training our model on heatmaps which contain a single 1 normalizing δ by dividing by the sum of its values. We show 397
344 spike at the ground-truth point, with 0s at every other posi- that ERE is correlated with localisation error in Section 5 398
345 tion. Formally this is defined as: and hypothesize that it can be used to flag up erroneous pre- 399
346 dictions, which we validate in Section 5.3.1. Examples of 400
347 401
(
1, if i = x and j = y ERE scores next to heatmaps are given in Figure 3.
348 H(i, j) = (2) 402
349
0, otherwise 403
4. Experiments
350 404
Where H(i, j) denotes the pixel value of the heatmap at
351 4.1. Dataset 405
352
each point (i, j), and (x, y) denotes the coordinates of the 406
353
ground truth landmark point. We perform our experiments on the publicly available 407
354
cephalometric dataset [23]. The dataset contains 400 x- 408
355
3.3. Temperature Scaling rays, split into 150 training images and two test sets, Test 409
356
Set 1 and Test Set 2, comprised of 150 and 100 images re- 410
An advantage of formulating landmark detection as a
357
spectively. Each image in the dataset has a resolution of 411
classification problem is that we can perform temperature
358
1935 × 2400 where each pixel represents a 0.1mm square 412
scaling to calibrate the heatmap probabilities. Temperature
359
and each comes with two sets of ground truth annotations 413
scaling is well described in Guo et al. [21]. After we train
360
for 19 landmarks, one from a senior clinician and one from 414
our model using the heatmaps described in Section 3.2 we
361
a junior clinician. As in previous works [23] we take the 415
add an additional parameter T for each landmark to the
362
average of the landmark points given by the two clinicians 416
model such that each channel pixel cl (·, ·) is divided by
363
as our ground truth landmark points which we train and test 417
scalar Tl before being put through the softmax activation
364
on. Before passing these images into our network for train- 418
function. So the softmax output becomes:
365
ing or testing we resize them to 640 × 800 pixels. 419
366 420
ecl (i,j)/Tl 4.2. Training The Network
367 σl (i, j) = Pw Ph (3) 421
cl (s,t)/Tl
368 s=1 t=1 e We train our U-Net architecture by passing down- 422
369 sampled cephalometric x-ray images of size 640 × 800 423
We freeze all the other parameters in our trained network
370 through our network and 2D Softmax function (Eq 1) to 424
and fine tune our network to optimize the Tl parameters us-
371 obtain predicted heatmaps. We then use a negative log like- 425
ing the same the negative log likelihood loss. This does not
372 lihood function against our ground-truth heatmaps to gener- 426
change the localisation accuracy of the network because the
373 ate a loss. We train the network using the Adam optimizer 427
hottest points will remain the same no matter what values Tl
374 with an initial learning rate of 0.001 and a batch size of 4. 428
takes. Section 5.2.1 shows how Tl calibrates the network.
375 We step down the learning rate by a factor of 0.1 at epochs 429
376 4 https : / / pytorch . org / docs / stable / generated / 4, 6 and 8. 430
377 torch.nn.NLLLoss.html A large amount of data augmentation is implemented 431

4
CVPR CVPR
#9240 #9240
CVPR 2022 Submission #9240. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

432 486
Test Set 1 Test Set 2
433 487
434 MRE SDR (%) MRE SDR (%) 488
Model
435 (mm) 2mm 2.5mm 3mm 4mm (mm) 2mm 2.5mm 3mm 4mm 489
436 490
No uncertainties

Ibragimov et al. [10] 1.87 71.70 77.40 81.90 88.00 - 62.74 70.47 76.53 85.11
437 Lindner et al. [11] 1.67 74.95 80.28 84.56 89.68 - 66.11 72.00 77.63 87.42 491
438 Arik et al. [24] - 75.37 80.91 84.32 88.25 - 67.68 74.16 79.11 84.63 492
439 Yao et al. [19] 1.24 84.84 90.52 93.75 97.40 1.61 71.89 80.63 86.36 93.68 493
440 Chen et al. [25] 1.17 86.67 92.67 95.54 98.53 1.48 75.05 82.84 88.53 95.05 494
441 Zhong et al. [9] 1.12 86.91 91.82 94.88 97.90 1.42 76.00 82.90 88.74 94.32 495
442 Ours (with uncertainty) 1.20 83.47 89.16 92.60 96.49 1.46 74.63 83.58 87.21 93.79 496
443 497
444 Table 1. Localisation results for our method compared to existing methods. Our method reports results which improve on old methods and 498
445 are on a par with recent SOTA methods whilst having the significant benefits of being a simpler architecture and outputting an uncertainty 499
446 measurement. It is worth noting that Lee et al. [2] also localise landmarks on cephalometric data and produces uncertainty measurements; 500
447 however those results do not belong in the table because their test set is a combination of of Test Sets 1 and 2. When we performed the 501
448 same experiment, our method obtained a MRE of 1.30 compared to their 1.54. 502
449 503
450 504
451 using the imgaug library.5 We implement the following 2.5mm, 3mm and 4mm. The SDR is the percentage of 505
452 augmentations: X and Y translation by a maximum of 10 points predicted with an MRE less than a given threshold. 506
453 pixels, intensity scaling of all pixels by a random factor be- We compare our results both to methods proposed when the 507
454 tween 1 and 0.5, scale up or down the image so that it is dataset was originally released [10,11] and to state of the art 508
455 between 0.95 and 1.05 of its original scale, rotate the image methods with novel deep learning architecures [9, 19, 25]. 509
456 either anti-clockwise or clockwise at most 3◦ , and finally The results are presented in Table 1. 510
457 elastically transform each image. As no validation set was 511
given with the dataset we optimized our hyper-parameters 5.2. Validating Our Heatmaps
458 512
459 by holding out 30 images from the training set to use as The qualitative analysis of our heatmaps (examples 513
460 a validation set. Then, once we had found the best hyper- shown in Figure 3b and Figure 3c) is particularly appealing 514
461 parameters, we trained on the whole of the training set for because the heatmaps hug the contours of the feature (the 515
462 15 epochs (15 epochs was chosen because localisation re- head, in this case); they offer realistic potential locations of 516
463 sults plateaued after epoch 15 on the validation set) to ob- where each landmark could be placed. 517
464 tain our final models. Additional to this geometric consideration, we also show 518
465 numerically that these heatmaps are realistic. To do this we 519
466 5. Evaluation demonstrate that the probabilities in the heatmap are well 520
467
We evaluate our model not just on the accuracy of its calibrated by using a reliability diagram (Figure 4b). 521
468 522
predicted points but also on how well-calibrated its output
469 5.2.1 Reliability Diagrams 523
heatmaps are and whether they can be used to flag up erro-
470 524
neous results. Given a particular computer vision applica-
471 To assess the calibration of our model after training we pro- 525
tion, these measurements can quantify its real-life utility.
472 duce reliability diagrams like in Guo et al. [21]. The reli- 526
473 5.1. Localisation Results ability diagrams are histograms which show accuracy as a 527
474 function of confidence. In our case we define confidence to 528
Once an image is put through the trained network, scaled be the probability at the hottest point in the output heatmap,
475 529
by the temperature parameters and passed through the 2D or, in other words, the value at our predicted landmark point.
476 530
softmax layer, we select the hottest point on the heatmap To produce the histograms shown in Figure 4 we put
477 531
as our final predicted heatmap position. To validate the each predicted landmark into a bin. There are 10 bins, each
478 532
accuracy of these positions we used statistics established of equal width, such that the first bin will contain predicted
479 533
since the cephalometric dataset was released [7]. These points with confidences between 0 and M
480 10 , the second will 534
are the Mean Radial Error (MRE), which is the average
481 contain points with confidences between M 2M
10 and 10 etc.
535
euclidean distance between the predicted landmark points
482 Where M is the maximum confidence of any prediction in 536
and the ground-truth landmark points measured in mm, and
483 the validation set. The confidence for each bin is defined as 537
the Success Detection Rate (SDR) for 4 thresholds: 2mm,
484 the average confidence of predictions in that bin. We then 538
485 5 https://imgaug.readthedocs.io/en/latest/ define the accuracy of each bin as the percentage of correct 539

5
CVPR CVPR
#9240 #9240
CVPR 2022 Submission #9240. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

540 594
541 595
542 596
543 597
544 598
545 599
546 600
547 601
548 602
549 603
550 (a) Reliability graph for a landmark (b) The reliability graph for the con- 604
detection model trained with Gaus- fidences of our model after tem-
551 605
sian heatmaps, like in most exist- perature scaling. The model’s un-
552 ing approaches. We can see that certainties are generally well cali- 606
553 the model is under-confident in the brated. 607
554 0.09-0.1 bin. 608
555 609
Figure 4. On the left is a poorly calibrated model trained on Gaus-
556 610
sian heatmaps (σ = 1). On the right is our model trained on
557 611
one-hot heatmaps (Section 3.2).
558 612
559 613
560 614
predictions in that bin; in this case a prediction is correct
561 Figure 5. Correlation between the ERE statistic and the true radial 615
when the hottest point on the output heatmap is at exactly error for bins of 36 landmarks over the cephalometric Test Set 1.
562 616
the same coordinate as the ground-truth point (at the aver- The radial error is the euclidean distance between the predicted
563 617
age position between our expert human annotations). and the ground truth point.
564 618
If the network is perfectly calibrated we would expect
565 619
the confidence of each bin to be exactly same as the accu-
566 620
racy for that bin. We display this information in reliability icians placements) over Test Set 1. This graph is created
567 621
diagrams which overlays two histograms, one plotting the by putting each predicted landmark into a bin depending on
568 622
accuracy of each bin and the second plotting its confidence. its ERE score. Each bin has 36 landmarks put into it such
569 623
These diagrams make it easy to see if the network is under- that the 36 landmarks with the lowest ERE scores are placed
570 624
estimating or overestimating the confidence in its predic- into one bin, then the next biggest 36 are put into another
571 625
tions for each bin. If it is underestimating then the accuracy bin etc. Once the landmarks are put into the bins we calcu-
572 626
of a bin will be higher than its confidence so we will be able late the average ERE score (x axis value) and the average
573 627
to see a solid blue piece at the top of the bar in that bin in True Radial Error (y axis value) of all landmarks in the bin
574 628
the diagram. If it is overestimating we will see a solid lime and plot that data point. We find there is a strong correlation
575 629
green piece at that top of that bar for a bin the diagram. between the ERE and True Radial Error which means it can
576 630
From these diagrams we can calculate an Expected Cali- be used as a good indicator for how accurate a prediction is
577 631
bration Error (ECE) which is the difference between the ac- likely to be.
578 632
curacy and confidence in each bin, weighted proportionally
579 633
to how many landmarks are in each bin and then summed
580 5.3.1 Flagging Up Potential Bad Readings Using The 634
together. The lower the ECE the more calibrated the model.
581 ERE Score 635
Figure 4b shows the reliability diagram for our model, with
582 636
its ECE score. Our model achieves an ECE score of 0.70.
583 The final part of this work is to show that the ERE score 637
For comparison we trained a model (also on the cephalo-
584 (Eq. 4) calculated from our heatmaps has a practical util- 638
metric dataset) using Gaussian heatmaps like in Figure 2a
585 ity. We know that there is a high correlation between the 639
and found that its output heatmap were poorly calibrated,
586 ERE score and potential accuracy of the landmark position 640
especially in the 0.09-0.1 bin as shown in Figure 4a.
587 (Figure 5) so we hypothesize that if we apply a threshold to 641
588
5.3. Validating The ERE Statistic the ERE score of a landmark when it is calculated we can 642
589 filter out predictions which are likely to be inaccurate or 643
590 We show that our ‘Expected Radial Error’ score corre- ‘erroneous’. This is valuable because the model can flag up 644
591 lates with the ‘True Radial Error’ in Figure 5 where the True when it is unsure of its prediction this way– a particularly 645
592 Radial Error is the distance between the predicted landmark desirable feature in medical applications where it is better 646
593 location and its ground truth location (average of the 2 clin- to be cautious. 647

6
CVPR CVPR
#9240 #9240
CVPR 2022 Submission #9240. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

648 702
The experiment we conduct to test this hypothesis con- # land- MRE SDR (%)
649
sists of the following steps: Set 703
marks (mm) 2mm 2.5mm 3mm 4mm
650 704
651 1. Apply the model to all images in Test Set 1 to obtain Overall 1900 1.46 74.63 83.58 87.21 93.79 705
652 heatmaps representing landmark locations. Good 1794 1.39 75.92 84.56 88.18 94.43 706
653 2. Take the hottest points of these heatmaps to obtain pre- Erroneous 106 2.67 52.83 66.98 68.87 81.13 707
654 dicted landmark points and calculate ERE scores for 708
655 each landmark like in Figure 3b and 3c. Table 2. Results of our model over Test Set 2, and over Test Set 2 709
656 again, after it has been split by thresholding the ERE value of each 710
3. Compare the predicted landmark points to the ground landmark’s heatmap. (ERE > 1.59 means a landmark is put into
657 711
truth points and classify each point as ‘good’ if its lo- the Erroneous group). MRE is the Mean Radial Error and SDR is
658 712
calisation error is < 4mm, or ‘erroneous’ if its locali- the Success Detection Rate over all the landmarks in the group as
659 713
sation error is > 4mm. described in Section 5.1.
660 714
661
4. Plot a Receiver Operating Characteristic (ROC) curve 715
662
which describes how well different threshold values 716
663
applied to the ERE scores can discriminate between 717
664
‘good’ or ‘erroneous’ predictions. 718
665 719
Our ROC curve is shown in Figure 6. A true positive
666 720
is when the classifier (thresholding the value of the ERE
667 721
for a landmark) predicted a localisation error of over 4mm
668 722
and the predicted point was incorrectly placed by at least
669 723
4mm. A false positive is when the classifier predicted a
670 724
localisation error of over 4mm but the predicted point was
671 725
within 4mm of the ground truth.
672 726
Once we have obtained the ROC curve we can choose
673 727
what True Positive Rate (TPR) we would like for our ap-
674 728
plication; in our experiment we choose a rate of 0.5 which
675 729
corresponds to a threshold of 1.59. In other words, if our
676 730
model outputs a heatmap for a landmark which has a ERE
677 731
score of over 1.59 (such as two landmarks shown in Fig. 3b)
678 732
we classify it as erroneous and ‘flag it up’.
679 733
When we apply this threshold to Test Set 2 we find
680 734
that it has classified 1794 landmarks (94%) as ‘good’ and
681 735
106 landmarks (6%) as ‘erroneous’. The MRE and SDRs
682 736
statistics for each group are in Table 2. The MREs of the
683 737
‘good’ and ‘erroneous’ groups are 1.39 and 2.67 respec- Figure 6. The ROC curve created by measuring the TPR and FPR
684 738
tively which makes it clear that the ‘erroneous’ group con- as we increase the threshold for the ERE score to be for its land-
685 739
tains landmarks which have been detected significantly less mark to be classified as ‘erroneous’. Area Under the Curve and
686 is a measure of how well our ERE statistic discriminates between 740
accurately that the other group thus proving that threshold-
687 ‘good’ and ‘erroneous’ predictions, 1 being the best achievable 741
ing the ERE value to discriminate between accurately and
688 score. 742
inaccurately predicted landmarks is reasonably effective.
689 743
690 744
6. Conclusion
691 745
692 We have shown that formulating the problem of land- We then went on to validate the output heatmaps quanti- 746
693 mark detection as a classification task, by using one-hot tatively, by showing that they were well calibrated using re- 747
694 heatmaps during training, has several practical advantages. liability diagrams, and that an Expected Radial Error (ERE) 748
695 The heatmaps output by the model are more easily inter- statistic could be calculated from them which is well corre- 749
696 pretable and visually intuitive because they hug the con- lated with the true radial error and thus can be thresholded to 750
697 tours of objects; they are more expressive than previous ap- create a classifier which can ‘flag up’ potentially erroneous 751
698 proaches because they can be multi-modal or asymmetric. results. 752
699 We have shown that near state of the art localisation perfor- 753
700 mance can be achieved when formulating the problem this Our approach to landmark detection should be valuable 754
701 way, even with a U-Net architecture. to the Computer Vision research community. 755

7
CVPR CVPR
#9240 #9240
CVPR 2022 Submission #9240. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

756 810
7. Future Work knee surgery. In International Conference on Medical Im-
757 811
age Computing and Computer-Assisted Intervention, pages
758 We plan to validate the heatmaps produced using this 622–630. Springer, 2019. 1 812
759 method more thoroughly by comparing them against mea- 813
760 surements taken by multiple clinicians. In addition, a [5] Jingru Yi, Pengxiang Wu, Qiaoying Huang, Hui Qu, and 814
Dimitris N Metaxas. Vertebra-focused landmark detection
761 current limitation of the approach is how the expressive 815
for scoliosis assessment. In 2020 IEEE 17th International
762 heatmap must be condensed into a single statistic (in this Symposium on Biomedical Imaging (ISBI), pages 736–740.
816
763 case ERE) for its classification as a ‘good’ or ‘erroneous’ IEEE, 2020. 1 817
764 prediction. Instead of taking a single statistic we would like 818
765 to take multiple statistics or even pass the heatmaps into [6] Martin Urschler, Christopher Zach, Hendrik Ditt, and Horst 819
766 another CNN which could classify more accurately. We Bischof. Automatic point landmark matching for regulariz- 820
ing nonlinear intensity registration: Application to thoracic
767 will also train more complicated deep learning architectures 821
ct images. In International Conference on Medical Image
768 with one-hot heatmaps to improve the localisation accuracy Computing and Computer-Assisted Intervention, pages 710–
822
769 and uncertainty measurements. Experimenting with more 717. Springer, 2006. 1 823
770 datasets will help confirm that our approach can obtain com- 824
771 parable results in other situations. [7] Ching-Wei Wang, Cheng-Ta Huang, Meng-Che Hsieh, 825
772
Chung-Hsing Li, Sheng-Wei Chang, Wei-Cheng Li, Rémy 826
8. Compliance with Ethical Standards Vandaele, Raphaël Marée, Sébastien Jodogne, Pierre Geurts,
773 827
et al. Evaluation and comparison of anatomical land-
774 828
The aim of this research is to improve the safety and ac- mark detection methods for cephalometric x-ray images: a
775 grand challenge. IEEE transactions on medical imaging, 829
curacy of existing landmark detection systems. However,
776 34(9):1890–1900, 2015. 2, 5 830
when bringing this technology into contact with the general
777 831
public it is always important to carefully check what data [8] Minmin Zeng, Zhenlei Yan, Shuai Liu, Yanheng Zhou, and
778 832
the system is handling and to make sure that the system has Lixin Qiu. Cascaded convolutional networks for automatic
779 833
a suitable amount of human supervision. It is also important cephalometric landmark detection. Medical Image Analysis,
780 834
to consider how such a system could be broken, for example 68:101904, 2021. 2
781 835
via an adversarial attack, and to take measures to stop that [9] Zhusi Zhong, Jie Li, Zhenxi Zhang, Zhicheng Jiao, and
782 836
from happening. Xinbo Gao. An attention-guided deep regression model
783 837
This research study was conducted using human subject for landmark detection in cephalograms. In International
784 838
data from publicly available sources, which are known to Conference on Medical Image Computing and Computer-
785 839
have had ethical approval for using the data for research. Assisted Intervention, pages 540–548. Springer, 2019. 2, 5
786 840
There are no conflicts of interest to declare. [10] Bulat Ibragimov, Boštjan Likar, F Pernus, and Tomaž Vr-
787 841
788 tovec. Automatic cephalometric x-ray landmark detection 842
References by applying game theory and random forests. In Proc. ISBI
789 843
790 [1] Abhinav Kumar, Tim K Marks, Wenxuan Mou, Ye Wang, Int. Symp. on Biomedical Imaging, pages 1–8, 2014. 2, 5 844
791 Michael Jones, Anoop Cherian, Toshiaki Koike-Akino, Xi- [11] Claudia Lindner, Paul A Bromiley, Mircea C Ionita, and 845
792 aoming Liu, and Chen Feng. Luvli face alignment: Esti- Tim F Cootes. Robust and accurate shape model matching 846
793 mating landmarks’ location, uncertainty, and visibility likeli- using random forest regression-voting. IEEE transactions on 847
hood. In Proceedings of the IEEE/CVF Conference on Com- pattern analysis and machine intelligence, 37(9):1862–1874,
794 848
puter Vision and Pattern Recognition, pages 8236–8246, 2014. 2, 5
795 849
2020. 1, 2
796 [12] Jonathan J Tompson, Arjun Jain, Yann LeCun, and Christoph 850
[2] Jeong-Hoon Lee, Hee-Jin Yu, Min-ji Kim, Jin-Woo Kim, and
797 Bregler. Joint training of a convolutional network and 851
Jongeun Choi. Automated cephalometric landmark detection
798 a graphical model for human pose estimation. Advances 852
with confidence regions using bayesian convolutional neural
799 in neural information processing systems, 27:1799–1807, 853
networks. BMC oral health, 20(1):1–10, 2020. 1, 2, 5
800 2014. 2 854
[3] Ewa Magdalena Nowara, Tim K Marks, Hassan Mansour,
801 [13] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully 855
and Ashok Veeraraghavan. Sparseppg: Towards driver mon-
802 convolutional networks for semantic segmentation. In Pro- 856
itoring using camera-based vital signs estimation in near-
803 infrared. In Proceedings of the IEEE conference on computer ceedings of the IEEE conference on computer vision and pat- 857
804 vision and pattern recognition workshops, pages 1272–1281, tern recognition, pages 3431–3440, 2015. 2 858
805 2018. 1 859
[14] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-
806 [4] Florian Kordon, Peter Fischer, Maxim Privalov, Benedict net: Convolutional networks for biomedical image segmen- 860
807 Swartman, Marc Schnetzke, Jochen Franke, Ruxandra La- tation. In International Conference on Medical image com- 861
808 sowski, Andreas Maier, and Holger Kunze. Multi-task lo- puting and computer-assisted intervention, pages 234–241. 862
809 calization and segmentation for x-ray guided planning in Springer, 2015. 2, 3 863

8
CVPR CVPR
#9240 #9240
CVPR 2022 Submission #9240. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

864 918
[15] Christian Payer, Darko Štern, Horst Bischof, and Martin
865 Urschler. Integrating spatial configuration into heatmap re- 919
866 gression based cnns for landmark localization. Medical im- 920
867 age analysis, 54:207–219, 2019. 2 921
868 922
[16] Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hour-
869 glass networks for human pose estimation. In European con- 923
870 ference on computer vision, pages 483–499. Springer, 2016. 924
871 2 925
872 926
[17] Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. Deep
873 high-resolution representation learning for human pose es- 927
874 timation. In Proceedings of the IEEE/CVF Conference 928
875 on Computer Vision and Pattern Recognition, pages 5693– 929
876 5703, 2019. 2 930
877 [18] Fabian Isensee, Jens Petersen, Andre Klein, David Zim- 931
878 merer, Paul F Jaeger, Simon Kohl, Jakob Wasserthal, Gregor 932
879 Koehler, Tobias Norajitra, Sebastian Wirkert, et al. nnu-net: 933
880 Self-adapting framework for u-net-based medical image seg- 934
881 mentation. In Bildverarbeitung für die Medizin 2019, pages 935
882 22–22. Springer, 2019. 2 936
883 [19] Qingsong Yao, Zecheng He, Hu Han, and S. Kevin Zhou. 937
884 Miss the point: Targeted adversarial attack on multiple land- 938
885 mark detection, 2020. 2, 3, 5 939
886 [20] Heqin Zhu, Qingsong Yao, Li Xiao, and S Kevin Zhou. You 940
887 only learn once: Universal anatomical landmark detection. 941
888 arXiv preprint arXiv:2103.04657, 2021. 2 942
889 [21] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. 943
890 On calibration of modern neural networks. In International 944
891 Conference on Machine Learning, pages 1321–1330. PMLR, 945
892 2017. 2, 4, 5 946
893 [22] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 947
894 Deep residual learning for image recognition. In Proceed- 948
895 ings of the IEEE conference on computer vision and pattern 949
896 recognition, pages 770–778, 2016. 3 950
897 [23] Ching-Wei Wang, Cheng-Ta Huang, Jia-Hong Lee, Chung- 951
898 Hsing Li, Sheng-Wei Chang, Ming-Jhih Siao, Tat-Ming Lai, 952
899 Bulat Ibragimov, Tomaž Vrtovec, Olaf Ronneberger, et al. 953
900 A benchmark for comparison of dental radiography analysis 954
901 algorithms. Medical image analysis, 31:63–76, 2016. 4 955
902 [24] Sercan Ö Arik, Bulat Ibragimov, and Lei Xing. Fully auto- 956
903 mated quantitative cephalometry using convolutional neural 957
904 networks. Journal of Medical Imaging, 4(1):014501, 2017. 958
905 5 959
906 [25] Runnan Chen, Yuexin Ma, Nenglun Chen, Daniel Lee, and 960
907 Wenping Wang. Cephalometric landmark detection by at- 961
908 tentive feature pyramid fusion and regression-voting. In In- 962
909 ternational Conference on Medical Image Computing and 963
910 Computer-Assisted Intervention, pages 873–881. Springer, 964
911 2019. 5 965
912 966
913 967
914 968
915 969
916 970
917 971

You might also like