Bioengineering 11 00214 v2

bioengineering
Article
Towards Automation in Radiotherapy Planning: A Deep Learning
Approach for the Delineation of Parotid Glands in Head and
Neck Cancer
Ioannis Kakkos 1, *,† , Theodoros P. Vagenas 1,† , Anna Zygogianni 2 and George K. Matsopoulos 1
1 Biomedical Engineering Laboratory, National Technical University of Athens, 15773 Athens, Greece;
tpvagenas@biomed.ntua.gr (T.P.V.); gmatsopoulos@biomed.ntua.gr (G.K.M.)
2 Radiation Oncology Unit, 1st Department of Radiology, ARETAIEION University Hospital,
11528 Athens, Greece; annazygo1@yahoo.gr
* Correspondence: ikakkos@biomed.ntua.gr; Tel.: +30-210772282
† These authors contributed equally to this work.
Abstract: The delineation of parotid glands in head and neck (HN) carcinoma is critical to assess
radiotherapy (RT) planning. Segmentation processes ensure precise target position and treatment
precision, facilitate monitoring of anatomical changes, enable plan adaptation, and enhance overall
patient safety. In this context, artificial intelligence (AI) and deep learning (DL) have proven exceed-
ingly effective in precisely outlining tumor tissues and, by extension, the organs at risk. This paper
introduces a DL framework using the AttentionUNet neural network for automatic parotid gland
segmentation in HN cancer. Extensive evaluation of the model is performed in two public and one
private dataset, while segmentation accuracy is compared with other state-of-the-art DL segmenta-
tion schemas. To assess replanning necessity during treatment, an additional registration method
is implemented on the segmentation output, aligning images of different modalities (Computed
Tomography (CT) and Cone Beam CT (CBCT)). AttentionUNet outperforms similar DL methods
(Dice Similarity Coefficient: 82.65% ± 1.03, Hausdorff Distance: 6.24 mm ± 2.47), confirming its effec-
Citation: Kakkos, I.; Vagenas, T.P.; tiveness. Moreover, the subsequent registration procedure displays increased similarity, providing
Zygogianni, A.; Matsopoulos, G.K. insights into the effects of RT procedures for treatment planning adaptations. The implementation
Towards Automation in of the proposed methods indicates the effectiveness of DL not only for automatic delineation of the
Radiotherapy Planning: A Deep anatomical structures, but also for the provision of information for adaptive RT support.
Learning Approach for the
Delineation of Parotid Glands in
Keywords: head and neck cancer; CT; parotid glands; artificial intelligence; deep learning; radiation
Head and Neck Cancer.
therapy; segmentation; registration
Bioengineering 2024, 11, 214.
https://doi.org/10.3390/
bioengineering11030214
Academic Editor: Alan Wang 1. Introduction

Received: 27 December 2023 Radiation therapy (RT) stands as the primary treatment for head and neck (HN) cancer,
Revised: 19 February 2024 either utilized independently or in conjunction with adjuvant treatments such as surgery
Accepted: 22 February 2024 and/or chemotherapy [1]. Despite its effectiveness in cancer treatment, RT is not without
Published: 24 February 2024 challenges, as even the partial irradiation of healthy cells can inadvertently damage normal
tissues (organs at risk, OARs), leading to complications [2]. Despite its efficacy in cancer
treatment, RT presents challenges, given that even the partial irradiation of healthy cells
can inadvertently harm normal tissues, referred to as Organs At Risk (OARs), thereby
Copyright: © 2024 by the authors.
leading to complications [3]. Typically, physicians manually perform the determination of
Licensee MDPI, Basel, Switzerland.
tumor and OAR boundaries, utilizing Computed Tomography (CT) images. However, this
This article is an open access article
is a time-consuming and labor-intensive process (since radiologists have to work slice by
distributed under the terms and
slice), making any adjustments in treatment planning particularly challenging [4]. This is
conditions of the Creative Commons
further exacerbated due to repeated and swift adaptations required on the treatment plan,
Attribution (CC BY) license (https://
particularly in cases that present 20–30% tumor-related volumetric changes between RT
creativecommons.org/licenses/by/
4.0/).
sessions [5,6]. On this premise, adaptive RT necessitates frequent updates of OARs and
Bioengineering 2024, 11, 214. https://doi.org/10.3390/bioengineering11030214 https://www.mdpi.com/journal/bioengineering

Bioengineering 2024, 11, 214 2 of 20
target volumes during treatment, making it imperative to delineate these structures quickly
and efficiently.
The advent of artificial intelligence (AI) and machine learning has revolutionized the
assessment of RT treatment, ranging from computer-aided detection and diagnosis systems
to predicting treatment alteration requirements [7,8]. However, many studies still involve
manual segmentation of parotid glands combined with machine learning for predicting the
necessity of RT replanning, which could be further automated [9–11].
Specifically, in automatic segmentation, machine learning has proven very effective in
contouring OARs within a reasonable timeframe, enhancing the safety and efficiency of
treatments [12–14]. Despite this, producing atlases for the segmentation task can reduce the
model’s ability to generalize on images from different patients due to the large anatomical
difference, because an atlas may not be able to represent all variations occurring from the
different anatomies and acquisitions [15]. Recent advancements in deep learning (DL)
have shown superior performance in auto-segmentation, offering better feature extraction,
capturing local relationships, reducing observer variation, and minimizing contouring
time [16–19]. However, these studies suffer from limitations that can impede their usage
and applicability. Although interactive segmentation can reduce time and errors compared
to manual segmentation, fully automatic segmentation can achieve both at a greater level.
In addition, 2D or slice-based segmentation cannot capture the 3D anatomical features,
which are essential to enhance the DL model’s accuracy. The scarcity of labeled data is
another challenge. DL models typically require a large amount of annotated data for
training, and obtaining a comprehensive dataset with accurately annotated parotid glands
in diverse clinical scenarios can be challenging. Moreover, the potential for class imbalance,
where certain anatomical structures or pathologies are underrepresented in the dataset,
may affect the model’s performance, leading to biased results [20].
Several recent studies have evaluated DL-based methods for auto-segmentation, show-
ing promising results. For instance, Gibbons et al. [21] evaluated the accuracy of atlas-based
and DL-based auto-segmentation methods for critical OARs in HN, thoracic, and pelvic
cancer. Their results showed significant improvements in DL-based segmentation com-
pared to atlas-based segmentation both on Dice Similarity Coefficient (DSC) and Hausdorff
Distance (HD) metrics, while contour adjustment time was significantly improved com-
pared to manual and atlas-based segmentation. In a similar manner, a recent study [22]
compared the automatically generated OAR delineations acquired from a deep learning
auto-segmentation software to manual contours in the HN region created by experts in
the field. The software provided good delineation of OAR volumes, with most structures
reporting DSC values higher than 0.85, while the HD values indicated a good agreement
between the automatic and manual contours. Furthermore, Chen et al. [23] employed
a WBNet automatic segmentation method and compared it with three additional algo-
rithms, utilizing a large number of whole-body CT images. They reported an average DSC
greater than or equal to 0.80 for 39 out of 50 OARs, outperforming the other algorithms.
Furthermore, Zhan et al. [24] utilized a Weaving Attention U-net deep learning approach
for multi-organ segmentation in HN CT images. Their method integrated convolutional
neural networks (CNNs) and a self-attention mechanism and was compared against other
state-of-the-art segmentation algorithms, showing superior or similar performance. As
such, it was able to generate contours that closely resembled the ground truth for ten
OARs, with DSC scores ranging from 0.73 to 0.95. Taking the above into consideration, it is
evident that DL-based segmentation methods have the potential to enhance RT treatment
for HN cancer patients by improving the precision of organ at risk delineation. However,
it should be noted that DL algorithms are data-driven methods whose results can be in-
fluenced by sample imbalance, variance among different patients and anatomical regions,
and observer bias. To address these, recent studies apply interconnected networks, while
employing larger datasets from different hospitals to ensure its effectiveness across different
patient populations [25–27].
Our study builds on these advancements, employing an AttentionUNet-based DL

segmentation method for HN CT images. This method focuses on targeted structures, sup-
pressing non-related structures to reduce false positives. We conducted a comprehensive
evaluation on three datasets, including two public and one private dataset, comparing
our method with state-of-the-art DL frameworks for automatic segmentation. The results
indicate the high performance of our automatic segmentation approach, outperforming
other DL methods significantly. As an extension of our segmentation framework, we also
investigated the efficiency of our model in RT replanning based on volumetric changes in
the parotid gland. Our registration framework, aligning the planning CT with the Cone
Beam CT (CBCT) scans from different weeks, demonstrated the validity of the registration
schema. The mean volumetric percentage difference in tumor volume between the first
and fifth week of treatment sessions, when comparing automated registration with the
ground truth, showed very low values, confirming the effectiveness of our framework in
cases requiring replanning during RT or between RT sessions. Overall, the contributions of
this study can be summarized as follows:
• A fast and automatic DL segmentation method based on the AttentionUNet is utilized
for the segmentation of the parotid glands from HN CT images. The AttentionUNet-
based method achieves both a reduction in false positive regions and the enhancement
of the parotid glands’ segmentation.
• An extensive evaluation of the segmentation method was performed on two public
datasets and one private dataset with varying acquisition parameters. Both qualitative
and quantitative results support the model’s ability to generalize effectively in images
with anatomical and imaging variability.
• A registration-based framework was designed and implemented to transfer the seg-
mentation masks produced by the AttentionUNet model to the CBCT images of the
following weeks. The results indicate the effectiveness of our framework in cases
where replanning is required during or between RT sessions.
The subsequent sections of the paper are organized as follows. Section 2 provides a
comprehensive overview of the datasets and the DL methodological framework for seg-
mentation and registration. In Section 3, the results of the proposed schema are separately
presented for segmentation and registration. The implications of our methods and results
are discussed in Section 4. The entire paper is summarized in Section 5.
2. Materials and Methods

2.1. Datasets
2.1.1. Public Dataset
Two public datasets were combined to evaluate the segmentation accuracy of the
parotid glands. The first dataset is the PDDCA dataset [28], which consists of 48 head and
neck CT images from the Radiation Therapy Oncology Group (RTOG). Images are accom-
panied by manual segmentation masks from several organs at risk (brainstem, mandible,
chiasm, bilateral optic nerves, bilateral parotid glands, and bilateral submandibular glands).
From these masks, we utilized only the two parotid glands, which are the target of this
investigation. The reconstruction matrix for the data was 512 × 512 pixels with in-plane
spacing from 0.76 × 0.76 to 1.27 × 1.27 mm. The spacing of the z dimension ranged from
1.25 mm to 3 mm and the number of slices ranged between 110 and 190.
The second dataset is HaN-Seg, the HN OAR CT and MR Segmentation Dataset [29],
with 42 head and neck CT and MRI images from patients included in the publicly avail-
able training set (set 1). The dataset includes masks of 30 OARs delineated from the CT
images, where we selected the parotid glands. CT scans were conducted using either the
Philips Brilliance Big Bore from Philips Healthcare, Netherlands or the Siemens Somatom
Definition AS scanner from Siemens Healthineers, Germany. Standard clinical protocols
were employed, with X-ray tube voltage set between 100 and 140 kV, and the field of
view (FoV) (500 × 500) − (800 × 800) measured in mm2 . The matrix size ranged between
512 × 512 and 1024 × 1024, while the number of slices varied from 116 to 323. In-plane
spacing ranged from 0.52 to 1.56 and z-dimension spacing from 2.0 to 3.0 mm.
Both datasets included images with different spacing values, a fact that introduced
image variability and made the subsequent segmentation more challenging. On this
premise, the two datasets were unified and a 5-fold cross-validation technique was applied.
During the 5-fold cross-validation, we ensured that in each fold, samples from both datasets
were included to test the model’s effectiveness on heterogenous images regarding size and
number of slices.
2.1.2. Private Dataset

The private dataset comprised planning CT (pCT) image data, collected from
23 patients (14 male) with squamous cell carcinoma of the HN from the ATTIKON Univer-
sity Hospital, Athens (Protocol EBD304/11-05-2022, 7 June 2022). Each pCT image included
87 to 197 slices with a size of 512 × 512 pixels for each slice and a 3 mm interslice distance.
The private dataset also included images with a variable number of slices.
2.2. Data Preprocessing

Prior to the DL model application, a series of preprocessing steps were performed
in order to reduce the effect of variability of CT images in the datasets and enhance the
neural network’s training progress. In order to transform all images into a common spatial
setting, images were resampled to a certain spacing, where the in-place spacing was set
to 0.87 mm × 0.87 mm and the spacing in z dimension to 3 mm. The spacing values were
selected by considering the mean and median spacing values. The previously mentioned
resampling was also utilized, since both the in-plane spacing and the interslice spacing
were variable across the images of the datasets. Linear and nearest neighbors’ interpolation
methods were applied to images and label masks, respectively. Normalization and clipping
were applied to reduce outlier values and assist the network to be trained. Hounsfield
units (HU) in the CT images were rescaled and clipped to the range (−160, 240) HU,
which is more suitable for the detection of the parotids corresponding to soft tissues and
neighboring areas [30].
2.3. Segmentation Model

The main neural network architecture of the segmentation implemented for the delin-
eation of the parotid glands from the pCT images was the AttentionUNet model [31]. The
AttentionUNet-based architecture follows a UNet-like structure with the attention gates at
the skip connections. Below, the attention gate and the main architecture are presented.
2.3.1. Attention Gate

The attention gate enables the network to focus only on the most important structures
for the target task. It can also reduce false positive predictions and improve accuracy [31].
To achieve this, feature maps from the encoder path pass through skip connections, which
include attention gates (Figure 1).
In the attention gate block, both the gating signal (g) and input features (x) are linearly
transformed by a convolutional layer with a kernel size of 1 (which is performed in a
channel-wise manner). In this way, aligned weights are amplified and unaligned weights
are reduced. The summation of the outputs of the two convolutions passes through a ReLU
activation function. Subsequently, the signal is linearly transformed by a new convolution
with a kernel size of 1. The following sigmoid layer is used to normalize the output, and
consequently the weights, and assist convergence. Finally, a grid resampling technique
using trilinear interpolation is applied to produce attention coefficients based on the spatial
information of each point in the grid. In this procedure, attention aggregates information
from multiple scales. The output of the attention gate is calculated by the product of the
input features (x) and the attention coefficients.
Bioengineering 2024, 11, x FOR PEER REVIEW 5 of 20

information from multiple scales. The output of the attention gate is calculated by the
product of the input features (x) and the attention coefficients.
Figure 1.
Figure 1. The
The architecture
architecture of
of the
the AttentionUNet-based
AttentionUNet-basedmethod.
method.The
Theattention
attentiongate
gateblock
blockisispresented
presented
in the bottom right.
in the bottom right.
2.3.2. AttentionUNet
2.3.2. AttentionUNet Architecture
Architecture
The proposed
The proposedneural neuralnetwork
networkarchitecture
architecturefollows followsthe theUNet
UNetstructure,
structure,consisting
consistingofof an
an encoder,
encoder, a bottleneck
a bottleneck layer,layer,
and and a decoder
a decoder pathpath
(Figure(Figure
1). In1).the Inencoder
the encoder path,path,
each each
block
block includes
includes two successive
two successive convolutional
convolutional blocksblocks and a downsampling
and a downsampling layer in layer
orderin to
order
reduceto
reduce the dimensions of the feature maps by 2 in each step.
the dimensions of the feature maps by 2 in each step. Each convolutional block comprises Each convolutional block
acomprises a convolutional
convolutional layer withlayer the with the corresponding
corresponding numbernumberof filters, of filters,
a batch a batch normal-
normalization
ization layer, and a ReLU activation function. Batch normalization
layer, and a ReLU activation function. Batch normalization standardizes the output standardizes the out-of
putconvolutional
the of the convolutional layers, by layers, by ensuring
ensuring that thethat mean theand
mean and standard
standard deviation deviation
are constant are
constant
across across mini-batches,
mini-batches, and makes andtraining
makes training
more stable.more stable. The downsampling
The downsampling layer layer
refers
refers
to a maxto apooling
max poolingoperation operation
with awith stride a stride of 2, aiming
of 2, aiming to reduce
to reduce the spatial
the spatial infor-
information
mation
(and (and
thus thethus
memory the memory requirements),
requirements), while extracting
while extracting the most the most
importantimportant
context context
infor-
information.
mation. AfterAfter
each each downsampling
downsampling step,step, the number
the number of filters
of filters is doubled.
is doubled. TheThe bottle-
bottleneck
neck also
layer layerconsists
also consistsof two of consecutive
two consecutive convolutional
convolutional blocks.
blocks. In our
In our current
current design,
design, the
encoder layers include convolutions with 16, 32, 64, and 128 filters, respectively. The
the encoder layers include convolutions with 16, 32, 64, and 128 filters, respectively. The
convolutional blocks
convolutional blocks of ofthe
thebottleneck
bottleneckhave have 256 filters.
256 TheThe
filters. bottleneck
bottleneck layerlayer
constitutes
constitutesthe
middle
the middlelayer, which
layer, which connects
connectsthe encoder
the encoder pathpath
with with
the decoder.
the decoder. The process
The processof compres-
of com-
sion (which
pression takes
(which place
takes in the
place in encoder)
the encoder) was was
completed
completedin the inbottleneck
the bottleneck layer,layer,
where the
where
high-level
the high-levelfeatures
features extracted
extractedfrom the input
from the inputimage capture
image the contextual
capture the contextual information
information es-
sential forforsegmentation.
essential segmentation. During
During thetheencoder’s
encoder’s layers, the context
layers, the context information
information to betoex- be
tracted is increased, while the spatial information becomes
extracted is increased, while the spatial information becomes coarser. This is achieved by coarser. This is achieved by
the application
the application of sequential convolutionalconvolutionalblocks, blocks,enlarging
enlargingthe theeffective
effectivereceptive
receptivefield.field.
Onthe
On the contrary,
contrary, the the decoder path aims at at reconstructing
reconstructingthe theoutput
outputfrom fromthe theinput
inputfeature
feature
maps. In
maps. In this regard, at the end of of each
each decoder
decoderlayer,layer,ananupsampling
upsamplingtechnique technique(based (basedon on
TransposeConvolutions)
Transpose Convolutions)isisapplied. applied.Transpose
Transposeconvolutions
convolutionsenable enable thethe upsampling
upsampling of of
the
the feature
feature maps maps
from from
the the previous
previous layerlayer of the
of the decoder
decoder byby inserting
inserting trainable
trainable parameters.In
parameters.
In consensus
consensus withwiththethe common
common UNetUNet structure,
structure, skip
skip connectionsare
connections arealso
alsoutilized
utilizedtototransfer
trans-
fer information from the encoder to the decoder. However,
information from the encoder to the decoder. However, attention gates are inserted into the attention gates are inserted
into connections
skip the skip connections
to suppress to suppress
irrelevantirrelevant feature responses
feature responses and extract and theextract the most
most important
important
regions. Byregions.
applyingBythe applying
attention the attention
gates beforegates before concatenation,
concatenation, the networkthe network
focuses onlyfo- on
cuses
the mostonly on the
useful most useful
activations. Inactivations.
each decoder In each
block, decoder block, the concatenation
the concatenation of the outputofofthe the
output ofgate
attention the attention
in the skip gate in the skip
connection and connection
the upsampledand theoutput
upsampled of theoutput
previous of the
layer pre-
are
vious layer
followed by are followed
a simple by a simple
convolutional convolutional
layer to merge thelayer to merge
feature maps and the feature
half themapsnumber andof
half themaps.
feature number of feature
In the maps.
final layer, In the final layer,
a convolutional layer a convolutional
with the number layerof with
filtersthe number
equal to the
number of classes and a softmax layer to extract the predictions for the segmentation are
implemented. The details on the size of the feature maps across the proposed architecture
are presented in Tables 1 and 2.
Table 1. Architecture sizes of the encoder and the bottleneck layers.
Layer Blocks Input Size Output Size

Encoder Block 1 [1, 96, 96, 96] [16, 96, 96, 96]
Encoder Block 2 [16, 96, 96, 96] [32, 48, 48, 48]
Encoder Block 3 [32, 48, 48, 48] [64, 24, 24, 24]
Encoder Block 4 [64, 24, 24, 24] [128, 12, 12, 12]
Encoder Block 5 [128, 12, 12, 12] [256, 6, 6, 6]
Table 2. Architecture sizes of the decoder layers.
Layer Blocks Input Size Output Size

Decoder Block 1 [256, 6, 6, 6] [128, 12, 12, 12]
Decoder Block 2 [128, 12, 12, 12] [64, 24, 24, 24]
Decoder Block 3 [64, 24, 24, 24] [32, 48, 48, 48]
Decoder Block 4 [32, 48, 48, 48] [16, 96, 96, 96]
Decoder Block 5 [16, 96, 96, 96] [3, 96, 96, 96]
In Table 1, the input and output sizes of the main blocks of the encoder path are
presented in detail. The example input corresponds to the size of windows utilized during
this study, which was [96, 96, 96]. As previously mentioned, in each block of the encoder,
the dimensions of the feature maps are downsampled by 2 and concurrently the number of
features is doubled. In the decoder path, the input and output sizes are omitted (since they
are the same as the encoder but in reverse order).
In Table 2, the input and output sizes of the main blocks of the decoder are pre-
sented. In each one of the first 4 blocks of the decoder, the feature maps are upsampled by
2 and the number of filters is reduced by 2. The last layer utilizes 3 filters for the 3 classes
(background, left parotid gland, right parotid gland).
2.3.3. Evaluation Scheme and Loss Function

To assess the model’s performance, a 5-fold cross-validation methodology was em-
ployed on the two created datasets, the public and the private one. For the unified public
dataset (consisting of the two public subsets), the folds were created by proportionally
splitting the images from the two subsets in order to include images from both of them in
the training and validation sets. For ground truth, the segmentation masks from parotid
glands (annotated manually by experts) were utilized. For the segmentation task, we used
masks with three labels, where the values 1 and 2 correspond to the left and right parotid
and the value 0 to the background. In the evaluation, the metrics were calculated for the
united parotid glands’ masks to provide a more elucidated perspective.
As such, the segmentation methodology evaluation included the Dice Similarity
Coefficient (DSC), Hausdorff Distance (HD) metric, Recall, and Precision [32,33]. Briefly,
the DSC coefficient quantifies the overlap between the ground truth mask and the predicted
mask, while HD measures the spatial distance between the ground truth and the produced
segmentation mask and provides an evaluation for the boundary delineation accuracy. For
the segmentation evaluation, the mean value of DSC, Precision, and Recall was estimated,
while for HD, the 95th percentile was calculated to reduce the influence of the outlier values.
The formulas employed for the calculation of DSC, HD, Recall, and Precision are presented
in Appendix A.
The utilized loss function consists of the weighted sum of the cross-entropy and DSC
coefficient. Here, the loss function, 1 − DSC, was used. Cross-entropy quantifies the
difference between two probability distributions for a given random variable. For the
segmentation task, cross entropy is calculated in a pixel-wise manner and the average
value is used as the result [34]. In this framework, cross-entropy enhances the training,
especially in the first steps when DSC is not able to provide adequate information. In
addition, because the two classes, referring to the left and right parotids, are almost equally
represented with a similar number of voxels through the image, the simple form of cross-
entropy without additional weights is utilized. The final loss is calculated through the sum
of the two previously mentioned losses with equal weights of 1. The calculation formulas
for the loss function are also presented in Appendix A.
2.4. Registration Expansion Framework

To further promote the clinical impact of this study, the DL segmentation method was
expanded on a registration procedure based on the premise that significant anatomical
alterations during RT necessitate treatment planning adaptation [5,6]. As such, we opted
for automatic calculation of the parotid volume in the follow-up CBCT scans, utilizing
the segmented masks from the pCT. To achieve this, a registration between the pCT as a
moving image and the corresponding CBCT image as a target image was implemented.
2.4.1. Registration Dataset

From the collected private dataset, additional low-dosage Cone Beam CT (CBCT)
scans were acquired for 16 out of 23 patients over a span of 6 weeks. Each of the 16 patients
underwent volumetric modulated arc therapy (5 sessions per week) with a 66 Gy dose
delivered in 30 fractions.
2.4.2. Registration Preprocessing

To achieve an accurate transformation of the initial segmentation masks to the CBCT
of the target week, the following methodology was implemented. As such, due to the
large difference in the acquisition parameters and intensities between the pCT and the
CBCTs, we initially applied a registration step between the pCT and the CBCT1 (week 1) to
move the labels in the CBCT1 space. Then, CBCT1 was used as the moving image to be
transformed for the next week (week 5), allowing us to calculate the volumetric differences.
In this regard, CBCT1 was registered to CBCT5 in the experiments where CBCT5 was
used as the fixed (target) image. The only difference concerns the interpolation method for
the segmentation masks, which was changed to nearest neighbors on account of existing
labeling. As a preprocessing step, histogram matching was applied to all the pairs to
be registered, thereby reducing the difference between the intensity distributions of the
different image modalities. By minimizing the differences in the intensity distribution, the
convergence of the registration was enhanced, especially in cases where images exhibit
variations in illumination and contrast (CT-CBCT). Affine registration was utilized to align
and match the pair of pCT-CBCT1 and CBCT1-CBCT5 by employing a combination of
translation, rotation, scaling, and shearing transformations. This allowed us to further
minimize the discrepancies between the corresponding points in the two images and reduce
the burden for the subsequent deformable registration step.
2.4.3. Registration Procedures

As mentioned above, registration included alignment and matching of pCT-CBCT1
and CBCT1-CBCT5 in distinct procedures. For the registration of pCT to CBCT1 images, the
Advanced Normalization Tools (ANTs) registration library was utilized [35]. Registration
parameters are presented in Table 3. Figure 2 presents the optimization procedure employed
for the affine image registration. In detail, firstly, the affine transformation matrix is
initialized. Then, the moving image is warped iteratively (using the current transformation)
and the similarity metric (mutual information) between the fixed and warped moving
images is calculated. An optimizer (regular step gradient descent) is used to update the
transformation parameters to maximize this similarity metric (or minimize an error loss).
This repetition continues until the convergence criteria are met, i.e., until the maximum
number of iterations is reached (200 iterations) or the similarity metric converges. Finally,
the transformation matrix and the warped moving image were generated, to be used as
input for the deformable registration.
Table 3. ANT registration parameters.
Parameter Value
Interpolation linear
Intensity clipping [0.005, 0.995]
Histogram-matching Yes
Initial transform Rigid [0.1]
Bioengineering 2024, 11, x FOR PEER REVIEW Initial transform metric MI (32 bins) 8 of 20
Iterations 50 × 50 × 25 × 10
Minimum convergence 1.00 × 10−6
Second transform Affine [0.1]
update theSecond
transformation parameters to maximize this similarity
transform metric MI (32metric
bins) (or minimize an
Iterations 100 × 70
error loss). This repetition continues until the convergence criteria are× 50 ×met,
20 i.e., until the
Transform SyN
maximum number of iterations is reached (200 iterations) or the similarity metric con-
Transform metric
verges. Finally, the transformation matrix and the warped Cross-Correlation
moving image were generated,
Iterations 100 × 70 × 50 × 20
to be used as input for the deformable registration.
Figure2.2.Affine
Figure Affineregistration
registration flowchart
flowchart implemented
implemented for pCT
for the the pCT to CBCT1
to CBCT1 and CBCT1
and CBCT1 to CBCT5
to CBCT5 steps.
steps.
Regarding the different transformations, an initial rigid registration was utilized to
Table 3. potential
capture ANT registration parameters.
rotations and translations in the pairs. Next, an affine registration step
was implemented assessing global transformations (such as translation,
Parameter Value rotation, scaling,
and shearing). For both transformations, Mutual Information (MI) was used as the metric
Interpolation linear
to be maximized during the optimization procedure, while a regularization value of 0.1 was
Intensity clipping [0.005, 0.995]
selected to control the elastic deformation and preserve the smoothness of the structures.
These procedures allowed an approximate alignment between theYes
Histogram-matching pairs. Following this, a
deformable step Initial
basedtransform
on normalized symmetric demons wasRigid [0.1] detecting more
employed,
complex andInitial transform
local metricIn the affine part of the registration,
deformations. MI (32 bins)a random selection
procedure for the tie points was employed, by including 10% of× the
Iterations 50 50 ×total
25 ×points
10 of the fixed
image. TheMinimum convergence
same procedure 1.00
was also used for the next step of the × 10 −6
registration but at 25% of
Second
the total points. The transform
denser sampling was used to improve the Affine [0.1]
registration accuracy by
including Second
more details in themetric
transform MI calculation (utilizing 32 bins).
MI (32 bins)
For the registration of CBCT1 to CBCT5 images, a deformable
Iterations 100 × 70 ×registration
50 × 20 methodol-
ogy based on theTransform
fast symmetric demons registration from the SimpleITK
SyN library [36,37]
was utilized Transform
for the initial affine alignment. The rationale
metric behind this was to include
Cross-Correlation
correspondences Iterations
as a hidden variable in the registration procedure.
100 × 70 ×In 50this
× 20procedure, the
Demons algorithm was used as global energy optimization, with the following regular-
ization criterion employed as a prior to the smoothness of the transformation [37]. In this
Regarding the different transformations, an initial rigid registration was utilized to
capture potential rotations and translations in the pairs. Next, an affine registration step
was implemented assessing global transformations (such as translation, rotation, scaling,
and shearing). For both transformations, Mutual Information (MI) was used as the metric
to be maximized during the optimization procedure, while a regularization value of 0.1
manner, flexibility between the overall representation is featured, allowing a small number
of errors, rather than demanding optimal precision of point correspondences between
image pixels (a vector field c). For the Symmetric Forces Demons algorithm, the standard
deviation value was set to 3. The main equations for the Demons algorithm (for global
energy and similarity) are presented in Appendix B.
In this work, the fast symmetric demons registration method [37] was implemented to
evaluate the applicability of our segmentation method to the correct identification of volume
changes in parotids and consequently to RT replanning necessity. To assess the registration
framework efficiency, different metrics were calculated, including Mean Squared Error
(MSE), Mutual Information (MI), and Normalized Cross Correlation (NCC) [38]. MSE
indicates an error that should be reduced after registration, while MI and NCC constitute
a measure of similarity that should be increased after a successful registration step. The
calculation formulas are briefly presented in Appendix B. Furthermore, the volumetric
percentage difference between the 1st and the 5th week of treatment sessions of the CBCT
registration outputs was estimated and compared with the ground truth (i.e., the volumetric
percentage difference of the CBCT1/CBCT5 images, annotated by experts). As such, the
Mean Absolute Error (MAE) (used to measure the average absolute differences between
predicted values and the actual values), the Root Mean Squared Error (RMSE) (used to
measure the average magnitude of errors between predicted values and actual values),
and the Pearson Correlation between the volumetric differences of the prediction and the
ground truth were calculated to measure the strength of their relation.
2.5. Implementation Details

The proposed segmentation framework was implemented using PyTorch and the
MONAI Python library [39]. The training and evaluation experiments were carried out on
a Desktop with 64 GB RAM and Nvidia 4090 with 24 GB VRAM.
During the training, cropped boxes with dimensions (96, 96, 96) were extracted from
the images by randomly selecting centers inside or outside the provided masks with
probabilities of 80% and 20%, respectively. For the network’s training, the Adam optimizer
with a learning rate of 0.0001 was applied. From each volume, four boxes were extracted
each time. By utilizing smaller boxes throughout the training, GPU memory requirements
were incredibly reduced, and the training was performed by effectively exploiting the
available GPU VRAM. For the registration algorithms, the SimpleITK [36,37] and ANTs [35]
libraries were utilized, whereas for the registration of CBCT1 to CBCT5 with SimpleITK,
the standard deviation was set to 2 and the number of iterations to 600.
3. Results
3.1. Training and Validation Loss
To evaluate the performance of a model during the training process, the error or the
difference between the model’s predictions and the actual target values in the training
data (training loss), as well as the degree of the model’s effectiveness to generalize to new,
unseen data (validation loss) was estimated. Figure 3 presents the learning curves of the
proposed neural network model.
Training was performed for a constant number of epochs (600 epochs), which was
selected by observing the trajectory of the loss function (reaching its final values approx-
imately after 300 epochs). The final number of epochs was selected to achieve the same
trajectory in all models and ablation experiments. To avoid overfitting, the model with the
best DSC value in the validation set was saved for each experiment. Training and validation
losses present a very close trajectory, which indicates that overfitting is mitigated, and the
model can achieve high accuracy on unseen data.
approximately after 300 epochs). The final number of epochs was selected to achieve the
same trajectory in all models and ablation experiments. To avoid overfitting, the model
with the best DSC value in the validation set was saved for each experiment. Training and
validation losses present a very close trajectory, which indicates that overfitting is miti-
gated, and the model can achieve high accuracy on unseen data.
Figure 3.
Figure 3. Training
Training and
and validation
validation loss
loss for
for 600
600epochs.
epochs.
3.2.
3.2. Segmentation
Segmentation Comparison
Comparison Experiments
Experiments
In order to evaluate the modelstability,
In order to evaluate the model stability,the
theproposed
proposedsegmentation
segmentationDL DL network
network is
is compared with representative state-of-the-art segmentation models
compared with representative state-of-the-art segmentation models (including neural (including neural
net-
networks
works with with residual
residual connections,
connections, transformer-based
transformer-based neural
neural networks,
networks, etc.),etc.), further
further em-
emphasizing its performance. Due to the small size of the private dataset,
phasizing its performance. Due to the small size of the private dataset, the weights calcu- the weights
calculated for public
lated for the the public datasets
datasets werewere utilized.
utilized. In In this
this way,the
way, thepre-trained
pre-trainedmodel
model from
from thethe
public dataset was also tested for its ability to perform well in an external (private)
public dataset was also tested for its ability to perform well in an external (private) set with
set
different acquisition
with different parameters.
acquisition In Table
parameters. 4, a short
In Table description
4, a short of each
description implementation
of each implementa-
and the parameters used are provided to enhance the reproducibility of our results.
tion and the parameters used are provided to enhance the reproducibility of our results.
Table 4. Segmentation
Table 4. Segmentation comparison
comparison models.
models.
Models
Models Description
Description Parameters
Parameters
Fully convolutional
Fully convolutional neural network.
neural network. Consists
Consistsofofananencoder path, which
encoder
extracts path, which
the image’s extracts
context, andthe image’s context,
a decoder and reconstructs the
path, which Encoder layers: 16, 32,
UNet [40,41] spatiala information.
decoder path, Skipwhichconnections
reconstructs from
the spatial
the different Encoder
levels of layers:
the en-16, 32,
64,64, 128,
128, 256;decoder
256; decoder
UNet [40,41]
information. Skip connections from the different levels layers are inversed
coder of
to the
theencoder
corresponding levels of thelevels
to the corresponding decoder
of thetransfer useful infor- layers are inversed
mationdecoder
between them.
transfer useful information between them.
Exploits the transformer
Exploits the transformer blocks to model
blocks to model long-range
long-rangedependencies in an
image.dependencies
It includes ainSwin transformer
an image. It includesencoder
a Swin to capture features in differ- Features dimension: 48;
transformerusing
SwinUNETR [42] ent resolutions, encoder to capture
shifted windowsfeatures in different
in the self-attention module. The number of heads: 3, 3,
6,
Features dimension: 48; number of heads:
SwinUNETR [42]decoder resolutions, using shifted windows in the
part is connected via skip connections to the encoder 6, 12, and
24 consists 12, 24
self-attention module. The decoder part is connected
of convolutional blocks.
via skip connections to the encoder and consists of
convolutional blocks. Hidden layer dimen-
Architecture refers to a UNet-like structure with encoder/decoder parts
UNETR [43] Architecture
and additional refers to a UNet-like
transformer blocks instructure
the encoderwith blocks replacing convo- sion: 768; attention
Hidden layer dimension: 768; attention
UNETR [43] encoder/decoder parts and additional transformer heads: 16; feedforward
lutions. heads: 16; feedforward layer dimension: 2048
blocks in the encoder blocks replacing convolutions. layer dimension: 2048
Includes a larger encoder to extract useful information Number of encoder
Includes
andaalarger
smallerencoder
decoderto to extract
produceuseful information and a smaller de-
the segmentation layer filters: 8, 16, 32,
coder mask.
to produce the segmentation mask.
It is based on the MONAI library without It is basedtheon the MONAI
Number li-
of encoder layer filters: 8, 16, 32, 64,
SegResNet [44]
SegResNet [44] 64, 128; kernel size of
brary without
use of thethe use of the
Variational VariationalThe
Autoencoder. Autoencoder.
main The128;
main construc-
kernel size of three; group normalization
construction
tion block block of is
of the encoder theResNet
encoderblocks
is ResNet blocks residual connections. three; group normali-
containing
containing residual connections. zation
3.3. Segmentation Quantitative Comparison Analysis

The proposed methodology, along with the additional state-of-the-art segmentation
networks presented above, are compared in terms of DSC (Overlap), Recall, Precision, and
HD. Results of the 5-fold cross-validation for both the public and private datasets are pre-
sented in Tables 5 and 6, corresponding to the public and the private datasets, respectively.
Notably, the proposed AttentionUNet-based method achieved the best results in the two
main metrics, while it performed comparably to the other two metrics in both datasets.
Table 5. Results on the public concatenated dataset. Mean values of DSC, Recall, Precision, and HD
are presented.
Method DSC (%) Recall (%) Precision (%) HD (mm)

UNet 80.81 (±0.66) 84.10 (±1.02) 80.28 (±1.21) 10.76 (±2.10)
SwinUNETR 81.02 (±1.93) 81.55 (±0.79) 83.40 (±2.61) 39.75 (±15.37)
UNETR 78.82 (±0.91) 82.12 (±0.21) 78.53 (±1.26) 33.06 (±5.44)
SegResNet 81.75 (±1.00) 81.11 (±2.16) 84.73 (±0.98) 12.07 (±4.93)
AttentionUNet 82.65 (±1.03) 82.66 (±2.10) 84.77 (±0.76) 6.24 (±2.47)
Standard deviations from the cross-validation are placed inside the parentheses. Bold indicates the overall best
result for each metric.
Table 6. Results on the private dataset. Mean values of DSC, Recall, Precision, and HD are presented.
Method DSC (%) Recall (%) Precision (%) HD (mm)

UNet 77.11 (±3.57) 80.07 (±4.13) 76.59 (±2.46) 5.78 (±0.44)
SwinUNETR 74.46 (±5.53) 80.44 (±3.70) 73.45 (±6.75) 30.46 (±19.37)
UNETR 74.12 (±6.62) 75.08 (±6.13) 75.38 (±5.45) 52.91 (±26.82)
SegResNet 78.54 (±4.30) 77.55 (±7.20) 81.63 (±1.07) 16.92 (±19.98)
AttentionUNet 78.93 (±3.37) 79.99 (±3.78) 80.04 (±1.82) 5.35 (±0.42)
Standard deviations from the cross-validation are placed inside the parentheses. Bold indicates the overall best
result for each metric.
In the public datasets, UNETR achieved the lowest DSC, with a value of 78.82%, and
the second highest HD, with 33 mm. Although SwinUNETR achieved a DSC of 81%, the
HD value was pretty high (39.75 mm), making it unsuitable for the current application of
radiotherapy. This could be explained by the fact that transformer-based networks require
a large number of training samples to learn effectively to produce segmentations. UNet
presented a low DSC value of 80.81, which indicates low ability to capture the details on the
difficult-to-separate regions in the CT image, but achieved a small HD of 10 mm. SegResNet
achieved the second-best results in most metrics while the proposed AttentionUNet-based
method outperformed the others in the DSC value and HD value. In this regard, it produces
segmentations with better overlap with the ground truth and also with the smallest distance
(mean value of 6.24 mm and std of 2.47 mm).
Regarding the results on the private dataset (Table 6), the weights from the public
dataset training were used as initialization for all the models, as well as the same cross-
validation technique. Due to the differences in the acquisition parameters between the
public and the private datasets and the small number of samples in the private dataset,
evaluation metrics presented lower values in comparison. Recall values were very close
for all models, while Precision was higher in SegResNet, indicating that less false positive
voxels were estimated. Similar to the public dataset, the AttentionUNet-based method
outperformed all other models in overlap and distance between prediction and ground
truth, with higher mean DSC and the most stable metrics, as it can be extracted by the
small values of the standard deviation. The HD metric was also small (5.35 mm) and stable.
Although SegResNet performed comparably in the first three metrics, the large difference in
the HD supports its lower prediction accuracy against the proposed AttentionUNet model.
3.4. Segmentation Qualitative Comparison Analysis

Segmentation examples of the proposed AttentionUNet network and the different
comparison models are presented in Figure 4a for the public dataset and in Figure 4b
for the private dataset. Regarding the public dataset, the AttentionUNet-based models
achieved visually better results in comparison with the other DL models that produced
slightly rotated masks in the same gland. As such, the proposed framework generated
smoother region and suppressed a fraction of the over-segmented regions of the other
networks. By contrast, UNet, SwinUNETR, and SegResNet have over-segmented a small
part of the parotid gland on the left and concurrently have produced fewer smooth masks.
In a similar manner, SwinUNETR presented a larger difference in the parotid gland when
Bioengineering 2024, 11, 214 compared with the ground truth. 12 of 20
Concerning the utilization of the private dataset as an external validation, it can be
observed that the proposed AttentionUNet model best approximates the ground truth as
it reduces over-segmentation, while also producing smoother segmentations. In compar-
a smoother region and suppressed a fraction of the over-segmented regions of the other
ison, UNet failed to outline a large portion of the parotid gland, whereas the other models
networks. By contrast, UNet, SwinUNETR, and SegResNet have over-segmented a small
produced slightly different structures on the left and under-segmented the right parotid.
part of the parotid gland on the left and concurrently have produced fewer smooth masks.
In this example, although AttentionUNet did not capture the whole structure on the right,
In a similar manner, SwinUNETR presented a larger difference in the parotid gland when
it reduced false positives and produced very accurate results on the left one.
compared with the ground truth.
(a)
(b)
Figure 4. Segmentation
Figure 4. Segmentation results
results of
of the
the tested
tested models
models on
on (a)
(a) the public dataset;
the public dataset; (b)
(b) the
the private
private dataset.
dataset.
Each
Each column
columnrepresents
representsdifferent
differentalgorithmic designs.
algorithmic The The
designs. segmentation masksmasks
segmentation of the parotid glands
of the parotid
glands
are are displayed
displayed in red. in red.
Concerning
3.5. Results on the the utilization
Registration of the private
Expansion dataset as an external validation, it can be
Framework
observed that the proposed AttentionUNet model best approximates the ground truth as it
Following the segmentation framework, the registration feasibility study was imple-
reduces over-segmentation, while also producing smoother segmentations. In comparison,
mented. In Figure 5, an example output from the registration process is presented. Moving
UNet failed to outline a large portion of the parotid gland, whereas the other models
refers to the image to be aligned with the fixed image. Transformed image refers to the
produced slightly different structures on the left and under-segmented the right parotid. In
results of the registration process. In the final two images of each row, the edges of the
this example, although AttentionUNet did not capture the whole structure on the right, it
fixed image (reference) are presented on top of the moving and moved images in order to
reduced false positives and produced very accurate results on the left one.
qualitatively assess the performance of the utilized registration method according to the
edges
3.5. of the
Results onimage. The first Expansion
the Registration row shows an example of the first registration step, where
Framework
the pCT image is registered to the CBCT1 (week 1), while the second row shows an exam-
Following the segmentation framework, the registration feasibility study was imple-
ple of the second registration step, where the CBCT1 is registered to the CBCT5 (week 5).
mented. In Figure 5, an example output from the registration process is presented. Moving
In Figure 5a, the pCT and CBCT images present a larger difference and the number of
refers to the image to be aligned with the fixed image. Transformed image refers to the
resultsisof
slices not theregistration
the same. Due process.
to this fact,
Inmoving
the finaland
twofixed images
images are far
of each apart
row, theand the of
edges edges
the
of the fixed image do not match the edges of the moving image. However,
fixed image (reference) are presented on top of the moving and moved images in order to registration
manages to bring
qualitatively assessthem to approximately
the performance of thethe same registration
utilized space, whichmethod
can be according
seen both to
in the
the
similarity between the fixed and the transformed image and in the alignment of
edges of the image. The first row shows an example of the first registration step, where the the edges
of theimage
pCT fixedisimage whento
registered they
the are superimposed
CBCT1 on thethe
(week 1), while transformed
second rowimage.
showsInanFigure
example5b,
an example pair of the registration between CBCT1 and CBCT5 shows
of the second registration step, where the CBCT1 is registered to the CBCT5 (week 5). In the smaller
Figure 5a, the pCT and CBCT images present a larger difference and the number of slices
is not the same. Due to this fact, moving and fixed images are far apart and the edges
of the fixed image do not match the edges of the moving image. However, registration
manages to bring them to approximately the same space, which can be seen both in the
similarity between the fixed and the transformed image and in the alignment of the edges
of the fixed image when they are superimposed on the transformed image. In Figure 5b, an
example pair of the registration between CBCT1 and CBCT5 shows the smaller difference
between them before registration, as well as the ability of the second registration step to
align images and their edges.
As presented in Figure 5a,b, the transformed image is more similar to the fixed one,
whereas the edges of the fixed image are closer to the real edges in the transformed image.
On the moving image, it can be observed that the edges of the fixed image initially do not
match the real edges of the moving one, while after registration, the edges of the fixed
image are aligned with the transformed image.
Bioengineering 2024, 11, x FOR PEER REVIEW 13 of 20
Bioengineering 2024, 11, 214 difference between them before registration, as well as the ability of the second registra-
13 of 20
tion step to align images and their edges.
(a)
(b)
Figure 5. An example of the registration processes for (a) pCT to CBCT1;

CBCT1; (b)
(b) CBCT1
CBCT1 to
to CBCT5.
CBCT5.
In
AsTable 7, theinvalues
presented Figureof5a,b,
the the
metrics for the assessment
transformed of the
image is more registration
similar between
to the fixed one,
the
whereas the edges of the fixed image are closer to the real edges in the transformedMutual
two images are presented. Mean values of the Mean Squared Error (MSE), image.
Information
On the moving (MI), and Normalized
image, Cross-correlation
it can be observed that the edges(NCC) arefixed
of the calculated for the pairs
image initially of
do not
images before (fixed and moving images) and after (fixed and transformed)
match the real edges of the moving one, while after registration, the edges of the fixed registration
and
imageareare
presented
aligned for
withboth
the tasks (i.e., pCT
transformed to CBCT1 and CBCT1 to CBCT5).
image.
TableIn
7. Table
Results7,on
thethe
values of thetasks.
registration metrics for the assessment of the registration between the
two images are presented. Mean values of the Mean Squared Error (MSE), Mutual Infor-
mation (MI), and Task
Normalized Cross-correlationMSE (NCC) are calculated
MI NCC of im-
for the pairs
ages before (fixed
pCT-CBCT1 (beforeand moving images) and
registration) after (fixed and transformed)
468.52353 0.1752 registration
0.5785 and
are presented(after
pCT-CBCT1 for both tasks (i.e., pCT to CBCT1
registration) 306.0259and CBCT1 0.4638
to CBCT5). 0.8116
CBCT1-CBCT5 (before registration) 94.1346 0.4684 0.8939
Table 7. Results on
CBCT1-CBCT5 the registration)
(after registration tasks. 16.7582 0.6675 0.9681
Task MSE MI NCC
The calculated
pCT-CBCT1 (before metrics support the registration 468.52353
registration) accuracy, as the 0.1752
MSE is reduced
0.5785and
CC and MI are increased
pCT-CBCT1 (after registration)in both tasks. Of note is that the initial
306.0259 step of pCT
0.4638 to CBCT1
0.8116was
performed
CBCT1-CBCT5 in order to enhance
(before the step of CBCT1 to CBCT5,
registration) 94.1346 which was the target
0.4684 of this
0.8939
investigation, providing indices
CBCT1-CBCT5 (after registration) for required RT replanning.
16.7582 As expected, the
0.6675 registration
0.9681
between pCT and CBCT1 was more challenging, confirmed by the large MSE before the
registration. After the registration, MSE was reduced and the similarity measures were
The calculated metrics support the registration accuracy, as the MSE is reduced and
increased, indicating effective alignment between the different modality images. In the
CC and MI are increased in both tasks. Of note is that the initial step of pCT to CBCT1 was
second task, MSE was significantly reduced from 94 to 16, MI was increased from 46 to 66,
performed in order to enhance the step of CBCT1 to CBCT5, which was the target of this
and NCC from 89 to 96, further demonstrating successful registration.
investigation, providing indices for required RT replanning. As expected, the registration
Subsequently, we calculated the volumetric percentage difference between the first
between pCT and CBCT1 was more challenging, confirmed by the large MSE before the
week and fifth week of RT, based on the premise that significant volumetric alterations
registration. After the registration, MSE was reduced and the similarity measures were
during RT or between RT sessions necessitate planning adaptations. As such, the parotid
increased, indicating effective alignment between the different modality images. In the
volume difference between CBCT1 and CBCT5 as a result of the registration tasks was
second task, MSE was significantly reduced from 94 to 16, MI was increased from 46 to 66,
estimated and compared to the ground truth (i.e., the volume difference as annotated by
and NCC from 89 to 96, further demonstrating successful registration.
experts). Table 8 presents the MAE, the RMSE, and the Pearson Correlation Coefficient
(PCC)Subsequently, we calculated
of the volumetric percentagethedifference
volumetricbetween
percentage difference between
the registration output the
andfirst
the
week and fifth week of RT, based
ground truth of the CBCT1/CBCT5 images. on the premise that significant volumetric alterations
during RT or between RT sessions necessitate planning adaptations. As such, the parotid
volume
Table differencepercentage
8. Volumetric between difference.
CBCT1 and CBCT5 as a result of the registration tasks was
estimated and compared to the ground truth (i.e., the volume difference as annotated by
Method MAE
experts). Table 8 presents the MAE, the RMSE, and RMSE the Pearson Correlation PCC
Coefficient
CBCT1
(PCC) →CBCT5
of the volumetric percentage7.577 difference between 9.256the registration output
0.723 and the
ground truth of the CBCT1/CBCT5 images.
The volumetric percentage difference demonstrated very small values for both MAE
and RMSE, accounting for limited divergence from the ground truth. In fact, the 7.577 value
of MAE suggests that the mean percentage difference deviation between the experts’
annotation and the registration output was less than 8%. Moreover, the results show a
statistically significant positive correlation coefficient of 0.723 (p-value < 0.002), which
indicates that the variables move in the same direction with relatively high correlation.
4. Discussion
Automated segmentation in HN cancer is of great significance for RT planning, as
it can reduce time and provide crucial information on the anatomical structures of the
region of interest. In this study, a DL segmentation framework is proposed, aiming for fast,
automatic, and accurate delineation of the parotid glands. Although automated systems for
enhancing adaptive radiotherapy exist (e.g., Varian ETHOS, RT-Plan, etc.), we propose a
low-requirement AI-based decision support system that can provide insight and effectively
assist radiotherapy planning [45,46]. The fact that our study is highly connected to the
current clinical workflow combined with further extensive evaluation experiments and
technical improvements could provide a tool applicable to real clinical procedure. In this
regard, the utilized AttentionUNet architecture achieved high overall performance in both
public and private datasets, supporting the potential applicability of the proposed method
to the clinical workflow.
Regarding the proposed DL model, the AttentionUNet architecture utilized in this
study exploits the addition of an attention mechanism to the UNet structure [31]. As such,
the UNet-like structure with five layers in the encoder enables the network to extract useful
context information in the encoder path and then reconstruct the input and the spatial
information to produce the final mask in the decoder. In addition, by including learnable
weights, the network learns to focus on the regions inside the image that are more relevant
to the task, in this case, the parotid glands. In this way, false positive regions are effectively
suppressed, especially in the CT images where the structures are not easy to separate [47].
Moreover, the integration of attention gates in the model’s design allowed us to incorporate
features from deeper encoder layers that introduce better context information in each layer
of the decoder [48]. Further enhancement of the segmentation accuracy (especially in the
early epochs, the weights could not be pushed in the right direction) was accomplished
with a combination of dice and cross-entropy loss.
For a robust and unbiased evaluation of the segmentation framework, the segmenta-
tion accuracy was tested on a unified public dataset (consisting of two public subsets) and
a private dataset. By including images with different acquisition parameters and settings,
we aimed to validate the generalizability of our method, enhancing the model’s ability to
work effectively under different clinical protocols [49]. As such, the model demonstrated
high performance on the public dataset by achieving DSC and HD values of 82.65 and
6.24, respectively. The private dataset was also used as an external validation dataset,
where delineation was much more difficult (with variations in the acquisition parameters in
comparison with the public training datasets further complicating training tasks). However,
the proposed method presented satisfying results that support its ability to generalize to
other datasets. As such, the proposed model achieved DSC of 78.93 and HD of 5.35 in the
private dataset. In this setting, the pre-trained model resulting from the training in the
public dataset was applied in a 5-fold cross-validation in the private set, also achieving
high metrics, albeit with a small reduction deemed inevitable due to the large difference in
the acquisition parameters among the datasets [50]. It should be noted that the proposed
AttentionUNet-based network outperformed many state-of-the-art DL segmentation meth-
ods, both in terms of overlap between the predicted and the ground truth masks and in
terms of the distance of their edges. On this premise, HD is required to be as low as possible
due to its applicability in RT (where the distance between the true region and the delineated
regions should be short, to increase tumor coverage and alleviate OAR overdose) [51].
Subsequent to the segmentation framework, the predicted masks from the pCT of
the private dataset were further utilized in a registration schema, assessing the method’s
potential use in clinical practice for RT planning adaptations. To achieve this, two image
registration tasks were designed and implemented in order to transfer the predicted masks
to the CBCT of week 5 of the radiotherapy workflow. The ANT registration method was
utilized to align the pCT with the CBCT1, and the calculated transformation was applied
to the labels of the parotids to align the labels to the CBCT1. This was an intermediate
step, because the distance between pCT and CBCT of week 1 was smaller than that of week
5. This step allowed the registration to focus on the large differences in the acquisition
parameters without introducing large differences in the patient’s anatomy during the RT
treatment. In the second task, CBCT1 was aligned to CBCT5 (through Fast Symmetric
Demons registration) to transform the labels of the parotid glands and calculate their
volume in week 5. Of note is that CBCT images were more similar to each other in
terms of the acquisition parameters and the intensity values, which in turn enhanced
the registration procedure. As presented in the registration results section, the proposed
workflow was able to approximate the volumetric changes of the parotid glands with
high accuracy. Interestingly, the volumetric percentage difference between CBCT1 and
CBCT5 indicated a 7.55 MAE when compared to the experts’ observations. This indicates
the high performance of the registration process, taking into account that variations in
the registration of cancer CT scans can also occur among different observers. In fact,
observers may have individual techniques, preferences, or interpretations when performing
registrations, leading to differences in how they align or match the images [21]. The use of
automated registration tools can contribute to reducing variability among observers.
Limitations and Future Recommendations

Some considerations should be given when interpreting the results of this study.
First and foremost, this study included a relatively small sample data size. Although
three datasets were combined, the significant variance among patients could limit the
generalizability of the findings. As such, the framework proposed in this study seems
to lack the requisite robustness for practical deployment in real-world scenarios. The
performance metrics and results (although encouraging) suggest limitations in the model’s
ability to generalize to diverse and complex scenarios for real-world variations. In addition,
the segmentation and subsequent registration processes were applied to delineate only the
parotid gland, therefore training for shape constraints of small organs and leaving out larger
OAR shape characteristics. This might introduce biases and limitations when expanding to
other organs. Moreover, the study focused on segmentation and registration algorithmic
performance, omitting potential limitations associated with their implementation in clinical
practice, such as computational resources, training time, and integration with existing
medical imaging systems. As a result, caution should be exercised before deploying the DL
model in real-world settings, and further refinements or improvements may be necessary
to support the adoption of automated DL segmentation systems. In this regard, we aim to
expand our framework to involve additional patient cohorts and expert observations to
enhance its efficacy and reliability.
5. Conclusions
In this study, a novel DL neural network framework (AttentionUNet) is implemented
for the automated segmentation of parotid glands in HN cancer. The model is evaluated
across two publicly available datasets and one private dataset, and compared against other
state-of-the-art DL segmentation approaches. The proposed model demonstrates high
segmentation results with a Dice Similarity Coefficient of 82.65% ± 1.03 and a Hausdorff
Distance of 6.24 ± 2.47, outperforming the other DL methods. Following this, the segmen-
tation process is expanded, applying to the segmentation output to align pCT and CBCT
images. The registration results exhibit high similarity when compared to the ground truth,
offering valuable insights into the impact of radiotherapy treatment for adaptive treatment
planning adjustments. The implementation of these proposed methods underscores the
efficacy of DL and its potential applicability in the clinical workflow of image-guided RT
for applications such as the identification of the volumetric changes in OARs.
Author Contributions: Conceptualization, G.K.M. and I.K.; methodology, T.P.V. and A.Z.; software,
T.P.V.; validation, I.K. and T.P.V.; formal analysis, I.K.; investigation, T.P.V.; resources, A.Z.; data
curation, A.Z.; writing—original draft preparation, I.K. and T.P.V.; writing—review and editing, I.K.;
visualization, I.K.; supervision, G.K.M.; project administration, G.K.M. All authors have read and
agreed to the published version of the manuscript.
Funding: This research received no external funding.
Institutional Review Board Statement: The study was conducted in accordance with the Declaration
of Helsinki and approved by the Ethics Committee of ATTIKON University Hospital, Athens, Greece
(protocol EBD304/11-05-2022, 7 June 2022).
Informed Consent Statement: Patient consent was waived as all patient data were analyzed retro-
spectively after being anonymized.
Data Availability Statement: The data presented in this study are available on request from the
corresponding author. The data are not publicly available due to privacy reasons.
Conflicts of Interest: The authors declare no conflicts of interest.
Appendix A
Here, the calculation of the evaluation metrics and the loss function of the segmentation
procedures are presented.
As mentioned in the main manuscript, the DSC coefficient quantifies the overlap
between the ground truth mask and the predicted mask. The DSC value is calculated by
the following formula:
∼
∼ 2 y∩y
D y, y := ∼ (A1)
y+ y
∼ ∼
where y and y represent the set of voxels in the two segmentations, and |y| and y represent
the total number of voxels in G and P, respectively.
Regarding the directed HD from A to B, it is calculated by the following equation:
h(A, B) = max min k a − b k (A2)

a∈A b∈B
and the final HD is calculated as follows:
H(A, B) = max{h(A, B), h(B, A)} (A3)
Recall/Sensitivity was calculated by the number of true positives divided by the sum
of true positives and false negatives.
TP
Recall = (A4)
TP + FN
Precision is calculated by the number of true negatives divided by the sum of true
negatives and false positives.
TP
Precision = (A5)
TP + FP
where TP, FN, and FP correspond to the true positives, false negatives, and false positives,
respectively.
In this study, the utilized loss function consists of the weighted sum of the cross-
entropy and DSC coefficient. Here, the loss function, 1 − DSC, was calculated as:
2TP
LDSC = 1 − DSC = 1 − (A6)
2TP + FP + FN
For the segmentation task, cross entropy was calculated in a pixel-wise manner and
the average value was used as the result. In the multi-class formulation, the categorical
cross-entropy was calculated from the following equation:
1 N C
LCCE (y, p) = − ∑ ∑
N i=1 c=1
yi,c · log pi,c (A7)
where yi,c represents the one hot encoded ground truth labels and pi,c refers to the predicted
values per class, where “c” iterates through classes while “i” iterates through the pixels.
The final loss is calculated through the sum of the two previously mentioned loss functions,
with equal weights of 1.
Loss = w1 × LDSC + w2 ∗ LCCE (A8)
Appendix B
Here, the three metrics calculated for the evaluation of both registration tasks are
presented, as well as the global energy and similarity of the Demons algorithm. The fixed
or reference image is declared with f and the output transformed image with t.
The Demons algorithm global energy and similarity are defined as:
1 1 1
E(c, s) = Sim(F, M ◦ c) + 2 dist(s, c)2 + 2 Reg(s) (A9)
σ2i σx σT
1 1
2|ΩP | p∈∑
Sim(F, M ◦ s) = k F − M ◦ s k2 = |F(p) − M(s(p)|2 (A10)
2 Ω P
where F and M represent the fixed and moving images, respectively. Ω P denotes the
overlap between F and M ◦ S, s is the transformation, σi the noise, while σT quantifies the
regularization, σx the spatial uncertainty, and k c − s k the distance.
The evaluation metrics of both registration tasks were as follows:
Mean Squared Error (MSE):
2
∑(yi − ŷi )
MSE(f, t) = (A11)
n
where n is the number of observations (pixels), yi is the ith observed value, and ŷi is the ith
predicted value (pixel).
Mutual Information (MI):

p(x, y)
I(X; Y) = ∑ ∑ p(x, y)log (A12)
y∈Y x∈X
p(x)p(y)
where p(x), p(y) are the marginal probability mass functions of x, y, respectively, and p(x, y)
is the joint probability mass function of x, y.
Normalized Cross-Correlation (NCC):
∑ni=1 (xi − x)(yi − y)

NCC = q q (A13)
2 2
∑i=1 (xi − x) ∑ni=1 (yi − y)
n
where x and y are the images’ values and x is the mean.
References
1. Argiris, A.; Karamouzis, M.V.; Raben, D.; Ferris, R.L. Head and Neck Cancer. Lancet Lond. Engl. 2008, 371, 1695–1709. [CrossRef]
[PubMed]
2. Mnejja, W.; Daoud, H.; Fourati, N.; Sahnoun, T.; Siala, W.; Farhat, L.; Daoud, J. Dosimetric Impact on Changes in Target Volumes
during Intensity-Modulated Radiotherapy for Nasopharyngeal Carcinoma. Rep. Pract. Oncol. Radiother. J. Gt. Cancer Cent. Poznan
Pol. Soc. Radiat. Oncol. 2020, 25, 41–45. [CrossRef] [PubMed]
3. Zhang, Z.; Zhao, T.; Gay, H.; Zhang, W.; Sun, B. ARPM-Net: A Novel CNN-Based Adversarial Method with Markov Random
Field Enhancement for Prostate and Organs at Risk Segmentation in Pelvic CT Images. Med. Phys. 2021, 48, 227–237. [CrossRef]
[PubMed]
4. Zhang, Z.; Zhao, T.; Gay, H.; Zhang, W.; Sun, B. Semi-Supervised Semantic Segmentation of Prostate and Organs-at-Risk on 3D
Pelvic CT Images. Biomed. Phys. Eng. Express 2021, 7, 65023. [CrossRef]
5. Figen, M.; Çolpan Öksüz, D.; Duman, E.; Prestwich, R.; Dyker, K.; Cardale, K.; Ramasamy, S.; Murray, P.; Şen, M. Radiotherapy
for Head and Neck Cancer: Evaluation of Triggered Adaptive Replanning in Routine Practice. Front. Oncol. 2020, 10, 579917.
[CrossRef]
6. Iliadou, V.; Economopoulos, T.L.; Karaiskos, P.; Kouloulias, V.; Platoni, K.; Matsopoulos, G.K. Deformable Image Registration to
Assist Clinical Decision for Radiotherapy Treatment Adaptation for Head and Neck Cancer Patients. Biomed. Phys. Eng. Express
2021, 7, 55012. [CrossRef] [PubMed]
7. Iliadou, V.; Kakkos, I.; Karaiskos, P.; Kouloulias, V.; Platoni, K.; Zygogianni, A.; Matsopoulos, G.K. Early Prediction of Planning
Adaptation Requirement Indication Due to Volumetric Alterations in Head and Neck Cancer Radiotherapy: A Machine Learning
Approach. Cancers 2022, 14, 3573. [CrossRef]
8. Siddique, S.; Chow, J.C.L. Artificial Intelligence in Radiotherapy. Rep. Pract. Oncol. Radiother. J. Gt. Cancer Cent. Poznan Pol. Soc.
Radiat. Oncol. 2020, 25, 656–666. [CrossRef]
9. Kushnure, D.T.; Talbar, S.N. HFRU-Net: High-Level Feature Fusion and Recalibration UNet for Automatic Liver and Tumor
Segmentation in CT Images. Comput. Methods Programs Biomed. 2022, 213, 106501. [CrossRef]
10. Li, W.; Song, H.; Li, Z.; Lin, Y.; Shi, J.; Yang, J.; Wu, W. OrbitNet—A Fully Automated Orbit Multi-Organ Segmentation Model
Based on Transformer in CT Images. Comput. Biol. Med. 2023, 155, 106628. [CrossRef]
11. Yousefi, S.; Sokooti, H.; Elmahdy, M.S.; Lips, I.M.; Manzuri Shalmani, M.T.; Zinkstok, R.T.; Dankers, F.J.W.M.; Staring, M.
Esophageal Tumor Segmentation in CT Images Using a Dilated Dense Attention Unet (DDAUnet). IEEE Access 2021, 9,
99235–99248. [CrossRef]
12. Qazi, A.A.; Pekar, V.; Kim, J.; Xie, J.; Breen, S.L.; Jaffray, D.A. Auto-Segmentation of Normal and Target Structures in Head and
Neck CT Images: A Feature-Driven Model-Based Approach. Med. Phys. 2011, 38, 6160–6170. [CrossRef]
13. Fortunati, V.; Verhaart, R.F.; Niessen, W.J.; Veenland, J.F.; Paulides, M.M.; Walsum, T. van Automatic Tissue Segmentation of Head
and Neck MR Images for Hyperthermia Treatment Planning. Phys. Med. Biol. 2015, 60, 6547. [CrossRef]
14. Fritscher, K.D.; Peroni, M.; Zaffino, P.; Spadea, M.F.; Schubert, R.; Sharp, G. Automatic Segmentation of Head and Neck CT Images
for Radiotherapy Treatment Planning Using Multiple Atlases, Statistical Appearance Models, and Geodesic Active Contours.
Med. Phys. 2014, 41, 51910. [CrossRef]
15. Costea, M.; Zlate, A.; Durand, M.; Baudier, T.; Grégoire, V.; Sarrut, D.; Biston, M.-C. Comparison of Atlas-Based and Deep
Learning Methods for Organs at Risk Delineation on Head-and-Neck CT Images Using an Automated Treatment Planning
System. Radiother. Oncol. 2022, 177, 61–70. [CrossRef]
16. Men, K.; Geng, H.; Cheng, C.; Zhong, H.; Huang, M.; Fan, Y.; Plastaras, J.P.; Lin, A.; Xiao, Y. Technical Note: More Accurate and
Efficient Segmentation of Organs-at-Risk in Radiotherapy with Convolutional Neural Networks Cascades. Med. Phys. 2019, 46,
286–292. [CrossRef] [PubMed]
17. Rasmussen, M.E.; Nijkamp, J.A.; Eriksen, J.G.; Korreman, S.S. A Simple Single-Cycle Interactive Strategy to Improve Deep
Learning-Based Segmentation of Organs-at-Risk in Head-and-Neck Cancer. Phys. Imaging Radiat. Oncol. 2023, 26, 100426.
[CrossRef] [PubMed]
18. Kawahara, D.; Tsuneda, M.; Ozawa, S.; Okamoto, H.; Nakamura, M.; Nishio, T.; Saito, A.; Nagata, Y. Stepwise Deep Neural
Network (Stepwise-Net) for Head and Neck Auto-Segmentation on CT Images. Comput. Biol. Med. 2022, 143, 105295. [CrossRef]
[PubMed]
19. Cubero, L.; Castelli, J.; Simon, A.; de Crevoisier, R.; Acosta, O.; Pascau, J. Deep Learning-Based Segmentation of Head and Neck
Organs-at-Risk with Clinical Partially Labeled Data. Entropy 2022, 24, 1661. [CrossRef] [PubMed]
20. Zhang, Y.; Liao, Q.; Ding, L.; Zhang, J. Bridging 2D and 3D Segmentation Networks for Computation-Efficient Volumetric Medical
Image Segmentation: An Empirical Study of 2.5D Solutions. Comput. Med. Imaging Graph. 2022, 99, 102088. [CrossRef] [PubMed]
21. Gibbons, E.; Hoffmann, M.; Westhuyzen, J.; Hodgson, A.; Chick, B.; Last, A. Clinical Evaluation of Deep Learning and Atlas-Based
Auto-Segmentation for Critical Organs at Risk in Radiation Therapy. J. Med. Radiat. Sci. 2023, 70, 15–25. [CrossRef]
22. D’Aviero, A.; Re, A.; Catucci, F.; Piccari, D.; Votta, C.; Piro, D.; Piras, A.; Di Dio, C.; Iezzi, M.; Preziosi, F.; et al. Clinical Validation
of a Deep-Learning Segmentation Software in Head and Neck: An Early Analysis in a Developing Radiation Oncology Center.
Int. J. Environ. Res. Public. Health 2022, 19, 9057. [CrossRef]
23. Chen, X.; Sun, S.; Bai, N.; Han, K.; Liu, Q.; Yao, S.; Tang, H.; Zhang, C.; Lu, Z.; Huang, Q.; et al. A Deep Learning-
Based Auto-Segmentation System for Organs-at-Risk on Whole-Body Computed Tomography Images for Radiation Therapy.
Radiother. Oncol. 2021, 160, 175–184. [CrossRef]
24. Zhang, Z.; Zhao, T.; Gay, H.; Zhang, W.; Sun, B. Weaving Attention U-Net: A Novel Hybrid CNN and Attention-Based Method
for Organs-at-Risk Segmentation in Head and Neck CT Images. Med. Phys. 2021, 48, 7052–7062. [CrossRef]
25. CSAF-CNN: Cross-Layer Spatial Attention Map Fusion Network for Organ-At-Risk Segmentation in Head and Neck CT
Images|IEEE Signal Processing Society Resource Center. Available online: https://rc.signalprocessingsociety.org/conferences/
isbi-2020/spsisbivid0331 (accessed on 16 February 2024).
26. Lei, W.; Mei, H.; Sun, Z.; Ye, S.; Gu, R.; Wang, H.; Huang, R.; Zhang, S.; Zhang, S.; Wang, G. Automatic Segmentation of
Organs-at-Risk from Head-and-Neck CT Using Separable Convolutional Neural Network with Hard-Region-Weighted Loss.
Neurocomputing 2021, 442, 184–199. [CrossRef]
27. Gao, Y.; Huang, R.; Yang, Y.; Zhang, J.; Shao, K.; Tao, C.; Chen, Y.; Metaxas, D.N.; Li, H.; Chen, M. FocusNetv2: Imbalanced Large
and Small Organ Segmentation with Adversarial Shape Constraint for Head and Neck CT Images. Med. Image Anal. 2021, 67,
101831. [CrossRef] [PubMed]
28. Raudaschl, P.F.; Zaffino, P.; Sharp, G.C.; Spadea, M.F.; Chen, A.; Dawant, B.M.; Albrecht, T.; Gass, T.; Langguth, C.; Lüthi, M.; et al.
Evaluation of Segmentation Methods on Head and Neck CT: Auto-Segmentation Challenge 2015. Med. Phys. 2017, 44, 2020–2036.
[CrossRef] [PubMed]
29. Podobnik, G.; Strojan, P.; Peterlin, P.; Ibragimov, B.; Vrtovec, T. HaN-Seg: The Head and Neck Organ-at-Risk CT and MR
Segmentation Dataset. Med. Phys. 2023, 50, 1917–1927. [CrossRef] [PubMed]
30. Hoang, J.K.; Glastonbury, C.M.; Chen, L.F.; Salvatore, J.K.; Eastwood, J.D. CT Mucosal Window Settings: A Novel Approach to
Evaluating Early T-Stage Head and Neck Carcinoma. AJR Am. J. Roentgenol. 2010, 195, 1002–1006. [CrossRef] [PubMed]
31. Oktay, O.; Schlemper, J.; Folgoc, L.L.; Lee, M.; Heinrich, M.; Misawa, K.; Mori, K.; McDonagh, S.; Hammerla, N.Y.; Kainz, B.; et al.
Attention U-Net: Learning Where to Look for the Pancreas 2018. arXiv 2018, arXiv:1804.03999.
32. Eelbode, T.; Bertels, J.; Berman, M.; Vandermeulen, D.; Maes, F.; Bisschops, R.; Blaschko, M.B. Optimization for Medical Image
Segmentation: Theory and Practice When Evaluating With Dice Score or Jaccard Index. IEEE Trans. Med. Imaging 2020, 39,
3679–3690. [CrossRef]
33. Müller, D.; Soto-Rey, I.; Kramer, F. Towards a Guideline for Evaluation Metrics in Medical Image Segmentation. BMC Res. Notes
2022, 15, 210. [CrossRef]
34. Yeung, M.; Sala, E.; Schönlieb, C.-B.; Rundo, L. Unified Focal Loss: Generalising Dice and Cross Entropy-Based Losses to Handle
Class Imbalanced Medical Image Segmentation. Comput. Med. Imaging Graph. 2022, 95, 102026. [CrossRef]
35. Avants, B.; Tustison, N.J.; Song, G. Advanced Normalization Tools: V1.0. Insight J. 2009, 2, 1–35. [CrossRef]
36. Lowekamp, B.C.; Chen, D.T.; Ibáñez, L.; Blezek, D. The Design of SimpleITK. Front. Neuroinformatics 2013, 7, 45. [CrossRef]
37. Vercauteren, T.; Pennec, X.; Perchant, A.; Ayache, N. Diffeomorphic Demons Using ITK’s Finite Difference Solver Hierarchy.
Insight J. 2008. [CrossRef]
38. Lee Rodgers, J.; Nicewander, W.A. Thirteen Ways to Look at the Correlation Coefficient. Am. Stat. 1988, 42, 59–66. [CrossRef]
39. Cardoso, M.J.; Li, W.; Brown, R.; Ma, N.; Kerfoot, E.; Wang, Y.; Murrey, B.; Myronenko, A.; Zhao, C.; Yang, D.; et al. MONAI: An
Open-Source Framework for Deep Learning in Healthcare. arXiv 2022, arXiv:2211.02701.
40. Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings
of the Medical Image Computing and Computer-Assisted Intervention—MICCAI, Munich, Germany, 5–9 October 2015;
Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F., Eds.; Springer International Publishing: Cham, Switzerland, 2015;
pp. 234–241.
41. Kerfoot, E.; Clough, J.; Oksuz, I.; Lee, J.; King, A.P.; Schnabel, J.A. Left-Ventricle Quantification Using Residual U-Net. In Statistical
Atlases and Computational Models of the Heart. Atrial Segmentation and LV Quantification Challenges; Pop, M., Sermesant, M., Zhao, J.,
Li, S., McLeod, K., Young, A., Rhode, K., Mansi, T., Eds.; Springer International Publishing: Cham, Switzerland, 2019; pp. 371–380.
42. Hatamizadeh, A.; Tang, Y.; Nath, V.; Yang, D.; Myronenko, A.; Landman, B.; Roth, H.R.; Xu, D. UNETR: Transformers for 3D
Medical Image Segmentation. In Proceedings of the 2022 IEEE/CVF Winter Conference on Applications of Computer Vision
(WACV), Waikoloa, HI, USA, 3–8 January 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 1748–1758.
43. Myronenko, A. 3D MRI Brain Tumor Segmentation Using Autoencoder Regularization. In Proceedings of the Brainlesion: Glioma
Multiple Sclerosis, Stroke and Traumatic Brain Injuries, Granada, Spain, 16 September 2018; Crimi, A., Bakas, S., Kuijf, H.,
Keyvan, F., Reyes, M., van Walsum, T., Eds.; Springer International Publishing: Cham, Switzerland, 2019; pp. 311–320.
44. Hatamizadeh, A.; Nath, V.; Tang, Y.; Yang, D.; Roth, H.R.; Xu, D. Swin UNETR: Swin Transformers for Semantic Segmentation
of Brain Tumors in MRI Images. In Proceedings of the Brainlesion: Glioma Multiple Sclerosis, Stroke and Traumatic Brain
Injuries, Singapore, 18 September 2022; Crimi, A., Bakas, S., Eds.; Springer International Publishing: Cham, Switzerland, 2022;
pp. 272–284.
45. El-qmache, A.; McLellan, J. Investigating the Feasibility of Using Ethos Generated Treatment Plans for Head and Neck Cancer
Patients. Tech. Innov. Patient Support Radiat. Oncol. 2023, 27, 100216. [CrossRef]
46. Miura, H.; Tanooka, M.; Inoue, H.; Fujiwara, M.; Kosaka, K.; Doi, H.; Takada, Y.; Odawara, S.; Kamikonya, N.; Hirota, S.
DICOM-RT Plan Complexity Verification for Volumetric Modulated Arc Therapy. Int. J. Med. Phys. Clin. Eng. Radiat. Oncol. 2014,
3, 117–124. [CrossRef]
47. Guo, W.; Li, Q. High Performance Lung Nodule Detection Schemes in CT Using Local and Global Information. Med. Phys. 2012,
39, 5157–5168. [CrossRef]
48. Yin, X.-X.; Sun, L.; Fu, Y.; Lu, R.; Zhang, Y. U-Net-Based Medical Image Segmentation. J. Healthc. Eng. 2022, 2022, 4189781.
[CrossRef]
49. Acosta, J.N.; Falcone, G.J.; Rajpurkar, P.; Topol, E.J. Multimodal Biomedical AI. Nat. Med. 2022, 28, 1773–1784. [CrossRef]
50. Althnian, A.; AlSaeed, D.; Al-Baity, H.; Samha, A.; Dris, A.B.; Alzakari, N.; Abou Elwafa, A.; Kurdi, H. Impact of Dataset Size on
Classification Performance: An Empirical Evaluation in the Medical Domain. Appl. Sci. 2021, 11, 796. [CrossRef]
51. Kong, V.C.; Marshall, A.; Chan, H.B. Cone Beam Computed Tomography: The Challenges and Strategies in Its Application for
Dose Accumulation. J. Med. Imaging Radiat. Sci. 2016, 47, 92–97. [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.

Bioengineering 11 00214 v2

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bioengineering 11 00214 v2

Uploaded by

Copyright:

Available Formats

bioengineering

Academic Editor: Alan Wang 1. Introduction

Bioengineering 2024, 11, 214. https://doi.org/10.3390/bioengineering11030214 https://www.mdpi.com/journal/bioengineering

Our study builds on these advancements, employing an AttentionUNet-based DL

2. Materials and Methods

2.1.2. Private Dataset

2.2. Data Preprocessing

2.3. Segmentation Model

2.3.1. Attention Gate

Bioengineering 2024, 11, 214 5 of 20

Table 1. Architecture sizes of the encoder and the bottleneck layers.

Layer Blocks Input Size Output Size

Table 2. Architecture sizes of the decoder layers.

Layer Blocks Input Size Output Size

2.3.3. Evaluation Scheme and Loss Function

2.4. Registration Expansion Framework

2.4.1. Registration Dataset

2.4.2. Registration Preprocessing

2.4.3. Registration Procedures

Table 3. ANT registration parameters.

2.5. Implementation Details

3.3. Segmentation Quantitative Comparison Analysis

Method DSC (%) Recall (%) Precision (%) HD (mm)

Method DSC (%) Recall (%) Precision (%) HD (mm)

3.4. Segmentation Qualitative Comparison Analysis

Figure 5. An example of the registration processes for (a) pCT to CBCT1;

Limitations and Future Recommendations

h(A, B) = max min k a − b k (A2)

and the final HD is calculated as follows:

H(A, B) = max{h(A, B), h(B, A)} (A3)

∑ni=1 (xi − x)(yi − y)

where x and y are the images’ values and x is the mean.

You might also like