Sagittal Cervical Spine Landmark Point Detection in X-Ray Using Deep Convolutional Neural Networks

This article has been accepted for publication in a future issue of this journal, but has not been
fully edited. Content may change prior to final

publication. Citation information: DOI
10.1109/ACCESS.2022.3180028, IEEE Access
Date of publication xxx 00, 0000, date of current version xxx 00, 0000.
Digital Object Identifier 10.1109/
Sagittal Cervical Spine Landmark Point

Detection in X-Ray Using Deep
Convolutional Neural Networks
ALI POURRAMEZAN FARD1,3, JOE FERRANTELLI2, ANNE-LISE DUPUIS2 and MOHAMMAD
H. MAHOOR1,3, (Senior Member, IEEE)
1
Ritchie School of Engineering and Computer Science, University of Denver, 2155 E Wesley Ave, Denver, CO 80208, USA (e-mail: {ali.pourramezanfard,
mmahoor}@du.edu)
2
PostureCo, Inc, Trinity, FL 34655, USA (email:{joe.ferrantelli, annelise.dupuis}@postureco.com)
3
DreamFace Technologies LLC, Centennial, CO USA
Corresponding author: Ali Pourramezan Fard (e-mail: ali.pourramezanfard@du.edu) This

work was supported by PostureCo.
ABSTRACT Sagittal cervical spine alignment measured on X-Ray is a key objective measure for clinicians
caring for patients with a multitude of presenting symptoms. Despite its applications, there has been no
research available in this field yet. This paper presents a framework for automatic detection of the
Sagittal cervical spine landmark point. Inspired by UNet, we propose an encoder-decoder Convolutional
Neural Network (CNN) called PoseNet. In developing our model, we first review the weaknesses of
widely used regression loss functions such as the L1, and L2 losses. To address these issues, we propose
a novel loss function specifically designed to improve the accuracy of the localization task under
challenging situations (extreme neck pose, low or high brightness and illumination, X-Ray noises, etc.)
We validate our model and loss function on a dataset of X-Ray images. The results show that our
framework is capable of performing precise sagittal cervical spine landmark point detection even for
challenging X-Ray images.
INDEX TERMS Neck Landmark Point Detection, Landmark Point Detection, Intensity Aware Loss, Custom
Loss Function, Medical Image Processing, X-Ray Image Processing, Convolutional Neural Network,
Computer Vision, Deep Learning
I. INTRODUCTION Typically, in a medical or chiropractic practice, X-Rays are

The sagittal cervical spine has been both modeled and obtained during the examination and analyzed “by hand”
validated for biomechanical alignment with regard to a using traditional X-Ray software annotation tools which
normal range [1]–[4]. Using these normal ranges of accompany most PACS (Picture Archiving and
alignment as guides, clinicians are better equipped to Communication System) software. Traditional PACS
gauge the health of their patient’s spine as related to their software allows for simple overlaying of annotation
presenting symptoms. For example, reversal of the drawings for both angular and linear measurements. In
expected normal cervical lordosis (known as cervical addition to PACS systems, more sophisticated radiographic
kyphosis) is a spinal deformity and is correlated with EMR systems, such as PostureRay® from PostureCo, Inc.
disability and pain [5], [6]. Furthermore, abnormal cervical take mensuration one step further allowing a clinician to
sagittal spine alignment is also correlated with symptoms truly digitize anatomical landmarks of the sagittal cervical
such as headache [7]–[10], migraine [10], [11], as well as spine leading to more efficient and accelerated objective
related to increased incidence of cervical radiculopathy quantification of vertebral rotations and translations for
and myelopathy [12], [13]. Furthermore, a forward head clinical documentation. However, this digitization remains
posture which is also measured on X-Ray is significantly a completely manual process thus a very timeconsuming
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see
https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final
10.110 publication. Citation information:
/ACCESS.2022.3180028, IEEE DOI
9 Access
Fardet al.: Sagittal Cervical Spine Landmark Points Detection in X-Ray Using Deep Convolutional Neural Networks
related to neck pain as well as distorted nervous system process for today’s clinician. To expedite this manual
function both sensorimotor and autonomic [14]. process, we have implemented an innovative neural
Consequently, spinal radiography is vital for the clinical network to automatically detect the anatomical location of
evaluation of sagittal cervical spine alignment to address the vertebrae which drastically reduces the clinician’s time
functional losses, pain, and weakness [15]. in mensuration of the sagittal cervical spine.
VOLUME 4, 2016 1
Heatmap regression studied by [16]–[23] is one intensity value of the pixels in this region more
of the most successful and widely used methods accurately than the BG region but not as accurate
among the proposed solutions for landmark point as the CF region. Accordingly, the model can learn
detection tasks. In heatmapbased regression, for the Gaussian distribution of heatmap channels.
each landmark point, we generate a channel (a Although almost all (to the best of our
two-dimensional matrix) using a Gaussian knowledge) of the previous work in landmark point
distribution centered at the location of the detection considered landmark localization as a
landmark point. We can model heatmap-based regression task, our proposed loss function
regression as F(X) = Y , where X ∈ RW×H×3 is an input implicitly considers such task as classification as
colored-image with the width and height of W and well. We propose the categorical loss, called CLoss,
H, respectively. Likewise, Y ∈ RWhm×Hhm×k is the which utilizes the cross-entropy (CE) loss to guide
predicted heatmap which is a kdimensional tensor the model to classify the pixels as CF, BG, or FG.
where the width, height and number of channels More specifically, for each pixel that belongs to FG,
are Whm, Hhm and k (the number of landmark and BG regions, CLoss guides the model to learn
point), respectively. Finding F(), a function that the region they belong to. Consequently, to guide
maps the input image to the set of heatmaps is the the model to focus on predicting the intensity
goal of the landmark localization task. In other value of the pixels located in the CF region more
words, the regression task is to predict the value of accurately, CLoss stops penalizing the model as the
each pixel in each heatmap channel. model correctly classifies the pixels in the BG and
L2 loss is widely used in heatmap-based FG regions. Consequently, CLoss guides the model
regression [19], [22], [24]–[26]. However, as L2 loss to learn the proposed regions rather than the
is insensitive to small errors, it is not the most intensity value of the pixels. After learning the
suitable loss function for heatmapbased regression distribution of the defined regions, the loss
landmark point detection. Thus, we propose a function only penalizes the model for accurately
novel loss function, called ILoss, in which for each predicting the intensity value of the pixels located
heatmap channel we define 3 different regions in the CF region. Such characteristics prevent the
(see Fig. 3): 1- CoreForeground (CF) region, which network to put a huge effort into predicting the
contains the pixels with the highest intensity exact intensity value of the pixels located in the BG
values. This region is considered as the most and FG regions, which can result in a more
crucial region since any inaccuracy in the accurate prediction of the intensity values of the
prediction of the intensity values of the pixels in pixels belonging to the CF region.
this region, will likely result in inaccurately The contributions of our approach are
predicted coordinates. 2- Background (BG) region, summarized as follows:
which is the region containing the pixels with the • We propose ILoss, which spans each
smallest intensity values. Regressing the intensity heatmap channel to the BG, CF and FG
value of the pixels in this region can be taken as a regions based on the intensity value of the
less-important task for the network. In other pixels.
words, since the coordinates of the facial landmark • We propose CLoss, which consider the
points heavily rely on the pixels that are located in landmark localization task as a classification
the CF region, finding the exact intensity value of problem and guides the model to classify the
the pixels in the BG region does not result in a pixels as CF, BG, or FG.
more accurate landmark localization task. 3- • We propose IC-Loss as a combination of ILoss
Foreground (FG) region, a region which can be and
considered as a boundary between the BG and CF
CLoss for localizing the sagittal cervical spine
regions. We penalize the model to predict the
landmark point more accurately.
2 VOLUME 4, 2016
publication.
10.1109 Citation information:IEEE
/ACCESS.2022.3180028, DOI Access
Fard et al. : Sagittal Cervical Spine Landmark Point Detection in X-Ray Using Deep Convolutional Neural Networks
• We propose a new network architecture, automated cephalometric landmark localization

called PoseNet, and train it using our method under the framework of GAN to learn the
proposed IC-Loss. mapping from features to the distance map of a
The remainder of this paper is organized as specific target landmark. Zeng et al. [35] proposed
follows. Sec. II reviews the related work in a cascaded three-stage CNN for localization of
landmark localization. Sec. III describes our cephalometric landmark point. A contextguided
proposed loss function and how it improves the CNN was proposed by Zhang et al. [36] for joint
accuracy of heatmap-based regression for the landmark localization and segmentation of
sagittal cervical spine landmark point detection craniomaxillofacial bone. Zhong et al. [37]
task. Sec. IV-D provides the experimental results proposed a Coarse-to-Fine CNN heatmap-based
and analysis. Then, in Sec.IV-E we provide an regression method that is capable of localizing
extensive ablation study to investigate the effect Adolescent idiopathic scoliosis assessment from
of each component of our proposed loss function. X-Ray CT images. Ma et al. [38] proposed Loc-Net
Finally, Sec. V concludes the paper with some for fast and memory-efficient 3D landmark
discussions on the proposed method and future detection and applied it to computed tomography
research directions. angiography scans of the sagittal cervical spine to
localize the bifurcation of the left and right carotid
II. RELATED WORK arteries. Chen et al. [39] proposed Adaptive Error
To the best of our knowledge, this is the first work Correction Net (AEC-Net) to estimate the Cobb
on sagittal cervical spine landmark point angles from spinal X-rays as a high-precision
detection in X-Ray images. However, face regression. Most of the mentioned work has
alignment as well as body joint tracking tasks, can focused on designing or modifying CNN to make
be taken as similar to sagittal cervical spine the models better fit the corresponding
landmark point detection. landmark point application or make the prediction accuracy
detection has a wide variety of applications both better. We proposed our model architecture,
in medical and non-medical image processing. PoseNet, inspired by UNet [28]. Although there
Accordingly, we first review the application of might be more accurate CNN for landmark
landmark points on medical images and then go localization task such as Stacked hourglass [40] or
through the non-medical applications including HRNet [41] models, we choose to design PoseNet
facial alignment and body joint tracking. using the idea behind UNet [28] to keep the
model efficient in terms of memory and CPU.
A. LANDMARK POINT DETECTION IN MEDICAL On Contrary, the following work mostly
CONTEXT proposed a custom methodology or an algorithm
In this section, we review some of the previously to improve the landmark point detection accuracy.
proposed landmark localization methods in the To improve the accuracy of the 3D cephalometric
medical image processing context. landmark point, Huang et al. [42] proposed a
Lee et al. [27] is among the first who applied sigmoid-based intensity transform that uses the
CNN to cephalometric landmark detection, by nonlinear optical property of X-Ray films to
training 38 independent CNN structures to increase image contrast. HeadLocNet [43] was
regress 19 landmark points. Likewise, a 2-stage proposed for localization and classification of inner
UNet [28] framework proposed by Zhong et al. ear images which can be used to estimate a point-
[29] for automatic detection of anatomical based registration with the atlas image. Urschler et
landmarks in cephalometric X-Ray images. A CNN al. [44] proposed a random-forest that uses a
architecture proposed by Qian et al. [30] inspired combination of image appearance information and
by Faster R-CNN [31] to improve the accuracy of geometric landmark configuration for anatomical
cephalometric landmarks localization. Using landmark localization. Likewise, Urschler et al. [44]
Bayesian CNN [32], Lee et al. [33] proposed a proposed a heatmap-based regression CNN for
framework to localize cephalometric landmarks anatomical landmark localization, to split the
while providing a confidence region of the localization task into two simpler sub-problems to
identified landmarks considering model reduce the need for large training datasets.
uncertainty. Dai et al. [34] proposed a new Poltaretskyi et al. [45] employed a statistical shape
model to predict the premorbid anatomy of the
VOLUME 4, 2016 3
9 Access
proximal humerus. A two-stage algorithm [59] introduced Wingloss, a new loss function that
composed of a rigid image-to-image and a is capable of overcoming the widely used L2 loss
deformable surface-to-image registration is in conjunction with a strong data augmentation
proposed by Brehler et al. [46] for detecting a set method as well as a pose-based data balancing
of landmarks in the knee joint. Negrillo et al. [47] (PDB). To ease the parts variations and regresses
proposed a geometrically-based algorithm to the coordinates of different parts, TwoStage Re-
detect the landmarks of the humerus for further initialization Deep Regression MODEL (TSR) [60]
analysis of a reduction of supracondylar fractures. splits the face into several parts. Zhang et al. [61]
Likewise to the mentioned research, we choose to proposed Exemplar-based Cascaded Stacked
design a new loss function that can cope well with Auto-Encoder Network (ECSAN) for face
the challenges of landmark localization task. In alignment, which is utilized to handle partial
addition, in Sec.III we discuss the weakness of the occlusion in the image. To cope with self-
widely used L1, and L2 loss and try to design the occlusions and large face rotations, Valle et al.
IC-Loss to perform the sagittal cervical spine [62] proposed a face alignment algorithm based
localization task more accurately. on a coarse-to-fine cascade of ensembles of
regression trees, which is initialized by robustly
B. LANDMARK POINT DETECTION fitting a 3D face model to the probability maps
In general, there are three main methods for produced by a pre-trained convolutional neural
landmark point detection. We have classified and network (CNN). Fard et al. [63] proposed a
reviewed the previously proposed methods in the lightweight multi-task network for jointly
following. detecting facial landmark points as well as the
Classical models (aka template-based methods) estimation of face pose. In another work, a
are among the first methods that are designed for student-teacher network was proposed by Fard et
landmark point detection specifically face al. [64] for extracting facial landmark points using
alignment. Active Shape Model (ASM) [48] and more accurately using lightweight CNN models.
Active Appearance Model (AAM) [49], [50] which More recently, ACR Loss [65] proposed a loss
utilize Principal Components Analysis (PCA) to function to estimate the level of difficulty in
simplify the problem and learn parametric predicting each landmark point for the network
features of faces to model facial landmarks and hence improve the accuracy of the face
variations in an iterative manner. Further, Martins localization task.
et al. [50] proposed a 2.5D AAM that combines a In general, the Coordinate-based method is the
3D metric Point Distribution Model (PDM) with a most efficient model of landmark localization since
2D appearance model to match a 3D deformable the output of the model is the location of the
face model to 2D images. In addition, Cristinacce landmark points, and no postprocessing is
and Cootes [51] introduced the Constrained Local required. However, the accuracy of this approach
Model (CLM) which models the face shapes with is not very high as we lose much spatial
Procrustes analysis and principal component information.
analysis. Although CLM models and their various Heatmap-based regression models are another
extensions including [52]–[55], are among the widely used method for landmark point detection
most promising methods for face alignment, they tasks. First, the likelihood heatmaps for each
are sensitive to occlusion as well as illumination landmark point are created, and then the network
when detecting landmarks in unseen datasets. is trained to generate the heatmaps for each input
Robust Cascade Pose Regression (RCPR) [56] was image. A two-part network proposed by Yang et al.
introduced to detect occlusions explicitly and use [24], including a supervised transformation to
robust shape-indexed features. Another normalize faces and a stacked hourglass network
computationally lightweight method was Local [40], is designed to predict heatmaps. LAB [16]
Binary Features (LBF) [57]. proposed by Wu, expressed that the facial
Coordinate-based regression models predict boundary line contains valuable information.
the landmark coordinates vector from the input Hence, they utilized boundary lines as the
image directly. Mnemonic Descent Method geometric structure of a face to help facial
(MDM) [58] has utilized a recurrent convolutional landmark detection. Furthermore, for a better
network to detect facial landmarks. Feng et al. initialization to Ensemble of Regression Trees (ERT)
4 VOLUME 4, 2016
publication.
regressor, Valle et al. [66] proposed a simple CNN background pixels it is not applicable to heatmap-
to generate heatmaps of landmark locations. As based regression landmark point localization. To
well as that, JMFA [67] leveraged a stacked cope with this issue, Wang et al. [21] proposed
hourglass network for multi-view face alignment, Adaptive Wing Loss, a loss function that can adapt
and it achieved state-of-the-art accuracy and its curvature to different ground truth pixel values.
demonstrated more accuracy than the best three Consequently, it can be sensitive to small errors on
entries of the Menpo Challenge [68]. Moreover, foreground pixels while it is tolerant to small errors
HRNet being introduced by Sun et al. [22] is a high- on background pixels.
resolution network that is applicable in many Contrary to the previously proposed loss
Computer Vision tasks such as facial landmark functions, our proposed IC-Loss is designed to
detection and achieves a reasonable accuracy. firstly spans each heatmap channel to the BG, CF
Besides that, to deal with shape variations in facial and FG considering the intensity value of each
landmark detection, Iranmanesh et al. [20] pixel and penalizing the model accordingly. More
proposed an approach that provides a robust clearly, IC-Loss penalizes the model for accurately
algorithm that aggregates a set of manipulated predicting the intensity value of the pixels located
images to capture robust landmark representation. in the CF region, while it can tolerate more errors
Gaussian heatmap vectors proposed by Xiong et al. when it comes to the pixels located in either of
[19] to be used instead of the widely used the BG or the FG regions.
heatmap for facial landmark point detection. More
recently, [69] proposed a two-paired cascade III. METHODOLOGY
subnetwork to generate heatmaps and accordingly In this section, we first introduce our proposed
the coordinates of the facial landmark point to network architecture, PoseNet. Next, we illustrate
deal with the more challenging faces. how to generate the heatmap attention map.
Since the accuracy of the heatmap-based Then, we explain ILoss, CLoss, and the proposed
landmark localization is more than the Coordinate- IC-Loss.
based method, we decided to follow the former
approach. A. POSENET ARCHITECTURE
Custom Loss function is a remarkable approach Inspired by UNet [28], we propose an Encoder-
for improving the performance of both regression- Decoder network called PoseNet for the sagittal
based and classification-based [70] models. cervical spine landmark point localization task. As
Similarly, Custom loss functions play an important Fig. 1 shows, PoseNet contains 4 main modules
role in improving the accuracy of landmark including Head, Down-Sampling, Keep, and Up-
localization. However, there is only a few research Sampling.
that has proposed and studied the effect of using a The Head module contains 3 sets of 2-
taskrelated loss function in this context. Fard et al. dimensional convolution (Conv2D) layers followed
[63] have proposed ASMNet, a lightweight CNN as by a ReLU and a batchnormalization layer (see Fig.
well as ASMLoss which is designed to guide the 2). Each Conv2D layer has a 3 × 3 kernel size, 64
network toward learning a smoother version of the filters and stride as 2. We design the Head
facial landmark point. KD-Loss proposed by Fard et module to extract the feature from the input
al. [64] is a loss function designed to use the image and downscale it to of the size of the
knowledge gained by two teacher networks to original input image.
improve the accuracy of a student network for face As Fig. 2 shows, each Down-Sampling module
alignment task. GoDP [71] proposed a distance- has 2 sets of 3 × 3 Conv2D layers having the same
aware softmax loss to assign a large penalty on number of filters and stride equals 1, followed by
incorrectly classified positive samples. Then, they a ReLU and a batch-normalization layer. We
gradually reduce the penalty on the misclassified define 2 outputs for the Down-Sampling module
negative samples as the distance from nearby called Skip and Out. Skip is a skipconnection used
positive samples decreases. The Wing loss [59] is a to pass the information and features to the
modified log loss for coordinate-based face corresponding Up-Sampling module. Besides,
alignment which is designed to amplify the Down is followed by a MaxPooling layer (with
influence of small errors. However, since Wing loss pool_size=2), which is used to downscale the
[59] is very sensitive to small errors in the input by the scale of 2. In the DownSampling
VOLUME 4, 2016 5
9 Access
modules, we multiply the number of the filter of stride=1 followed by a ReLU and a batch-
the Conv2D layers by a factor of 2 as we normalization layer. Then, a 2 × 2 DeConv2D with
downscale the image. The first Down-Sampling stride=2 is used to upscale the input by the factor
module has 64 filters, the second, third, and of 2. The number of filters we use for both Conv2D
fourth have 128, 256, and 512 filters, respectively. layers and the DeConv2D are the same. While the
In the Keep module, we only use a 3×3 Conv2D first Up-Sampling module has 512 filters, we
layer with 1024 filters and stride as 1, followed by decrease the filters by a factor of 2, so the second,
a ReLU and a batchnormalization layer. Then, we third, and fourth have 256, 128, and 64 filters
have a 2 × 2 Deconvolution (DeConv2D) layer respectively.
having 512 filters with stride equals to
FIGURE 1. The architecture of the PoseNet.
Down- Up-
Head Sampling Sampling Keep B. HEATMAP ATTENTION MAP
Conv2D Conv2D We define a heatmap attention map based on the
ReLU ReLU ReLU ReLU intensity value of the ground-truth heatmap
batch-norm batch-norm batch-norm batch-norm
channels. Heatmap attention maps define the
Conv2D Conv2D Conv2D DeConv2D importance of each pixel in the ground-truth
ReLU ReLU ReLU ReLU heatmap. For an input image Img, we define
batch-norm batch-norm batch-norm batch-norm
H
Max HWhm×Hhm×N and ′
W hm×Hhm×N as the corresponding
Conv2D Pooling DeConv2D
ReLU ReLU Core
batch-norm Down Skip batch-norm Region
Conv2D
Intensity Value
FIGURE 2. The Head, Down-Sampling, Up-Sampling, and Keep Foreground

modules. Region
2 followed by a ReLU and a batch-normalization Bakcground

Region
layer (see Fig. 2). Since the input is in its smallest
scale at this module, we use a Conv2D layer with
1024 filters to extract more information from the Heatmap Channel
image.
FIGURE 3. We divide the Heatmap with respect to the intensity value of
The Up-Sampling modules are designed to firstly the pixels into 3 different regions.
extract information from the input, and then scale
up the inputs to further generate the heatmaps.
Thus, as Fig. 2 represents, in each Up-Sampling ground-truth and the predicted heatmaps,
module we use 2 sets of 3 × 3 Conv2D with respectively. The Whm, Hhm, and N are the width,
6 VOLUME 4, 2016
publication.
the height and the number of the channels of the more focus on predicting the accurate intensity
heatmaps respectively. For each heatmap value of the pixels which are important for the
channel, we define the BG, FG and CF regions landmark point detection task.
based on the intensity value of the corresponding
pixels. As Fig 3 shows, a pixel pij belongs to BG if C. ILOSS
its intensity value Iij in [0,θBG]. Likewise, pij belongs As Fig.3 shows (also discussed in Sec.I), predicting
to FG if Iij in (θBG,θFG], and to CF in case Iij in the accurate intensity value of the pixels which
(θFG,1]. Accordingly, we define wij, the attention belong to the CF region can heavily affect the final
for pj according accuracy of the landmark localization task.
to Eq. 1: Accordingly, we propose an intensity aware loss
1 ∀pij ∈ BG ∀pij ∈ function called ILoss, which is designed to penalize
 the model based on the intensity value of each
FG
wj = pixel in the ground truth heatmap. Our proposed
ωFG ∀pij ∈ CF ILoss consists of 3 different loss functions which
ωCF
are designed with respect to the fact that the
where ωFG, and ωCF are the hyper parameters that
pixels in the CF, and FG, and BG regions contribute
define the magnitude of the attention map for the
to the accuracy of the final landmark localization
BG, FG and CF regions, respectively.
task differently, and accordingly, ILoss tends to
We define each heatmap channel using 2
penalize the model with 3 different loss functions.
dimensional Gaussian function using Eq. 2:
Before introducing each part of the ILoss, we
define 2 different functions called ∆ and IM()
respectively. Firstly, we define ∆m,i,j,k using Eq. 4:
(2)
where P = (px,py) is the location of the landmark ∆m,i,j,k = |Im,i,j,k − Im,i,j,k′ |
point and σ is the standard deviation of the
distribution. Following the typical practices, we where m ∈ M (M is the number of the images in
define σ = 1.5. Since predicting the intensity value the training set) is the index of the mth item in the
of the pixels belonging to CF region accurately is training set, and I and I′ indicate the ground truth
very crucial for generating the coordinates of the and the predicted intensity value of the pixel
landmark point, we define this region to be smaller located at i ∈ Hhm, j ∈ Whm and k ∈ N heatmap
compared to the other regions. Thus, we respectively. As the Eq.4 shows, ∆ is a function
empirically set θFG = 0.9, which means the pixels that shows the difference between the intensity
with intensity values between 0.9 and 1 belong to value of the corresponding pixels in the predicted
this region. Likewise, we set θBG = 0.4. and the ground truth heatmap.
Considering the ratio of the pixels exists in each Moreover, we define the Intensity Map function,
of the CF, FG, regions compared to the pixels in the IM(), using Eq. 5:
BG region in each heatmap channel H, we set the
values of ωCF, and ωFG, respectively, using Eq. 3:
if: Im,i,j,k ∈ [lb,hb)

&
(5)
∗
where X indicates the number of pixels that exist otherwise
in the X region. To clarify, since the number of
where lb and hb are the minimum and the
pixels that exist in the BG region is much greater
maximum intensity values. In other words, if the
than the pixels in CF, and FG, the model should
intensity value of the ground truth pixel located
take a huge effort on predicting the intensity value
at m,i,j,k position is between lb and hb, the output
of the pixels in the former region, while predicting
of the function is 1, and otherwise, its output is 0.
those values accurately does not pay a crucial role
We use IM() to define ILoss for each of the
in the final accuracy of the landmark point
proposed regions.
detection task. Thus, we introduce the heatmap
attention map which guides the network to put
VOLUME 4, 2016 7
9 Access
1) ILoss for the BG Region make the influence of the loss function very low
For the pixels belonging to the BG region, there is while the prediction is considered as good
no need for a highly accurate prediction of the enough. Moreover, we define ϕBG = 0.5 which
A: BG Region B: FG Region C: CF Region
Loss Function Gradient of the Loss Function
FIGURE 4. The loss function and the gradient of the loss in different regions.
intensity values. means for each pixel in the BG region, the
Thus, we define a threshold and consider the prediction is accurate if the error is less than 0.5.
prediction as accurate enough if the prediction As Fig. 4-A shows, we define LossBG such that
error is smaller than the threshold. The intensity the influence of the loss decreases as the
value of each pixel in a heatmap channel H can magnitude of the loss decreases. This
vary between 0, and 1, so we consider the characteristic of LossBG guides the model to put
prediction as accurate enough if ∆ ≤ ϕBG. The more focus on the prediction of the intensity
influence of the loss function should be high if the value of the pixels belonging to the FG, and CF
prediction error is more than the defined regions as soon as the model predict the intensity
threshold, ϕBG, while it should drop dramatically value of the pixels in BG accurately enough.
as soon as the prediction error becomes smaller
than the threshold. Having said that, ILoss can be 2) ILoss for the FG Region
a quadratic (y = αx2) function in the BG region. Compared to the BG region, it is much more
Consequently, we define LossBG as a pieces-wise important that the model to predict the intensity
function using Equations 6 and 7: value of the pixels belonging to the FG region
if: ∆m,i,j,k > ϕBG since learning the Gaussian distribution of a
LBG heatmap channel can further help the model to
1 otherwise predict the intensity value of the pixels in the CF
(6) region. The same as the proposed loss function
for the BG region, we define the error prediction
LossBG = threshold as ∆ ≤ ϕFG. The influence of the loss
LBGm,i,j,k function should be high if the prediction error is
greater than the threshold, so we define the loss
as a quadratic function for this condition. Then,
when the prediction error is smaller than the
M N Whm Hhm threshold, we define the loss function as a linear
where C1 is a constant used to smoothly connect 2 function. Contrary to LossBG, the gradient of the
pieces of LossBG together, and α is a linear part of LossFG is a constant value, and
hyperparameter that can set the magnitude and accordingly it penalizes the model independently
the influence of the loss function. We empirically from the magnitude of the prediction error. This
define α = 0.5 (and accordingly which characteristic of LossFG guides the model to
predict the intensity value of the pixels belonging
8 VOLUME 4, 2016
publication.
to CF more accurately. we define LossFG as a has a negative consequence on the accuracy of

pieces-wise function using Equations 8 and 9: the other regions. Thus, we define a threshold
and call it ϕ2_CF and consider the prediction as
LFGm,i,j,kif: ∆m,i,j,k > ϕFG accurate enough if the prediction error is smaller
than this threshold. We use a linear loss function
otherwise
(y = x) if the prediction error is smaller than ϕ2_CF.
(8)
Since the gradient of the linear loss is a constant
LossFG = number (y′ = 1), the influence of the loss is
independent to the magnitude of the loss
LFGm,i,j,k function and thus the loss function keeps
penalizing the model to improve the accuracy. We
define LossCF as a piece-wise function using
Equations 10 and 11:
where C2 = is a constant used to smoothly connects
2 pieces of Loss FG together. As Fig. 4-B shows, the if:
influence of the quadratic part of Loss FG is related ∆m,i,j,k
to the magnitude of the loss function which guides >ϕ1_CF
the model to quickly learn the distribution of the LCFm,i,j,k =if: ϕ2_CF < ∆m,i,j,k ≤ϕ1_CF
heatmap values and predict the intensity value of otherwise
the pixels such that they can be considered as (10)
accurate enough. Then, the linear part of Loss FG,
having a constant influence value, penalizes the LossCF =
model to improve the prediction accuracy
independently of the magnitude of the loss LCFm,i,j,k
function. Following the value of ϕBG, we define ϕFG M N Whm Hhm
as (11)
0.5.
3) ILoss for the CF Region

The accuracy of the landmark point detection task
has a significant reliance on how accurately the
model predicts the intensity value of the pixels in
the CF region. Similarly to LossBG, LossFG, a
quadratic loss function (y = x2) is a suitable option
for huge errors since the influence of the loss
depends on the magnitude of the loss, and
accordingly a quadratic loss function can penalize
the model dramatically for huge errors. For the
small prediction errors where ∆ ≤ ϕ1_CF, the
influence of the quadratic loss function reduces
dramatically, and hence the model gets penalized
much less. We introduce a logarithmic loss
function (y = Ln(1 + x)) where the influence of the
loss function increases as the magnitude of the
loss decreases. In other words, the logarithmic loss
can force the model to put a huge effort into
accurately predicting the intensity value of the
pixels in the CF region.
Although the logarithmic loss function can
guide the model towards learning the very
accurate intensity values of the pixels in the CF
region, putting too much effort into this region
VOLUME 4, 2016 9
9 Access
LC LCE LCF
FIGURE 5. Examples of the X-Ray images from the dataset.

where C3 = ϕ2_CF − βLn(1 + ϕ2_CF) and C4 = βLn(1 + of each pixel located in the BG and FG regions.
are Then, we use CE loss function and define CLoss for
each pixel using Equations 13, and 14:
constants
used to smoothly connect pieces of Loss CF CEm,i,j,k = −[L(Im,i,j,k) Log(Im,i,j,k)+
together. We introduce the hyper parameter β to ′ ′
adapt the curvature of the logarithmic piece. Fig.
4-C shows, different values of β, and ϕ2_CF change
the influence and the magnitude of the loss
function. The accurate prediction of the intensity
value of the pixels belonging to the CF region L(Im,i,j,k)
affects the final landmark point detection accuracy Log(Im,i,j,k)] if:
dramatically. Thus, we conduct experiments on
different values of β, and ϕ2_CF and accordingly (14)
define β = 4, and ϕ2_CF = 0.05 (see Sec. IV-E). 1 if: otherwise
To explain, for any pixel Pm,i,j,k with the ground truth
D. CLOSS
and predicted intensity values Im,i,j,k and Im,i,j,k′
We classify any arbitrary pixel Pm,i,j,k located at i ∈
Hhm, j ∈ Whm and k ∈ N to be in one of the BG, FG, respectively, we first use CEm,i,j,k to figure out if the
or CF model has classified the pixel correctly or not.
regions. Hence, we can consider the sagittal Then, we introduce Ωm,i,j,k as a binary weight, which
cervical spine landmark localization task as a
is 0 if the classification is correct and is 1
classification problem too, where we penalize the
model to learn the region each pixels belongs to. otherwise.
Since predicting the very accurate intensity values Then, we define categorical intensity aware loss
of the pixels belonging to the BG region, the FG function for pixels belonging to the BG and FG
region do not affect the final landmark point regions using Equations 15, and 16:
detection accuracy, we can stop penalizing the
model as soon as the model classify each pixels CLossBG =
correctly. We define 2 classes called CBG = 0 and CFG LBGm,i,j,k
= 1 and accordingly label any pixel located in BG Ωm,i,j,k
and FG as CBG and CFG respectively. We define L
operator as:
M N Whm Hhm
0 if: V ∈ [0,θBG) (15)
(12)
1 if: V ∈ [θBG,θFG)
CLossFG =
where V is the intensity value of the input pixel.
LFGm,i,j,k
We introduce L operator to define the class label
Ωm,i,j,k
10 VOLUME 4, 2016
publication.
sets, a training set (which is used for the training

M N Whm Hhm purpose), and a test set (which is used to evaluate
(16) our proposed method). We create our test set by
randomly selecting 10% of the total number of LC,
The proposed CLossBG and CLossFG are designed 5% of LCE, and 5% of LCF images. Then, we use the
to stop penalizing the model to predict the rest of the images as the training set. Table 1
intensity of a pixel as soon as the model correctly shows the detail of our generated training and test
classifies it. Consequently, the model puts more set. Moreover, in Table 2, Table 3, and Table 4 we
effort into predicting the intensity value of the provide a detailed information about the year
pixels located at CF more accurately. We show in range of X-Rays, the gender and the age of the
Sec.IV-E that using the proposed categorical subjects respectively.
CLossBG and CLossFG compared to their base To obtain the dataset of X-Rays, 5 chiropractic
models LossBG and LossFG improves the accuracy of clinics were included that are met the basic criteria
the final landmark point detection task. of being certified in a chiropractic technique called
Chiropractic BioPhysics. This was done to make
E. PROPOSED LOSS FUNCTION sure the digitization of anatomical landmarks was
As Eq.17 shows we define our proposed loss consistent for the dataset. The offices then signed
function as the sum of the loss functions and agreed to data sharing and business associate
proposed for the BG, FG, and CF regions and call it agreements with the company PostureCo, Inc. to
Intensity-aware Categorical Loss, IC-Loss. be compliant with HIPAA guidelines. All X-Ray data
was anonymized and selected from the years
IC-Loss = CLossBG + CLossFG + LossCF 2008-2021. From the dataset, the patient
population broke down to males making up 3.72%
Our proposed loss function is designed to guide of lateral cervical extension (LCE) X-Rays, 3.56% of
the model focus on predicting the intensity values lateral cervical flexion X-Rays (LCF), and 31.23% of
of the pixels belonging to the CF region more neutral lateral cervical radiographs (LC). Females
accurately compared to the other regions. We contributed 6.49% LCE, 5.96% LCF, and 49.03% LC.
show in Sec.IV-D that the proposed loss function Collectively 80.26% of the XRays were LC, with
can perform landmark point detection task more 10.22% LCE and 9.52% LCF respectively. Age
accurately compared to the standard loss demographics revealed that age 5-20 comprised
functions such as L2 and L1 loss. 11.16% of the X-Rays included with the 20-40yrs
IV. EXPERIMENTAL RESULTS
age bracket making up 36.66% and >40 yrs
In this section, we first provide a detailed comprising the remaining
description of our dataset. Afterward, we explain 52.18%.
the implementation detail of our proposed
method. Then, we illustrate the evaluation metrics
which are used to assess the performance of our
proposed method for the sagittal cervical spine TABLE 1. Total number of X-Ray images in the training and test set.
landmark detection task. Finally, we provide our
LC LCE LCF Total
results.
Dataset 19,599 2,495 2,325 24,419
A. DATASET Training Set 17,639 2,371 2,209 22,219
Test Set 1,960 124 116 2,200
Our dataset contains 24,419 sagittal cervical spine
X-Ray images that are labeled by expert humans.
As Fig. 5 depicts, each sagittal cervical spine image
in the dataset can be categorized to 3 different
types with respect to its pose including normal
(LC), extension (LCE), and flexion (LCF). As Table 1
shows, the number of LC images (19,599) is by far
greater than the LCE (2,495) and LCF (2,325)
images. We split our dataset into 2 independent
VOLUME 4, 2016 11
9 Access
TABLE 4. X-Ray age per view type. All numbers are in % format.
Age LC LCE LCF Total

<20 8.69 1.28 1.20 11.16
20-40 28.83 4.02 3.80 36.66
>40 42.74 4.92 4.52 52.18
Input Image
Equalized Gray-scale Inverted
white-foreground white-background
FIGURE 6. An example of a white-foreground and a white-background X-

Ray image.
1) Preprocessing
Since the sagittal cervical spine X-Ray images in
our dataset have been captured with different
devices and technologies, there exists 2 different
types of X-Rays in the dataset: whiteforeground
images and white-background (see Fig. 6), while
we consider the bones as the foreground and
anything else as the background. We introduce
Original Flipped
the white-foreground X-Rays as the images in
which the intensity value of the pixels in the FIGURE 7. To deal with the huge diversity of the images in the dataset,
foreground segments are greater than the other we proposed a novel preprocessing method. We stack the equalized, the
gray-scale, and the inverted version of the input image together with 2
segments, while in the white-background X-Rays, different orders and save both the image and the flipped version of it.
the intensity value of the pixels in the bone
segments is less than the other segments. As the
discrepancy in types of X-Ray images can impact channels. Then, we normalize the generated gray-
the accuracy of the model negatively, we propose scale image such that the intensity value of each
a preprocessing algorithm to deal with this issue pixel be ∈ [0,1] and call the output img gs. Next, to
and eventually improve the accuracy of the improve the quality of the input X-Ray image, we
model. use the histogram-equalization technique and
As Fig. 7 shows, we first convert each input X- create the equalized version of the image from
Ray image to a gray-scale image by calculating the imggs, and call it imgeq. After that, we create the
mean of the RGB inverted version of imggs and call it imginv.
We create the output image called imgout, a 3-
TABLE 2. X-Ray date range per view type. All numbers are in % format. channel image generated by stacking img gs, imgeq,
Range LC LCE LCF Total and imginv in two different orders to make the
2008-2010 4.46 0.97 0.97 6.40 model learn to deal with both white-foreground,
2010-2014 18.65 2.45 2.41 23.51 and white-background X-Ray images. In the first
2010-2018 29.91 3.28 2.90 36.08
order, we stack imggs, imgeq, and imginv. In the
2018-2021 27.24 3.52 3.24 34
second order, we flip the images horizontally and
TABLE 3. X-Ray gender per view type. All numbers are in % format. then stack imggs, imginv, and imgeq (See Fig. 7). For
the training sample, we keep both of the
Gender LC LCE LCF Total
generated images, while for the test set, we
Male 31.23 3.27 3.56 38.52
randomly select only one.
Female 49.03 6.49 5.96 61.48
12 VOLUME 4, 2016
publication.
Using this preprocessing method, the model will normalizing factor is selected as the distance
be capable of dealing with both white-foreground between the pupils of the human eyes (or the
and white-background X-Ray images. The number distance between the outer corner of the eyes),
of whitebackground X-Ray images is lower we choose d as the distance between the
compared to the whitebackground X-Ray images. landmark point number 0 and 4 (see Fig 11). In
Hence, training the model without using the addition, we calculate NME for each vertebra to
proposed preprocessing method, the accuracy of better analyze the accuracy of the model.
the sagittal cervical spine landmark point detection Furthermore, we calculate failure rate (FR),
on the white-background X-Ray images is very low. defined as the proportion of failed detected
sagittal cervical spines, for a maximum error of
B. IMPLEMENTATION DETAILS 0.08 and 0.1. Cumulative Errors Distribution (CED)
Since the width and the height of the input X-Ray curve and the areaunder-the-curve (AUC) [73] are
images are large, we crop each image with respect reported as well.
to the minimum and maximum coordination of the
landmarks points in both X and Y axes. Then, we D. RESULTS ANALYSIS
resize each X-Ray image to have width and height In this section, we first evaluate the accuracy of
of 128, and 256 respectively. We augmented the our proposed network architecture, PoseNet.
XRay images in terms of rotation (from -45 to 45 Then, we evaluate the accuracy of our proposed
degrees), and contrast, brightness and color loss function. Moreover, we study the
modification to add robustness to the model. performance of our proposed PoseNet.
Furthermore, since generating heatmaps
having the same width and height as the input 1) Evaluation of PoseNet
cropped images is memory and processor- Since this is the first work in the field, we first
consuming, we created our heatmaps to be four compare the accuracy of our PoseNet trained
times smaller than the cropped input images, with the widely used L2 loss with 2 very popular
having width and height equal to 32 and 64 baseline models, MobileNetV2 [74] called mnv2,
respectively. We used the Adam optimizer [72] and ResNet50 [75], called res50. Moreover,
TABLE 5. Comparison of the NME (in %), the FR (in %), and the AUC of
for training the networks while choosing the different models versus PoseNet. All the models are trained with L2 loss.
learning rate 10−2, β1 = 0.9, β2 = 0.999, and decay Model NME(↓) FR(↓) AUC(↑)
= 10−5. We then trained the networks for about
150 epochs with a batch size of 60. We LC LCE LCF LC LCE LCF LC LCE LCF
implemented our networks using the TensorFlow mnv2 6.47 7.35 11.04 7.49 9.16 41.96 0.6100 0.5270 0.2069
res50 6.12 6.88 10.75 7.02 8.41 39.95 0.6375 57.21 0.2192
library and ran them on an NVidia 1080Ti GPU.
mnv2-hm 5.60 6.42 10.56 3.84 6.66 36.60 0.6971 0.6227 0.2490
res50-hm 4.79 5.20 7.68 2.92 2.5 16.07 0.7375 0.7527 0.4862
C. EVALUATION METRICS PoseNet 4.75 5.21 7.48 2.77 3.33 9.82 0.7602 0.7385 0.5067
We follow landmark localization tasks and employ since PoseNet is a heatmap-based regression, for
normalized mean error (NME), in Eq.18, to better comparison we create 2 encoder-decoder
measure our model’s accuracy. models using MobileNetV2 [74], and ResNet50
[75] as the encoders respectively, and introduce 3
sets of DeConv2D layers, with filters=256,
kernel=3, and stride=2, followed by a ReLU and a
batch-normalization layer, as the decoder. We call
where m is the number of samples in the test set,
the former model mnv2-hm, and the latter model
and n is number of the landmark point, Pxij, Pyij
res50-hm.
are the predicted points (jth landmark point of the
As Table 5 shows, the accuracy of the sagittal
ith sample) while Gxij, Gyij are the corresponding
cervical spine landmark point detection using
Ground-truths. We also define d as the
heatmap-based regression models is much better
normalizing factor for presenting the results in a
than the coordinate-based regression models.
unified manner (the accuracy of the model will
Moreover, while the NME for the mnv2hm is
not be affected by the size of the image, the
5.60%, 6.42%, and 10.56% for the LC, LCE, and LCF
cropping margin, etc.). Inspired by previously
respectively, these values are reduced to 4.79%
proposed work in facial alignment where the
(about
VOLUME 4, 2016 13
9 Access
0.81% reduction), 5.20% (about 1.22% reduction), model and train the model using L1, L2, and IC-
and 7.68% (about 2.88% reduction) respectively Loss. As Fig 8 shows, the performance of the
using resn50-hm as the model. Using the PoseNet, landmark localization is much better when we
the NME reduces to 4.75%, 5.21%, and 7.48% train the model using IC-Loss compared to L1, and
indicating about 0.85%, 1.21%, 3.08% compared to L2 loss.
mnv2-hm for the LC, LCE, and LCF respectively. The
accuracy of PoseNet is better compared to E. ABLATION STUDY
resn50hm both in LC, and LCF subsets, while it is In this section, we first study the effect of the
slightly worse for the LCE subset. However, as hyperparameters defined in Equations 10 and 11.
Table 6 shows, the number of the model Then, we conduct experiments to measure how
parameters and the number of the floating-point positively the proposed CLoss can improve the
operations (FLOPs) of PoseNet is much smaller sagittal cervical spine landmark point detection
than that of resn50-hm. task.
2) Evaluation of IC-Loss 1) Studying the Effects of β and ϕ

In order to evaluate the accuracy of the proposed As mentioned in Sec. III, different values of the
loss, we train PoseNet with 3 different loss hyperparameters β and ϕ2CF in Equations 10 and
functions including L1, L2, and our proposed IC- 11 can affect the accuracy of sagittal cervical
Loss function. As Table 7 shows, the NME of the spine landmark point detection task. Accordingly,
model trained using L2 loss is 4.75%, 5.21%, and we conduct 2 independent experiments to study
7.48% for the LC, LCE, and LCF subsets the effect of each hyperparameter. Needless to
respectively. The model performs slightly better say, finding the best combination of these 2
being trained using L1 loss and NME reduces to hyperparameters is almost impossible.
4.69%, 5.20%, and 7.25% for the LC, LCE, and LCF In the first experiment, we assign 5 different
subsets respectively. The lowest NME is achieved values to β and calculate the NME of the
when we train the model using our proposed landmark point detection task. As Fig 9 shows
ICLoss where the NME reduces to 4.38%, 4.76%, choosing β = 1 results NME to be 4.82%, 5.37%
and 6.50% for LC, LCE, and LCF subsets and 7.62% for LC, LCE, and LCF subsets
respectively. respectively. These values are almost the same
As we discussed in Sec. III, the influence of L2 choosing β = 2. Then, choosing β = 3 reduces the
loss depends on the magnitude of the error, and NME more and we achieve 4.66%, 5.19%, and
consequently, it is very sensitive to large errors 7.19% for LC, LCE, and LCF subsets respectively.
while it is very insensitive TABLE 6. Comparison of the We followed the trend and increase β to be 4, and
number of the model parameters and FLOPs.
achieved the least NME. However, choosing β = 5
Model # parameters #FLOPs increases the NME again, and accordingly, we
mnv2-hm 6,398,045 683,025,408 choose the hyperparameter β = 4 in our
res50-hm 29,497,245 3,066,262,528 experiments.
PoseNet 23,226,269 596,375,552 In the second experiment where we define β =
TABLE 7. Comparison of the NME (in %), the FR (in %), and the AUC of 4, we assign 3 different values to the
PoseNet trained using different loss functions.
hyperparameter ϕ2CF and calculate the NME of the
Model NME(↓) FR(↓) AUC(↑)
landmark point detection task. As Fig 10 shows,
the value of ϕ2CF affects the accuracy of the model
LC LCE LCF LC LCE LCF LC LCE
L2 4.75 5.21 7.48 2.77 3.33 9.82 0.7602 0.7385 very slightly and the lowest NME is achieved by
L1 4.69 5.20 7.25 2.61 3.33 12.50 0.7654 0.7395 choosing ϕ2CF = 0.05.
IC-Loss 4.38 4.76 6.50 2.51 3.33 6.25 0.7882 0.7760
to small errors. Conversely, the influence of L1 2) Studying the Effects of the Categorical CLoss To
loss is a constant value. Our provided better understand the effect of our proposed
experiments in Table 7 also shows empirically Categorical Loss, we conducts 2 different
that L1 loss performs better than L2 loss. experiments. In the first experiment, called IC-
Since NME and FR are sensitive to outliers, we LossNo-CAT, we train the model without using the
depict the CED curve of the sagittal cervical spine
landmark point detection using PoseNet as the
14 VOLUME 4, 2016
publication.
Categorical Loss. In the second experiment, the accurate values for the pixels in CF are vital for
called the ultimate goal of the model which is the
All LC LCE LCF

FIGURE 8. CED curve generated using PoseNet as the model and IC-Loss, L1, and L2 loss functions.
NME (%)
FIGURE 9. The NME (in %) of the PoseNet using different values for the localization of the sagittal cervical spine landmark
hyper parameter β.
point.
TABLE 8. Studying the effect of using the proposed categorical loss on

the NME (in %), the FR (in %), and the AUC of PoseNet.
Model NME(↓) FR(↓) AUC(↑)
LC LCE LCF LC LCE LCF LC LCE

IC-LossNo-CAT 4.75 5.31 7.40 2.71 4.16 11.60 0.7599 0.7290
IC-LossAll-CAT 8.18 9.96 13.96 16.77 32.50 75.89 0.4547 0.2807
IC-Loss 4.38 4.76 6.50 2.51 3.33 6.25 0.7882 0.7760
IC-LossAll-CAT,t we apply the CLoss to all the 3

NME (%)
regions, BG, FG and CF.

As Table 8 shows, removing the proposed CLoss
from ICLoss increases the value of NME about
0.37% (from 4.38% to 4.75%), 0.55% (from 4.76%
to 5.31%), and 0.90% (from 6.50% to 7.40%) for LC,
LCE, and LCF subsets respectively. In addition,
using IC-LossAll-CAT, where we use the CLoss for all
the regions devastate the accuracy of the model
and NME increases dramatically by about 3.8%
(from 4.38% to 8.18%), 5.2% (from 4.76% to
9.96%), and 7.46% (from 6.50% to 13.96%) for LC, FIGURE 10. The NME (in %) of the PoseNet using different values for the
hyper parameter ϕ2CF .
LCE, and LCF subsets respectively. As discussed in
Sec III, using CLoss for the pixels belonging to the
CF region stops penalizing the model as soon as
the model classifies the pixels correctly. However,
VOLUME 4, 2016 15
9 Access
V. CONCLUSION landmark point localization task. Moreover, both

PoseNet and IC-Los are capable of being used in
In this paper, we proposed PoseNet, a CNN similar landmark localization tasks too.
inspired by UNet [28] for detecting sagittal REFERENCES
cervical spine landmark points in X-Ray images. [1] D. D. Harrison, S. J. Troyanovich, D. E. Harrison, T. J. Janik, and
We discussed that the widely used L1 and L2 loss D. J. Murphy, “A normal sagittal spinal configuration: a
desirable clinical outcome.” Journal of manipulative and
functions are not the best options for physiological therapeutics, vol. 19, no. 6, pp. 398–405, 1996.
heatmapbased regression landmark point [2] D. D. Harrison, T. J. Janik, S. J. Troyanovich, D. E. Harrison, and
detection. Consequently, we proposed IC-Loss, an C. J. Colloca, “Evaluation of the assumptions used to derive an
intensity aware categorical-based loss function. ideal normal cervical spine model.” Journal of manipulative
and physiological therapeutics, vol. 20, no. 4, pp. 246–256,
IC-Loss set the magnitude of the loss and its 1997.
influence according to our defined BG, FG and CF [3] D. D. Harrison, T. J. Janik, S. J. Troyanovich, and B. Holland,
regions. In addition, our novel IC-Loss simplifies “Comparisons of lordotic cervical spine curvatures to a
theoretical ideal model of the static sagittal cervical spine,”
the regression task for the pixels located in BG, Spine, vol. 21, no. 6, pp. 667–675, 1996.
FG to a classification task, which improves the [4] D. D. Harrison, D. E. Harrison, T. J. Janik, R. Cailliet, J. R.
accuracy of the sagittal cervical spine landmark Ferrantelli, J. W. Haas, and B. Holland, “Modeling of the
point detection task. Our proposed experiments sagittal cervical spine as a method to discriminate
hypolordosis: results of elliptical and circular modeling in 72
show that PoseNet trained using IC-Loss can asymptomatic subjects, 52 acute neck pain subjects, and 70
perform a very accurate sagittal cervical spine chronic neck pain subjects,” Spine, vol. 29, no. 22, pp. 2485–
2492, 2004.
Ground-Truth Predicted Ground-Truth Predicted Ground-Truth Predicted
LC
LCE
LCF
16 VOLUME 4, 2016
FIGURE 11. Examples of the sagittal cervical spine landmark point detection using PoseNet trained with IC-Loss.
publication.
[5] H. Y. Seong, M. K. Lee, S. R. Jeon, S. W. Roh, S. C. Rhim, and J. Conference on Applications of Computer Vision, 2020, pp.
H. Park, “Prognostic factor analysis for management of 330–340.
chronic neck pain: can we predict the severity of neck pain [21] X. Wang, L. Bo, and L. Fuxin, “Adaptive wing loss for robust
with lateral cervical curvature?” Journal of Korean face alignment via heatmap regression,” in Proceedings of the
Neurosurgical Society, vol. 60, no. 4, p. 456, 2017. IEEE International Conference on Computer Vision, 2019, pp.
[6] K. Han, C. Lu, J. Li, G.-Z. Xiong, B. Wang, G.-H. Lv, and Y.-W. 6971–6981.
Deng, “Surgical treatment of cervical kyphosis,” European [22] K. Sun, Y. Zhao, B. Jiang, T. Cheng, B. Xiao, D. Liu, Y. Mu, X.
spine journal, vol. 20, no. 4, pp. 523–536, 2011. Wang, W. Liu, and J. Wang, “High-resolution representations
[7] C. Fernandez-de Las-Penas, C. Alonso-Blanco, M. Cuadrado, for labeling pixels and regions,” arXiv preprint
and J. Pareja, “Forward head posture and neck mobility in arXiv:1904.04514, 2019.
chronic tensiontype headache: a blinded, controlled study,” [23] J. Su, Z. Wang, C. Liao, and H. Ling, “Efficient and accurate face
Cephalalgia, vol. 26, no. 3, pp. 314–319, 2006. alignment by global regression and cascaded local
[8] A. Nagasawa, T. Sakakibara, and A. Takahashi, refinement,” in Proceedings of the IEEE Conference on
“Roentgenographic findings of the cervical spine in tension- Computer Vision and Pattern Recognition Workshops, 2019,
type headache,” Headache: The Journal of Head and Face pp. 0–0.
Pain, vol. 33, no. 2, pp. 90–95, 1993. [24] J. Yang, Q. Liu, and K. Zhang, “Stacked hourglass network for
[9] M. M. BRAAF and S. ROSNER, “Trauma of cervical spine as robust facial landmark localisation,” in Proceedings of the IEEE
cause of chronic headache,” Journal of Trauma and Acute Conference on Computer Vision and Pattern Recognition
Care Surgery, vol. 15, no. 5, pp. 441–446, 1975. Workshops, 2017, pp. 79–87.
[10] H. Vernon, I. Steiman, and C. Hagino, “Cervicogenic [25] Z. Ruan, C. Zou, L. Wu, G. Wu, and L. Wang, “Sadrnet: Self-
dysfunction in muscle contraction headache and migraine: a aligned dual face regression networks for robust 3d dense
descriptive study.” Journal of manipulative and physiological face alignment and reconstruction,” IEEE Transactions on
therapeutics, vol. 15, no. 7, pp. 418– 429, 1992. Image Processing, 2021.
[11] G. N. Ferracini, T. C. Chaves, F. Dach, D. Bevilaqua-Grossi, C. [26] Z. Shao, Z. Liu, J. Cai, and L. Ma, “Jaa-net: joint facial action
Fernándezde Las-Peñas, and J. G. Speciali, “Analysis of the unit detection and face alignment via adaptive attention,”
cranio-cervical curvatures in subjects with migraine with and International Journal of Computer Vision, vol. 129, no. 2, pp.
without neck pain,” Physiotherapy, vol. 103, no. 4, pp. 392– 321–340, 2021.
399, 2017. [27] H. Lee, M. Park, and J. Kim, “Cephalometric landmark
[12] T. J. Buell, A. L. Buchholz, J. C. Quinn, C. I. Shaffrey, and J. S. detection in dental x-ray images using convolutional neural
Smith, “Importance of sagittal alignment of the cervical spine networks,” in Medical Imaging 2017: Computer-Aided
in the management of degenerative cervical myelopathy,” Diagnosis, vol. 10134. International Society for Optics and
Neurosurgery Clinics of North America, vol. 29, no. 1, pp. 69– Photonics, 2017, p. 101341W.
82, 2018. [28] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional
[13] M. F. Shamji, C. P. Ames, J. S. Smith, J. M. Rhee, J. R. networks for biomedical image segmentation,” in
Chapman, and M. G. Fehlings, “Myelopathy and spinal International Conference on Medical image computing and
deformity: relevance of spinal alignment in planning surgical computer-assisted intervention. Springer, 2015, pp. 234–241.
intervention for degenerative cervical myelopathy,” Spine, [29] Z. Zhong, J. Li, Z. Zhang, Z. Jiao, and X. Gao, “An attention-
vol. 38, no. 22S, pp. S147–S148, 2013. guided deep regression model for landmark detection in
[14] I. M. Moustafa, A. Youssef, A. Ahbouch, M. Tamim, and D. E. cephalograms,” in International Conference on Medical Image
Harrison, “Is forward head posture relevant to autonomic Computing and Computer-Assisted Intervention. Springer,
nervous system function and cervical sensorimotor control? 2019, pp. 540–548.
cross sectional study,” Gait & posture, vol. 77, pp. 29–35, [30] J. Qian, M. Cheng, Y. Tao, J. Lin, and H. Lin, “Cephanet: An
2020. improved faster r-cnn for cephalometric landmark detection,”
[15] P. A. Oakley, J. M. Cuttler, and D. E. Harrison, “X-ray imaging is in 2019 IEEE 16th International Symposium on Biomedical
essential for contemporary chiropractic and manual therapy Imaging (ISBI 2019). IEEE, 2019, pp. 868–871.
spinal rehabilitation: radiography increases benefits and [31] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: towards
reduces risks,” Dose-Response, vol. 16, no. 2, p. real-time object detection with region proposal networks,”
1559325818781437, 2018. IEEE transactions on pattern analysis and machine
[16] W. Wu, C. Qian, S. Yang, Q. Wang, Y. Cai, and Q. Zhou, “Look intelligence, vol. 39, no. 6, pp. 1137–1149, 2016.
at boundary: A boundary-aware face alignment algorithm,” in [32] C. Leibig, V. Allken, M. S. Ayhan, P. Berens, and S. Wahl,
Proceedings of the IEEE Conference on Computer Vision and “Leveraging uncertainty information from deep neural
Pattern Recognition, 2018, pp. 2129–2138. networks for disease detection,” Scientific reports, vol. 7, no.
[17] Z. Tong and J. Zhou, “Face alignment using two-stage 1, pp. 1–14, 2017.
cascaded pose regression and mirror error correction,” [33] J.-H. Lee, H.-J. Yu, M.-j. Kim, J.-W. Kim, and J. Choi,
Pattern Recognition, p. 107866, 2021. “Automated cephalometric landmark detection with
[18] Y. Ge, J. Zhang, C. Peng, M. Chen, J. Xie, and D. Yang, “Deep confidence regions using bayesian convolutional neural
shape constrained network for robust face alignment,” networks,” BMC oral health, vol. 20, no. 1, pp. 1–10, 2020.
Pattern Recognition Letters, vol. 138, pp. 587–593, 2020. [34] X. Dai, H. Zhao, T. Liu, D. Cao, and L. Xie, “Locating anatomical
[19] Y. Xiong, Z. Zhou, Y. Dou, and Z. Su, “Gaussian vector: An landmarks on 2d lateral cephalograms through adversarial
efficient solution for facial landmark detection,” in encoder-decoder networks,” IEEE Access, vol. 7, pp. 132738–
Proceedings of the Asian Conference on Computer Vision, 132747, 2019.
2020. [35] M. Zeng, Z. Yan, S. Liu, Y. Zhou, and L. Qiu, “Cascaded
[20] S. M. Iranmanesh, A. Dabouei, S. Soleymani, H. Kazemi, and N. convolutional networks for automatic cephalometric
Nasrabadi, “Robust facial landmark detection via aggregation landmark detection,” Medical Image Analysis, vol. 68, p.
on geometrically manipulated faces,” in The IEEE Winter 101904, 2021.
VOLUME 4, 2016 17
9 Access
[36] J. Zhang, M. Liu, L. Wang, S. Chen, P. Yuan, J. Li, S. G.-F. Shen, [52] A. Asthana, S. Zafeiriou, S. Cheng, and M. Pantic, “Robust
Z. Tang, K.-C. Chen, J. J. Xia et al., “Context-guided fully discriminative response map fitting with constrained local
convolutional networks for joint craniomaxillofacial bone models,” in Proceedings of the IEEE conference on computer
segmentation and landmark digitization,” Medical image vision and pattern recognition, 2013, pp.
analysis, vol. 60, p. 101621, 2020. 3444–3451.
[37] Z. Zhong, J. Li, Z. Zhang, Z. Jiao, and X. Gao, “A coarse-to-fine [53] T. Baltrušaitis, P. Robinson, and L.-P. Morency, “3d
deep heatmap regression method for adolescent idiopathic constrained local model for rigid and non-rigid facial
scoliosis assessment,” in International Workshop and tracking,” in 2012 IEEE conference on computer vision and
Challenge on Computational Methods and Clinical pattern recognition. IEEE, 2012, pp. 2610–2617.
Applications for Spine Imaging. Springer, 2019, pp. 101–106. [54] J. M. Saragih, S. Lucey, and J. F. Cohn, “Deformable model
[38] T. Ma, A. Gupta, and M. R. Sabuncu, “Volumetric landmark fitting by regularized landmark mean-shift,” International
detection with a multi-scale shift equivariant neural network,” journal of computer vision, vol. 91, no. 2, pp. 200–215, 2011.
in 2020 IEEE 17th International Symposium on Biomedical [55] Y. Wang, S. Lucey, and J. F. Cohn, “Enforcing convexity for
Imaging (ISBI). IEEE, 2020, pp. 981–985. improved alignment with constrained local models,” in 2008
[39] B. Chen, Q. Xu, L. Wang, S. Leung, J. Chung, and S. Li, “An IEEE Conference on Computer Vision and Pattern Recognition.
automated and accurate spine curve analysis system,” IEEE IEEE, 2008, pp. 1–8.
Access, vol. 7, pp. 124596–124605, 2019. [56] X. P. Burgos-Artizzu, P. Perona, and P. Dollár, “Robust face
[40] A. Newell, K. Yang, and J. Deng, “Stacked hourglass networks landmark estimation under occlusion,” in Proceedings of the
for human pose estimation,” in European conference on IEEE International Conference on Computer Vision, 2013, pp.
computer vision. Springer, 2016, pp. 483–499. 1513–1520.
[41] J. Wang, K. Sun, T. Cheng, B. Jiang, C. Deng, Y. Zhao, D. Liu, Y. [57] S. Ren, X. Cao, Y. Wei, and J. Sun, “Face alignment at 3000 fps
Mu, M. Tan, X. Wang et al., “Deep high-resolution via regressing local binary features,” in Proceedings of the
representation learning for visual recognition,” IEEE IEEE Conference on Computer Vision and Pattern Recognition,
transactions on pattern analysis and machine intelligence, vol. 2014, pp. 1685–1692.
43, no. 10, pp. 3349–3364, 2020. [58] G. Trigeorgis, P. Snape, M. A. Nicolaou, E. Antonakos, and S.
[42] Y. Huang, F. Fan, C. Syben, P. Roser, L. Mill, and A. Maier, Zafeiriou, “Mnemonic descent method: A recurrent process
“Cephalogram synthesis and landmark detection in dental applied for end-to-end face alignment,” in Proceedings of the
cone-beam ct systems,” Medical Image Analysis, vol. 70, p. IEEE Conference on Computer Vision and Pattern Recognition,
102028, 2021. 2016, pp. 4177–4187.
[43] D. Zhang, J. Wang, J. H. Noble, and B. M. Dawant, [59] Z.-H. Feng, J. Kittler, M. Awais, P. Huber, and X.-J. Wu, “Wing
“Headlocnet: loss for robust facial landmark localisation with convolutional
Deep convolutional neural networks for accurate neural networks,” in Proceedings of the IEEE Conference on
classification and multilandmark localization of head cts,” Computer Vision and Pattern Recognition, 2018, pp. 2235–
Medical image analysis, vol. 61, p. 101659, 2020. 2245.
[44] M. Urschler, T. Ebner, and D. Štern, “Integrating geometric [60] J. Lv, X. Shao, J. Xing, C. Cheng, and X. Zhou, “A deep
configuration and appearance information into a unified regression architecture with two-stage re-initialization for
framework for anatomical landmark localization,” Medical high performance facial landmark detection,” in Proceedings
image analysis, vol. 43, pp. 23–36, 2018. of the IEEE Conference on Computer Vision and Pattern
[45] S. Poltaretskyi, J. Chaoui, M. Mayya, C. Hamitouche, M. Recognition, 2017, pp. 3317–3326.
Bercik, P. Boileau, and G. Walch, “Prediction of the pre- [61] J. Zhang and H. Hu, “Exemplar-based cascaded stacked auto-
morbid 3d anatomy of the proximal humerus based on encoder networks for robust face alignment,” Computer
statistical shape modelling,” The bone & joint journal, vol. 99, Vision and Image Understanding, vol. 171, pp. 95–103, 2018.
no. 7, pp. 927–933, 2017. [62] R. Valle, J. M. Buenaposada, A. Valdés, and L. Baumela, “Face
[46] M. Brehler, G. Thawait, J. Kaplan, J. Ramsay, M. J. Tanaka, S. alignment using a 3d deeply-initialized ensemble of regression
Demehri, J. H. Siewerdsen, and W. Zbijewski, “Atlas-based trees,” Computer Vision and Image Understanding, vol. 189,
algorithm for automatic anatomical measurements in the p. 102846, 2019.
knee,” Journal of Medical Imaging, vol. 6, no. 2, p. 026002, [63] A. P. Fard, H. Abdollahi, and M. Mahoor, “Asmnet: A
2019. lightweight deep neural network for face alignment and pose
[47] J. Negrillo-Cárdenas, J.-R. Jiménez-Pérez, H. Cañada-Oya, F. R. estimation,” in Proceedings of the IEEE/CVF Conference on
Feito, and A. D. Delgado-Martínez, “Automatic detection of Computer Vision and Pattern Recognition, 2021, pp. 1521–
landmarks for the analysis of a reduction of supracondylar 1530.
fractures of the humerus,” Medical Image Analysis, vol. 64, p. [64] A. P. Fard and M. H. Mahoor, “Facial landmark points
101729, 2020. detection using knowledge distillation-based neural
[48] T. Cootes, E. Baldock, and J. Graham, “An introduction to networks,” Computer Vision and Image Understanding, p.
active shape models,” Image processing and analysis, pp. 103316, 2021.
223–248, 2000. [65] A. P. Fard and M. Mahoor, “Acr loss: Adaptive coordinate-
[49] T. F. Cootes, G. J. Edwards, and C. J. Taylor, “Active based regression loss for face alignment,” arXiv preprint
appearance models,” in European conference on computer arXiv:2203.15835, 2022.
vision. Springer, 1998, pp. 484–498. [66] R. Valle, J. M. Buenaposada, A. Valdés, and L. Baumela, “A
[50] P. Martins, R. Caseiro, and J. Batista, “Generative face deeplyinitialized coarse-to-fine ensemble of regression trees
alignment through 2.5 d active appearance models,” for face alignment,” in Proceedings of the European
Computer Vision and Image Understanding, vol. 117, no. 3, Conference on Computer Vision (ECCV), 2018, pp. 585–601.
pp. 250–268, 2013. [67] J. Deng, G. Trigeorgis, Y. Zhou, and S. Zafeiriou, “Joint multi-
[51] D. Cristinacce and T. F. Cootes, “Feature detection and view face alignment in the wild,” IEEE Transactions on Image
tracking with constrained local models.” in Bmvc, vol. 1, no. 2. Processing, vol. 28, no. 7, pp. 3636–3648, 2019.
Citeseer, 2006, p. 3.
18 VOLUME 4, 2016
publication.
[68] S. Zafeiriou, G. Trigeorgis, G. Chrysos, J. Deng, and J. Shen, ANNE-LISE DUPUIS received her
“The menpo facial landmark localisation challenge: A step Bachelor of Science degree in
towards the solution,” in Proceedings of the IEEE Conference Computing Science from the
on Computer Vision and Pattern Recognition Workshops, University of Alberta in 1980. She was
2017, pp. 170–179. awarded a substantial scholarship
[69] S. Mahpod, R. Das, E. Maiorana, Y. Keller, and P. Campisi, from the National Research Council
“Facial landmarks localization using cascaded neural of Canada in 1980 towards a master’s
networks,” Computer Vision and Image Understanding, p. degree which she declined after
103171, 2021. working as a computer consultant for
[70] A. P. Fard and M. H. Mahoor, “Ad-corre: Adaptive correlation- large firms. Having worked in various scientific, process
based loss for facial expression recognition in the wild,” IEEE control, business, and computer environments as a Project
Access, vol. 10, pp. 26756–26768, 2022. leader, Database Administrator, Systems Analyst,
[71] Y. Wu, S. K. Shah, and I. A. Kakadiaris, “Godp: Globally and Programmer, Anne-Lise has a rich background to draw
optimized dual pathway deep network architecture for facial upon. Currently, she resides as the lead programming
landmark localization in-thewild,” Image and Vision architect for application development at PostureCo.
Computing, vol. 73, pp. 1–16, 2018.
[72] D. P. Kingma and J. Ba, “Adam: A method for stochastic
optimization,” arXiv preprint arXiv:1412.6980, 2014.
[73] H. Yang, X. Jia, C. C. Loy, and P. Robinson, “An empirical study
of recent face alignment methods,” arXiv preprint
arXiv:1511.05049, 2015. MOHAMMAD H. MAHOOR received
[74] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, the MS degree in Biomedical
“Mobilenetv2: Inverted residuals and linear bottlenecks,” in Engineering from Sharif University of
Proceedings of the IEEE Conference on Computer Vision and Technology, Iran, in 1998, and the
Pattern Recognition, 2018, pp. 4510–4520. Ph.D. degree in Electrical and
[75] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for Computer Engineering from the
image recognition,” in Proceedings of the IEEE conference on University of Miami, Florida, in 2007.
computer vision and pattern recognition, 2016, pp. 770–778. Currently, he is a professor of
ALI POURRAMEZAN FARD received Electrical and Computer Engineering
the MSc degree in Computer at the University of Denver. He does
Engineering - from Iran University of research in the area of computer vision and machine learning
Science and Technology, Tehran, Iran, including visual object recognition, object tracking, affective
in 2015. He is currently pursuing his computing,
Ph.D. degree in Electrical & Computer and human-robot interaction (HRI) such as humanoid social
engineering and is a graduate robots for interaction and intervention of children with autism
teaching assistant in the Department and older adults with depression and dementia. He has
of Electrical and Computer received over $7M in research funding from state and federal
Engineering at the University of agencies including the National Science Foundation and the
Denver. His research interests include Computer Vision, National Institute of Health. He is a Senior Member of IEEE
Machine Learning, and Deep and has published over 158 conference and journal papers.
Neural Networks, especially in face alignment,
and facial expression analysis.
JOSEPH R. FERRANTELLI graduated

with honors in 1995 from Florida
State University with a Bachelor of
Science in Biology. In 1999 Dr.
Ferrantelli graduated with honors
from Life University College of
Chiropractic with his Doctor of
Chiropractic (D.C.) degree. In 2010 he
co-founded PostureCo, Inc, a
technology company focusing on Posture and Movement
assessment software as well as spinal x-ray biomechanical
mensuration
EMR products for healthcare
professionals. Since 1999, he continues to serve on the board
for CBP Non-Profit, a non-profit spinal research foundation
where he has co-authored manuscripts published in numerous
health care journals.
VOLUME 4, 2016 19

Sagittal Cervical Spine Landmark Point Detection in X-Ray Using Deep Convolutional Neural Networks

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Sagittal Cervical Spine Landmark Point Detection in X-Ray Using Deep Convolutional Neural Networks

Uploaded by

Copyright:

Available Formats

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final

Sagittal Cervical Spine Landmark Point

Corresponding author: Ali Pourramezan Fard (e-mail: ali.pourramezanfard@du.edu) This

I. INTRODUCTION Typically, in a medical or chiropractic practice, X-Rays are

• We propose a new network architecture, automated cephalometric landmark localization

FIGURE 1. The architecture of the PoseNet.

FIGURE 2. The Head, Down-Sampling, Up-Sampling, and Keep Foreground

2 followed by a ReLU and a batch-normalization Bakcground

if: Im,i,j,k ∈ [lb,hb)

Loss Function Gradient of the Loss Function

to CF more accurately. we define LossFG as a has a negative consequence on the accuracy of

3) ILoss for the CF Region

FIGURE 5. Examples of the X-Ray images from the dataset.

sets, a training set (which is used for the training

Age LC LCE LCF Total

Equalized Gray-scale Inverted

FIGURE 6. An example of a white-foreground and a white-background X-

2) Evaluation of IC-Loss 1) Studying the Effects of β and ϕ

All LC LCE LCF

TABLE 8. Studying the effect of using the proposed categorical loss on

LC LCE LCF LC LCE LCF LC LCE

IC-LossAll-CAT,t we apply the CLoss to all the 3

regions, BG, FG and CF.

V. CONCLUSION landmark point localization task. Moreover, both

JOSEPH R. FERRANTELLI graduated

You might also like