You are on page 1of 5

Bone Abnormality Classification using Deep Learning on

Radiographic Images

Amalia Irma Nurwidya Rifki Firdaus Riyanto


Research Center for Electronics Research Center for Electronics Research Center for Electronics
National Research and Innovation Agency National Research and Innovation Agency National Research and Innovation Agency
Tangerang Selatan, Indonesia Tangerang Selatan, Indonesia Tangerang Selatan, Indonesia
amal006@brin.go.id rifk005@brin.go.id riya006@brin.go.id

Rony Febryarto Peni Laksmita Widati Faizurrahman `Allam Majid


Research Center for Electronics Research Center for Electronics Research Center for Electronics
National Research and Innovation Agency National Research and Innovation Agency National Research and Innovation Agency
Tangerang Selatan, Indonesia Tangerang Selatan, Indonesia Tangerang Selatan, Indonesia
rony003@brin.go.id peni004@brin.go.id faiz007@brin.go.id

Arky Astasari Miftahul Donny Sanjaya Afrias Sarotama


Research Center for Electronics Department of Electrical Engineering Research Center for Smart Mechatronics
National Research and Innovation Agency Sumatera Institute of Technology National Research and Innovation Agency
Tangerang Selatan, Indonesia Lampung, Indonesia Bandung, Indonesia
arky001@brin.go.id miftahul.119130144@student.itera.ac.id afri002@brin.go.id

ABSTRACT ACM Reference format:


Amalia Irma Nurwidya, et al. 2022. Bone Abnormality Classification using
Radiologists inspect bone abnormalities visually by examining X- Deep Learning on Radiographic Images. In The 2022 International
Ray samples. However, insignificant bone fractures are hard to Conference on Computer, Control, Informatics, and Its Applications
detect manually. In recent years, an intelligent fracture detection (IC3INA 2022), November 22-23, 2022, Virtual/online conference,
system has been widely proposed to minimize the false detection Indonesia. ACM, New York, NY, USA, 5 pages.
rate. In this paper, we proposed various forms of deep learning https://doi.org/XXXXX.XXXXXX
algorithms to detect and classify bone fractures, such as 1 Introduction
DenseNet201, DenseNet169, DenseNet121, ResNet, Inception,
and VGG. We also used Convolutional Neural Network (CNN) to Interpreting a radiographic sample is an important task for
classify bone types. The dataset used in this paper comes from the radiologists to diagnose whether abnormalities are present or not.
MURA standard dataset which contains a large number of X-Ray This process can be time-consuming as it is still manually done by
images of several types of bones that have been labelled positive visually examining them. Moreover, the shortage of expert
(abnormal) and negative (normal). This paper aims to analyze radiologists might lead to increasing workloads. Misdiagnosis
which deep learning algorithm that has the best possible result in often occurs in Emergency Departments which can have serious
detecting bone abnormalities. consequences because of delays in treatment and resulting long-
term disability [1].
CCS CONCEPTS Radiographic images can be difficult to interpret, but even with
the advancement of technology, this process remains to be done
• Computing methodologies → Artificial intelligence → Computer
manually. In recent years, deep learning algorithms became popular
Vision
due to their capability to extract a higher number of features from
• Computing methodologies → Machine learning → Machine unstructured data. High-performance computing also facilitates the
learning approaches → Neural networks use of several layers as more advanced use of machine learning [2]
Detecting bone fractures has been attempted before by
Dimililer, K. [3] who developed the Intelligent Bone Fracture
KEYWORDS Detection System (IBDFS) to detect and classify bone fractures.
Bone fracture, classification, CNN, Deep Learning. IBDFS was carried out using two conventional stages of a 3-layer
Back Propagation Neural Network (BPNN) with 1024 input
neurons. They used 30 training images and 70 testing images.
IC3INA’22, November, 2022, Bandung, Indonesia A.I. Nurwidya et al.

BPNN has an accuracy of 94.3% with 66 images detected from 70 12173 patients with a total of 40561 X-ray images with two labels,
images of fractured bones. normal and abnormal. The types of bone images are humerus, hand,
Chung S. W. [4] also demonstrated the very high performance forearm, finger, wrist, shoulder, and elbow. The MURA dataset
of the Deep Convolutional Neural Network (DCNN) to detect and was used for two-phase classification, bone types classification and
classify Proximal Humerus fracture. They used a total of 1891 abnormalities classification [6]
dataset images, of which 1376 fractured images and 515 normal In addition to MURA dataset, this study also used ImageNet as
images. The training dataset had 1702 images, while the testing the pre-trained weight for each model. ImageNet is an image
dataset had 189 images. DCNN accuracy reached 96% with a database organized according to the WordNet hierarchy (currently
sensitivity of 99% and specificity of 97%. only the nouns), in which each node of the hierarchy is depicted by
Another work using a Convolutional Neural Network was also hundreds and thousands of images [7]
implemented by Yahalomi E [5] who developed a machine vision
neural network called Faster R-CNN to detect and classify wrist 3.2 Image Pre-processing
bone fractures. The datasets are 55 images of anteroposterior Firstly, the images were labelled based on their abnormalities
fracture, 40 normal images without anteroposterior fracture, and 25 (positive for abnormal and negative for normal). Therefore, there
additional other types of bone images which served as the negative were two classes configured as the output of the image. Using the
label. The pretraining dataset had 96 images and the training dataset OpenCV library, the images were then read as RGB matrices and
had 24 images. Faster R-CNN has an accuracy of 96%. resized into 224 x 224 pixels. The values of the pixels were
normalized by dividing them by 255.
This paper proposes and evaluates a bone abnormalities MURA dataset provides training and validation datasets, but
detection system with a larger number of datasets to help we used the validation dataset as the test dataset instead. The
radiologists minimize the risk of misdiagnoses and any severe training dataset was split randomly into both training and validation
damage. With the use of a Convolutional Neural Network for bone sets with a ratio of 85% and 15% respectively.
type classification and six different deep learning framework
approaches for abnormalities classification, we attempt to measure 3.3 Bone types classification
and compare the performance of each framework to pick the best
To classify the types of bone in the image before moving
possible result for the bone abnormalities detection system.
further to classifying its abnormalities, the standard Convolutional
2 Proposed Bone Abnormalities Classification Neural Network was used. In this method, the image was processed
System into multiple convolutions and the features are obtained.
The workflow in Figure 1 is the proposed bone abnormality Convolutional Neural Network (CNN) is a Neural Network,
classification system for radiographic images. consisting of an input layer, output layer, and multiple hidden
layers. The hidden layers of CNN typically consist of convolutional
layers, pooling layers, fully connected layers, and normalization
layers [8]. The convolutional layer computes the convolutional
operation for the input image and kernel filter to get features of the
Figure 1: Bone abnormalities detection system workflow input image [9]. The kernel filter is a matrix that contains constant
In the first stage, the bone image input is received by the parameters and has a smaller size than the input image. The pooling
system. The machine learning model will classify the type of bone layer shrinks or reduces the dimensionality of the extracted features
using a Convolutional Neural Network (CNN). Seven possible of the input image and retains the important information. The
types of bone are classified: Humerus, Elbow, Wrist, Shoulder, normalization layer (ReLU) changes all negative values of the
Finger, Forearm and Hand. The bone type is the input for the filtered image to zero [8]. The convolution, pooling, and
system to choose which model to further process the image. There normalization layers are stacked, so the output of one layer
are several machine learning models used in our research, namely, becomes the input of the next layer and this stacked layer can be
DenseNet (201, 169, 121), ResNet, Inception and VGG. Every repeated. The fully connected layer means connecting each node in
model has seven different models for each type of bone. one layer to every node in the next layer. At the end of the network,
The input image is processed by the models to determine there is a classifier which is determining the image classification.
whether an abnormality is present. The output of the system then The CNN architecture is displayed in Figure 2.
produces the output of the bone type and its abnormality.
3 Bone Abnormalities Classification Method
3.1 Dataset
Our research used MURA dataset which is available publicly
intended for a deep learning competition held by the Machine
Learning group of Stanford University. This dataset has not been
reviewed by authoritative sources or organizations and is used for Figure 2: CNN Architecture [10]
research purposes only. MURA consists of 14863 studies from
Bone Abnormality Classification using Deep Learning on
IC3INA’22, November, 2022, Bandung, Indonesia
Radiographic Images

3.4 Abnormalities classification


Classification of bone images was using deep learning models,
such as VGG-16, InceptionV3, ResNet50, and DenseNet. The base
models used were from the Tensorflow Keras library with the input
shape according to the pre-processed image, ImageNet as the pre-
trained weight, and Rectified Linear Unit and Softmax as the
activation function. The model was compiled using RMSprop
optimizer with a learning rate of 0.0001, accuracy metrics, and loss
Figure 5: Inception Module
calculation using Binary Cross Entropy. The training process batch
size is 256 with 50 epochs and 0.4 dropouts. The output of the The InceptionV3 model used three different inception modules.
model produces two classes, positive and negative, indicated with Each used a combination of different sizes of kernel filters, and
binary 1 and 0 respectively. The model training process is defined different sizes of receptive fields to extract features of different
by the flowchart below. scales and the extracted features are spliced and merged [13]. This
model uses label smoothing, 7x7 convolutions, and an auxiliary
classifier [14]

3.4.3 ResNet50
Figure 3: The model training process Residual Network (ResNet) is a CNN model with a residual
building block (RBB) that increases the benefit of solving
3.4.1 VGG-16 complicated tasks and increasing detection accuracy [15]. This
VGG-16 is one of the most popular CNN models used for model was proposed by Kaiming He in 2015. The idea of RBB is
image classification. This model was proposed by Karen Simonyan the skipping block of the convolutional layer by using a shortcut
and Andrew Zisserman in 2013. VGG-16 model consists of 16 connection. RBB consists of several convolutional and batch
layers which are 13 layers of convolutional using a 3x3 kernel filter normalization (BN) layers [16]. The RBB has shown in Figure 6.
with a pooling layer at the end of each stack, 2 fully connected
layers and, one classifier [11]. The precise architecture of the VGG-
16 models which is used to classify bone fracture in this work
shown in Figure 4.

Figure 6: Residual Building Network (RBB)


ResNet50 has 50 layers of residual network, 16 RBB which
contain three convolutional and BN layers, two pool layers, and
Figure 4: VGG-16 model architecture
one classifier (softmax).
3.4.2 Inception V3
InceptionV3 is the third version of the Inception model. The 3.4.4 DenseNet
difference between Inception and CNN is that the layer of the DenseNet or Dense Convolutional Network was proposed by
Inception model is wider. More than one kernel filter will filter the Huang [17] in 2017. The idea of DenseNet is a hierarchical network
input image while the pooling layer will shrink it parallelly. All of structure where the input of the current layer is the feature maps
the output layers will be concatenated and used as input for the next from previous layers so that every layer in the network is connected
step [12]. The concept of this parallel process is called the to the front layer. Compared with other deep learning networks,
“Inception Module” as can be seen in Figure 5. DenseNet has relatively fewer parameters, deeper layers, and an
ability to avoid overfitting as there is a connection between the front
to the final layer [18].
IC3INA’22, November, 2022, Bandung, Indonesia A.I. Nurwidya et al.

The number of data used for bone abnormalities classification for


each object is in Table 1. They are divided into positive and
negative classes.

5.2 Testing Result

5.2.1 Bone Type Classification


Figure 7: DenseNet Model The trained CNN model was able to produce 4554 true output
for bone types classification in validation testing and thus
The DenseNet networks used are 121, 169, and 201. Each achieving an accuracy of 92%. The testing dataset also performed
network contains four dense blocks. The difference between the a similar result with high accuracy of 92.1%.
three networks is the number of convolution layers in Dense blocks.
Every convolution layer has Convolution, Batch Normalization, 5.2.2 Bone Abnormality Classification
and RelU Activation.
4 Evaluation
4.1 Confusion Matrix
A confusion matrix is used to measure the performance of the
trained model by comparing the number of correct and wrong
classifications of the two classes, fractured and normal.
● True positives: the output and the ground truth show a
'fractured' class.
● True negatives: the output and the ground truth show 'normal'.
● False positives: the output shows 'fractured' while the ground
truth labels otherwise.
● False negatives: the output shows 'normal' while the ground
truth labels otherwise.
We aim to find the lowest false negative rate as it is related to
human health.

4.2 Accuracy
The accuracy measured is the ratio of the correctly predicted
images (the sum of true positives and two negatives) to the total
number of datasets.

5 Result
All the results of this model evaluation for bone fracture
detection are provided by TensorFlow 2.0 platform with Keras API Figure 8: Model testing results comparison
and Scikit-Learn under Python 3. The testing accuracy of each model for each type of bone
yielded different findings. The comparison is displayed in Figure 8.
5.1 Dataset The hand type has the lowest average accuracy of all the bone types,
with the DenseNet121 model with the highest accuracy of 66.96%
Table 1. Datasets Distribution
and the lowest accuracy of 58% from Inception.
Object Total Training Validation Testing Other than the hand bone type, the VGG model shows a
Data
noticeable good performance in classifying bone abnormalities.
Humerus 1272 850 150 272 The model achieves the highest accuracy results on every other
Elbow 4931 3825 675 431 bone type with the humerus type having the highest accuracy of
Wrist 9739 8278 1461 656 81.62%. The average accuracy of VGG model across all of the bone
Shoulder 8364 7109 1255 563 types achieves 73.29%. This is significantly higher than other
Finger 5106 4340 766 461 models which have less than 70%.
Forearm 1825 1551 247 301 All DenseNet networks and ResNet appear to show a relatively
Hand 5543 4711 832 460 similar performance. They are averaging at around 65% for all
types of bones, but the peak accuracy of the models was on different
For bone type classification, CNN training used 28050 data for
types of bone. In comparison to other models, Inception always
training and 4950 for validation while the testing used 3808 data.
gives the least number of true classifications in all types of bones.
Bone Abnormality Classification using Deep Learning on
IC3INA’22, November, 2022, Bandung, Indonesia
Radiographic Images

The average accuracy this model achieved for every bone type is ACKNOWLEDGMENTS
only 50.5%.
This study is a part of Smart Direct Digital Radiography research
in Research Center for Electronics at National Research and
Innovation Agency.

REFERENCES
[1] H. R. Guly, “Diagnostic errors in an accident and emergency department,”
Emergency Medicine Journal, vol. 18, no. 4, pp. 263–269, Jul. 2001, doi:
10.1136/emj.18.4.263.
[2] A. Mathew, P. Amudha, and S. Sivakumari, “Deep Learning Techniques: An
Overview,” 2021, pp. 599–608. doi: 10.1007/978-981-15-3383-9_54.
[3] K. Dimililer, “IBFDS: Intelligent bone fracture detection system,” Procedia
Comput Sci, vol. 120, pp. 260–267, 2017, doi: 10.1016/j.procs.2017.11.237.
[4] S. W. Chung et al., “Automated detection and classification of the proximal
humerus fracture by using deep learning algorithm,” Acta Orthop, vol. 89, no.
4, pp. 468–473, Jul. 2018, doi: 10.1080/17453674.2018.1453714.
[5] E. Yahalomi, M. Chernofsky, and M. Werman, “Detection of distal radius
fractures trained by a small set of X-ray images and Faster R-CNN,” Dec. 2018.
[6] P. Rajpurkar et al., “MURA: Large Dataset for Abnormality Detection in
Musculoskeletal Radiographs,” Dec. 2017.
[7] J. Deng, W. Dong, R. Socher, L.-J. Li, Kai Li, and Li Fei-Fei, “ImageNet: A
large-scale hierarchical image database,” in 2009 IEEE Conference on
Computer Vision and Pattern Recognition, Jun. 2009, pp. 248–255. doi:
10.1109/CVPR.2009.5206848.
[8] M. Hussain, J. J. Bird, and D. R. Faria, “A Study on CNN Transfer Learning for
Image Classification,” 2019, pp. 191–202. doi: 10.1007/978-3-319-97982-3_16.
[9] S. A. Singh, T. G. Meitei, and S. Majumder, “Short PCG classification based on
Figure 9: Evaluation of the model with confusion matrices deep learning,” in Deep Learning Techniques for Biomedical and Health
Informatics, Elsevier, 2020, pp. 141–164. doi: 10.1016/B978-0-12-819061-
6.00006-9.
Figure 9 shows the evaluation of the model with confusion [10] Inc. The MathWorks, “What is a Convolutional Neural Network? - MATLAB
matrices. Each matrix combines the result of all types of bones’ & Simulink,” Sep. 30, 2022.
abnormality classification so that it shows the overall performance [11] S. Tammina, “Transfer learning using VGG-16 with Deep Convolutional Neural
of the model. From the matrix color, all models’ performances are Network for Classifying Images,” International Journal of Scientific and
Research Publications (IJSRP), vol. 9, no. 10, p. p9420, Oct. 2019, doi:
generally similar except for the Inception model. The Inception 10.29322/IJSRP.9.10.2019.p9420.
model has the most false positives which we aim to avoid. From [12] C. Szegedy et al., “Going deeper with convolutions,” in 2015 IEEE Conference
the previous accuracy result, we see that VGG shows the best result, on Computer Vision and Pattern Recognition (CVPR), Jun. 2015, pp. 1–9. doi:
10.1109/CVPR.2015.7298594.
but DenseNet169 has the least number of false positives. However, [13] Z. Dongmei, W. Ke, G. Hongbo, W. Peng, W. Chao, and P. Shaofeng,
all the models still have over 20% of false positives so it is still “Classification and identification of citrus pests based on InceptionV3
needed to be improved. convolutional neural network and migration learning,” in 2020 International
Conference on Internet of Things and Intelligent Applications (ITIA), Nov.
2020, pp. 1–7. doi: 10.1109/ITIA50152.2020.9312359.
6 Conclusion [14] M. Tripathi, “Analysis of Convolutional Neural Network based Image
Classification Techniques,” Journal of Innovative Image Processing, vol. 3, no.
The proposed bone abnormalities classification system was 2, pp. 100–117, Jun. 2021, doi: 10.36548/jiip.2021.2.003.
able to produce output in the form of classification results for bone [15] I. Z. Mukti and D. Biswas, “Transfer Learning Based Plant Diseases Detection
Using ResNet50,” in 2019 4th International Conference on Electrical
type and bone abnormality. The bone type classification using CNN Information and Communication Technology (EICT), Dec. 2019, pp. 1–6. doi:
produces a good result, while the bone abnormality classification 10.1109/EICT48899.2019.9068805.
shows varied results. The comparison of the bone abnormality [16] L. Wen, X. Li, and L. Gao, “A transfer convolutional neural network for fault
diagnosis based on ResNet-50,” Neural Comput Appl, vol. 32, no. 10, pp. 6111–
classification models shows that the VGG model has the highest 6124, May 2020, doi: 10.1007/s00521-019-04097-w.
accuracy in almost all bone types while the Inception model has the [17] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger, “Densely
lowest accuracy in all bone types. However, they still give high Connected Convolutional Networks,” in 2017 IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), Jul. 2017, pp. 2261–2269. doi:
false positive results, therefore it needs to be improved. 10.1109/CVPR.2017.243.
The datasets used in this research comprised different types of [18] J. Zhang, C. Lu, X. Li, H.-J. Kim, and J. Wang, “A full convolutional network
abnormalities. Therefore, in the future, more studies are necessary based on DenseNet for remote sensing scene classification,” Mathematical
Biosciences and Engineering, vol. 16, no. 5, pp. 3345–3367, 2019, doi:
to evaluate the optimal integration of this model and other deep 10.3934/mbe.2019167.
learning models in a clinical setting. Also, attention maps may need
to be implemented so that the network can focus on the abnormal
features.

You might also like