You are on page 1of 10

Computers in Biology and Medicine 157 (2023) 106683

Contents lists available at ScienceDirect

Computers in Biology and Medicine


journal homepage: www.elsevier.com/locate/compbiomed

Deep learning based classification of multi-label chest X-ray images via


dual-weighted metric loss
Yufei Jin, Huijuan Lu *, Wenjie Zhu, Wanli Huo
College of Information Engineering, China Jiliang University, Hangzhou, China

A R T I C L E I N F O A B S T R A C T

Keywords: —Thoracic disease, like many other diseases, can lead to complications. Existing multi-label medical image
Deep metric learning learning problems typically include rich pathological information, such as images, attributes, and labels, which
Multi-label learning are crucial for supplementary clinical diagnosis. However, the majority of contemporary efforts exclusively focus
ConvNeXt
on regression from input to binary labels, ignoring the relationship between visual features and semantic vectors
Chest X-ray
of labels. In addition, there is an imbalance in data amount between diseases, which frequently causes intelligent
diagnostic systems to make erroneous disease predictions. Therefore, we aim to improve the accuracy of the
multi-label classification of chest X-ray images. Chest X-ray14 pictures were utilized as the multi-label dataset for
the experiments in this study. By fine-tuning the ConvNeXt network, we got visual vectors, which we combined
with semantic vectors encoded by BioBert to map the two different forms of features into a common metric space
and made semantic vectors the prototype of each class in metric space. The metric relationship between images
and labels is then considered from the image level and disease category level, respectively, and a new dual-
weighted metric loss function is proposed. Finally, the average AUC score achieved in the experiment reached
0.826, and our model outperformed the comparison models.

1. Introduction resulted in a shortage of radiologists, which has made interpretation


tasks challenging. By automatically analyzing pathological images,
Multi-label classification is frequently used in realistic scenes for Computer Aided Diagnosis (CAD) systems can reduce the workload of
intent detection [1], medical image recognition [2], and image labeling radiologists and improve the accuracy of clinical diagnosis [8]. Mondol
[3]. Chest X-ray (CXR) images typically contain one or more pathol­ et al. [9] introduced a new computer-aided diagnosis system based on a
ogies, making their a typical multi-label classification problem. deep convolutional neural network (DCNN) to assist physicians in
Multi-label classification is more challenging than single-label classifi­ identifying musculoskeletal abnormalities on X-rays. However, accurate
cation because it not only addresses the issue of an imbalanced number disease diagnosis from chest X-ray images is challenging due to the
of positive and negative labels but also extracts multiple object features complex relationships between different pathologies.
from a single sample. Even for human observers, identifying and In terms of complex pathological characteristics, the majority of
differentiating various thoracic anomalies on chest X-rays is a difficult previous works [10,11] treated all pathologies equally. This indicates
process. With the introduction of artificial intelligence technology in that all categories are given equal weight when predicting an image’s
medical imaging [4–6], computer vision technology for rapid reading label set. However, there is a relationship between labels that are
and intelligent diagnosis of medical images can give physicians sup­ inherently concealed. For instance, pneumonia is also associated with a
plemental medical supports. high risk of pulmonary effusion. Consequently, combining the correla­
CXR is a commonly used screening method and diagnostic tool in tions between the labels improves the accuracy of model detection.
radiological investigations; it can be used to diagnose thoracic disorders Additionally, the proportion of various pathological characteristics
such as nodules, cardiomegaly, and effusion [7]; and it is widely varies between images. For instance, the pathological characteristics of
employed in clinical practice. As an initial screening tool for disease cardiomegaly are significantly more extensive than those of pulmonary
examination, the large number of CXR images produced every day has nodules. The features of the final few layers of deep neural networks

* Corresponding author.
E-mail addresses: s20030812005@cjlu.edu.cn (Y. Jin), hjlu@cjlu.edu.cn (H. Lu), zhwj@cjlu.edu.cn (W. Zhu), huowl@mail.ustc.edu.cn (W. Huo).

https://doi.org/10.1016/j.compbiomed.2023.106683
Received 2 June 2022; Received in revised form 17 October 2022; Accepted 6 November 2022
Available online 15 February 2023
0010-4825/© 2023 Published by Elsevier Ltd.
Y. Jin et al. Computers in Biology and Medicine 157 (2023) 106683

have receptive fields. If the receptive field is significantly larger than the X-ray images with input from actual chest radiography into the DL-CRC
size of the pathological feature, then diseases with inconspicuous fea­ framework to accurately differentiate COVID-19 cases from other ab­
tures can be easily ignored [12]. In this regard, we augment the image normalities (e.g., pneumonia) and normal cases.
feature extraction module with an additional 1 × 1 convolution layer The single-label tasks of medical images have been partially
and a global max pooling layer to amplify subtle features and receive resolved, whereas the multi-label classification problem dominates the
responses from small objects, respectively. majority of radiology tasks. Guan et al. [7] proposed a framework for
In multi-label datasets, a long tail distribution of data and an extreme Categorical Residual Attention Learning (CRAL) consisting of a CNN
imbalance between positive and negative samples are common occur­ module and an attention mechanism module to enhance relevant fea­
rences. As with the multi-label Chest X-ray14 dataset we utilized, it had tures and suppress pathologically irrelevant features by assigning
both of these problems. In addition, this dataset contains a small number different weights. CRAL generates feature maps with the same number
of samples with incomplete or incorrect labels, resulting in uncertain of categories for each image, which are then sent to their respective
prior information. We propose a deep metric learning framework for classifiers to obtain the results. When there are numerous sample types,
multi-label chest X-ray image classification to address the multi-label it is not desirable to have a classifier for each category. Ei-fiky et al. [22]
chest X-ray image disease classification task. First, the state-of-the-art proposed a classifier for automatic multi-label classification of chest
pre-trained ConvNeXt [13] model is used as an image feature X-ray images based on transfer learning.
extractor to improve diagnostic accuracy with limited a piece of prior
information. Second, we introduce disease label vectors to calculate the 2.2. Multi-label learning
Euclidean distance between labels and images. We propose a
dual-weighted metric loss. To solve the problem of sample imbalance, Each instance in multi-label learning corresponds to a set of label
assign different weights to different disease types. This loss takes into sets. Switching problems is a solution for the multi-label problem. The
consideration two constraints. The first level is the image level, which objective of Binary Relevance (BR) [23] is to convert the multi-label
requires related labels to be close together and unrelated labels to be far problem into multiple binary classification problems; however, this
apart; the second level is the label level, which requires images of the method not only ignores the correlation between labels, but may also
same category to be close together. By learning a nonlinear metric be­ suffer from positive and negative sample imbalance and poor general­
tween multi-labels while simultaneously learning the correlation be­ ization ability. Therefore, Yao et al. [24] utilized the Long Short-Term
tween labels. Experiments indicate that our model yields excellent Memory (LSTM) network to predict disease types based on correla­
performance. tions between target labels. Algorithm Adaptation (AA), a way to modify
Our main contributions are summarized as follows. the existing single-label algorithm and then apply it to multiple labels, is
a second classic technique for solving the multi-label problem. K-Nearest
• We introduced a unique classification framework for multi-label Neighbor (KNN) [25], Decision Tree (DT) [26], Support Vector Machine
chest X-ray images, which used a semantic vector as the prototype [27], and Neural Network (NN) [28] are popular models for construct­
of each class in metric space and eliminates the need to extract class ing AA.
features from multi-label images. In addition to the above traditional machine learning methods have
• A new dual-weighted metric loss function was proposed. The metric been mentioned. Deep learning has been applied to multi-label learning
relationship between images and labels was evaluated at the level of in recent works. Hassanin et al. [29] proposed a deep neural network
the image and the disease category, respectively. based on ResNet-101 to learn the discriminative features of multi-label
• We compared various experiments on the multi-label classification of tasks, that is, to extract different category features from multi-label
the chest X-ray14 dataset, and the experimental results demonstrated images and to distinguish the different classes by increasing the
that our method outperforms all others. inter-class distance and decreasing the intra-class difference. In
multi-label learning, exploiting the correlation between labels correctly
The remainder of the paper is structured as follows: Section 2 in­ is crucial. Wang et al. [30] proposed a new label correlation modeling
troduces the related work on the classification of chest X-ray images, the method, which superimposed the knowledge graph onto the statistical
research status of multi-label learning, and multi-label classification graph to build the relation between labels. In addition, extreme learning
based on metric learning. Our proposed model and the dual-weighted machines enable new categorization algorithms for multiple labels. For
metric loss function are described in Section 3. Section 4 describes the multi-label classification, Rezaei-Ravari et al. [31] utilized a linear
experimental setup and results. Section 5 is a summary of the entire combination of base kernels in each layer of a multilayer kernel extreme
work. learning machine. And in their other article [32], they employed two
neural network architectures, Multi-Label Radial Basis Function
2. Related work (ML-RBF) and Multi-Label Multi-Layer Extreme Learning Machine
(ML-ELM), to develop a feature flow-based model. Regularized
2.1. Deep learning on chest X-ray images multi-label learning methods based on shape learning and two-manifold
learning are presented.
Deep learning has made significant progress in medicine image
analysis, including classification [14,15], lesion segmentation [16], and 2.3. Multi-label classification based on metric learning
image registration [17]. Convolutional Neural Networks are currently
the most popular deep learning models, which have been demonstrated The objective of metric learning is to extract excellent advanced
to improve the accuracy and efficiency of medical diagnostic analysis. features from image pixels using neural networks. For analyzing CT [33]
Berrimi et al. [18] proposed a deep learning model based on transfer and Magnetic Resonance Imaging (MRI) [34] images, traditional metric
learning to detect COVID-19. Both CXR and CT datasets were utilized. learning methods have been proposed [33,34]. Zhong et al. [35] created
Ismael et al. [19] fine-tuned the pre-trained convolutional neural a novel CXR image retrieval model based on deep metric learning,
network to classify extracted features using Support Vector Machine proving that deep metric learning is highly effective for CXR image
(SVM) by extracting features from COVID-19 and normal (healthy) chest retrieval and diagnosis.
X-ray images. Saberi-Movahed et al. [20] performed feature selection The correlation constraint of the sample-label category can improve
using several matrix factorization (MF) based methods and a random the performance of the single-label classification algorithm, but this
forest algorithm for COVID-19 prognostic classification. Sakib et al. [21] method is ineffective for the multi-label classification problem. Sun et al.
utilized Generative Adversarial Networks (GAN) to synthesize chest [36] proposed a novel multi-label metric learning method that uses a

2
Y. Jin et al. Computers in Biology and Medicine 157 (2023) 106683

combined distance metric and semantic similarity in label space to of two branches, one for image and another for label.
generate component libraries and triplet constraints. Zhang et al. [37] ConvNeXt is the latest proposed CNN architecture that exhibits
discovered an optimal distance metric for each view by constructing outstanding performance on various visual tasks. In this paper, the
multiple views and then used this metric to link all view features. pretrained ConvNeXt-T, which consists of four ConvNeXt Blocks of
Additionally, an additional distance metric was employed to discover varying sizes, is used as a feature extractor. Table 1 displays the specific
the correlation between tags. structure. A transition layer, which serves to downsample the feature
In short, the purpose of metric learning is to construct a new distance map, is positioned between the different blocks. We fine-tuned the
function d(x, y) that is closer to the true distance than the original model by removing the final linear layer and adding a 1 × 1 convolution
function. layer prior to the global average pooling layer. This is due to the fact that
in a deep neural network, the features of the final few layers have a
3. The proposed method larger receptive field, and if the receptive field is much larger than the
disease size, it is easy to ignore the features of small lesions. 1 × 1
In this section, we introduce the details of the proposed model. convolution enables learning the cross-channel correlations between
ConvNeXt is utilized as a feature extractor, followed by the application feature maps. It can amplify subtle characteristics without compro­
of a dual-weighted metric loss to enhance model diagnostic perfor­ mising the expressiveness of the model, as well as reduce the number of
mance. We begin by introducing the relevant notation. Then, we will parameters and computation time. We also added a global max pooling
describe the overall framework of the model. Finally, the proposed loss layer because the global average pooling layer will disregard responses
function is introduced. from small objects when the resolution of the feature maps is high. The
global max pooling layer, on the other hand, will select the maximum
Notation
Table 1
Assuming that the dataset is designated as D = {(xi ,yi ),i = 1, 2,...,N}, The network architecture of ConvNeXt-T. b represents batch size, N indicates the
where N represents the total number of samples. And label yi comes from number of categories.
set Y = [y1 , y2 , ..., yN ]T ∈ {0, 1}N×l . Every single sample xi ∈ Rd , yi ∈ Layer name Kernel size Output
{0, 1}l , l represents the number of categories. yi j = 1 means the j-th label stem 4 × 4, 96, stride 4 (b, 96, 56, 56)
⎡ ⎤
is related to the i-th sample. yi j = 0 indicates that the label is irrelevant Conv_block1 7 × 7, 96 (b, 96, 56, 56)
⎣ 1 × 1, 384 ⎦ × 3
to the sample. χ = [x1 , x2 , ..., xN ]T ∈ RN×d , where xi comes from f(Xi ; θ), 1 × 1, 96
⎡ ⎤
and f( ⋅) is fcnn . θ is the parameter of the CNN model. The set of label Conv_block2 7 × 7, 192 (b, 192, 28, 28)
T l×m
⎣ 1 × 1, 768 ⎦ × 3
embeddings for images can be represented as γ = [a1 , a2 , ..., al ] ∈ R , 1 × 1, 192
where ai comes from g(Ai ; φ), and g( ⋅) is gbert . φ is the parameter of Conv_block3

7 × 7, 384

(b, 384, 14, 14)
BioBert [38] model. The goal of this experiment is to make chest disease ⎣ 1 × 1, 1536 ⎦ × 9
1 × 1, 384
predictions more accurate: Rd → lim {0, 1}l . Conv_block4

7 × 7, 768

(b, 768, 7, 7)
pre→true
⎣ 1 × 1, 3072 ⎦ × 3
1 × 1, 768
3.2. Architecture of model AVG – (b, 768)
FC – N
The overall structure of the model is shown in Fig. 1, which consists

Fig. 1. Multi-label chest disease diagnosis network model. The input image is first fed into the backbone convolutional neural network, with an additional 1× 1
convolution layer added after the final convolution layer. Next, image-level features are obtained by combining Global Max Pooling (GMP) and Global Average
Pooling (GAP). Based on these features, classification predictions are then made. The other branch feeds the category labels into the language model to obtain the
semantic vector, which together with the image feature vector performs the embedding operation. The objective for images is to bring the relevant labels close
together and irrelevant labels far apart. The purpose of labels is to bring images belonging to the same category closer together.

3
Y. Jin et al. Computers in Biology and Medicine 157 (2023) 106683

corresponding to the global representation of the response feature map. distance, as shown in (6):
It does not disregard the corresponding tail or small objects. As shown in
Fig. 2, the features extracted from GAP and GMP were combined to form sim(x, y) = e− d(x,y)
(6)
feature xi of the input image. Given a chest X-ray image I as input to the sim( ⋅) represents the similarity and d( ⋅) is the Euclidean distance
model. After Conv_block1, Conv_block2, Conv_block3, Conv_block4 and between image x and label y.
1 × 1 convolution, the feature blocks are obtained. We assume that dyp and dyn are the distances of the positive and
negative labels to the corresponding image, respectively. We set the
F(x) = fcnn (I; θcnn ), F(x) ∈ RH×W×C (1)
margin , and the relationship between dyn and dyp is required to satisfy
Where θcnn is the parameter of the feature embedding module. F(x) is the (7):
feature block after the last 1 × 1 convolution. H × W × C is the feature dyn ≥ dyp + δ (7)
block size. C is the number of the feature channels.
F(x) gets image features after GAP and GMP. Our ultimate goal is that the distance between the relevant label and
the image is at least one δ smaller than the distance between the irrel­
f (x) = GAP(F(x)) ⊕ GMP(F(x)) (2) evant label and the image. Therefore, the following metric relationship
f(x) is the feature vector of image I. ⊕ is represented as vector needs to be satisfied:
addition. ( ))
1 ∑N
( ( )
We used a large biomedical pre-trained BioBert as a semantic LMetric I = max 0, δ + log 1 + edyp i − log 1 + edyn i (8)
encoder to encode labels for each disease category. During training, the N i=1
BioBert network will not update parameters; only image-level supervi­
sion will exist in the model. Word vectors and visual features obtained at Where, N is the total number of samples and max(, ) denotes taking the
this time are both vectors with a high dimension. Due to the fact that the larger value in these two values. Considering the extreme imbalance
original dimension of high-dimensional data will significantly compli­ between the disease’s positive and negative samples, we added the
cate calculations, the high dimension of the data does not imply that it adaptive weight to equalize the positive and negative samples, as shown
has the discriminant ability [39–41]. Therefore, we designed two in (9):
mapping functions consisting of three fully connected layers with ( ))
1 ∑N
( ) (
respective dimensions of 512, 256, and 128. A Leaky ReLU function LMetric I = max 0, δ + dyp i α log 1 + edyp i − dyn i β log 1 + edyn i (9)
N i=1
exists between each layer. Another function of the mapping function is
to map two features with distinct shapes and sizes into a potential In the experiment, both α and β were set to 2, and δ was 0.3.
common space. As shown in (3) - (4), M1 ( ⋅) and M2 ( ⋅) represent the As shown in Table 2, in addition to normal samples, single-label
mapping function, respectively. samples account for about 60% of the total number of experimental
fmapping (x) = M1 (f (x)) (3) samples. Therefore, we proposed a metric loss about labels that bring
images belonging to the same category closer together, thereby enabling
gmapping (A) = M2 (g(A)) (4) the model to extract more robust disease features from scarce positive
samples. As indicated by (10):

1∑ l
( ( ) ( ))
3.3. Loss function LMetric L = max 0, δ + log 1 + edxp i − log 1 + edxn i (10)
l i=1

1) Dual-weighted metric loss

The purpose of the model is to develop a metric space appropriate for Table 2
multi-label classification. For a given chest radiograph, the relevant la­ Number of samples with different numbers of diseases.
bels are close and the irrelevant labels are far away, and there is no loss Number of diseases Number of samples Proportion (%)
for correlation between labels. As demonstrated, we measured the
1 30,963 59.82
relationship between images and labels using Euclidean distance (5): 2 14,306 27.64
√̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅ 3 4856 9.38
√ k
√∑ 4 1247 2.41
d(x, y) = √ (xi − yi )2 (5) 5 301 0.58
i=1 6 67 0.13
7 16 0.03
k represents the length of the feature vector after the mapping 8 2 3.86 × 10− 3

function. 9 1 1.93 × 10− 3

The similarity of images and labels is transformed by Euclidean

Fig. 2. Fine-tuned ConvNeXt model structure.

4
Y. Jin et al. Computers in Biology and Medicine 157 (2023) 106683

Where, l is the number of disease categories, dxp represents the distance in the diagnosis of various chest diseases. Consequently, healthy samples
(i.e., samples labeled “No Finding”) are not used during the experiment,
between the image with this type of disease and the disease label, and
and each image is assigned one or more pathologies. According to the
dxn represents the distance between the image and the disease label
dataset splitting method provided by Ref. [2], there is no patient overlap
without such disease. Similarly, an adaptive weight is added to the
between the training set and the test set.
metric loss to compensate for the imbalance of positive and negative
As shown in Fig. 4, with the exception of the positive samples with
samples, with the following formula:
the pathology, all other samples were negative for each disease. The
1∑ l
( ( ) ( )) number of positive and negative samples in the dataset is seriously un­
LMetric L = max 0, δ + dxp i a log 1 + edxp i − dxn i b log 1 + edxn i (11) balanced. In addition, there is a large disparity between the number of
l i=1
samples in different categories, which poses significant classification
In the experiment, both α and β were set to 2, respectively, and δ was challenges.
kept at 0.3.
4.2. Training details and evaluation metrics
2) Weighted multi-label loss
The original size of the chest X-ray image is 1024 × 1024. The image
As a loss function, binary cross-entropy is commonly used in tradi­ is randomly cropped to different sizes and aspect ratios during pre­
tional multi-label classification. The general form is depicted in (12): processing, and then the cropped image is scaled to the requested size, i.
e. 224 × 224, and finally input to the network is normalized. In the
∑ training process, random horizontal flipping was used as the data
l
( )
Loss(yi , f (Xi ; θ)) = − yij L+ − 1 − yij L−
j=1 augmentation method. We optimized the network using Adam with 32
{ ( ( ⃒ )) (12) batch sizes. The initial learning was 0.0001, the weight attenuation was
L+ = log P yij = 1⃒f (Xi ; θ) 0.0003, and a total of 40 Epochs were trained.
( ( ⃒
L− = log 1 − P yij = 0⃒f (Xi ; θ)
)) We used the Area Under Curve (AUC) value as the evaluation stan­
dard of the model. AUC is defined as the area enclosed by the Receiver
Among them, l is the disease category, and P( ⋅) is the predicted value Operating Characteristic (ROC) curve and the coordinate axis. A better
of the neural network model. L+ represents the discrepancy between the classifier will always have a higher AUC score; the closer the AUC area
predicted probability and the actual label when it is 1. Conversely, L− value is to 1, the more accurate the prediction.
represents the error between the predicted probability and the actual
label when it is 0. But treating each label equally is not a wise choice to 4.3. Experimental results
solve the problem of multi-label classification.
In medical images, only a few numbers of positive samples (disease We evaluated our method on the Chest X-ray14 dataset. Using our
samples) contain significant pathological information. However, too proposed dual-weighted metric loss, ConvNeXt as a model for image
many negative samples (samples without disease) prevent the network feature extraction and BioBert as a semantic embedding module provide
from fully learning disease characteristics. Therefore, we added an the corresponding AUC values and ROC curves (Fig. 6).
adaptive weight Wadp based on the original binary cross-entropy, as As shown in Table 3, we first compared it with other advanced
shown in (13): methods. The AUC values obtained through model training in this study
were compared to those from previous research (Wang et al., 2017; Li

l
[ ( ) ]
Loss(yi , f (Xi ; θ)) = − Wadp+ yij L+ + Wadp− 1 − yij L− et al., 2018; Tang et al., 2018; Albahli et al., 2021; Guan et al., 2020).
j=1 These methods each suggest their unique strategies for classifying multi-
{ ( ⃒ ) (13) label chest X-ray images. Wang et al. [2] presented a large chest X-ray
Wadp+ = P yij = 1⃒f (Xi ; θ)
dataset and proposed a framework for the weakly supervised categori­
( ⃒ )
Wadp− = 1 − P yij = 0⃒f (Xi ; θ) zation of multi-label images and disease localization. The primary ar­
chitecture is a traditional model that has been pre-trained, the loss
After merging into one equation, as shown in (14): function is a weighted Cross Entropy Loss, and the predicted labels use

l
[ ( ) ] the one-hot form. The problem to be solved by Li et al. [42] is to
Loss(yi , Pi ) = − Pi yij log(Pi ) + (1 − Pi ) 1 − yij log(1 − Pi ) (14) simultaneously perform disease prediction and localization using a little
j=1 amount of location annotation information. A residual network is used
In summary, the overall loss function of the model is: for feature extraction, followed by a fully convolutional network for
classification in the proposed network architecture. A CNN-based
L = LMC + λLMetric (15) Attention-Guided Course Learning (AGCL) framework was proposed
LMetric represents the dual-weighted metric loss, LMC is the weighted by Tang et al. [43] and experimentally validated using the publicly
multi-label classification loss, and λ takes a value of 1 in the experiment. available ChestXray14 dataset. For the problem of insufficient data and
data imbalance in multi-label x-ray images, Albahli et al. [44] suggested
a synthetic data augmentation approach for multi-label x-ray images
4. Experiments
using three different deep convolutional neural network (CNN) archi­
tectures. Guan et al. [7] introduced the Classification Residual Attention
In this section, we will perform experiments with the proposed model
Learning paradigm to address the problem that multi-label thoracic
and losses. All of the experiments in this paper were based on the NVI­
disorders are frequently impeded by target-independent pathology.
DIA GeForce RTX 3060 12 GB GPU development environment and the
Second, it can be seen from the final results that the average identifi­
PyTorch framework.
cation accuracy of our method in 14 diseases exceeds that of the com­
parison method, and the average AUC value reaches 0.826. The AUC
4.1. Dataset values for Pneumothorax, Consolidation, Pleural Thickening, and Her­
nia for each disease were lower overall than those proposed in the
We evaluated our model using Chest X-ray14, a large Chest X-ray previous two years. This result is primarily due to the limited number of
dataset. It contains 112,120 frontal X-ray images of 30,805 patients with positive disease samples, particularly Hernia, and the inability of the
a total of 14 typical chest diseases. This experiment is designed to assist network to fully learn the characteristics of each disease through

5
Y. Jin et al. Computers in Biology and Medicine 157 (2023) 106683

Table 3
Comparison Results of Different Methods on The Chest X-ray14 Dataset. We Compared The AUC Scores For Each Category and The Average AUC Scores for The 14
Diseases. For Each Column, The Best Result is Indicated in Bold.
Method Atel Card Effu Infi Mass Nodu Pne1 Pne2 Cons Edem Emph Fibr PT Hern Avg

Wang et al. [2] 0.720 0.810 0.780 0.610 0.710 0.670 0.630 0.810 0.710 0.830 0.810 0.770 0.710 0.770 0.739
Li et at [42]. 0.727 0.836 0.789 0.672 0.776 0.696 0.649 0.808 0.720 0.806 0.888 0.771 0.737 0.693 0.755
Tang et al. [43] 0.756 0.887 0.819 0.689 0.814 0.755 0.729 0.850 0.728 0.848 0.906 0.818 0.765 0.875 0.803
Albahli et al. [44] 0.780 0.900 0.820 0.700 0.820 0.760 0.730 0.870 0.700 0.860 0.880 0.810 0.790 – 0.801
Guan et al. [7] 0.781 0.880 0.829 0.702 0.834 0.773 0.729 0.857 0.754 0.850 0.908 0.830 0.778 0.917 0.816
Ours 0.797 0.911 0.844 0.724 0.836 0.802 0.739 0.869 0.725 0.860 0.933 0.849 0.753 0.916 0.826

training. Using our method, the AUC values for the remaining 10 dis­ used the Friedman test to measure significant differences between the
eases were the highest. Nodule, Emphysema, Infiltration, and Fibrosis all various frameworks. Here, the value of α is 0.05, the degree of freedom is
increased significantly, with respective AUC values of 0.802, 0.933, 5. We checked the Chi-Squares table and found a critical value of 11.07.
0.724, and 0.849. They improved by at least 2.9%, 2.5%, 2.2%, and The formula for calculating the Chi-Squares value is shown in (16). From
1.9%, respectively, in comparison to all other comparison experiments. the results of the calculation, it is clear that the Chi-Squares value was
The AUC value measured by our method for the diagnosis of diseases calculated to be 39.755, which is greater than the critical value of 11.07.
with relatively small lesion areas, such as Mass, is 0.836. Although Consequently, there were significant differences between the
disease features are reduced in size, accuracy is also improved. The frameworks.
identification of diseases with more intricate pathological characteris­ ∑
12
tics, such as pneumonia, has also been enhanced by at least 0.9%. Other χ 2R = R2 − 3n(k + 1) (16)
nk(k + 1)
diseases, including Atelecsis, Cardiomegaly, and Effusion, have
increased recognition accuracy by at least 1.1%–1.6%. In conclusion, where, k denotes the number of frames (comparison algorithms), n
our algorithm can learn the characteristics of diverse disease types more represents the number of disease categories and R denotes the rank sum
effectively. In general, this method improves the average AUC value of of each frame.
chest disease recognition and the overall classification performance. We visualize some feature heatmaps and classification results shown
As it was not possible to collect all the data for model testing, it was in Figs. 3 and 5, respectively. According to the heatmap of image fea­
necessary to make an “aggregate” performance difference to know if tures, our model can focus on the lesion area and thus accurately identify
there was indeed a difference in performance between models, so we

Fig. 3. Heatmap. For each sample, the boundary box provided in Ref. [2] was annotated on the original image first, and then a heatmap is generated based on the
learned features. Note that since the image is cropped during feature extraction, the heatmap may have some deviations from the original image annotation box.

6
Y. Jin et al. Computers in Biology and Medicine 157 (2023) 106683

at the top. When the number of labels included in the sample is small,
almost all of them can be completely predicted to be correct, and there
are also a few special cases. For example, the sample in column 2 of row
1. The disease type with the highest similarity score is “Pleural_Th­
ickening,” while the real label of this sample ranks second and third in
terms of similarity score. When the sample contains a greater number of
labels, such as column 2, column 6 in row 2, and column 7 in row 2,
although there are some incoherent diseases with the highest similarity
score, the overall related diseases are ranked in the first half of the score,
and the more distinctive diseases are scored. This is because the adaptive
weighted in the loss function can correct the sample imbalance defect, so
that the scores of related diseases are generally higher, thereby resolving
the issue where the similarity scores of individual diseases present in the
actual label are ranked last.

4.4. Ablation experiments


Fig. 4. Distribution of disease numbers in the Chest X-ray14 dataset. The 14
pathologies from left to right are Infiltration, Effusion, Atelectasis, Nodule, To evaluate the effectiveness of each model module or loss function,
Mass, Pneumothorax, Consolidation, Pleural_Thickening, Cardiomegaly, We performed independent ablation experiments using the same dataset
Emphysema, Edema, Fibrosis, Pneumonia, and Hernia. for the model structure and loss function. Four different experiments
were conducted to certificate the effectiveness of our proposed model.
the disease category. In Fig. 5, the similarity scores of each test sample to And four different experiments to demonstrate the validity of our pro­
each category are shown, with the true category of the sample high­ posed dual-weighted metric loss. The experimentally measured AUC
lighted in red. We see that the similarity scores of real pathologies are all values for chest disease classification with different settings were

Fig. 5. Examples of test set samples prediction. We present the similarity scores of each sample to each category. We sort them in descending order by score value,
where the red is the real label.

7
Y. Jin et al. Computers in Biology and Medicine 157 (2023) 106683

Fig. 6. ROC curves of different diseases on the test set.

compared with the complete model, as shown in Table 4 and Table 5. As visual features and label semantic vectors, whereas the majority of
far as each module or component was concerned, first of all, the 1× 1 recent work has focused on regression from input to binary labels. In this
convolution module and the global max pooling layer in the model paper, we address the multi-label classification problem for chest dis­
structure of this paper were respectively removed, and the average AUC eases based on the similarity between label semantic vectors and image
value of the model was decreased by 0.5% and 1.0% respectively, features. Our experiments demonstrate the effectiveness of the proposed
indicating that these two parts can improve the feature extraction ability algorithm and its classification accuracy, indicating that it is possible to
of the network. Then, in Table 5, LM I refers to LMetric I and LM L refers to create reliable predictive models. It is essential to discuss that the
LMetric L . When the image-level metric loss LMetric I and the loss metric at problem involved is medical image disease classification, and in addi­
the label-level LMetric L was removed together from the loss function of tion to the size of the disease sample size, lesion size is also critical in
the model, the average AUC value of the model drops to 0.512, a full influencing the features learned by the model; for example, Hernia had a
drop of 31.4%, indicating that our proposed metric loss LMetric was at relatively large lesion and obtained a high AUC value even with only a
least significant for the multi-label chest X-ray diagnosis problem. few samples. Secondly, despite the fact that clinical tests create a great
Finally, we conducted experiments on removing the image-level metric amount of sample data, making it accessible to the algorithm without
loss LMetric I and removing the loss metric at the label-level LMetric L , standardization and structured processing can pose serious problems. In
respectively. According to the results of the experiment, LMetric I and addition, medical picture data frequently contain private patient infor­
LMetric L can promote the model, and the optimal model can be obtained mation, and it is difficult to get reasonably valid medical data. Conse­
when they are used together. Through experimental comparison, it is quently, multi-label few-shot medical image diagnosis will be a
demonstrated that each module and loss function integrated into our challenging and practical research task.
model are effective in enhancing model performance.
6. Conclusion
5. Discussion
The multi-label classification problem of chest X-rays was optimized
In recent years, researchers have proposed various multi-label clas­ further in this study. We proposed a novel classification framework for
sification algorithms. Few research have focused on the link between chest X-ray diseases with multiple labels. Before obtaining the image’s

Table 4
Ablation Experiments. For Each Column, The Best Result is Indicated in Bold.
GMP 1× Atel Card Effu Infi Mass Nodu Pne1 Pne2 Cons Edem Emph Fibr PT Hern Avg
1

0.782 0.904 0.835 0.714 0.822 0.792 0.728 0.867 0.720 0.849 0.918 0.835 0.730 0.906 0.815
✓ 0.794 0.909 0.837 0.719 0.821 0.798 0.711 0.869 0.718 0.859 0.928 0.849 0.757 0.918 0.821
✓ 0.790 0.908 0.840 0.718 0.825 0.786 0.712 0.858 0.729 0.859 0.922 0.847 0.742 0.887 0.816
✓ ✓ 0.797 0.911 0.844 0.724 0.836 0.802 0.739 0.869 0.725 0.860 0.933 0.849 0.753 0.916 0.826

8
Y. Jin et al. Computers in Biology and Medicine 157 (2023) 106683

Table 5
Ablation Experiments. For Each Column, The Best Result is Indicated in Bold.
LM L LM I Atel Card Effu Infi Mass Nodu Pne1 Pne2 Cons Edem Emph Fibr PT Hern Avg

0.538 0.511 0.539 0.451 0.519 0.526 0.446 0.445 0.466 0.486 0.484 0.565 0.525 0.663 0.512
✓ 0.780 0.909 0.834 0.703 0.821 0.782 0.719 0.861 0.717 0.848 0.921 0.847 0.750 0.927 0.816
✓ 0.791 0.907 0.840 0.716 0.819 0.796 0.722 0.863 0.729 0.849 0.918 0.841 0.752 0.900 0.817
✓ ✓ 0.797 0.911 0.844 0.724 0.836 0.802 0.739 0.869 0.725 0.860 0.933 0.849 0.753 0.916 0.826

feature block, the ConvNeXt feature extraction model was fine-tuned, [10] J. Wang, L. Yang, Z. Huo, et al., Multi-label classification of fundus images with
efficientnet, IEEE Access 8 (2020) 212499–212508.
and a 1 × 1 convolution was added. To obtain the visual feature vec­
[11] B. Chen, J. Li, G. Lu, et al., Lesion location attention guided network for multi-label
tor, the global max pooling layer was then added. From the trained thoracic disease classification in chest X-rays, IEEE J. Biomed. Health Inform. 24
BioBert model, the semantic label vector is obtained. To project two (7) (2019) 2016–2027.
feature vectors with different forms and dimensions to a common po­ [12] Z. Yan, W. Liu, S. Wen, et al., Multi-label image classification by feature attention
network, IEEE Access 7 (2019) 98005–98013.
tential space, we used two mapping functions. In addition, a dual- [13] Z. Liu, H. Mao, C.Y. Wu, et al., A ConvNet for the 2020s, 2022 arXiv preprint arXiv:
weighted metric loss was proposed to consider the metric relationship 2201.03545.
between image and label at the image level and disease category level, [14] M. Anthimopoulos, S. Christodoulidis, L. Ebner, et al., Lung pattern classification
for interstitial lung diseases using a deep convolutional neural network, IEEE
respectively. The objective of image labeling is to keep related labels Trans. Med. Imag. 35 (5) (2016) 1207–1216.
close together and unrelated labels far apart. The purpose of labels is to [15] P. Kumar, M. Grewal, M.M. Srivastava, Boosted Cascaded Convnets for Multilabel
group images of the same category together. The proposed dual- Classification of Thoracic Diseases in Chest radiographs[C]//International
Conference Image Analysis and Recognition, Springer, Cham, 2018, pp. 546–552.
weighted metric loss can enhance the model’s performance. In experi­ [16] G. Gilanie, M. Attique, S. Naweed, et al., Object extraction from T2 weighted brain
mental comparisons, the proposed model demonstrates a higher degree MR image using histogram based gradient calculation, Pattern Recogn. Lett. 34
of recognition accuracy than existing models. In the future, considering (12) (2013) 1356–1363.
[17] R. Liao, S. Miao, P. de Tournemire, et al., An artificial agent for robust image
the small number of positive samples in medical images, we will registration, in: Proceedings of the AAAI Conference on Artificial Intelligence,
combine the Transformer idea with multi-label and few samples to study 2017, 31(1).
the classification of chest X-ray diseases. [18] M. Berrimi, S. Hamdi, R.Y. Cherif, et al., Covid-19 detection from xray and ct scans
using transfer learning, in: 2021 International Conference of Women in Data
Science at Taif University (WiDSTaif), IEEE, 2021, pp. 1–6.
[19] A.M. Ismael, A. Şengür, Deep learning approaches for COVID-19 detection based
Declaration of competing interest
on chest X-ray images, Expert Syst. Appl. 164 (2021), 114054.
[20] F. Saberi-Movahed, M. Mohammadifard, A. Mehrpooya, et al., Decoding clinical
The authors declare that they have no known competing financial biomarker space of covid-19: exploring matrix factorization-based feature selection
interests or personal relationships that could have appeared to influence methods, Comput. Biol. Med. 146 (2022), 105426.
[21] S. Sakib, T. Tazrin, M.M. Fouda, et al., DL-CRC: deep learning-based chest
the work reported in this paper. radiograph classification for COVID-19 detection: a novel approach[J], IEEE Access
8 (2020) 171575–171589.
[22] A. El-Fiky, M.A. Shouman, S. Hamada, et al., Multi-label transfer learning for
Acknowledgement
identifying lung diseases using chest X-rays, in: 2021 International Conference on
Electronic Engineering (ICEEM), IEEE, 2021, pp. 1–6.
This work was supported by the National Natural Science Foundation [23] E.A. Tanaka, S.R. Nozawa, A.A. Macedo, et al., A multi-label approach using binary
relevance and decision trees applied to functional genomics, J. Biomed. Inf. 54
of China (61272315), the Natural Science Foundation of Zhejiang
(2015) 85–95.
Province (LY21F020028), the Funded Project of Zhejiang Province [24] L. Yao, E. Poblenz, D. Dagunts, et al., Learning to Diagnose from Scratch by
University Student Scientific Research and Innovation Activities Plan Exploiting Dependencies Among Labels, 2017 arXiv preprint arXiv:1710.10501.
(No.2021R409054), Zhejiang Provincial Natural Science Foundation of [25] E. Oliveira, P.M. Ciarelli, C. Badue, et al., A comparison between a KNN based
approach and a PNN algorithm for a multi-label classification problem, in: 2008
China (Grant No.LQ21A050003), Anhui Province Key Research and Eighth International Conference on Intelligent Systems Design and Applications
Development Plan (Grant No.202104h04020038) and the Fundamental vol. 2, IEEE, 2008, pp. 628–633.
Research Funds of Zhejiang Province (Grant No.2021YW52). [26] C. Vens, J. Struyf, L. Schietgat, et al., Decision trees for hierarchical multi-label
classification, Mach. Learn. 73 (2) (2008) 185–214.
[27] S. Koda, A. Zeggada, F. Melgani, et al., Spatial and structured SVM for multilabel
References image classification, IEEE Trans. Geosci. Rem. Sens. 56 (10) (2018) 5948–5960.
[28] J. Du, Q. Chen, Y. Peng, et al., ML-Net: multi-label classification of biomedical texts
[1] Y. Hou, Y. Lai, Y. Wu, et al., Few-shot Learning for Multi-Label Intent Detection, with deep neural networks, J. Am. Med. Inf. Assoc. 26 (11) (2019) 1279–1285.
2020 arXiv preprint arXiv:2010.05256. [29] M. Hassanin, I. Radwan, S. Khan, et al., Learning discriminative representations for
[2] X. Wang, Y. Peng, L. Lu, et al., Chestx-ray8: hospital-scale chest x-ray database and multi-label image recognition, J. Vis. Commun. Image Represent. 83 (2022),
benchmarks on weakly-supervised classification and localization of common 103448.
thorax diseases, in: Proceedings of the IEEE Conference on Computer Vision and [30] Y. Wang, D. He, F. Li, et al., Multi-label classification with label graph
Pattern Recognition, 2017, pp. 2097–2106. superimposing, in: Proceedings of the AAAI Conference on Artificial Intelligence,
[3] X. Shen, W. Liu, I.W. Tsang, et al., Multilabel prediction via cross-view search, IEEE 2020, pp. 12265–12272, 34(07).
Transact. Neural Networks Learn. Syst. 29 (9) (2017) 4324–4338. [31] M. Rezaei-Ravari, M. Eftekhari, Saberi Movahed, F. Ml-Ck-Elm, An efficient multi-
[4] K. Hu, L. Zhao, S. Feng, et al., Colorectal polyp region extraction using saliency layer extreme learning machine using combined kernels for multi-label
detection network with neutrosophic enhancement, Comput. Biol. Med. 147 classification, Sci. Iran. 27 (6) (2020) 3005–3018.
(2022), 105760. [32] M. Rezaei-Ravari, M. Eftekhari, F. Saberi-Movahed, Regularizing extreme learning
[5] H. Su, D. Zhao, H. Elmannai, et al., Multilevel threshold image segmentation for machine by dual locally linear embedding manifold learning for training multi-
COVID-19 chest radiography: a framework using horizontal and vertical multiverse label neural network classifiers, Eng. Appl. Artif. Intell. 97 (2021), 104062.
optimization, Comput. Biol. Med. (2022), 105618. [33] G. Wei, P. Huang, C. Xu, et al., Experimental study on the radiative properties of
[6] A. Qi, D. Zhao, F. Yu, et al., Directional mutation and crossover boosted ant colony open-cell porous ceramics, Sol. Energy 149 (2017) 13–19.
optimization with application to COVID-19 X-ray image segmentation, Comput. [34] J. Cheng, W. Yang, M. Huang, et al., Retrieval of brain tumors by adaptive spatial
Biol. Med. 148 (2022), 105810. pooling and Fisher vector representation, PLoS One 11 (6) (2016), e0157112.
[7] Q. Guan, Y. Huang, Multi-label chest X-ray image classification via category-wise [35] A. Zhong, X. Li, D. Wu, et al., Deep metric learning-based image retrieval system
residual attention learning, Pattern Recogn. Lett. 130 (2020) 259–266. for chest radiograph and its clinical applications in COVID-19, Med. Image Anal. 70
[8] M. Firmino, G. Angelo, H. Morais, et al., Computer-aided detection (CADe) and (2021), 101993.
diagnosis (CADx) system for lung cancer with likelihood of malignancy, Biomed. [36] Y.P. Sun, M.L. Zhang, Compositional metric learning for multi-label classification,
Eng. Online 15 (1) (2016) 1–17. Front. Comput. Sci. 15 (5) (2021) 1–12.
[9] T.C. Mondol, H. Iqbal, M.M.A. Hashem, Deep CNN-Based Ensemble CADx Model [37] M. Zhang, C. Li, X. Wang, Multi-view Metric Learning for Multi-Label Image
for Musculoskeletal Abnormality Detection from Radiographs, in: 2019 5th Classification, in: 2019 IEEE International Conference on Image Processing (ICIP),
International Conference on Advances in Electrical Engineering (ICAEE), IEEE, IEEE, 2019, pp. 2134–2138.
2019, pp. 392–397.

9
Y. Jin et al. Computers in Biology and Medicine 157 (2023) 106683

[38] J. Lee, W. Yoon, S. Kim, et al., BioBERT: a pre-trained biomedical language [42] Z. Li, C. Wang, M. Han, et al., Thoracic Disease Identification and Localization with
representation model for biomedical text mining, Bioinformatics 36 (4) (2020) Limited Supervision, in: Proceedings of the IEEE Conference on Computer Vision
1234–1240. and Pattern Recognition, 2018, pp. 8290–8299.
[39] Y. Fan, J. Liu, P. Liu, et al., Manifold learning with structured subspace for multi- [43] Y. Tang, X. Wang, A.P. Harrison, et al., Attention-guided curriculum learning for
label feature selection, Pattern Recogn. 120 (2021), 108169. weakly supervised classification and localization of thoracic diseases on chest
[40] H. Lu, H. Chen, T. Li, et al., Multi-label Feature Selection Based on Manifold radiographs, in: International Workshop on Machine Learning in Medical Imaging,
Regularization and Imbalance Ratio, Applied Intelligence, 2022, pp. 1–20. Springer, Cham, 2018, pp. 249–258.
[41] A. Mehrpooya, F. Saberi-Movahed, N. Azizizadeh, et al., High dimensionality [44] S. Albahli, H.T. Rauf, A. Algosaibi, et al., AI-driven deep CNN approach for multi-
reduction by matrix factorization for systems pharmacology, Briefings Bioinf. 23 label pathology classification using chest X-Rays, PeerJ. Comp. Sci. 7 (2021) e495.
(1) (2022) bbab410.

10

You might also like