Multi-Label Local To Global Learning A Novel Learning Paradigm For Chest X-Ray Abnormality Classification

IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, VOL. 27, NO.
9, SEPTEMBER 2023 4409
Multi-Label Local to Global Learning: A Novel

Learning Paradigm for Chest X-Ray
Abnormality Classification
Zean Liu, Yuanzhi Cheng , Member, IEEE, and Shinichi Tamura , Fellow, IEEE
Abstract—Deep neural network (DNN) approaches have I. INTRODUCTION

shown remarkable progress in automatic Chest X-rays
HEST diseases are common but serious health problems
classification. However, existing methods use a training
scheme that simultaneously trains all abnormalities without
considering their learning priority. Inspired by the clinical
C afflicting many people across the world. For instance,
the COVID-19 pandemic has caused more than 194 million
practice of radiologists progressively recognizing more ab- infections and 4.1 million deaths globally by July 2021 [1].
normalities and the observation that existing curriculum Due to low cost and easy access, chest X-ray (CXR) is the
learning (CL) methods based on image difficulty may not most common and important radiological examination used for
be suitable for disease diagnosis, we propose a novel CL early screening and diagnosis. However, these radiographs are
paradigm, named multi-label local to global (ML-LGL). This typically interpreted by radiologists, which is labor-intensive
approach iteratively trains DNN models on gradually in-
and poses challenges in consistently maintaining high interpre-
creasing abnormalities within the dataset, i,e, from fewer
abnormalities (local) to more ones (global). At each itera- tation accuracy. Therefore, developing automatic and accurate
tion, we first build the local category by adding high-priority CXR classification algorithms is highly demanded in clinical
abnormalities for training, and the abnormality’s priority applications, and it has become an extensively studied research
is determined by our three proposed clinical knowledge- area in the medical image community.
leveraged selection functions. Then, images containing ab- With the growing availability of large-scale datasets, deep
normalities in the local category are gathered to form a new neural network (DNN) based methods have delivered best results
training set. The model is lastly trained on this set using among existing methods [3]. These methods typically address
a dynamic loss. Additionally, we demonstrate the superi- a multi-label classification scenario, attempting to make a set
ority of ML-LGL from the perspective of the model’s ini- of binary predictions for each disease. However, the disease
tial stability during training. Experimental results on three
are learned jointly at once during training, ignoring their learn-
open-source datasets, PLCO, ChestX-ray14 and CheXpert
show that our proposed learning paradigm outperforms ing priority. In clinical scenarios, radiologists’ training process
baselines and achieves comparable results to state-of-the- exhibits a character: they are trained with gradually more ab-
art methods. The improved performance promises potential normalities, mirroring the natural human learning process. For
applications in multi-label Chest X-ray classification. example, as described in [2] and shown in Fig. 1(a), student
trainees prefer to first learn common abnormalities (effusion and
Index Terms—Chest radiography abnormality classi- infiltration), and later advance to more rare ones (cardiomegaly
fication, multi-label, clinical knowledge, curriculum
and pneumonia). Such clinical practice motivates us to train
learning, local to global learning.
a DNN model with gradually increasing abnormalities, i.e.,
implement a curriculum in the abnormality space.
Aside from clinical inspiration, another persuasive rationale
exists for adopting a curriculum in the abnormality space. While
the curriculum is extensively employed in medical image diag-
Manuscript received 3 December 2022; revised 26 March 2023 and nostics, a commonality is that they schedule the presentation
17 May 2023; accepted 25 May 2023. Date of publication 30 May 2023; order of training samples for DNN model. For example, [4] fed
date of current version 6 September 2023. The work was supported images into the model in order of difficulty based on severity
in part by the Grants-in-Aid for Scientific Research of Exploratory Re-
search under Grants JP21656100 and JP17K20029. (Corresponding levels mined from radiology reports. However, we argue that
author: Yuanzhi Cheng) implementing such a curriculum in the sample space may not
Zean Liu is with the School of Computer Science and Technology, be appropriate for disease diagnosis, as the errors made by
Harbin Institute of Technology, Harbin 150001, China (e-mail: zaliuhit@ a model often arise from confusion between diseases rather
gmail.com).
Yuanzhi Cheng is with the School of Information Science and Technol-
than the absolute difficulty of an image independent of its
ogy, Qingdao University of Science and Technology, Qingdao 266061, class. As exemplified by the confusion matrix in Fig. 1(b),
China (e-mail: yzcheng2007@163.com). the model’s mistakes are not uniformly distributed across all
Shinichi Tamura is with the Department of Radiology, Graduate pairs of diseases, and certain diseases, such as fluid and card,
School of Medicine, Osaka University, Osaka 565-0871, Japan, also are mainly confused with only a few other diseases. Thus, it
with the NBL Technovator Company, Ltd., Sennan 590-0522, Japan, and
also with the Rm I-271, ISIR, Osaka University, Osaka 567-0047, Japan might be more effective to consider the difficulty of disease
(e-mail: tamuras@nblmt.jp). instead of that of the images to perform curriculum learning
Digital Object Identifier 10.1109/JBHI.2023.3281466 (CL).
2168-2194 © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY JALANDAR. Downloaded on February 23,2024 at 13:16:49 UTC from IEEE Xplore. Restrictions apply.
4410 IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, VOL. 27, NO. 9, SEPTEMBER 2023
II. RELATED WORK

A. Multi-Label CXR Classification
We review the literature of DNN-based methods for CXR
classification, dividing it into the following four groups.
1) General Network-Based Approach: Methods in this part
use vanilla network, simple variants and generic techniques of
deep learning for direct recognition. The first truly large-scale
CXR classification is started by [3], who published a dataset
and introduced a unified framework. Some useful techniques
were subsequently proposed to boost performance, such as dual
architecture [6]. However, these networks still face challenges
related to abnormality complexity, including disease-irrelevant
regions, variable scales, positions, and similarities with sur-
rounding tissue or other abnormalities.
Fig. 1. (a) An illustration of radiologists’ learning routine, where abnor- 2) Attention-Guided Approach: To alleviate these issues,
malities are gradually learned from common diseases to rare ones [2]; attention-guided methods are proposed to extract more dis-
(b) Confusion matrix of DenseNet121 trained the PLCO dataset. The criminative representation. Most methods use attention maps
position (i, j) indicates the ratio of times the model wrongly classifies an in different ways to capture disease-critical regions, such as
image as class j instead of the correct class i. The diagonal is removed
to make small differences among the other elements more visible. grad-cam [7], threshold on feature maps [8] and sequential
convolutions [9]. Particularly, channel attention can be used
along with space attention. For example, [10] and [11] utilized
extra channel attention to emphasize the critical disease-specific
features. Moreover, [12] propose channel attention, element
Recently, [5] advocated a novel CL, called local to global attention and scale attention to simultaneously address these
learning (LGL), which trains a model from fewer categories challenges.
(local) to more categories (global) gradually within the entire 3) Domain Knowledge Integrated Approach: Integrating do-
dataset. The learning paradigm is inspired by that humans mem- main knowledge can also acquire more powerful representation.
orize gradually from fewer categories to more categories. By Methods can be grouped into four families based on the type
simulating such a process, it has been empirically verified that of integrated knowledge. 1) Report-based methods incorporate
LGL can avoid local minima and achieve better classification information from radiological reports, such as [13] and [14],
performance in many tasks [5]. Obviously, LGL is the expected they both embedded the text context. 2) Additional image-
kind of learning paradigm that can be exploited to achieve our based methods utilize multiple related images as input. For
motivation. example, [15] and [16] used GCN to model image-level and
However, LGL is developed for multi-class classification semantic-level similarities respectively to optimize the visual
where a single one-hot prediction is made per image. As more feature embedding. 3) Medical knowledge-based methods uti-
than one abnormality can be observed on a CXR, we must lize certain medical knowledge of diseases. For instance, [17]
operate in a multi-label setting. To this end, we extend LGL and [18] proposed a conditional training regime to embed ab-
to the multi-label case and propose multi-label local to global normality taxonomy. 4) Extra data-based methods use extra data
learning (ML-LGL) to iteratively train the DNN model on for joint training, such as [19] and [20], they use a combination
gradually increasing abnormalities. We introduce three types of of chest-ray14 and PlCO to perform joint training.
abnormality selection functions, correlation-based, similarity- 4) Multi-Label Reasoning Approach: These methods are
based and frequency-based, to determine abnormality’s learning proposed to capture disease dependency. The dependency can
priority and create learning orders that resemble radiologists’ be constructed in the label space using various ways, such
progression. Summarizing, our proposed approach results in the as LSTM-decoder [21], cascade of binary classifiers [22] and
following contributions:
r We propose ML-LGL, an extension of LGL for multi-label GCN [23]. The dependency can be also modeled in the fea-
ture space. For example, [11] proposed a spatial and channel
CXR classification. To the best of our knowledge, our encoding module to strengthen semantic dependencies of multi-
work is the first attempt to investigate the application of disease features. Recently, [24] introduced a novel semantic-
category-based CL for medical image diagnosis, opening interactive graph convolutional network (SI-GCN) that inte-
up a myriad of possibilities to explore category-based grates concept correlations within each scene and semantic
CL for improving disease diagnosis. We consider this similarities across different scenes for multilabel image recogni-
pioneering a significant advantage of our approach. tion. Note that our proposed selection function likewise utilizes
r We introduce three abnormality selection functions to the label correlation in each scene and incorporates it into the
construct radiologist-like learning order. curriculum for improved performance. However, we emphasize
r We explain why our ML-LGL works well from the per- that our ML-LGL, as a CL method, is not specifically designed
for addressing the multi-label issue.
spective of the model’s initial stability during training.
r Experiments are performed on three public datasets,
PLCO, CXR14 and CheXpert. The applicability is demon- B. Related Learning Paradigms for Diagnosis
strated and comparable performance is achieved compared Our work most directly relates to the fields of curriculum
with state-of-the-art approaches. learning (CL), self-paced learning (SPL), progressive learning
LIU et al.: MULTI-LABEL LOCAL TO GLOBAL LEARNING 4411
(PL) and brain-like computing. We closely review them in the network architecture-based methods primarily use bio-inspired
medical diagnosis area and explain how they differ from ours. spiking neural networks (SNNs) [37] in various tasks. For
1) Cl: Using curriculum in the context of machine learning example, [38] proposes an SNN-based framework to mitigate
is first proposed by [25], before [26] coin the term curriculum the class imbalanced problems in medical image classification,
learning and trains deep neural network models gradually from while [39] uses the SNNs for COVID-19 detection. 2) The
easier to harder data within a dataset. CL has been extensively learning paradigm-based methods aim to mimic how humans
applied across various domains, demonstrating improved per- learn in the real world. Representative works include transfer
formance [27]. In medical image diagnosis, CL methods fall learning, few-shot learning [40] and the above-mentioned CL.
into two categories based on how the curriculum is gener- Our proposed ML-LGL method is also inspired by how people
ated. 1) Predefined curriculum-based methods manually design learn, but it specifically focuses on the mechanism of how
the curriculum using human prior knowledge. For example, individuals learn from a few categories to more categories.
in histopathology image classification, [28] and [29] leverage
annotator agreement and disease severity as a proxy for image III. PRELIMINARIES: LGL
difficulty, respectively. 2) By contrast, transfer teacher-based
methods invite a teacher model and measure training samples’ Consider the problem of multi-class classification on cat-
difficulty based on its output. For example, [30] develop a teacher egories Y using a DNN model, where w denotes the model
model with an evidence identification algorithm, so as to explore weights and L denotes the loss function. The definition of LGL
prior knowledge of training bias about diagnosis difficulty and introduced in [5] can be given as follows:
local features for curriculum generation. Definition 1: The LGL methodology is to iteratively train the
Our ML-LGL conducts training by progressive aggregation of DNN model by adding a new category’s samples to the training
disease categories within the entire dataset, thereby classifying set at each time.
it as a form of CL. A key difference between our approach LGL is implemented by iteratively minimizing the loss func-
and other CL methods is curriculum’s implementation space. tion on gradually increasing training sets. At the k-th iteration,
While others derive the sample’s curriculum in sample space, let Ykl denotes the local category where the categories have
our method adopts a distinct type of curriculum, specified by already been trained, and wk∗ denotes convergent weights. At
establishing the disease’s curriculum in category space. each iteration, the implementation consists of three steps:
2) Spl: SPL is the extension of CL that measures the diffi- 1) Update the local category: Using the category selection
culty of training examples according to their losses automati- function f to select a new category yks from the remaining
cally and embeds the curriculum design into the loss function. untrained categories, and push it into the local category
For example, [31] exploit SPL to handle class imbalance of to form a new set:
skin disease, by combining the sample number of each class
and the difficulty of the sample. In multi-modal Alzheimer’s yks = f (Y\Yk−1
l
)
(1)
disease classification [32], SPL is applied to dynamically es- Yk = Yk−1 ∪ {yks }
l l
timate the contribution of each sample to the fusion model
to avoid the influence of noise samples and outliers, and the 2) Build a new training set: Select samples whose labels
sample significance is also helpful to capture the relevance are in the new local category to build the new training
across different modalities in the multi-modal fusion process. set tk :
Apparently, our proposed ML-LGL is out of the scope of SPL,
and more importantly, the curriculum of these methods also takes tk = {(xn , yn ) : yn ∈ Ykl } (2)
place in the sample space. 3) Training: Train the model on the new training set tk using
3) Progressive Learning: “Progressive Learning” is an over- the loss function L until convergence:
loaded term used with different meanings. In some literature, it
means a kind of CL method where the curriculum is not related wk∗ = arg min L(w; tk , wk−1
∗
)
w (3)
to the difficulty of each single sample but is configured instead
as a progressive mutation of model capacity or architecture. For
example, [33] progressively decreases the dropout probability IV. METHOD
during training and [34] progressively grows the capacity of Building on LGL, we propose ML-LGL for multi-label
GANs to obtain high-quality results. Another meaning of PL is CXR classification. Given the dataset Dcxr = {(xn , Yn )}N
n=1
related to continual or lifelong learning [35], which learns on with K abnormalities Y = {y1 , y2 , . . .yK }, xn is an input
increasing tasks using an infinite stream of data. image, and Yn is its corresponding label, which is a subset
As a novel kind of CL method, ML-LGL does not belong of Y. The flowchart of ML-LGL is shown in Fig. 2, and
to the scope of PL and differs from the aforementioned PL in Algorithm 1 outlines the process. ML-LGL follows the general
two aspects: 1) ML-LGL performs curriculum in the category steps of LGL: selecting new disease patterns to update the
space and the model remains unchanged, whereas PL uses a local category, building a new training set, and training on
changeable architecture during the learning process. 2) ML-LGL the new set using dynamic loss. We will now delve into the
uses a fixed dataset set while PL uses increasing and large-scale details.
data flow.
4) Brain-Like Computing: Brain-like computing mimics the
information processing mode and structure of the biological A. Selection Function for Local Category Updating
nervous system, proposing new computing theory, computer The core of updating local category at each iteration is to
architecture and learning algorithms [36]. In medical diagnosis, use the selection function f to choose diseases of high training
the methods can be roughly divided into two families. 1) The priority. To enable the model to learn like a radiologist, f must
Fig. 2. ML-LGL for multi-label CXR classification, which iteratively trains the DNN model on gradually increasing abnormalities. It shows the three
steps involved in each iteration: select new disease to update the local category, build a new training set, and train on the set using dynamic loss.
Fig. 3. Illustration of three clinical knoweledge-leveraged selection functions: (a) co-occurrence frequency matrix of 12 disease patterns from
the PLCO dataset; (b) multi-label conditional entropy computing; (c) frequency of disease pattern from PLCO. The 12 disease are nodule, mass,
pleural-based mass, granuloma, fluid in pleural space, hilar, infiltration, scarring, pleural fibrosis, bone lesion, cardiac abnormality, and COPD.
characterize radiologists’ learning patterns. Accordingly, we for training. The correlation function is given by:
propose three selection functions that leverage clinical knowl-
edge to establish radiologist-like learning order. K
2 × #(yi , yj )
1) Correlation Function: Radiologists typically determine fcor = arg max (4)
the learning order based on prior knowledge of comorbidity l
yi ∈Y\Yk−1 #(yi ) + #(yj )
j=1,j=i
(Fig. 3(a)). They tend to prioritize learning diseases that are
strongly related to other ones, as this enables them to leverage where #(yi ) represents the number of samples where the disease
as much relevant experience gained from previously studied yi exists, and #(yi , yj ) represents the number of samples where
diseases as possible when diagnosing subsequent ailments. both diseases yi and yj co-exist.
Therefore, we aim to assign higher training priority to strongly 2) Similarity Function: Radiologists tend to prioritize learn-
correlated diseases. ing about similar diseases based on their historical learning
To achieve this, we use the co-occurrence frequency of two experiences, which motivates us to assign a higher training
diseases to measure their correlation and use it to compute the priority to diseases more similar to those already learned
total correlation for each disease. The disease with higher total To achieve this, we propose multi-label conditional entropy
correlation is selected first to be pushed into the local category (MLCE) as a similarity metric. As shown in Fig. 3(b), given
the model weights w and image dataset X, the MLCE can be

Algorithm 1: ML-LGL for CXR Classfication.
computed as follows:

Hw (O | X) = p(x)Hw (O | X = x)
x∈X
N K
1 k
=− [on log2 okn +(1−okn )log2 (1−okn )]
N n=1
k=1
(5)
where O is the output of classification layer and okn is the sigmoid
output of the k-th node for image xn .
We quantify the similarity between each untrained disease
with the local category. The one with the highest similarity value
is first selected. Let Xyi represent the image labeled solely with
yi , the similarity function is given as:
fsim = arg max Hwk−1 (O | Xyi ) (6)
l
yi ∈Y\Yk−1
Note that the initial disease is chosen based on weights trans- number of buckets is denoted as nb , and the number of disease
ferred from ImageNet. patterns added in the k-th iteration can be calculated using:
3) Frequency Function: As mentioned in the introduction,
radiologists typically start by learning commonly occurring K/nb k < nb
disease patterns and gradually move on to rare ones over time. add(k) = (10)
K − (nb − 1) × K/nb k = nb
To emulate this approach, we introduce the frequency function:
This is more efficient than adding disease patterns one by one,
ff re = arg max #(yi ) (7) which is time-consuming due to the linear relationship between
l
yi ∈Y\Yk−1
the training time and the number of disease patterns (K) as
This function prioritizes disease patterns with high frequency. proven in [5], as well as the time required for abnormality
selection, especially for the dynamic similarity function that
B. Build New Training Set involves feedforward estimating of images.
The updated local category is then used to construct the

D. Difference With LGL
training set for current iteration. In the multi-class case, the
dataset is obtained by selecting samples whose labels are in the Although our proposed ML-LGL shares similarities with
local category (see (2)). However, in our multi-label case, this original LGL, there are several key differences that highlight
operation is not applicable because some samples have multiple the novelty of our method. First, LGL was originally developed
labels, part of which does not belong to the local category. As for single-label classification and tested on single-label natural
an alternative, we select images whose labels intersect with the image datasets. However, the multi-label setting of CXR makes
local category to create a new training set: it unavailable for direct utilization. To achieve our objective of
developing DNN model for CXR abnormality classification, we
tk = {(xn , Yn ) : Yn ∩ Ykt = ∅} (8) must consider the multi-label setting in our implementations,
such as the new training set acquiring and loss function. Conse-
C. Training on Dynamic Loss quently, we utilize LGL as a solid foundation for our ML-LGL,
We use DenseNet121 as the backbone to train the DNN adapting it to accommodate the specific requirements of the
model on the new training set. The original fully connected and multi-label context.
classification layer are removed and a fully-connected layer of Second, our ML-LGL provides more elaborated selection
K neurons and a classification layer with sigmoid function are functions that embody the inherent characteristics of CXR
added. Each output represents the likelihood of belonging to datasets. For instance, the correlation function considers label
corresponding disease pattern and the cross-entropy loss is used relations, yielding improved performance in experiments. In
for each disease pattern. To train the model only on the local contrast, the original LGL focused on proposing the overall
category, we control the loss function using the characteristic learning paradigm and only tested with two simple and universal
function, resulting in a dynamic loss function: category selection functions (random and similarity).
Third, our motivation for developing ML-LGL is more pro-
L(w; tk , wk−1 ) = CE(zy , z(y)) ∗ 1y∈Ykt (9) found. As presented in the introduction, excluding being in-
y∈Y
spired by radiologists’ training processes, our ML-LGL is also
motivated by the observation that model errors often stem from
where CE(·, ·) denote the cross-entopy loss, y denotes the confused disease classes rather than independent samples, which
disease pattern, and z(y) ∈ {0, 1} denotes its ground truth with motivates us to implement category-based CL in medical image
z(y) corresponding to model’s sigmoid output. diagnosis for the first time. In contrast, LGL is only inspired
To speed up training, multiple disease patterns can be added by the human learning mechanism of gradually learning more
simultaneously in each iteration, referred to as a “bucket”. The categories.
TABLE I
OVERVIEW OF THE THREE DATASETS
Fig. 4. (a) Initial stability across varying disease numbers; (b) AUC
performance at the convergent state for different disease numbers.
1) PLCO: PLCO is a CXR arm of the National Cancer Insti-
tute’s screening trial for prostate, lung, colorectal, and ovarian
Overall, our ML-LGL approach is a novel and effective solu- cancer [43]. There are 198,000 images in total but only 84182
tion for multi-label CXR abnormality classification, with unique images from 25 k participants are publicly distributed, with 13
features that distinguish it from LGL. disease patterns labeled by radiologists’ visual observation. We
exclude major atelectasis due to its low prevalence and use the
V. EXPLANATION OF MODEL STABILITY PERSPECTIVE remaining 12 ones for experiments. Without provided official
split, we split the data into three patient-level subgroups: 70%
To understand why ML-LGL works well, we give an expla- for training, 10% for validation, and 20% for testing, maintaining
nation from the perspective of model stability during training. disease pattern prevalence in each subgroup.
First, we define model stability as follows: 2) ChestX-Ray14: ChestX-ray14 (CXR14), published by
Definition 2: Given the image dataset X and model weights the US National Institute of Health, is the most widely used
w, we use the multi-label conditional entropy (MLCE) defined benchmark in the field of CXR analysis. The 14 labels are
in (5) to quantify the model stability. A lower MLCE signifies a minded from radiological reports by using natural language pro-
more stable model. cessing (NLP). To establish a consistent benchmark, we follow
We then devise a series of comparative experiments using the official patient-wise split standard [3]. During training, we
the PLCO dataset. A non-ML-LGL model is established, which use 10% of images from training set for validation.
shares the same DenseNet121 backbone as the ML-LGL model 3) CheXpert: CheXpet is a large-scale dataset released by
but employs random initialization for training. The ML-LGL Stanford Hospital and here we use the 1.0 version dataset with
model is trained using nb = 12 and the correlation function, downsampled resolution. The labels of training set are automati-
while the non-ML-LGL model is trained on a series of updated cally extracted from radiology reports using a rule-based labeler,
datasets established by the ML-LGL model (see (8)). We report while the validation and test sets are manually annotated by the
the stability at the initial moment and the AUC performance consensus of board-certified radiologists. 12 common disease
at the convergent moment for each training session of the two patterns are labeled as positive, negative and uncertain. The test
models. Both metrics are evaluated on the corresponding subset set is not publicly available and we evaluate our method on the
of the validation set, i.e., the image-label pairs whose labels validation set. We adopt two commonly utilized uncertain label
intersect with the currently trained diseases. Consequently, we policies [44] to use uncertain labels: 1) U-Zeros that replaces all
obtain the metric values on various disease numbers. the uncertain labels with “zero” and 2) U-Ones that replaces all
As shown in Fig. 4, the stability of ML-LGL model tends the uncertain labels with “one”.
to be unchanged when the disease number exceeds 8, whereas 4) Remarks: Label Noise: The CXR14 and CheXpert
the non-ML-LGL model demonstrates sustained growth with a datasets use NLP to extract labels from reports, which can result
slope of 1. More importantly, the stability gap and the AUC gap in label noise for two main reasons. First, NLP has high levels of
between the two models synchronously expand as the disease error and uncertainty [44]. Second, text reports cannot replace
number increases, indicating a positive correlation between the visual examination of the image due to insufficient or overly
reduced initial stability achieved by ML-LGL and the AUC descriptive information (e.g., lab tests or prior radiological stud-
performance. Therefore, we infer that ML-LGL benefits the ies). Label noise is severe in CXR14, with positive predictions
training process by lowering the initial stability, ultimately mostly 10% to 30% lower than original values [45]. As a result,
attaining improved performance. Note that our explanation is PLCO is the only large-scale CXR dataset with labels produced
provided from the perspective of initial stability, grounded in by radiologists’ visual observation of CXR images.
experimental validation. More theoretical explanations could be
given from continuation method [41] or data distribution [42].
B. Experimental Setup
VI. EXPERIMENT AND RESULTS 1) Evaluation: The per-class area under the ROC curve
(AUC) is adopted to measure the performance of each disease
A. Dataset pattern, and the average AUC is employed to evaluate over-
We conduct experiments on three public available CXR all performance. In our experiments, differences in the two
datasets: PLCO [43], ChestX-ray14 [3] and CheXpert [44]. compared methods are assessed using the one-sided DeLong
Table I gives an overview. statistical test [46] and the p-value < 0.05 is considered as the
TABLE II
THE AUC SCORES OF VARIOUS MODELS ON PLCO
due to its cleanliness. The results presented in Fig. 5 demonstrate

that increasing nb leads to a continuous improvement in the mean
AUC of all three functions. However, the growth rate is sublinear.
As mentioned in Section IV-C, larger nb requires more training
time, indicating a need to balance performance and training time.
In subsequent experiments, we set nb = 4 for PLCO and CXR14
and nb = 3 for CheXpert.
D. Comparison With the Baseline

Fig. 5. Comparison of mean AUC evaluated on PLCO with different nb . The mean AUC of the three functions outperforms the base-
line on all three datasets, as shown in Tables II–IV. Per-class
AUC indicates significant improvement on over half of disease
significant level for all statistical tests. The statistical test is patterns on their respective datasets, except for the frequency
performed using Matlab implementation1.1 function on PLCO. These findings demonstrate the effectiveness
2) Training Details: During training, the raw images of of ML-LGL in enhancing CXR classification performance. Ab-
PLCO and CXR14 are resized to 512 × 512 via down-sample normality heatmaps, generated by grad-cam, visually illustrate
and random crop, while images of CheXpert are directly cropped this effectiveness. In the single-label sample (Fig. 7(a)), all three
to 320 × 320. The processed images are normalized with mean functions increase the likelihood of mass and improve the match
and standard deviation of ImageNet images. We only use hori- between the identified abnormality area and the bounding box,
zontal flipping to augment training data. particularly for the correlation function. In the two-label sample
The framework is trained by an Adam optimizer with default (Fig. 7(b)), the similarity and correlation functions correctly
β parameters (β1 = 0.9, β2 = 0.99), a weight decay of 0.01, and diagnose cardiomegaly and indicate well-matched abnormality
an adaptive learning rate. The strategy proposed in [47] is used areas, while the baseline predicts no cardiomegaly and mis-
to find the proper initial learning rate at each iteration, and the diagnoses effusion with a likelihood of 0.730. However, not
learning rate is reduced by 5 times when the mean AUC on the all disease predictions show significant improvement, such as
validation set plateaus for more than 5 epochs. The mean AUC on infiltration, fibrosis and pneumonia on CXR14, as depicted in
the validation set is also used to terminate the training procedure Fig. 7(c). For infiltration, the three functions demonstrate a
when it plateaus for more than 10 epochs, and the model with likelihood as low as the baseline and no high activation values
the highest mean AUC on validation set will be the final model. in the corresponding bounding box, despite higher likelihood
For validation and inference, we crop 512 × 512 (PLCO and and more precise disease localization for its co-existing diseases
CXR14) and 320 × 320 (CheXpert) sub-images (a central one (atelectasis and effusion).
and four corner ones) as the network input. Our experiments
are implemented with PyTorch on two 2080Ti GPUs of 12 GB
memory, and the batch size is set to 32 (CheXpert) and 16 E. Comparison of Three Selection Functions
(CXR14 and PLCO). Results vary across three functions according to Tables II–IV.
On PLCO, correlation and similarity functions have comparable
C. Parameter Analysis for Bucket Number mean AUC and both significantly outperform the frequency
function on most diseases, with correlation function showing
We investigated the impact of the bucket number on the notable improvement in 7 out of 12 diseases. On CXR14,
performance of ML-LGL by testing different nb values, namely mean AUC difference of three functions is not noticeable, with
nb = 2, 3, 4, 5, using three selection functions. The experiments only 4 out of 14 diseases showing a statistical significance
were conducted on PLCO, which provides more reliable results between frequency function and correlation/similarity function.
On CheXpert, the difference is significant for 13 diseases but not
1 https://github.com/PamixSun/DeLongUI for 5. We attribute the performance difference for three functions
TABLE III
THE AUC SCORES OF VARIOUS MODELS ON CXR14
TABLE IV
THE AUC SCORES OF VARIOUS MODELS ON CHEXPERT
not only to inherent variations among datasets but also to label the organizers. Moreover, this method reported results us-
noise, which we will explore in Section VII-D. ing training data that included samples from CXR14. It is
also not possible to make a direct comparison with HMLC
regarding the per-class AUC score, as they use different
data splits. Thus, to ensure fairness, we only compared the
F. Compare With State-of-The-Art Methods overall performance. As shown in Table II, our correla-
We compare our method against state-of the-art methods on tion function outperforms HMLC, and the similarity function
the three datasets and Fig. 6 shows ROC of our best results. achieves comparative results, demonstrating the effectiveness of
1) Results on PLCO: We only compared our method with ML-LGL.
only HMLC [17] due to limited studies on this dataset. [20] 2) Results on CXR14: We select 15 typical and well-
also experimented on PLCO but used a previously pub- performed methods that are evaluated on the official patient-wise
lished dataset with up to 198000 images, which is now split. To provide more comprehensive comparisons, we attempt
not available due to changes in the data release policy by to compare our proposed method with SI-GCN by [24], which
consists of 14 observations, the validation set only contains

13 diseases, with no samples of fracture, and some methods
only evaluated the 5 clinically important and prevalent diseases.
Therefore, we reported the average AUC for both 5 and 13 types
of diseases simultaneously.
We first examine the performance of the correlation function
on the 5 diseases. Table IV show that the mean AUC achieved
by our correlation function is comparable to the previous best
method, CheXpert [44], in the U-zeros setting, despite CheXpert
using an ensemble of 30 models. Moreover, our correlation
function ranks first in the U-ones setting.
Moving on to the performance on the 13 diseases, we find that
our correlation function achieves the best overall performance
and shows the top performance on more than half of the disease
in both two-label settings, outperforming the other two multi-
view learning methods: MVC-Net [49] and ImageGCN [15],
which take multi-view images (i.e., frontal and lateral) and multi-
relational images as input, respectively.
Note: The original results of ImageGCN on both CXR14 and
CheXpert datasets, as well as MVC-Net on CheXpert, were
reported on their respective data splits. We have reported the
results using the released code.2 In this comparison, ImageGCN
Fig. 6. The ROC curve of our best results from the three datasets. employed DenseNet121 as the backbone model.
VII. ABLATION STUDY
has shown superior performance on multi-label natural images. A. Is Radiologists-Like Learning Order Beneficial?
However, the results are obtained from different datasets and To examine how much radiologist-like learning order con-
thus cannot be used for reliably comparing the performance of tribute to ML-LGL, we give a new selection function for compar-
the two methods. Fortunately, we note that CheXGCN by [23] ative study called random function, which selects abnormalities
and SSGE by [16] are two special cases of SI-GCN tailored randomly at each iteration. Table II shows random function also
for CXR classification. Specifically, CheXGCN and SSGE both achieves a higher mean AUC than the baseline, demonstrating
mirror the corresponding components of SI-GCN to build con- that ML-LGL can naturally benefit CXR classification regard-
cept correlations of the same scene and the semantic similarities less of learning order. This finding supports the motivation of
of different scenes, respectively. Therefore, we use the results LGL [5] and our explanation in Section V.
reported in [16], [23] for our comparison. We compared the random function with the three clinical
As shown in Table III, our similarity function achieves the knowledge-leveraged functions in terms of mean AUC and
second-highest mean AUC, narrowly beaten by [19] who used per-class AUC. We found that the random function achieved
CheXpert as an external dataset for joint training. Notably, our results similar to the frequency function, but was outperformed
similarity function performed comparably with the previous by the correlation and similarity functions. This suggests that the
best results, [48] and SSGE [16], both of which used only correlation and similarity order can benefit the training process,
CXR14. [48] proposed squeeze-and-excitation blocks, multi- while the frequency order cannot. The similarity order makes the
map transfer and max-min pooling to learn disease-specific fea- model start training from the most stable state at each iteration as
tures, thus achieving good performance. SSGE [16] explored the demonstrated in Section V, which may possibly explain the su-
semantic similarities of in-batch images to optimize the feature perior performance. The improvement brought by the correlation
embedding, also leading to good results. As previous best work function may be attributed to the exploitation of label relations,
that exploits label correlations, CheXGCN [23] compares fairly which is consistent with previous methods [11], [23] that have
against our frequency function, but is inferior to our correlation demonstrated the boosting of performance through label relation
and similarity function. These observations demonstrate the modeling.
superiority of our proposed ML-LGL.
Compared to the multi-modal learning method TieNet [13], B. Deep Analysis of Radiologists-Like Learning Order
which uses additional reports, and the multi-view method Im-
ageGCN [15], which uses multi-relational images (e.g., from the To aetiologically understand the radiologists-like learning
same age and person), three functions we proposed can consis- order, We tested the reverse versions of the correlation and
tently achieve the highest mean AUC with higher per-class AUC similarity functions. The anti-correlation function learns weaker
on almost all diseases. Furthermore, it outperforms AG-CL [4], correlated disease patterns first, while the anti-similarity func-
which also uses CL for training, by a significant margin. These tion learns dissimilar disease patterns first. The reverse version
comparisons again persuasively demonstrate the advantage of of the frequency function is not tested as it does not improve
our proposed ML-LGL. performance.
3) Results on CheXpert: For a fair comparison, we selected
state-of-the-art methods that were evaluated on the official val- 2 https://github.com/fzfs/Multi-view-Chest-X-ray-Classification; https://
idation set. It should be noted that although the training dataset github.com/mocherson/ImageGCN
Fig. 7. Classification results visualization from CXR14. Image with one-label (a), two-label (b) and three-label (c) are shown. From top to bottom:
heatmaps generated by baseline and three selection functions. The highest five sigmoid values are shown with ground truth highlighted in red.
TABLE V representation power (from VGG16 to AG-CNN). Both selec-

RESULTS OF DIFFERENT DNN MODELS ON THE PLCO DATASET tion functions yielded mean AUC improvement for all models
compared to their respective baselines, indicating the effective-
ness of our approach for DNN models with different levels of
complexity.
D. Limitation
Our approach is susceptible to label noise due to the learn-
Comparing the results in Table II, the reverse versions ing order built on the label, causing the clinical knowledge-
had significantly different outcomes compared to their orig- leveraged function to degrade into a random function and hinder
inal counterparts. Surprisingly, the anti-similarity function learning ability. Our experimental results in Tables III and IV,
showes a higher average AUC and improved 5 disease pat- which show no significant difference in the mean AUC of the
terns compared to its original version. This may be because three functions, seem to support this hypothesis.
the similarity function prioritizes disease with greater sim- To demonstrate our hypothesis, we perform experiments to
ilarity, which are more informative and further away from simulate different levels of label noise severity to monitor per-
the currently trained images. Training on these images first formance change. Since PLCO has greater label reliability, we
can rapidly reduce the error and lead to better outcomes. regard it as a basic dataset to simulate a label noise scenario
Conversely, the anti-correlation function decreased the mean that CXR14 and CheXpert face. Label noise specifically refers
AUC drastically and approached the baseline, indicating that to three cases: missed finding, over-labeling and mislabeling.
it is crucial to learn strongly correlated disease patterns first Missed finding means an abnormality is observed on a CXR
and incorrect learning order may adversely affect the training while it is not annotated. Over-labeling means an abnormality
process. does not exist on a CXR but it is labeled as positive. Mislabeling
means an abnormality is misidentified and labeled as another
C. Generalizability on Different DNN Models name. Accordingly, we corrupt PLCO to produce the scenario
using the following controlled scheme:
To validate ML-LGL’s generalizability across models, an- r Choose a base probability α ∈ [0, 0.25]
other three DNN models are involved for evaluation: VGG16, r For each sample with abnormality, we randomly select one
ResNet50 and AG-CNN [8]. The experiments were conducted
on PLCO using correlation and similarity functions, and AG- of its labels and delete it with a probability of α.
CNN was implemented using code3 provided by the authors. As
r For each sample with abnormality, we replace one of its
shown in Table V, the performance increased with the model’s labels with another label (not originally included in the
sample) with a probability of α.
r For each sample, we add an extra label (not originally
3 https://github.com/Ien001/AG-CNN/blob/master/ included in the sample) with a probability of α.
generate more compact buckets to improve performance, as

described in [51]. Third, while this paper only introduces
radiologist-like curricula, it is flexible enough to support
curricula that extend beyond mimicking radiologists. For
example, the confusion matrix depicted in Fig. 1(b) can be
used to develop a hierarchical curriculum where diseases are
trained from coarse to fine levels. Specifically, the model that
is directly trained on the training set is first used to estimate the
Fig. 8. Mean AUC with with 95% confidence interval on different levels
matrix. Then, the matrix is input into a hierarchical clustering
of label noise severity using various selection functions. algorithm to construct the disease hierarchy, where diseases
that are more frequently confused are gathered at lower levels.
Such a heuristic curriculum does not require the expertise of
r The previous operation is overlayed by the subsequent one radiologists and may yield even greater improvements.
on each sample. Considering the potential applicability of ML-LGL in clin-
We tested α = 0, 0.05, 0.1, 0.15, 0.2, 0.25 using original and ical settings, it is crucial to pay attention to situations where
reverse versions of the correlation and similarity functions, with the model can be attacked by imperceptible and carefully-
the random function also synchronously tested for comparison. engineered adversarial examples. To prepare for this, several
To allow comparisons across α, we ensured that if a label was aspects can be investigated in the future. First, we need to
operated on at a certain α value, it would also be operated on all evaluate the adversarial robustness of the model. Unlike assess-
larger α values. ments on natural images and other medical diagnosis tasks as
As noise severity increases, overall performance drops for discussed in [52], our evaluation must consider the factor of
all three functions as shown in Fig. 8, confirming the negative label co-occurrence and label noise. We also need to perform
impact of label noise. The anti-similarity function consistently the evaluation on different datasets using various threat models
outperforms the similarity function, and the correlation function (white box, gray box and black box) and attacks of different in-
also performs continually better than the ant-correlation func- tensities. Second, we need to explore how adversarial examples
tion. Notably, as the severity of noise increases, the performance attack the model and identify its vulnerability. Basic exploration
gap between the original, reverse and random functions shrinks, techniques could include analyzing intermediate features and
with the original and reverse functions approaching the perfor- estimating label correlations, for both original/clean images
mance of the random function when α = 0.25. These results and adversarial ones, as introduced by [53]. Third, based on
provide further evidence for our hypothesis. the above analysis, we need to design an easy-to-deploy and
effective defense method to handle adversarial examples. Doing
so will improve the robustness of the model against adversarial
VIII. DISCUSSION AND CONCLUSION attacks, and therefore increase its utility in clinical settings.
In this paper, we propose ML-LGL, a novel learning paradigm
tailored for multi-label CXR classification. It trains DNN model ACKNOWLEDGMENT
with a curriculum of gradually increasing abnormalities, which
is generated by clinical knowledge-leveraged selection func- The authors would like to thank the anonymous reviewers
tions. ML-LGL benefits training process by initiating models for their valuable comments and help suggestions that greatly
from a more stable state. Our evaluations demonstrate com- improved the paper’s quality.
parable performance with state-of-the-art methods on PLCO,
CXR14, and CheXpert datasets. The proposed learning orders
of correlation and similarity improved training, and their reverse REFERENCES
orders yielded different results. Our experiments with different [1] M. Chavez-MacGregor, X. Lei, H. Zhao, P. Scheet, and S. H. Giordano,
levels of label noise severity indicated that ML-LGL lacks “Evaluation of COVID-19 mortality and adverse outcomes in us patients
robustness to label noise. with or without cancer,” JAMA Oncol., vol. 8, no. 1, pp. 69–78, 2022.
[2] S. Sait and M. Tombs, “Teaching medical students how to interpret chest
Our ML-LGL is a category space-based CL method and X-rays: The design and development of an e-learning resource,” Adv. Med.
is model-agnostic, making it applicable to all existing multi- Educ. Pract., vol. 12, 2021, Art. no. 123.
modal CXR classification methods, such as ImageGCN [15], [3] X. Wang, Y. Peng, L. Lu, Z. Lu, M. Bagheri, and R. M. Summers, “ChestX-
TieNet [13] and MVC-Net [49], irrespective of the fusion tech- ray8: Hospital-scale chest X-ray database and benchmarks on weakly-
niques used. Concerning its extendability to multimodal learn- supervised classification and localization of common thorax diseases,” in
Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 2097–2106.
ing beyond medical image diagnosis. First, while multimodal [4] Y. Tang, X. Wang, A. P. Harrison, L. Lu, J. Xiao, and R. M. Summers,
learning involves various tasks (segmentation, translation and “Attention-guided curriculum learning for weakly supervised classifica-
alignment), ML-LGL is currently only extendable to classifi- tion and localization of thoracic diseases on chest radiographs,” in Proc.
cation models. Further research will explore the possibility of Int. Workshop Mach. Learn. Med. Imag., Springer, 2018, pp. 249–258.
[5] H. Cheng, D. Lian, B. Deng, S. Gao, T. Tan, and Y. Geng, “Local to global
extending to other tasks. Second, ML-LGL can only be applied learning: Gradually adding classes for training deep neural networks,” in
in the category space, and not in the sample space [32] or Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 4748–4756.
modality space [50]. [6] B. Chen, J. Li, X. Guo, and G. Lu, “DualCheXNet: Dual asymmetric fea-
There are several interesting avenues for future work. First, ture learning for thoracic disease classification in chest X-rays,” Biomed.
we could explore other radiologist-like learning orders, such as Signal Process. Control, vol. 53, 2019, Art. no. 101554.
[7] H. Wang, H. Jia, L. Lu, and Y. Xia, “Thorax-Net: An attention regular-
hierarchical curriculum based on abnormality taxonomy [17]. ized deep neural network for classification of thoracic diseases on chest
Second, instead of adding an equal number of diseases at each radiography,” IEEE J. Biomed. Health Inform., vol. 24, no. 2, pp. 475–485,
iteration, we could use clustering algorithms to automatically Feb. 2020.
[8] Q. Guan, Y. Huang, Z. Zhong, Z. Zheng, L. Zheng, and Y. Yang, “Diagnose [31] J. Yang et al., “Self-paced balance learning for clinical skin disease
like a radiologist: Attention guided convolutional neural network for thorax recognition,” IEEE Trans. Neural Netw. Learn. Syst., vol. 31, no. 8,
disease classification,” 2018, arXiv:1801.09927. pp. 2832–2846, Aug. 2020.
[9] Q. Guan and Y. Huang, “Multi-label chest X-ray image classification [32] Q. Zhu, N. Yuan, J. Huang, X. Hao, and D. Zhang, “Multi-modal AD
via category-wise residual attention learning,” Pattern Recognit. Lett., classification via self-paced latent correlation analysis,” Neurocomputing,
vol. 130, pp. 259–266, 2018. vol. 355, pp. 143–154, 2019.
[10] B. Chen, J. Li, G. Lu, and D. Zhang, “Lesion location attention guided [33] P. Morerio, J. Cavazza, R. Volpi, R. Vidal, and V. Murino, “Curriculum
network for multi-label thoracic disease classification in chest X-rays,” dropout,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 3544–3552.
IEEE J. Biomed. Health Inform., vol. 24, no. 7, pp. 2016–2027, Jul. 2020. [34] T. Karras, T. Aila, S. Laine, and J. Lehtinen, “Progressive grow-
[11] Q. Guan, Y. Huang, Y. Luo, P. Liu, M. Xu, and Y. Yang, “Discriminative ing of GANs for improved quality, stability, and variation,” 2017,
feature learning for thorax disease classification in chest X-ray images,” arXiv:1710.10196.
IEEE Trans. Image Process., vol. 30, pp. 2476–2487, 2021. [35] M. De Lange et al., “A continual learning survey: Defying forgetting in
[12] H. Wang, S. Wang, Z. Qin, Y. Zhang, R. Li, and Y. Xia, “Triple attention classification tasks,” in IEEE Trans. Pattern Anal. Mach. Intell., vol. 44,
learning for classification of 14 thoracic diseases using chest radiography,” no. 7, pp. 3366–3385, Jul. 2022.
Med. Image Anal., vol. 67, 2021, Art. no. 101846. [36] W. Ou, S. Xiao, C. Zhu, W. Han, and Q. Zhang, “An overview of
[13] X. Wang, Y. Peng, L. Lu, Z. Lu, and R. M. Summers, “TieNet: Text-image brain-like computing: Architecture, applications, and future trends,” Front.
embedding network for common thorax disease classification and report- Neurorobot., vol. 16, 2022, Art. no. 1041108.
ing in chest X-rays,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., [37] S. Ghosh-Dastidar and H. Adeli, “Spiking neural networks,” Int. J. Neural
2018, pp. 9049–9058. Syst., vol. 19, no. 4, pp. 295–308, 2009.
[14] G. Jacenków, A. Q. O’Neil, and S. A. Tsaftaris, “Indication as prior [38] Q. Zhou, C. Ren, and S. Qi, “An imbalanced R-STDP learning rule in
knowledge for multimodal disease classification in chest radiographs with spiking neural networks for medical image classification,” IEEE Access,
transformers,” in Proc. IEEE 19th Int. Symp. Biomed. Imag., 2022, pp. 1–5. vol. 8, pp. 224162–224177, 2020.
[15] C. Mao, L. Yao, and Y. Luo, “ImageGCN: Multi-relational image graph [39] A. Garain, A. Basu, F. Giampaolo, J. D. Velasquez, and R. Sarkar,
convolutional networks for disease identification with chest X-rays,” IEEE “Detection of COVID-19 from CT scan images: A spiking neural
Trans. Med. Imag., vol. 41, no. 8, pp. 1990–2003, Aug. 2022. network-based approach,” Neural Comput. Appl., vol. 33, no. 19,
[16] B. Chen, Z. Zhang, Y. Li, G. Lu, and D. Zhang, “Multi-label chest X-ray im- pp. 12591–12604, 2021.
age classification via semantic similarity graph embedding,” IEEE Trans. [40] A. Paul et al., “Generalized zero-shot chest X-ray diagnosis
Circuits Syst. Video Technol., vol. 32, no. 4, pp. 2455–2468, Apr. 2022. through trait-guided multi-view semantic embedding with self-
[17] H. Chen, S. Miao, D. Xu, G. D. Hager, and A. P. Harrison, “Deep training,” IEEE Trans. Med. Imag., vol. 40, no. 10, pp. 2642–2655,
hiearchical multi-label classification applied to chest X-ray abnormality Oct. 2021.
taxonomies,” Med. Image Anal., vol. 66, 2020, Art. no. 101811. [41] Y. Bengio, “Evolving culture versus local minima,” in Growing Adaptive
[18] H. H. Pham, T. T. Le, D. Q. Tran, D. T. Ngo, and H. Q. Nguyen, “Interpret- Machines: Combining Development and Learning in Artificial Neural
ing chest X-rays via CNNs that exploit hierarchical disease dependencies Networks. Berlin, Germany: Springer, 2014, pp. 109–138.
and uncertainty labels,” Neurocomputing, vol. 437, pp. 186–194, 2021. [42] T. Gong, Q. Zhao, D. Meng, and Z. Xu, “Why curriculum learning &
[19] L. Luo et al., “Deep mining external imperfect data for chest X-ray disease self-paced learning work in big/noisy data: A theoretical perspective,” Big
screening,” IEEE Trans. Med. Imag., vol. 39, no. 11, pp. 3583–3594, Data Inf. Analytics, vol. 1, no. 1, pp. 111–127, 2016.
Nov. 2020. [43] P. P. Team, J. K. Gohagan, P. C. Prorok, R. B. Hayes, and B.-S.
[20] S. Gündel et al., “Robust classification from noisy labels: Integrating Kramer, “The prostate, lung, colorectal and ovarian (PLCO) cancer
additional knowledge for chest radiography abnormality assessment,” screening trial of the national cancer institute: History, organization,
Med. Image Anal., vol. 72, 2021, Art. no. 102087. and status,” Controlled Clin. Trials, vol. 21, no. 6, pp. 251S–272S,
[21] L. Yao, E. Poblenz, D. Dagunts, B. Covington, D. Bernard, and K. Lyman, 2000.
“Learning to diagnose from scratch by exploiting dependencies among [44] J. Irvin et al., “CheXpert: A large chest radiograph dataset with uncertainty
labels,” 2017, arXiv:1710.10501. labels and expert comparison,” in Proc. AAAI Conf. Artif. Intell., 2019,
[22] P. Kumar, M. Grewal, and M.M. Srivastava, “Boosted cascaded convnet pp. 590–597.
for multilabel classification of thoracic diseases in chest radiographs,” in [45] L. Oakden-Rayner, “Exploring large-scale public medical image datasets,”
Proc. Int. Conf. Image Anal. Recognit., Springer, 2018, pp. 546–552. Academic Radiol., vol. 27, no. 1, pp. 106–112, 2020.
[23] B. Chen, J. Li, G. Lu, H. Yu, and D. Zhang, “Label co-occurrence [46] X. Sun and W. Xu, “Fast implementation of DeLong’s algorithm for
learning with graph convolutional networks for multi-label chest X-ray comparing the areas under correlated receiver operating characteristic
image classification,” IEEE J. Biomed. Health Inform., vol. 24, no. 8, curves,” IEEE Signal Process. Lett., vol. 21, no. 11, pp. 1389–1393,
pp. 2292–2302, Aug. 2020. Nov. 2014.
[24] B. Chen, Z. Zhang, Y. Lu, F. Chen, G. Lu, and D. Zhang, “Semantic- [47] L. N. Smith, “Cyclical learning rates for training neural networks,” in Proc.
interactive graph convolutional network for multilabel image recognition,” IEEE Winter Conf. Appl. Comput. Vis., 2017, pp. 464–472.
IEEE Trans. Syst., Man, Cybern. Syst., vol. 52, no. 8, pp. 4887–4899, [48] C. Yan, J. Yao, R. Li, Z. Xu, and J. Huang, “Weakly supervised deep
Aug. 2022. learning for thoracic disease classification and localization on chest X-
[25] J. L. Elman, “Learning and development in neural networks: The impor- rays,” in Proc. ACM Int. Conf. Bioinf., Comput. Biol., Health Inform.,
tance of starting small,” Cognition, vol. 48, no. 1, pp. 71–99, 1993. 2018, pp. 103–110.
[26] Y. Bengio, J. Louradour, R. Collobert, and J. Weston, “Curriculum learn- [49] X. Zhu and Q. Feng, “MVC-Net: Multi-view chest radiograph classifica-
ing,” in Proc. 26th Annu. Int. Conf. Mach. Learn., 2009, pp. 41–48. tion network with deep fusion,” in Proc. IEEE 18th Int. Symp. Biomed.
[27] X. Wang, Y. Chen, and W. Zhu, “A survey on curriculum learning,” Imag., 2021, pp. 554–558.
IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 9, pp. 4555–4576, [50] W. Xu, W. Liu, X. Huang, J. Yang, and S. Qiu, “Multi-modal self-paced
Sep. 2022. learning for image classification,” Neurocomputing, vol. 309, pp. 134–144,
[28] J. Wei et al., “Learn like a pathologist: Curriculum learning by annotator 2018.
agreement for histopathology image classification,” in Proc. IEEE/CVF [51] N. Sarafianos, T. Giannakopoulos, C. Nikou, and I. A. Kakadiaris, “Cur-
Winter Conf. Appl. Comput. Vis., 2021, pp. 2473–2483. riculum learning of visual attribute clusters for multi-task classification,”
[29] M. Yang, Z. Xie, Z. Wang, Y. Yuan, and J. Zhang, “Su-MICL: Severity- Pattern Recognit., vol. 80, pp. 94–108, 2018.
guided multiple instance curriculum learning for histopathology image [52] S. Ghamizi, M. Cordy, M. Papadakis, and Y. L. Traon, “On evaluating
interpretable classification,” IEEE Trans. Med. Imag., vol. 41, no. 12, adversarial robustness of chest X-ray classification: Pitfalls and best prac-
pp. 3533–3543, Dec. 2022. tices,” 2022, arXiv:2212.08130.
[30] R. Zhao, X. Chen, Z. Chen, and S. Li, “EGDCL: An adaptive curriculum [53] M. Xu, T. Zhang, Z. Li, M. Liu, and D. Zhang, “Towards evaluating the
learning framework for unbiased glaucoma diagnosis,” in Proc. Eur. Conf. robustness of deep diagnostic models by adversarial attack,” Med. Image
Comput. Vis., Springer, 2020, pp. 190–205. Anal., vol. 69, 2021, Art. no. 101977.

Multi-Label Local To Global Learning A Novel Learning Paradigm For Chest X-Ray Abnormality Classification

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Multi-Label Local To Global Learning A Novel Learning Paradigm For Chest X-Ray Abnormality Classification

Uploaded by

Copyright:

Available Formats

IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, VOL. 27, NO.

9, SEPTEMBER 2023 4409

Multi-Label Local to Global Learning: A Novel

Abstract—Deep neural network (DNN) approaches have I. INTRODUCTION

II. RELATED WORK

the model weights w and image dataset X, the MLCE can be

The updated local category is then used to construct the

due to its cleanliness. The results presented in Fig. 5 demonstrate

D. Comparison With the Baseline

consists of 14 observations, the validation set only contains

VII. ABLATION STUDY

TABLE V representation power (from VGG16 to AG-CNN). Both selec-

generate more compact buckets to improve performance, as

You might also like