You are on page 1of 12

Pattern Recognition 150 (2024) 110321

Contents lists available at ScienceDirect

Pattern Recognition
journal homepage: www.elsevier.com/locate/pr

Learning with incomplete labels of multisource datasets for


ECG classification
Qince Li a, Yang Liu a, Ze Zhang a, Jun Liu a, Yongfeng Yuan a, Kuanquan Wang a, Runnan He b, *
a
School of Computer Science and Technology, Harbin Institute of Technology (HIT), Harbin, Heilongjiang 150001, China
b
Academy of Medical Engineering and Translational Medicine, Tianjin University, Tianjin 300072, China

A R T I C L E I N F O A B S T R A C T

Keywords: The shortage of annotated ECG data presents a significant impediment, hampering the overall generalization
Electrocardiogram capabilities of machine learning models tailored for automated ECG classification. The collective integration of
Multilabel classification multisource datasets presents a potential remedy for this challenge. However, it is crucial to underscore that the
Incomplete labels
mere addition of supplementary data does not automatically guarantee performance enhancement, given the
Multisource data mining
unresolved challenges associated with multisource data. In this research, we address one such challenge, namely,
the issue of incomplete labels arising from the diversity of annotations within multi-source ECG datasets. First,
we identified three distinct types of label missing: dataset-related label missing, supertype missing, and subtype
missing. To address the supertype missing effectively, we introduce a novel approach known as offline category
mapping which leverages the hierarchical relationships inherent within the categories to recover the missing
supertype labels. Additionally, two complementary strategies, referred to as prediction masking and online
category mapping, are proposed to mitigating the adverse effects of subtype and dataset-related label missing on
model optimization. These strategies enhance the model’s ability to identify missing subtypes under conditions of
weak supervision. These pioneering methodologies are integrated into a deep learning-based framework
designed for multilabel ECG classification. The performance of our proposed framework is rigorously evaluated
using realistic multi-source datasets obtained from the PhysioNet/CinC challenge 2020/2021. The proposed
learning framework exhibits a notable improvement in macro-average precision, surpassing the corresponding
baseline model by more than 25 % on the test datasets. As a result, this research study makes a substantial
contribution to the field of ECG classification by addressing the critical issue of incomplete labels in multisource
datasets, ultimately enhancing the generalization capabilities of machine learning models in this domain.

1. Introduction annotating large ECG datasets, making it impractical to construct


comprehensive datasets with annotations. Therefore, jointly mining
Cardiovascular diseases have become the primary cause of morbidity datasets from multiple organizations is an effective and pragmatic
and mortality worldwide [1], underscoring the critical need for accurate approach to mitigate insufficient training data.
and efficient diagnostic tools. Automatic electrocardiogram (ECG) The research community has been pushing for advances in multi­
classification has the potential to alleviate the burden on healthcare source ECG data mining. For example, the PhysioNet/CinC challenge
systems and improve patient outcomes. Current methods for automatic 2020/2021 [3,4] had organized datasets from multiple continents to
ECG classification mostly use machine learning (ML) methodology, develop models that automatically detected ECG abnormalities. The
where the training data have a critical influence on the model perfor­ detection of ECG abnormalities is a multilabel classification problem
mance. However, current ECG datasets (such as the PhysioNet datasets because multiple abnormalities simultaneously appear in a recording.
[2]) are generally too small to reflect diverse artifacts and inter-patient The datasets collected by different organizations are diverse in data
differences of ECG signals. These challenges lead to overfitting issues specifications and annotations. Considering data specifications, the data
and hinder the development of models capable of reliable diagnoses for from different sources have different recording lengths, sampling fre­
diverse patient populations. Additionally, the high cost and difficulty of quencies, signal qualities, device types, and so on. These differences

* Corresponding author.
E-mail address: runnanhe@tju.edu.cn (R. He).

https://doi.org/10.1016/j.patcog.2024.110321
Received 2 October 2022; Received in revised form 18 December 2023; Accepted 5 February 2024
Available online 10 February 2024
0031-3203/© 2024 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-
nc-nd/4.0/).
Q. Li et al. Pattern Recognition 150 (2024) 110321

have been addressed in some previous studies using methods, such as guided by the semantic hierarchy of labels. We employ an innovative
unsupervised domain adaptation [5] or adversarial domain generaliza­ offline category mapping mechanism, executed prior to model training,
tion [6]. In terms of annotation practices, it is noteworthy that datasets to achieve this outcome.
often display substantial disparities. However, it is essential to empha­ Screening Suspected Missing Labels: Our second objective is to
size that these disparities have not received due attention in prior mitigate the negative influence of suspected missing labels on model
research efforts. These dissimilarities in annotation methodologies exert training. These suspected missing labels may exhibit characteristics of
a substantial influence on the efficacy and dependability of ML models. label-set-based or subtype missing but are not definitively absent. We
Regrettably, in many antecedent studies, multisource datasets have been introduce a prediction masking mechanism, driven by data-source
amalgamated into a unified dataset for model training, disregarding the properties and label hierarchy, to identify and exclude these labels
plausible existence of diverse annotation manners. This oversight will from the loss calculation during training. By doing so, we aim to improve
have an adverse impact on model performance. the overall robustness of the model.
The label missing problem in multisource datasets can be categorized Recognizing Missing Subtypes: Our third objective is to recognize
into three types, namely dataset-related label missing, supertype and address missing subtypes effectively. To accomplish this, we
missing, and subtype missing. develop an online category mapping mechanism, seamlessly integrated
Dataset-Related Label Missing occurs when a label is consistently into the learning model. This mechanism leverages the concept of
absent in the annotations of all recordings within a dataset. It primarily weakly supervised learning to identify and classify missing subtypes,
stems from the limited scope and specific research focus of each dataset, thus contributing to more accurate ECG classification.
resulting in variations in the considered label sets among annotators. For Collectively, these strategies contribute to enhancing the field of
instance, one dataset might prioritize the detection of atrial fibrillation pattern recognition in multilabel ECG classification when dealing with
(AF) and overlook annotations related to T-wave variations. Such label multisource datasets characterized by incomplete labels.
omissions vary from one dataset to another, emphasizing the label set
diversity issue. 2. Related works
Supertype Missing refers to a scenario where a recording is assigned
a label, but the broader supertype of that label is omitted. In practice, Multilabel ECG classification is essential for the automatic detection
annotators often choose to label specific subtypes for conciseness, of cardiac abnormalities. Also, it is a challenging task because of the
leaving out the supertypes. For instance, a recording might be labeled large number of categories and extreme class imbalance. The more
with "incomplete right bundle branch block" (IRBBB) but lack the label general problem of multilabel classification (MLC) has been studied
"right bundle branch block" (RBBB), which logically includes all in­ extensively in previous works. The MLC methods can be divided into
stances of IRBBB. This situation arises due to practical annotation con­ two categories: problem transformation methods, e.g., binary relevance
straints and is related to label granularity diversity. [13] and classifier chains [14], and algorithm adaptation methods, e.g.,
Subtype Missing describes a situation in which a recording is an­ ML-kNN [15] and ML-DT [16]. A systematic review of these methods
notated with a label, but the specific subtype of that label is absent. For can be found at [17].
example, a recording may be labeled as "RBBB," but without further For multilabel ECG classification, binary relevance methods that
specification, such as "incomplete right bundle branch block" (IRBBB) or transform MLC to binary classification problems are widely applied
"complete right bundle branch block" (CRBBB). This type of label [18]. Recently, binary relevance methods are mostly combined with
omission is common in multisource datasets due to variations in anno­ multitask learning in a deep-learning framework where all binary clas­
tation criteria and annotator backgrounds. Unlike supertype missing, sifiers share a backbone network for feature extraction [19,20]. Addi­
subtype missing cannot be directly inferred from label hierarchies, tionally, some studies have explored the category relationships to
making it a more complex issue. improve the performance of multilabel ECG classification. For example,
These three types of label missing collectively contribute to the Du et al. [21] used a recurrent neural network to generate labels
challenges of incomplete labels in multisource ECG datasets. Addressing one-by-one for the input, where the label relationships were implicitly
these issues is essential to improve model performance and reliability in learned. The above works are based on a single data source or treating
the field of pattern recognition for multilabel ECG classification within multiple-source data as data from a single source.
the context of multisource datasets. Some studies focus on the distribution differences between the
The problem of multilabel learning with incomplete labels is a gen­ training and test sets and propose methods to improve the model per­
eral issue studied in other fields, such as image annotation [7] and text formance on the test set. For example, Jin et al. [22] proposed a
classification [8]. Several methods have been proposed to complete the domain-adaptive residual neural network for AF detection. Wang et al.
missing labels, including label completion using content similarity and [5] developed a domain-adaptive arrhythmia classifier using unsuper­
label co-occurrence [9], matrix completion using the low-rank vised domain adaptation, where a cluster-aligning loss was used to align
assumption of the label matrix [10], empirical risk minimization [8], the distributions of training and test sets, and a cluster-maintaining loss
label ranking constraints [11], and probabilistic modeling with latent was used to ensure the discriminability of the features. Besides, some
variables [12]. Recently, the semantic hierarchy between labels has researchers have explored methods for domain generalization [23],
been explored for multilabel learning with incomplete labels [7]. where the test data are unseen during model training. For example,
However, until now, no previous work in multilabel classification has Hasani et al. [6] developed a multisource domain generalization model
studied incomplete labels and multisource data mining simultaneously. using the adversarial domain generalization method. However, no pre­
Since data sources are correlated with missing labels, effective use of the vious study has addressed the diversity of annotations among multi­
properties of data sources will deal with the incomplete label problem. source datasets and the resulting incomplete labels.
In this study, we present a deep-learning-based framework designed Some methods have been proposed to mitigate the label noise issue.
to harness the potential of multisource datasets with incomplete labels For example, Pasolli et al. [24] proposed an optimal subset search al­
for training multilabel ECG classification models. Our primary aim is to gorithm to identify noisy ECG labels in a dataset. Li et al. [25] developed
address the pressing challenge posed by incomplete labels in the context a cross-validation-based technique to identify mislabeled samples. Cai
of multisource ECG datasets, ultimately enhancing the performance of et al. [26] addressed the label inconsistency problem by adjusting labels
automated ECG classification systems. To achieve this, we introduce a and creating a label mask to filter potentially incorrect data. Vázquez
comprehensive three-pronged strategy: et al. [27] adapted a self-learning multi-class label correction method
Recovering Missing Supertypes: Our first objective is to enhance typically used for image classification to learn a multi-label classifier for
the completeness of label sets by recovering missing supertype labels, electrocardiogram signals. Furthermore, Antoni et al. [28] proposed a

2
Q. Li et al. Pattern Recognition 150 (2024) 110321

three-value label method aimed to surmount the constraints posed by form an integrated solution for enhancing the robustness and reliability
binary labels. This approach employs three values: 1 for presence, 0 for of pattern recognition in multilabel ECG classification using multisource
absence, and NA for unknown, addressing inconsistencies and ambigu­ datasets with incomplete labels.
ities in label semantics across datasets. However, these approaches The datasets used in this study were sourced from the PhysioNet/
typically address specific facets of label noise, without providing a sys­ CinC challenge 2020/2021, encompassing eight datasets from various
tematic analysis and resolution framework. Furthermore, they do not organizations across different countries and continents. These datasets
comprehensively explore the diverse annotation manners inherent in consist of 12-lead ECGs with variations in sampling rates and recording
multi-source data or explicitly harness these manners to address label lengths. The study focused on seven of these datasets for experimenta­
incompleteness. The distinctive characteristics of annotation manners tion, including CPSC, CPSC-Extra, PTB, PTB-XL, G12EC, Chapman-
offer valuable insights into mitigating label noise, which serves as a Shaoxing, and Ningbo. The St Petersburg INCART 12-lead arrhythmia
primary motivation for this study. database was not included in the study due to its relatively small sample
size and significantly longer recording lengths compared to the other
3. Methods datasets. More detailed descriptions of the datasets can be found in the
Experiments section.
The proposed learning framework addresses the challenges associ­
ated with incomplete labels in multisource ECG datasets through a
3.1. Preprocessing
comprehensive approach, as illustrated in Fig. 1. This framework en­
compasses data preprocessing, deep neural network architecture (DNN),
The specifications of signals vary in several aspects, including sam­
and three key mechanisms: offline category mapping, online category
pling rate, recording length, analog-to-digital converter units per phys­
mapping, and prediction masking. Data preprocessing standardizes
ical unit, and so on. To unify the specifications, the signals are resampled
input signals, while the DNN conducts feature extraction and multilabel
to 250 Hz, and the unit of the signal amplitude is converted to mV. Then,
classification. Offline category mapping recovers missing supertype la­
a moving average filter (with a window size of 250) is used to estimate
bels, while online category mapping rectifies subtype missing labels
the signal baseline. The estimated baseline is subtracted from the orig­
directly within the DNN. Prediction masking reduces the impact of
inal signal for baseline wander removal. To suppress the noise from
dataset-related and subtype missing labels on model optimization.
muscle movement and environmental interference, a band-pass filter
During testing, outputs from the online category mapping are further
(0.1–50 Hz) is applied. Additionally, each ECG lead signal is normalized,
processed for precise label generation. These combined components
where the mean and variance of the signal in each lead are zero and one,

Fig. 1. Overview of the learning framework for multilabel classification with incomplete labels from multisource ECG datasets. The boxes with square corner
indicate data, whereas the boxes with round corners indicate operations. (a) The model training process. The model first preprocesses the input ECG and makes raw
predictions on it. Then the raw predictions are fixed by online category mapping to ensure that the predicted labels conform to the hierarchical relationships. Before
computing the loss, the annotated labels are proprocessed by offline category mapping to recover supertypes and the prediction masking are applied to exclude the
suspected missing labels from loss computing, which will force the model to do weakly supervised learning based on incomplete labels. (b) The model testing process.
The online category mapping is also applied in the inferrencing pipeline to improve the prediction robustness. Finally, the predicted probabilities are thresholded to
generate predicted labels.

3
Q. Li et al. Pattern Recognition 150 (2024) 110321

respectively.

3.2. Offline category mapping

The basic idea of offline category mapping is that the missing


supertypes are inferred from the label hierarchy using the known sub­
type. Specifically, the offline category mapping is calculated as follows:
loff = clip[0,1] (lraw M) (1)
Fig. 3. Pseudo-code of online category mapping. y′ denotes the output of the
where loff denotes the label vector after offline category mapping, lraw
last fully connected layer of the DNN. M denotes the category mapping matrix.
denotes the raw label vector, M denotes an n × n category mapping n is the number of categories. ⊙ denotes element-wise multiplication.
matrix, n is the number of considered labels in the classification task. mi, j
denotes the element of M at the intersection of the i-th row and j-th
loss calculation to avoid misleading model optimization. Taking
column. If i = j or the label i is a subtype of label j, mi, j = 1; otherwise, mi,
together, this machenism not only alters the model optimization direc­
j = 0. The function of clip[0,1] is to clip each element of the input vector to
tion but also facilitates weakly supervised learning if a subtype label is
the interval [0,1]. Fig. 2(a) illustrates the process of offline category missing.
mapping, where labels in the original label vector are mapped to their
respective supertype label based on a category mapping matrix. This
mapping helps address the issue of supertype missing in the labels. 3.4. Prediction masking

As mentioned earlier, it’s vital to exclude predictions for suspected


3.3. Online category mapping
missing labels from the loss calculation to prevent adverse effects on
model optimization. These labels exhibit characteristics of label-set-
The idea of online category mapping is that the known supertypes,
based missing or subtype missing and may exist in the recordings but
although not specific, can provide a certain amount of supervision for
were overlooked or omitted during annotation. Treating them as nega­
subtype detection, which is a weakly supervised learning method [29].
tive labels would mislead the model. To address this, we introduce a
The online category mapping is designed to use weak supervision to
prediction masking mechanism to exclude these predictions. This in­
guide the model in learning to recognize subtypes. Fig. 3 shows the
volves identifying suspected missing labels and computing the mask
procedure of online category mapping. Here the model prediction, i.e.,
vector using two methods: dataset-based mask calculation for dataset-
the output of the last fully-connected layer of the DNN model, is first
related label missing and hierarchy-based mask calculation for sub­
processed using a sigmoid function. Then, the prediction is multiplied
type missing.
(element-wise) with each column of the category mapping matrix, and
The dataset-based mask calculation uses the estimation of dataset-
the maximum of each resulting vector is the final prediction value for the
related label missing. The idea is that if a label is absent for all re­
category corresponding to the column. Thus, the online mapping
cordings in a dataset, it is likely not considered by the annotator. Spe­
mechanism rectifies the online predictions of the model. Fig. 2(b)
cifically, the mask vector for the k-th dataset is calculated as follows:
demonstrates online category mapping, which rectifies predictions for
{
parent categories using predictions from their child categories. The 0, if yki,j = 0 ∀ i = 1, …, Nk
adjustment aligns predictions with the hierarchical label relationships, maskjk = (2)
1, otherwise
ensuring consistency with label hierarchies. Furthermore, suspected
missing labels, such as labels 1 and 3 in this case, are masked out during

Fig. 2. Schematic of offline category mapping and online category mapping. The values in the blank squares of the mapping matrix are all zero. (a) An example
illustrating offline category mapping, where the initial label vector contains labels 1 and 3, while the category mapping matrix reveals that labels 1 and 3 are subtypes
of label 2. Consequently, through matrix multiplication, the entry corresponding to label 2 is transformed to 2, signifying the presence of label 2 in the recording.
Subsequently, the values within the remapped label vector are thresholded to binary values (0 or 1). (b) An example illustrating online category mapping, where the
model initially predicts "label 2″ as 0.2 but has a child category "label 1″ predicted as 0.9, online category mapping adjusts "label 2″ to 0.9, considering its child
category. Additionally, suspected missing labels, like "label 1″ and "label 3," are masked out during loss calculation to prevent them from affecting model optimi­
zation. As a result, this adjustment shifts the loss gradient towards optimizing the model’s prediction for "label 1.".

4
Q. Li et al. Pattern Recognition 150 (2024) 110321

where maskkj is the j-th element (corresponding to label j) of the mask


vector for the k-th dataset. yki,j denotes the annotation for label j in the i th
recording of the k-th dataset. If label j is present in the annotation for the
i th recording of the k-th dataset, yki,j = 1; otherwise, yki,j = 0. Nk denotes
the recording number of the k-th dataset. From this formula, labels ab­
sent in all recordings of datasets will be found, and predictions for these
labels will be excluded from the loss calculation. This mask vector is
shared by all recordings of the dataset.
The hierarchy-based mask calculation identifies suspected missing
labels by utilizing the hierarchical relationships among categories. Since
missing subtypes cannot be directly inferred from their supertype labels,
all subtypes of a supertype are treated as suspected missing labels. This
method identifies categories that are absent from a recording’s label set
but have corresponding supertypes present in the annotation. Fig. 4
shows the procedure for calculating the hierarchy-based mask. The in­
puts of this procedure include the annotated label vector of a recording
(denoted by y), the category mapping matrix (denoted by M), and the
number of considered categories (denoted by n). First, the original
mapping matrix is modified by assigning zero to its diagonal elements to Fig. 5. An example illustrating prediction masking. The left side represents the
obtain the modified mapping matrix denoted by M′. Then, the label i th dataset with three labels (indexed as 1, 3, and 5), while the universal label
vector y is multiplied element-wise with each column of M′, and the set for all multisource datasets comprises six labels (indexed 1 to 6). This results
column vectors obtained are organized in a matrix denoted by L. Finally, in a prediction mask vector for the dataset [1, 0, 1, 0, 1, 0], indicating present
the mask vector is obtained by selecting the maximum of each row of labels. On the right side, labels 3 and 5 are subtypes of label 1. Given that
matrix L, while clipping the values to no more than 1. recording r is annotated with only label 1, labels 3 and 5 are considered sus­
pected missing labels, leading to a prediction mask vector [1,1, 0, 1, 0, 1] based
In the learning framework, the classifier’s predictions for an ECG
on annotated labels and their relationships.
recording are element-wise multiplied with both dataset-based and
hierarchy-based masks. This operation zeroes out the predictions for
suspected missing labels, resulting in zero loss for these masked pre­
dictions. Consequently, prediction masking prevents suspected missing
labels from affecting model optimization by eliminating any associated
noise. Fig. 5 illustrates a mask vector calculation example, showcasing
the computation of both dataset-based and hierarchy-based masks,
along with their combination into a prediction mask vector for a sample.

3.5. Architecture of the backbone network

The backbone network of our proposed framework for multilabel


ECG classification is based on a 1D residual neural network (ResNet) and
class-wise attention, as shown in Fig. 6. The 1D ResNet consists of six
residual blocks, each containing two convolutional layers and other
assistant layers, including batch normalization, ReLU activation, and
dropout. There are 32 filters in each convolutional layer with a kernel
size of 16. The output of the last convolutional layer in each block is
merged with its input by element-wise addition. The mergedfeature map
is compressed to half of its original length by a max-pooling layer with a
pool size of two.
The output of each residual block is a feature map manifesting the
features of different parts of the input ECG. The feature maps from
different blocks are up-sampled to the same length and concatenated
along the channel dimension to form a multiscale feature map. Before
the classification, the multiscale feature map is aggregated to generate a

Fig. 6. Diagram of prediction masking. FV = feature vector, SE = squeeze-and-


Fig. 4. Pseudo-code of hierarchy-based mask calculation. y denotes the label excitation module, FC = fully-connected layer.
vector (1 for presence, 0 for absence) for a recording. M denotes the category
mapping matrix. n is the number of categories. ⊙ denotes element-wise
multiplication.

5
Q. Li et al. Pattern Recognition 150 (2024) 110321

feature vector that can be easily processed by a fully-connected layer. organizations in different countries and continents. The recordings in
Additionally, a class-wise attention mechanism, proposed in our previ­ each dataset are all 12-lead ECGs with varying sampling rates and
ous study [30], is used to generate a proper feature vector for each lengths. Seven datasets are used in our experiments, including the CPSC
category to be detected. The class-wise attention is designed for multi­ dataset, the extra data from CPSC (CPSC-Extra), the
label ECG classification, where multiple abnormalities can co-exist in a Physikalisch-Technische Bundesanstalt database (PTB), the extension of
single ECG recording, while distributed at different parts. It extracts a PTB database (PTB-XL), the Georgia 12-lead ECG challenge dataset
dedicated feature vector for each class to avoid a single feature vector (G12EC), the database from Chapman University and Shaoxing People’s
being unable to account for all abnormalities in a recording. Then, the Hospital (Chapman-Shaoxing), and the dataset from Ningbo First Hos­
feature vector for each class is processed by a squeeze-and-excitation pital (Ningbo). The leftover dataset, i.e., the St Petersburg INCART
(SE) block [31] to reweight the classification features. Since the fea­ 12-lead arrhythmia database, was not used in our experiments, because
tures are from multiple residual blocks, they have a large degree of its sample size is relatively small and the recording length is much longer
redundancy, especially for recognizing a specific category. The SE block than other datasets.
reduces the weights of irrelevant features and thus alleviates the risk of Table 1 shows some distinct statistical features of these datasets,
overfitting. Finally, the feature vector for each category is fed into its respectively. However, the recording number, recording length, sam­
corresponding fully-connected layer to make predictions. pling frequency, and amplitude are significantly different for each
dataset. Also, the label types in these datasets are different. Moreover,
3.6. Model implementation and optimization the number of different label types is not proportional to the number of
recordings. For example, the CPSC dataset contains 6877 recordings
The models are implemented using the Keras framework and trained involving only 9 label types, while the PTB dataset contains 516 re­
on a workstation with one CPU running at 3.5 GHz, an NVIDIA Quadro cordings involving 17 label types. This reflects that the annotation is
k6000 GPU, and 64 GB of memory. The binary cross-entropy loss inconsistent among these datasets, which partly accounts for the
function is used for model training. The optimization method is the incomplete labels of multisource datasets.
adaptive moment estimation (Adam), where β1 and β2 are 0.9 and
0.999, respectively. Warm-up and exponential decay are used to 4.2. Obtaining label hierarchical relationships
schedule the learning rate: the learning rate is initialized to 0.0001,
increased to 0.001 at the second epoch, and then decreased at a rate of The hierarchical relationships among labels are the main knowledge
0.9 through each epoch. For each training session, 10 % of the training basis used by our methods to address incomplete labels. The definition
data are randomly selected as the validation set. The validation set is of such a label hierarchy requires consensus among domain experts. We
used to monitor the degree of overfitting during the training. When the derive the hierarchical structure from SNOMED CT Codes [32], which
loss value on the validation set does not decrease for 5 consecutive provides a systematized nomenclature and sematic hierarchy of
epochs, the training process is terminated. medicine-clinical terms. For the sake of clarity, Table 2 presents the ECG
types and their corresponding abbreviations as referenced in this paper.
3.7. Thresholding
4.3. Results of offline category mapping
The model outputs are predicted probabilities for each category. To
obtain specific labels, a thresholding method, called category imbalance To evaluate the influence of offline category mapping on the labels of
and cost-sensitive thresholding (CICST) [39], is applied to the predicted the datasets, the distributions of labels are investigated before and after
probabilities. CICST was proposed in our previous studies to address the offline category mapping in Fig. 7. The results demonstrate that the
category imbalance and reduce the expected cost of model predictions offline category mapping significantly changes label distributions of
using the predefined costs of different misclassifications. these datasets, showing that the supertype missing is a common problem
in multisource datasets. If this problem is not addressed, the sensitivity
4. Experiments of the model to the involved supertypes will be reduced. Also, the effects
of offline category mapping vary with datasets. For example, in the CPSC
In this section, the performance of the proposed methods is evaluated dataset, sample sizes of only three labels changed, whereas, in other
for completing labels and learning with incomplete labels. First, the datasets, the changes involve a wide spectrum of labels. The degree of
datasets used in our experiments and the information source of the label label recovery by the offline category mapping depends on the quality of
relationships are introduced. Then, the effects of offline category map­ the original annotation. For datasets with recordings annotated with all
ping on label distribution, and those of online category mapping and specific subtypes, the missing supertype can be easily recovered. How­
prediction masking on the model performance are evaluated. ever, for datasets where the annotations of subtypes are scarce, the re­
covery would be modest due to insufficient clues available for
4.1. Datasets reconstructing the missing supertypes.

The datasets used in this study are from the PhysioNet/CinC chal­
lenge 2020/2021 [3,4], which contains eight datasets from

Table 1
Characteristics of ECG datasets used in this study.
Dataset Recording Number Recording Lengths Sampling Frequency Amplitude Minimuma Amplitude Maximuma Label Kinds

CPSC 6877 6–144 s 500 Hz − 1.89 ± 1.55 mV 2.18 ± 1.42 mV 9


CPSC-Extra 3453 8–98 s 500 Hz − 1.97 ± 1.48 mV 2.16 ± 1.65 mV 72
PTB 516 32–120 s 1000 Hz − 2.72 ± 1.41 mV 2.71 ± 1.41 mV 17
PTB-XL 21,837 10 s 500 Hz − 1.73 ± 1.10 mV 1.84 ± 1.04 mV 50
G12EC 10,344 5–10 s 500 Hz − 0.32 ± 0.19 mV 0.31 ± 0.17 mV 67
Chapman-Shaoxing 10,247 10 s 500 Hz − 1.81 ± 0.99 mV 2.19 ± 0.99 mV 54
Ningbo 34,905 10 s 500 Hz − 2.03 ± 1.92 mV 2.41 ± 1.88 mV 80
a
Statistics of amplitude minimum and maximum are in the form of mean ± std for the recordings in each dataset.

6
Q. Li et al. Pattern Recognition 150 (2024) 110321

Table 2 Shaoxing and trains on others. These cross-dataset evaluation settings


The ECG labels and their abbreviations mentioned in this paper. can reveal the generalization ability of the proposed methods. These
ECG label Abbr. ECG abnormality Abbr. datasets involve 133 categories, many of which have very small sample
sizes. From the suggestion of the PhysioNet/CinC challenge 2021 [4],
atrial fibrillation AF nonspecific intraventricular NSIVCB
conduction disorder we use 26 categories (excluding synonyms) for model evaluation. Other
atrial flutter AFL sinus rhythm SR labels, though not scored, are also involved in the model training.
atrial tachycardia ATach atrioventricular junctional AVJR We employ three metrics for evaluating the model performance,
rhythm namely, average precision (AP), F1 score, and cost-weighted accuracy
bundle branch block BBB supraventricular premature SVPB
beats
(CWAcc). AP is computed from the precision-recall curve to compre­
bradycardia Brady Q wave abnormal QAb hensively evaluate model performance without being affected by the
left bundle branch block LBBB right axis deviation RAD specified thresholds. The calculation procedure of AP can be found in
right bundle branch RBBB sinus arrhythmia SA [33]. The F1 score is a combination of precision and recall: F1 = 2 ×
block
precision × recall/(precision + recall), where precision = TP/(TP + FP),
first degree AV block IAVB sinus bradycardia SB
complete right bundle CRBBB sinus tachycardia STach recall = TP/(TP + FN), TP = the number of true positives, FP = the
branch block number of false positives, FN = the number of false negatives. Since
incomplete right bundle ICRBBB T wave abnormal Tab multiple labels are involved in the classification, both macro AP and F1
branch block are computed by averaging the corresponding category-wise scores.
left axis deviation LAD T wave inversion TInv
left anterior fascicular LAnFB acute myocardial infarction AMI
CWAcc is a novel score proposed by the PhysioNet/CinC challenge 2020
block for evaluating the model performance with respect to predefined costs of
low QRS voltages LQRSV myocardial infarction MI different misclassifications [3,34].
prolonged QT interval LQT premature atrial contraction PAC The models’ evaluation results on the test sets can be found in
left ventricular high LVHV sinus atrium to atrial wandering SAAWR
Table 2. To facilitate comparison, we retrained and evaluated open-
voltage rhythm
left ventricular LVH ST changes STC source algorithms originally developed for the PhysioNet/CinC chal­
hypertrophy lenge 2020 within our experimental setup. Additionally, we explored
different configurations of our learning framework to assess the per­
formance of the proposed mechanisms. The baseline model, involving a
4.4. Performance evaluation of our methods ResNet, class-wise attention module, and fully-connected layers for bi­
nary classifications, serves as our baseline. We also investigated two
The method evaluation is performed in a cross-dataset manner, structural extensions: multiscale feature fusion and SE module-based
where the training and test data of the model are selected from different feature reweighting. Subsequently, we incrementally incorporated off­
sources. Two training/test settings are designed in our experiments: one line category mapping, prediction masking, and online category
tests on G12EC and trains on others, the other tests on Chapman-

Fig. 7. Label distribution changes after offline category mapping. For each dataset, only the 10 labels that changed the most are shown. The original sample sizes of
the labels are shown in blue bars, while added sample sizes by offline category mapping are shown in orange bars. (a) Statistics for the China Physiological Signal
Challenge (CPSC) dataset. (b) Statistics for the extra data from CPSC (CPSC-Extra). (c) Statistics for the Georgia 12-lead ECG challenge (G12EC) dataset. (d) Statistics
for the extension of the Physikalisch-Technische Bundesanstalt (PTB-XL) database. (e) Statistics for the database from the Chapman University and Shaoxing People’s
Hospital (Chapman-Shaoxing). (c) Statistics for the dataset from the Ningbo First Hospital (Ningbo). The label distribution of PTB has modest changes after offline
category mapping, which is not shown in the figure.

7
Q. Li et al. Pattern Recognition 150 (2024) 110321

mapping to assess their effectiveness in handling incomplete labels. AP (0.472) and CWAcc (0.565) on G12EC, as well as the best macro AP
Notably, we applied offline category mapping to the test sets to rectify (0.580) on Chapman-Shaoxing. In comparison to the baseline, the full
the annotations due to label missing. model enhances the macro AP by 29 % on G12EC and by 25 % on
As demonstrated in Table 3, the performance of previous state-of- Chapman-Shaoxing, respectively. Additionally, the model with all
the-art methods and the baseline model is notably constrained on both mechanisms, except for the online category mapping, attains the best F1
test sets. Notably, methods that incorporate mechanisms for handling score on G12EC and the best F1 score and CWAcc on Chapman-
incomplete labels, such as the works of Cai et al. [26], exhibit substantial Shaoxing. It’s worth noting that the full model doesn’t achieve
performance improvements compared to other approaches. This un­ optimal performance in all metrics, potentially due to the selection of
derscores the critical significance of addressing the challenges associ­ threshold values.
ated with incomplete labels. However, it’s worth noting that even with The AP scores for each category on the test sets are shown in Fig. 8.
these advancements, the performance of these methods still falls short of The performances of the baseline model, the model with offline category
reaching an optimal level. mapping (OffMap model), the model with offline category mapping and
The results below the division line in Table 3 present the perfor­ prediction masking (PredMask model), and the model with all proposed
mance of our baseline model and various combinations of mechanisms mechanisms (the full model) are compared in this figure. The results
proposed in this study. The introduction of multiscale feature fusion and show that the OffMap model outperforms the baseline model on most of
the SE module to the baseline model yields performance enhancements. the categories. Especially, the most significant improvement of AP
Specifically, the macro AP improves from 0.364 to 0.394 on G12EC and scores is achieved for BBB detection (from 0.336 to 0.760 on G12EC and
from 0.462 to 0.494 on Chapman-Shaoxing. However, the CWAcc on 0.361 to 0.914 on Chapman-Shaoxing), which can be attributed to the
Chapman-Shaoxing experiences a minor decline after the incorporation effects of offline category mapping on fixing the missing supertypes as
of these two structural mechanisms, which may be attributed to BBB is the supertype of several categories, including LBBB, RBBB, and
threshold selection. their subtypes. Similarly, the scores for detecting RBBB, sinus rhythm
Upon the application of offline category mapping to the training (SR), bradycardia (Brady), and superventricular premature beats (SVPB)
data, substantial improvements in all metric scores are observed on both also improve. Compared with the OffMap model, the PredMask model
test sets. The macro AP increases from 0.414 to 0.443 on G12EC and further improves the scores for some categories, such as AF, CRBBB,
from 0.505 to 0.567 on Chapman-Shaoxing. The enhancements in ICRBBB, left axis deviation (LAD). The full model has similar scores as
CWAcc are even more pronounced, with improvements from 0.338 to the PredMask model for most categories, and improves in detecting AF,
0.526 on G12EC and from 0.263 to 0.753 on Chapman-Shaoxing. CRBBB, ICRBBB, left anterior fascicular block (LAnFB), SR, right axis
Furthermore, the introduction of prediction masking further boosts deviation (RAD), and sinus arrhythmia (SA). Also, some categories are
the model’s performance on both datasets. such that the best scores on different test sets are achieved by different
Furthermore, after the incorporation of online category mapping, the models. For example, the best AP of LBBB on G12EC is achieved by the
full model, utilizing all proposed mechanisms, achieves the best macro full model, whereas that on Chapman-Shaoxing is achieved by the Off­
Map model. Overall, the full model achieves the best AP scores on most
Table 3 of the categories.
Metric scores of different method compositions in cross-dataset evaluations.
Above the demarcation line, the outcomes result from evaluating previous 4.5. Instance analysis
research by aligning datasets and metrics, while below the line, the results are
associated with the novel approach introduced in this study. Some instances of the model predictions are shown in Fig. 9. For each
Methods G12EC Chapman-Shaoxing instance, the original and recovered labels from the offline category
macro macro CWAcc macro macro CWAcc
mapping are marked with purple and red downward triangles, respec­
AP F1 AP F1 tively. In these instances, the missing supertypes indicated by red
downward triangles in the charts are successfully recovered by our
Goodfellow et al. 0.163 0.13 0.181 0.202 0.119 0.058
[35] method. As shown in Fig. 9, the baseline model has very poor sensitivity
Zhao et al. [20] 0.315 0.226 0.309 0.378 0.226 0.235 for the missing supertypes, while the other three models have a strong
Wong et al. [18] 0.293 0.203 0.274 0.430 0.340 0.552 ability in recognizing these labels, which demonstrates the advantage of
Singstad and 0.275 0.206 0.231 0.391 0.212 0.394 offline category mapping in solving supertype missing. For most in­
Tronstad [36]
Cai et al. [26] 0.468 0.375 0.427 0.568 0.341 0.321
stances, the performance of the OffMap model, PredMask model, and the
Vázquez et al. 0.369 0.288 0.346 0.459 0.278 0.254 full model is similar. However, the full model outperforms other models
[27] in detecting AF, as shown in Fig. 9(b), might because the online category
Baseline a 0.364 0.350 0.294 0.462 0.400 0.308 mapping in company with prediction masking reduces the disruption
MultiScale 0.394 0.381 0.349 0.494 0.394 0.288 caused by missing labels to model optimization. Example (b), however,
MultiScale + SE 0.414 0.394 0.338 0.505 0.406 0.263 also reflects that the full model has limited ability in distinguishing
MultiScale + SE 0.443 0.426 0.526 0.567 0.491 0.753 between AF and atrial flutter (AFL), potentially influenced by the la­
+ OffMap
MultiScale + SE 0.465 0.436 0.553 0.578 0.524 0.816
beling inconsistency observed in the Ningbo database.
+ OffMap +
PredMask 5. Discussions
MultiScale + SE 0.472 0.431 0.565 0.580 0.519 0.811
+ OffMap +
In this study, we explore methods to address learning with incom­
OnMap +
PredMask plete labels from multisource ECG datasets. The experimental results
a
demonstrate the effectiveness of the proposed methods in improving the
The baseline model is constructed with a residual convolutional network,
performance of multilabel ECG classification. Previous studies has
class-wise attention module, and fully-connected layers for the recognition of
shown that the performance of ML methods is closely associated with
each label. Other methods extend the baseline model in different ways. Multi­
Scale = multiscale features, SE = squeeze-and- excitation module, OffMap = the size of datasets [37]. Thus, multisource datasets provide an oppor­
offline category mapping, OnMap = online category mapping, PredMask = tunity for deep-learning classifiers to break through the bottleneck of
prediction masking, G12EC = Georgia 12-lead ECG challenge dataset, Chapman- generalization ability. However, due to diverse annotation manners, the
Shaoxing =the database from the Chapman University and Shaoxing People’s incompleteness of labels is a serious problem in multisource ECG data­
Hospital, AP = average precision, CWAcc = class-weighted accuracy. sets, as shown in Fig. 7. The experimental results show that ignoring this

8
Q. Li et al. Pattern Recognition 150 (2024) 110321

Fig. 8. Average precisions (AP) of different methods for each category on the test sets. The blue bars show the scores of the baseline model. The green bars show the
scores of the model with offline category mapping (OffMap). The orange bars show the scores of the model with offline category mapping and prediction masking
(PredMask). The red bars show the scores of our full learning framework (the full model). The last two are equipped with structural extensions of the multiscale
feature fusion and SE module. The left chart shows the scores on the Georgia 12-lead ECG challenge dataset (G12EC), while the right chart shows the scores on the
database from the Chapman University and Shaoxing People’s Hospital (Chapman-Shaoxing).

problem weakens the model performance even though the training set is methods tailored to specific label missing scenarios. These methods
very large. Similar results have been reported in other fields [38,39], but include offline category mapping, which leverages the semantic hier­
this problem is rarely discussed in ECG classification. archy of categories, and Prediction Masking, designed to identify and
Our study makes theoretical and practical contributions in the handle suspected missing labels. Our study presents an online category
domain of learning with incomplete labels from multisource electro­ mapping approach inspired by weakly supervised learning, effectively
cardiogram (ECG) datasets. We categorize three distinct types of label improving model performance in recognizing missing subtypes. Finally,
incompleteness in multisource ECG datasets, namely dataset-based label our exploration of data-source properties in addressing incomplete label
missing, supertype missing, and subtype missing, shedding light on the learning extends the applicability of our methodologies to domains with
complexities of this challenge. We introduce innovative label recovery hierarchically organized labels and diverse annotations, underscoring

9
Q. Li et al. Pattern Recognition 150 (2024) 110321

Fig. 9. Instances of predictions by different methods. For each instance, the left panel shows the ECG signal (including lead II and lead V2), while the right panel
shows the corresponding predictions for the ECG. The blue bars show the predictions of the baseline model. The green bars show the predictions of the model with
offline category mapping (OffMap). The orange bars show the predictions of the model with offline category mapping and prediction masking (PredMask). The red
bars show the predictions of the model using our full learning framework (the full model). The original labels and recovered labels by OffMap are marked with purple
and red downward triangles respectively.

the broader impact and relevance of our research findings. mask vector used in prediction masking: dataset-based mask calculation
For addressing supertype missing labels, we introduced the offline and hierarchy-based mask calculation. Experimental results demon­
category mapping mechanism, which leverages the semantic hierarchy strate that the application of prediction masking results in improved
of categories to recover omitted labels. Although offline category map­ detection performance for ECG abnormalities. As depicted in Fig. 9(b),
ping may appear somewhat straightforward in principle, we have both the baseline and OffMap models initially exhibit reduced sensitivity
introduced a method that uses matrices to precisely represent hierar­ to AF, mainly because the labels for AF are missing in one of their
chical relationships between labels. Leveraging parallelized matrix op­ training data sources, namely Ningbo. However, upon the imple­
erations, this approach efficiently handles label recovery. This method is mentation of prediction masking, the sensitivity to AF experiences a
particularly well-suited for large-scale dataset label recovery tasks and significant enhancement.
supports flexible and convenient label relationship definitions. As For addressing missing subtypes, we’ve developed an online cate­
illustrated in Fig. 7, it is evident that this mechanism successfully re­ gory mapping approach inspired by weakly supervised learning. This
covers numerous missing labels. However, it’s important to note that mechanism enhances model performance through two primary mecha­
this method has limitations when dealing with labels lacking annotated nisms. First, it adjusts model predictions to align with the semantic hi­
subtypes. For instance, consider the ECG depicted in Fig. 9(c), which erarchy of categories, leading to improved Average Precision (AP)
exhibits clear characteristics of T-wave inversion (TInv). Despite these scores, particularly evident in the detection of AF and SR, as illustrated
indications, neither the original nor the recovered labels include this in Fig. 8. Second, it leverages weak supervision derived from supertype
specific label. Both dataset-based label missing and subtype missing fall annotations to guide the optimization of subtype classifications. Given
into this category, making them challenging to recover solely through the limited annotated samples for many subtypes, this form of weak
the application of the semantic hierarchy. supervision plays a pivotal role in effectively training classifiers for these
To tackle Dataset-Related Label Missing, the learning framework categories. Notably, online category mapping significantly enhances the
utilizes the Prediction Masking mechanism. Prediction masking is model’s performance in detecting CRBBB, ICRBBB, and LanFB, as evi­
designed to identify and handle labels that are suspected to be missing or denced in Fig. 8.
omitted by annotators. Suspected missing labels are those that exhibit The optimization of the model’s architecture also contributes
characteristics of label-set-based missing or subtype missing but may be significantly to performance enhancement. Comparative analysis
present in the recordings. The mechanism aims to prevent these sus­ against the baseline model reveals that models incorporating multilevel
pected missing labels from negatively influencing the model optimiza­ feature fusion and the SE module achieve superior metric scores, as
tion process. Prediction masking identifies these suspected missing presented in Table 2. The inclusion of multilevel feature fusion enables
labels and excludes their predictions from the loss calculation during the direct input of features of various scales into the classification layers.
model training. Two methods are provided for the computation of the This proves crucial in multilabel ECG classification, given the varied

10
Q. Li et al. Pattern Recognition 150 (2024) 110321

temporal scales of critical features related to ECG abnormalities, ranging annotation manners or mixed multisource data. Despite these limita­
from milliseconds (e.g., P wave morphology) to seconds (e.g., RR- tions, the proposed framework represents a significant step forward in
interval variations). Furthermore, the SE module effectively di­ addressing incomplete label learning, with practical implications for the
minishes the importance of irrelevant features in abnormality detection, broader field of ML and healthcare applications.
guarding against overfitting to these less relevant aspects.
Given that clinical ECG data are typically collected and annotated Code availability
across multiple centers, the presence of label noise becomes an inherent
challenge when developing practical applications. The learning frame­ The source code of this study is available at https://github.com/s
work and mechanisms introduced in this study serve as valuable con­ dnjly/multisource-ecg-classification.
tributions to overcoming the challenges posed by ML in the context of
the diverse annotations present in multisource datasets. Consequently, CRediT authorship contribution statement
these methods offer a valuable tool for more effectively harnessing
multisource datasets, establishing a robust foundation for the develop­ Qince Li: Conceptualization, Methodology, Writing – original draft,
ment of automated ECG classification technologies within clinical Writing – review & editing, Funding acquisition, Resources. Yang Liu:
settings. Conceptualization, Methodology, Writing – original draft, Writing –
Furthermore, it is worth noting that, to the best of our knowledge, review & editing. Ze Zhang: Data curation, Formal analysis. Jun Liu:
this study represents the exploration of data-source properties in the Investigation, Visualization. Yongfeng Yuan: Supervision. Kuanquan
context of addressing the complexities of incomplete label learning. The Wang: Supervision. Runnan He: Supervision, Conceptualization, Proj­
methodologies proposed here have the potential to be extended beyond ect administration, Writing – original draft, Writing – review & editing.
ECG classification and applied to other domains characterized by labels
organized in semantic hierarchies and featuring diverse annotations. Declaration of competing interest
This versatility underscores the broader impact and relevance of our
research findings. The authors declare that they have no known competing financial
Nonetheless, several limitations warrant consideration in this study. interests or personal relationships that could have appeared to influence
Firstly, our methods exclusively account for hierarchical relationships the work reported in this paper.
among categories, omitting other label associations like co-occurrence
and mutual exclusion. Incorporating these relationships holds poten­ Ackownledgment
tial for enhancing model robustness when dealing with incomplete la­
bels. Furthermore, our methods operate under the assumption that data This work was supported by the National Natural Science Foundation
from the same source adhere to a consistent annotation manner. If a of China (NSFC) under Grant 62133009.
single dataset lacks such uniformity in annotation style or comprises a
mixture of multisource data, our methods’ performance in identifying References
missing (or suspected missing) labels within the dataset may suffer due
to ambiguities in annotation practices or data origins. [1] V.L. Roger, A.S. Go, D.M. Lloydjones, E.J. Benjamin, J.D. Berry, W.B. Borden, D.
Additionally, it’s worth noting that our work does not harness the M. Bravata, S. Dai, E.S. Ford, C.S. Fox, Heart disease and stroke statistics—2012
update: a report from the american heart association, Circulation 125 (2012)
capabilities of large pre-trained models, which have shown promise in E2–E220.
improving the robustness and performance of classification models. [2] A.L. Goldberger, L.A. Amaral, L. Glass, J.M. Hausdorff, P.C. Ivanov, R.G. Mark, J.
Integrating such models into our framework could be a valuable avenue E. Mietus, G.B. Moody, C.K. Peng, H.E. Stanley, PhysioBank, PhysioToolkit, and
PhysioNet: components of a new research resource for complex physiologic signals,
for further research, potentially enhancing the accuracy of label
Circulation 101 (2000) e215–e220.
correction and classification within the context of multisource ECG [3] E.A.P. Alday, A. Gu, A.J. Shah, C. Robichaux, A.-K.I. Wong, C. Liu, F. Liu, A.B. Rad,
datasets. A. Elola, S. Seyedi, Classification of 12-lead ECGs: the physionet/computing in
cardiology challenge 2020, Physiol. Meas. 41 (2020). Article 124003.
It’s crucial to acknowledge that these limitations, while pertinent, do
[4] M. Reyna, N. Sadr, A. Gu, P. Alday, E. Andres, C. Liu, S. Seyedi, A. Shah, G. Clifford,
not undermine the primary contributions of this study. Instead, they Will two do? varying dimensions in electrocardiography: the physionet -
present opportunities for future research. Subsequent investigations can computing in cardiology challenge 2021, PhysioNet 48 (2021) 1–4.
focus on the development of novel methods that exploit more compre­ [5] G. Wang, M. Chen, Z. Ding, J. Li, H. Yang, P. Zhang, Inter-patient ECG arrhythmia
heartbeat classification based on unsupervised domain adaptation,
hensive label relationships and incorporate large pre-trained models, Neurocomputing 454 (2021) 339–349.
thus advancing the accuracy and robustness of missing label identifi­ [6] H. Hasani, A. Bitarafan, M.S. Baghshah, Classification of 12-lead ECG signals with
cation and classification in multisource ECG datasets. adversarial multi-source domain generalization, in: Proceedings of the Computing
in Cardiology Conference (CinC), Rimini, ITALY, 2020.
[7] B. Wu, F. Jia, W. Liu, B. Ghanem, S. Lyu, Multi-label learning with missing labels
6. Conclusion using mixed dependency graphs, Int. J. Comput. Vis. 126 (2018) 875–896.
[8] H.F. Yu, P. Jain, P. Kar, I.S. Dhillon, Large-scale multi-label learning with missing
labels, in: Proceedings of the International Conference on Machine Learning,
In this work, three label missing arising from the diversity of anno­ Bejing, Peoples R China, 2014.
tation manners in multisource ECG datasets are identified. A series of [9] L. Wu, R. Jin, A.K. Jain, Tag completion for image retrieval, IEEE Trans. Pattern
mechanisms, including offline category mapping, online category Anal. Mach. Intell. 35 (2013) 716–727.
[10] R.S. Cabral, F. Torre, J.P. Costeira, A. Bernardino, Matrix completion for multi-
mapping, and prediction masking are proposed to recover the missing label image classification, Adv. Neural Inf. Process. Syst. (2011) 190–198.
labels and alleviate the negative influence of missing labels on model [11] S.S. Bucak, R. Jin, A.K. Jain, Ieee, multi-label learning with incomplete class
optimization. The experimental results demonstrate the advantages of assignments, in: Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), Colorado Springs, CO, 2011, pp. 2801–2808.
our methods in addressing incomplete annotations from multisource [12] A. Kapoor, R. Viswanathan, P. Jain, Multilabel classification using bayesian
datasets and improving model performance for multilabel ECG classifi­ compressed sensing. Adv. Neural Inf. Process. Syst., 2012, pp. 2645–2653.
cation. By mining multisource datasets more effectively, the proposed [13] M.R. Boutell, J. Luo, X. Shen, C.M. Brown, Learning multi-label scene classification,
Pattern Recognit. 37 (2004) 1757–1771.
framework improves the generalization ability of ML models in auto­
[14] J. Read, B. Pfahringer, G. Holmes, E. Frank, Classifier chains for multi-label
matic ECG diagnosis and will have implications for other research fields classification, Mach. Learn. 85 (2011) 333–359.
with incomplete labels. While this work has made substantial progress in [15] M.L. Zhang, Z.H. Zhou, ML-KNN: a lazy learning approach to multi-label learning,
addressing incomplete labels, there are areas for future improvement, Pattern Recognit. 40 (2007) 2038–2048.
[16] A. Clare, R.D. King, Knowledge discovery in multi-label phenotype data, in:
such as considering label relationships like co-occurrence and mutual Proceedings of the European Conference on Principles of Data Mining and
exclusion and developing methods to handle datasets with unclear Knowledge Discovery, Springer, 2001, pp. 42–53.

11
Q. Li et al. Pattern Recognition 150 (2024) 110321

[17] M.L. Zhang, Z.H. Zhou, A review on multi-label learning algorithms, IEEE Trans. [34] Y. Liu, Q. Li, K. Wang, J. Liu, R. He, Y. Yuan, H. Zhang, Automatic multi-Label ECG
Knowl. Data Eng. 26 (2014) 1819–1837. classification with category imbalance and cost-sensitive thresholding, Biosensors
[18] A.W. Wong, W. Sung, S.V. Kalmady, P. Kaul, A. Hindle, Ieee, multilabel 12-lead 11 (2021).
electrocardiogram classification using gradient boosting tree ensemble, in: [35] S.D. Goodfellow, D. Shubin, R.W. Greer, S. Nagaraj, C. McLean, W. Dixon, A.
Proceedings of the Computing in Cardiology Conference (CinC), Rimini, ITALY, J. Goodwin, A. Assadi, A. Jegatheeswaran, P.C. Laussen, M. Mazwi, D. Eytan, Ieee,
2020. rhythm classification of 12-lead ECGs using deep neural networks and class-
[19] A. Natarajan, Y. Chang, S. Mariani, A. Rahman, G. Boverman, S. Vij, J. Rubin, Ieee, activation maps for improved explainability, in: Proceedings of the Computing in
A wide and deep transformer neural network for 12-lead ECG classification, in: Cardiology Conference (CinC), Rimini, ITALY, 2020.
Proceedings of the Computing in Cardiology Conference (CinC), Rimini, ITALY, [36] B.J. Singstad, C. Tronstad, Ieee, convolutional neural network and rule-based
2020. algorithms for classifying 12-lead ECGs, in: Proceedings of the Computing in
[20] Z. Zhao, H. Fang, S.D. Relton, R. Yan, Y. Liu, Z. Li, J. Qin, D.C. Wong, Ieee, adaptive Cardiology Conference (CinC), Rimini, ITALY, 2020.
lead weighted resnet trained with different duration signals for classifying 12-lead [37] M.I. Jordan, T.M. Mitchell, Machine learning: trends, perspectives, and prospects,
ECGs, in: Proceedings of the Computing in Cardiology Conference (CinC), Rimini, Science 349 (2015) 255–260.
ITALY, 2020. [38] B. Frenay, M. Verleysen, Classification in the presence of label noise: a survey, IEEE
[21] N. Du, Q. Cao, L. Yu, N. Liu, E. Zhong, Z. Liu, Y. Shen, K. Chen, FM-ECG: a fine- Trans. Neural Netw. Learn. Syst. 25 (2014) 845–869.
grained multi-label framework for ECG image classification, Inf. Sci. 549 (2021) [39] D. Pathak, P. Krahenbuhl, T. Darrell, Constrained convolutional neural networks
164–177. for weakly supervised segmentation, in: Proceedings of the IEEE international
[22] Y. Jin, C. Qin, J. Liu, K. Lin, H. Shi, Y. Huang, C. Liu, A novel domain adaptive conference on computer vision, 2015, pp. 1796–1804.
residual network for automatic atrial fibrillation detection, Knowl. Based Syst.
(2020) 203.
Qince Li: Ph.D., Associate Professor with the School of Computer Science and Technology
[23] J. Wang, C. Lan, C. Liu, Y. Ouyang, W. Zeng, T. Qin, Generalizing to unseen
at Harbin Institute of Technology in China. His-research fields include computational
domains: a survey on domain generalization, IEEE Trans. Knowl. Data Eng. 35
modeling in cardiovascular system and machine learning for biomedical signal and image
(2021) 8052–8072.
processing.
[24] E. Pasolli, F. Melgani, Genetic algorithm-based method for mitigating label noise
issue in ECG signal classification, Biomed. Signal Process. Control 19 (2015)
130–136. Yang Liu: Ph.D. candidate at the Perception Computing Center of the School of Computer
[25] Y. Li, W. Cui, Identifying the mislabeled training samples of ECG signals using Science and Technology, Harbin Institute of Technology. His-research interests include
machine learning, Biomed. Signal Process. Control 47 (2019) 168–176. biomedical signal processing and computer-aided diagnosis.
[26] W. Cai, F. Liu, B. Xu, X. Wang, S. Hu, M. Wang, Classification of multi-lead ECG
with deep residual convolutional neural networks, Physiol. Meas. 43 (2022)
Ze Zhang: Ph.D. candidate at the Perception Computing Center of the School of Computer
074003.
Science and Technology, Harbin Institute of Technology. Her research interests include
[27] C.G. Vázquez, A. Breuss, O. Gnarra, J. Portmann, A. Madaffari, G. Da Poian, Label
medical image processing and artificial intelligence.
noise and self-learning label correction in cardiac abnormalities classification,
Physiol. Meas. 43 (2022) 094001.
[28] Ľ. Antoni, E. Bruoth, P. Bugata, P. Bugata Jr, D. Gajdoš, Š. Horvát, D. Hudák, Jun Liu: Ph.D. candidate at the Perception Computing Center of the School of Computer
V. Kmečová, R. Staňa, M. Staňková, Automatic ECG classification and label quality Science and Technology, Harbin Institute of Technology. His-research interests include
in training data, Physiol. Meas. 43 (2022) 064008. biomedical signal processing, medical image processing and artificial intelligence.
[29] Y. Zhou, Y. Zhu, Q. Ye, Q. Qiu, J. Jiao, Weakly supervised instance segmentation
using class peak response, in: Proceedings of the IEEE Conference on Computer Yongfeng Yuan: Ph.D., Associated Professor with the School of Computer Science and
Vision and Pattern Recognition, 2018, pp. 3791–3800. Technology at Harbin Institute of Technology in China. His-research fields include
[30] Y. Liu, K. Wang, Y. Yuan, Q. Li, Y. Li, Y. Xu, H. Zhang, Multi-label classification of computational cardiology, image processing and virtual reality.
12-lead ECGs by using residual CNN and class-wise attention, in: Proceedings of
the 2020 Computing in Cardiology Conference (CinC), 2020.
[31] J. Hu, L. Shen, G. Sun, Ieee, squeeze-and-excitation networks, in: Proceedings of Kuanquan Wang: Ph.D., Professor with School of Computer Science and Technology, and
the 31st IEEE/CVF Conference on Computer Vision and Pattern Recognition the deputy director of Research Center of Perception and Computing at Harbin Institute of
(CVPR), Salt Lake City, UT, 2018, pp. 7132–7141. Technology. Senior member of Chinese Society of Biomedical Engineering. His-main
[32] K. Donnelly, SNOMED-CT: the advanced terminology and coding system for research areas include biometrics, biocomputing, modelling and simulation.
eHealth, Stud. Health Technol. Inform. 121 (2006) 279–290.
[33] Z. Chen, Assessing sequence comparison methods with the average precision Runnan He: Ph.D., Associate Professor with Tianjin University, Tianjin, China. His-
criterion, Bioinformatics 19 (2003) 2456–2460. research fields include medical signal processing, machine learning and deep learning.

12

You might also like