You are on page 1of 11

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMI.2020.2994459, IEEE
Transactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. XX, NO. XX, XXXX 2020 1

Deep learning for classification and localization


of COVID-19 markers in point-of-care lung
ultrasound
Subhankar Roy, Willi Menapace, Sebastiaan Oei, Ben Luijten, Enrico Fini, Cristiano Saltori, Iris Huijben,
Nishith Chennakeshava, Federico Mento, Alessandro Sentelli, Emanuele Peschiera, Riccardo Trevisan,
Giovanni Maschietto, Elena Torri, Riccardo Inchingolo, Andrea Smargiassi, Gino Soldati, Paolo Rota,
Andrea Passerini, Ruud J.G. van Sloun, Elisa Ricci, Libertario Demi

Abstract— Deep learning (DL) has proved successful in shortage of mouth masks and mechanical ventilators, testing
medical imaging and, in the wake of the recent COVID- capacity has been severely limited. Priority of testing was
19 pandemic, some works have started to investigate DL- therefore given to suspected patients and hospital staff [1].
based solutions for the assisted diagnosis of lung dis-
eases. While existing works focus on CT scans, this paper However, extensive testing and diagnostics are of great im-
studies the application of DL techniques for the analysis portance in order to effectively contain the pandemic. Indeed,
of lung ultrasonography (LUS) images. Specifically, we countries that have been able to achieve large-scale testing
present a novel fully-annotated dataset of LUS images of possibly infected people combined with massive citizen
collected from several Italian hospitals, with labels indicat- surveillance, reached significant containment of the SARS-
ing the degree of disease severity at a frame-level, video-
level, and pixel-level (segmentation masks). Leveraging CoV-2 virus [2]. The insufficient testing capacity in most
these data, we introduce several deep models that address countries has therefore spurred the need and search for alterna-
relevant tasks for the automatic analysis of LUS images. tive methods that enable diagnosis of COVID-19. In addition,
In particular, we present a novel deep network, derived the accuracy of the current lab test, reverse transcription
from Spatial Transformer Networks, which simultaneously polymerase chain reaction (RT-PCR) arrays, remains highly
predicts the disease severity score associated to a input
frame and provides localization of pathological artefacts dependent on swab technique and location [3].
in a weakly-supervised way. Furthermore, we introduce a COVID-19 pneumonia can rapidly progress into a very
new method based on uninorms for effective frame score critical condition. Examination of radiological images of over
aggregation at a video-level. Finally, we benchmark state 1,000 COVID-19 patients showed many acute respiratory dis-
of the art deep models for estimating pixel-level segmenta- tress syndrome (ARDS)-like characteristics, such as bilateral,
tions of COVID-19 imaging biomarkers. Experiments on the
proposed dataset demonstrate satisfactory results on all and multi-lobar glass ground opacifications (mainly posteri-
the considered tasks, paving the way to future research on orly and/or peripherally distributed) [4], [5]. As such, chest
DL for the assisted diagnosis of COVID-19 from LUS data. computed tomography (CT) has been coined as a potential
Index Terms— COVID-19, Lung ultrasound, Deep learning alternative for diagnosing COVID-19 patients [4]. While RT-
PCR may take up to 24 hours and requires multiple tests for
definitive results, diagnosis using CT can be much quicker.
I. I NTRODUCTION However, use of chest CT comes with significant drawbacks:
it is costly, exposes patients to radiation, requires extensive
The rapid global SARS-CoV-2 outbreak resulted in a
cleaning after scans, and relies on radiologist interpretability.
scarcity of medical equipment. In addition to a worldwide
Lately, ultrasound imaging, a more widely available, cost-
Copyright (c) 2019 IEEE. Personal use of this material is permit- effective, safe and real-time imaging technique, is gaining
ted. However, permission to use this material for any other purposes attention. In particular, lung ultrasound (LUS) is increasingly
must be obtained from the IEEE by sending a request to pubs-
permissions@ieee.org. Manuscript submitted on April 14, 2020. S. used in point-of-care settings for detection and management
Oei∗ , B. Luijten∗ , I. Huijben∗ , N. Chennakeshava∗ , and R.J.G. van of acute respiratory disorders [6], [7]. In some cases, it
Sloun† are with the Dept. of Electrical Engineering, Eindhoven Uni- demonstrated better sensitivity than chest X-ray in detecting
versity of Technology, The Netherlands (e-mail: r.j.g.v.sloun@tue.nl).
S. Roy∗ , C. Saltori∗ , E. Fini∗ , W. Menapace∗ , P. Rota† , E. Ricci† , A. pneumonia [8]. Clinicians have recently described use of LUS
Passerini† are with the Dept. of Information Engineering and Computer imaging in the emergency room for diagnosis of COVID-19
Science, University of Trento, Trento, Italy (e-mail: e.ricci@unitn.it, an- [9]. Findings suggest specific LUS characteristics and imaging
drea.passerini@unitn.it). S. Roy and E. Ricci† are with FBK, Trento,
Italy. F. Mento∗ , A. Sentelli, E. Peschiera, R. Trevisan, G. Maschietto, biomarkers for COVID-19 patients [10]–[12], which may be
and L. Demi† are with ULTRa, Dept. of Information Engineering and used to both detect these patients and manage the respiratory
Computer Science, University of Trento, Trento, Italy (e-mail: liber- efficacy of mechanical ventilation [13]. The broad range of
tario.demi@unitn.it) E. Torri is with Bresciamed, Italy. R. Inchingolo
and A. Smargiassi are with the Dept. of Cardovascular and Thoracic applicability and relatively low costs make ultrasound imaging
Sciences-Fondazione Policlinico Universitario A. Gemelli IRCCS, Italy. an extremely useful technique in situations when patient
G. Soldati is with the Diagnostic and Interventional Ultrasound Unit, inflow exceeds the regular hospital imaging infrastructure
Valle del Serchio General Hospital, Lucca, Italy. ∗ joint first authors
contributed equally to this work. † joint lead authors contributed equally capabilities. Thanks to its low costs, it is also accessible for
to this work. low- and middle-income countries [14]. However, interpreting

0278-0062 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: IEEE Xplore. Downloaded on May 19,2020 at 04:20:21 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMI.2020.2994459, IEEE
Transactions on Medical Imaging
2 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. XX, NO. XX, XXXX 2020

Task 2: Task 1: Task 3:


Video-based Frame-based Video Frame-based
Score Score Prediction Segmentation
Prediction Frame
Score Segmentation
CNN STN
Network
Frame
Score
Video Aggregation
CNN STN

Score Frame
Score Segmentation
CNN STN Network

Segmentation
Network

Localizations

Fig. 1: Overview of the different tasks considered in this work. Given a LUS image sequence, we propose approaches for: (orange) prediction
of the disease severity score for each input frame and weakly supervised localization of pathological patterns; (pink) aggregation of frame-level
scores for producing predictions on videos; (green) estimation of segmentation masks indicating pathological artifacts.

ultrasound images can be a challenging task and is prone to https : //github.com/mhug − Trento/DL4covidUltrasound .
errors due to a steep learning curve [15].
Recently, automatic image analysis by machine and deep II. R ELATED W ORK
learning (DL) methods have already shown promise for re- DL has proven to be successful in a multitude of computer
construction, classification, regression and segmentation of vision tasks ranging from object recognition and detection to
tissues using ultrasound images [16], [17]. In this paper semantic segmentation. Motivated by these successes, more
we describe the use of DL to assist clinicians in detecting recently, DL has been increasingly used in medical applica-
COVID-19 associated imaging patterns on point-of-care LUS. tions, e.g. for biomedical image segmentation [23] or pneu-
In particular, we tackle three different tasks on LUS imaging monia detection from chest X-ray [24]. These seminal works
(Fig. 1): frame-based classification, video-level grading and indicate that, with the availability of data, DL can lead to the
pathological artifact segmentation. The first task consists of assistance and automation of preliminary diagnoses which are
classifying each single frame of a LUS image sequence into of tremendous significance in the medical community.
one of the four levels of disease severity, defined by the scoring In the wake of the current pandemic, recent works have fo-
system in [12]. Video-level grading aims to predict a score for cused on the detection of COVID-19 from chest CT [25], [26].
the entire frame sequence based on the same scoring scale. In [27], a U-Net type network is used to regress a bounding
Segmentation instead comprises pixel-level classification of box for each suspicious COVID-19 pneumonia region on con-
the pathological artifacts within each frame. secutive CT scans, and a quadrant-based filtering is exploited
This paper advances the state of the art in the automatic to reduce possible false positive detections. Differently, in [28]
analysis of LUS images for supporting medical personnel a threshold-based region proposal is first used to retrieve the
in the diagnosis of COVID-19 related pathologies in many region of interests (RoIs) in the input scan and the Inception
directions. (1) We propose an extended and fully-annotated network is exploited to classify each proposed RoI. Similarly,
version of the ICLUS-DB database [18]. The dataset contains in [29], a VNET-IR-RPN model pre-trained for pulmonary
labels on the 4-level scale proposed in [12], both at frame and tuberculosis detection is used to propose RoIs in the input CT
video-level. Furthermore, it includes a subset of pixel-level and a 3D version of Resnet-18 is employed to classify each
annotated LUS images useful for developing and assessing RoI. However, very few works using DL on LUS images can
semantic segmentation methods. (2) We introduce a novel deep be found in the literature [30]. A classification and weakly-
architecture which permits to predict the score associated supervised localization method for lung pathology is described
to a single LUS image, as well as to identify regions in [17]. Based on the same idea, in [18] a frame-based
containing pathological artifacts in a weakly supervised classification and weakly-supervised segmentation method is
manner. Our network leverages Spatial Transformers Network applied on LUS images for COVID-19 related pattern detec-
(STN) [19] and consistency losses [20] to achieve disease tion. Here, Efficientnet is trained to recognize COVID-19 in
pattern localization and from a soft ordinal regression LUS images, after which class activation maps (CAMs) [31]
loss [21] for robust score estimation. (3) We introduce a are exploited to produce a weakly-supervised segmentation
simple and lightweight approach based on uninorms [22] map of the input image. Our work has several differences
to aggregate frame-level predictions and estimate the score compared to all the previous works. First, while in [18] CAMs
associated to a video sequence. (4) We address the problem are used for localization, in this work we exploit STN to
of automatic localization of pathological artifacts evaluating learn a weakly-supervised localization policy from the data
the performance of state-of-the-art semantic segmentation (i.e. not exploiting explicit labelled locations but inferring it
methods derived from fully convolutional architectures. from simple frame-based classification labels). Second, while
(5) Finally, we conduct an extensive evaluation of our in [18] a classification problem is solved, we focus on ordinal
methods on all the tasks, showing that accurate prediction regression, predicting not only the presence of COVID-19
and localization of COVID-19 imaging biomarkers can be related artifacts, but also a score connected to the disease
achieved with the proposed solutions. Dataset and code severity. Third, we move a step forward compared to all
are available at https : //iclus − web.bluetensor .ai and at previous methods by proposing a video-level prediction model

0278-0062 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: IEEE Xplore. Downloaded on May 19,2020 at 04:20:21 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMI.2020.2994459, IEEE
Transactions on Medical Imaging
SUBHANKAR ROY et al. (APRIL 2020) 3

Brescia Rome 4000


Lucca Pavia Tione Total
1250
30000 6000 2000 40000
number of frames

3000 1000
1500 30000
20000 4000 750
2000
1000 20000
500
10000 2000 1000
250 500 10000

0 0 0 0 0 0
Convex Linear Convex Linear Convex Linear Convex Linear Convex Linear Convex Linear
Score 0 Score 1 Score 2 Score 3
Fig. 2: The distribution of the probes and the scores of frames grouped by hospital and overall statistics.

built on top of the frame-based method. Finally, we propose annotation, the labelling process was stratified into 4 levels:
a simple yet effective method to predict segmentation masks 1) score assigned frame-by-frame by four master students
using an ensemble of multiple state-of-the-art convolutional with ultrasound background knowledge, 2) validation of the
network architectures for image segmentation. Additionally, assigned scores performed by a PhD student with expertise in
the model’s predictions are accompanied with uncertainty LUS, 3) second level of validation performed by a biomedical
estimates to facilitate interpretation of the results. engineer with more than 10 year of experience in LUS and
4) third level of validation and agreement between clinicians
III. ICLUS-DB: DATA C OLLECTION AND A NNOTATION
with more than 10 years of experience in LUS.
We here present the Italian COVID-19 Lung Ultrasound Additionally, a subset of 60 videos sampled across all
DataBase (ICLUS-DB), which currently includes a total of 277 35 patients was selected and video-level annotations were
lung ultrasound (LUS) videos from 35 patients, corresponding provided for them. These annotations use the same scoring
to 58,924 frames1 . The data were acquired within different system defined for the frame-level annotations. In order to
clinical centers (BresciaMed, Brescia, Italy, Valle del Serchio address subjective biases in the evaluation of the videos,
General Hospital, Lucca, Italy, Fondazione Policlinico Univer- five different clinicians provided their evaluation for each
sitario A. Gemelli IRCCS, Rome, Italy, Fondazione Policlinico sequence. We assess the complexity of this task by calculating
Universitario San Matteo IRCCS, Pavia, Italy, Tione General the inter-operator agreement, comparing the evaluation of the
Hospital, Tione (TN), Italy) and using a variety of ultra- predictions of each doctor against the average prediction of
sound scanners (MindrayDC-70 Exp®, EsaoteMyLabAlpha®, the remaining four. The resulting average agreement is about
ToshibaAplio XV®, WiFi Ultrasound Probes - ATL). Both 67% among the available labels.
linear and convex probes were used, depending on necessities. Finally, for 33 patients, a total of 1,005 and 426 frames
Of the 35 patients, 17 were confirmed positive to COVID-19 respectively acquired using convex and linear probes, were
by swab technique (49 %), 4 were COVID-19 suspected (11 semantically annotated at a pixel-level by contouring the
%), and 14 were healthy and symptomless individuals (40 %). aforementioned imaging biomarkers using the annotation tool
A recent proposal by Soldati et al. describes how specific LabelMe [34]. For the frames acquired using the linear probe,
imaging biomarkers in LUS can be used in the manage- relative pixel-level occurrences for scores 0, 1, 2, and 3
ment of COVID-19 patients [12]. Specifically, to evaluate the are 6.4%, 0.080%, 0.67%, and 3.7%, respectively. For the
progression of the pathology, a 4-level scoring system was convex probe, these statistics are 1.9%, 0.074%, 1.8%, and
devised [32], with scores ranging from 0 to 3. Score 0 indicates 2.1%, respectively. Notably, a large proportion of pixels is
the presence of a continuous pleural-line accompanied by not associated to either of these scores. These pixels do not
horizontal artifacts called A-lines [33], which characterize a display clear characteristics of a specific class, and are referred
healthy lung surface. In contrast, score 1 indicates the first to as background (BG). A few images and the corresponding
signs of abnormality, i.e., the appearance of alterations in the annotations are shown in the supplementary material.
pleural-line in conjunction with vertical artifacts. Scores 2 and
IV. D EEP L EARNING - BASED A NALYSIS OF LUS IMAGES
3 are representative of a more advanced pathological state,
with the presence of small or large consolidations, respectively. This paper tackles several challenges towards the develop-
Finally score 3 is associated with the presence of a wider ment of automatic approaches for supporting medical person-
hyperechogenic area below the pleural surface, which can be nel in the diagnosis of COVID-19 related pathologies (see
referred to as “white lung”. Fig. 1). In particular, following the COVID-19 LUS scoring
A total of 45,560 and 13,364 frames, acquired using re- system in [12] we present a novel deep architecture which
spectively convex and linear probes, were labelled according automatically predicts the pathological scores associated to
to the scoring system defined above. Of the 58,924 LUS all frames of a LUS image sequence (Section IV-A) and
frames forming the dataset, 5,684 were labeled score 3 (10%), optimally fuse them to produce a disease severity score at
18,972 score 2 (32%), 14,295 score 1 (24%), 19,973 score video-level (Section IV-B). We also show that the proposed
0 (34%). A plot showing the distribution of the scores and model automatically identifies regions in an image which are
probes per hospital is shown in Fig. 2. To guarantee objective associated to pathological artifacts without requiring pixel-
level annotation. Finally, to further improve the accuracy in
1 https://iclus-web.bluetensor.ai. the automatic detection of disease-related patterns, we also

0278-0062 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: IEEE Xplore. Downloaded on May 19,2020 at 04:20:21 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMI.2020.2994459, IEEE
Transactions on Medical Imaging
4 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. XX, NO. XX, XXXX 2020

consider a scenario where frames are provided with pixel-level


annotations and we propose a segmentation model derived
from a state of the art convolutional network architecture
(Section IV-C). In the following, we describe the proposed
deep learning models.

A. Frame-based score prediction


1) Problem formulation and notation: With the purpose of
supporting medical personnel in the analysis of LUS images,
in this paper we introduce an approach for predicting the
presence or the absence of a pathological artifact in each frame Fig. 3: Illustration of the architecture for frame-based score predic-
tion. An STN modeled by Φstn predicts two transformations θ1 and
of a LUS image sequence and for automatically assessing θ2 which are applied to the input image producing two transformed
the severity score of the disease related to such patterns versions x1 and x2 that localize pathological artifacts. The feature
according to the COVID-19 LUS scoring system [12]. We extractor Φcnn is applied to x1 to generate the final prediction.
are also interested in the spatial localisation of a pathological
artifact in the frame without assuming any annotation about feature map, conditioned on the input itself. It consists of three
such artifact positions in a frame. The weak localization is parts: (i) a localization network that predicts the parameters of
achieved through the use of Spatial Transformer Networks the affine transformation, (ii) a grid generator which selects
(STN) [19]. The use of STN stems from the fact that most the grid co-ordinates in the source image, to be sampled from,
of the pathological artifacts are concentrated in a relatively and (iii) a sampler that warps the input image based on the
small area of the image, and, hence the entire image should transformation, producing the output map.
not be considered by the network to make predictions. The For what concerns the localization network, it is trained to
problem can be formalized as follows. output a transformation matrix θ such that:
 t 
Let X denote the input space (i.e. the image space) and  s  α
α
S the set of possible scores. During training, we are given a s = θ  βt  (1)
β
training set T = {(xn , sn )}N n=1 where xn ∈ X and sn ∈ S. 1
2) Model definition: We are interested in learning a mapping
where αs , β s , αt , β t , are the source and target coordinates
Φ : X → S, which given an input LUS image outputs
in the input and output feature map respectively. In principle
the associated pathological score label. We model Φ as the
θ can describe any affine transformation, however, keeping in
composition of two functions Φ = Φstn ◦ Φcnn where Φstn :
mind the properties of LUS images we restrict the space of
X → X estimates an affine transformation and applies it to
possible transformations to rotation, translation, and isotropic
the input image x and Φcnn : X → S assigns the score to the
scaling:
transformed image. Intuitively, Φstn learns to localize regions
 
σ r 1 τα
of interest in the input image and provides Φcnn with an θ= (2)
r2 σ τβ
image crop where information about the score is most salient.
Consequently, Φstn produces as a side effect the localization In our proposed method, an input image, x is processed by
of pathological artifacts in the frame. The mapping Φcnn is the Φstn that predicts two set of transformations θ1 and θ2 ,
composed by a convolutional feature extractor and a linear instead of one θ. Subsequently, the transformations are applied
layer with |S|-dimensional output logits. The model Φstn is to x, generating cropped images x1 and x2 , respectively. The
implemented as a deep neural network derived STN [19]. network Φcnn is then applied to x1 and x2 , producing two
Fig. 3 shows an overview of the proposed deep architecture. sets of logits for the same image under different transfor-
In the context of deep learning the generalization capability mations. As a side effect, the intermediate images x1 and
of a network is of critical importance. To this end, data aug- x2 are produced and can be interpreted as the localization
mentation has shown to be very effective [35] in improving the of the pathological artifacts in the input image x. Finally,
performance of a network. Previous works [18] showed that the Φcnn (x1 ) branch then can be trained with any standard
augmenting a dataset composed of LUS images can drastically supervised classification loss and (Φcnn (x1 ), Φcnn (x2 )) is
improve the ability of the network to discriminate healthy trained with a consistency enforcing loss (see below).
and ill patients. Another way to achieve robust predictions is 3) Loss definition: As stated before, we are interested in
to enforce some consistency between two perturbed versions devising a deep network Φ for automatically predicting the 4-
(colour jitter, dropout, etc.) of the same image [20], [36]. This level scores identified in [12]. While this problem can trivially
makes the network produce smoothed predictions by attending be cast within a classification framework, in this paper we
to the salient features in an image. Inspired by this idea, we argue that ordinal regression [37] is more appropriate as we
propose to use STN [19] to produce two different crops from are interested in predicting labels from an ordinal scale. The
a single image and enforce the predictions of the network rationale behind the choice of ordinal regression is that there
to be similar. We name our approach Regularised Spatial exist certain categories that are more correct than others with
Transformer Networks (Reg-STN). respect to the true label, as opposed to an independent class
STN [19] is a differentiable module that applies a learnable scenario, in which the order of the levels does not matter. In
affine transformation to an input image, or more in general to a fact, errors on low-distance levels should be less penalized

0278-0062 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: IEEE Xplore. Downloaded on May 19,2020 at 04:20:21 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMI.2020.2994459, IEEE
Transactions on Medical Imaging
SUBHANKAR ROY et al. (APRIL 2020) 5

with respect to long-distance error. For instance, predicting a the STN into yielding different parameters θ1 6= θ2 , we simply
severely ill patient (score 3) as healthy (score 0) should be choose σp1 6= σp2 . Hence, a loss is defined as follows:
strongly discouraged, while sometimes the difference between
LP = |σ1 − σp1 | + |σ2 − σp2 | (6)
score 1 and score 2 can be subtle and the network should not
be overly penalized. Finally, the proposed Reg-STN model is trained end-to-end
While ordinal regression can be implemented resorting on minimizing the following joint loss function:
the traditional approach of decomposing the problem assuming
a |S|-rank formulation [38], following [21] we introduce a LT OT = LSORD + LM SE + LP (7)
lightweight approach for Soft ORDinal regression (SORD). 4) Training strategy: We split the ICLUS-DB dataset into a
In practice, we implement an ordinal regression framework by train and test split. The test split comprises 80 videos from 11
using a carefully devised label smoothing mechanism. Instead patients, with a total of 10,709 frames. All the frames from
of one-hot representations of labels, we encode the ground the remaining videos are included in the train set. The split
truth information into soft-valued vectors (SORD vectors) ŝ ∈ is performed at patient level, such that the sets of patients
R|S| , where S is the set of possible scores for a frame. Hence, in the training and test set are disjoint. The STN is modeled
for a frame x with score s ∈ S the i-th element of the SORD by a ConvNet similar to [17]. Specifically, we removed the
vector is computed as follows: Average Pooling and the output layer and replaced it with
e−δ(s,i) two fully connected layers to predict the affine transformation
ŝi = P −δ(j,i)
(3)
j∈S e
parameters. The CNN architecture [17] is kept unchanged.
The STN and CNN are jointly trained by using the Adam
where δ is a manually defined distance function between optimizer with an initial learning rate of 1e − 4, a batch size
scores/levels for which we use square distance multiplied of 64 and trained for 120 epochs. We also used similar data
by a constant factor. This formulation produces a smooth augmentation strategies and learning rate decay as suggested
probability distribution over S, in which the magnitude of in [17], [18]. We set the values of σ1 and σ2 to 0.50 and
the elements decreases while the distance to the ground 0.75 respectively, leveraging the prior knowledge about LUS
truth increases. Encoding ground truth labels as probability images that pathological artifacts roughly covers 25% to 50%
distributions seamlessly blends with common classification area of the image.
loss functions that use a softmax output. Therefore, at training
time, we simply train the network Φ using cross entropy: B. Video-level score aggregation
|S|
!
X exp(Φ (x)i ) 1) Problem formulation and notation: The identification of
LSORD = − ŝi log P|S| (4)
potentially pathological artifacts in LUS images is a crucial
i=0 j exp(Φ (x)j )
step towards diagnosis support. However, frame-based pre-
The result is a loss function that yields a smaller cost for dictions should be turned into a single video-based score
predictions that are in the neighbourhood of the ground prediction in order to assess the pathological state of a patient.
truth label, which, in turn generates smaller gradients, hence The video-based score aggregation problem can be formalized
discouraging drastic updates of the network for small errors. as follows. Let v = {xi }Mi=1 , be a video, V be the set of videos
Empirically, we found that our algorithm works best when we of any length, and S the set of scores. The goal of video-level
increase the distance of score 0 from the others. As mentioned score prediction is learning a mapping Ψ : V → S.
before, this is also validated by the semantics of the scores. In principle the mapping Ψ could be obtained by taking
Another desirable property of the network is to extract im- the maximum score assigned to any frame of the current
portant semantic features of the input image, in order to enable video because the identification of an artifact of score s in
accurate frame score prediction. This can be strengthened by a frame implies that the patient has a severity level of at
resorting to a regularization in the form of consistency loss least s. This hard rule, however, is inapplicable in practice
on the two branch predictions (Φcnn (x1 ), Φcnn (x2 )) with when dealing with machine-predicted scores, as even a single
the rationale that two different crops from the same image frame-based prediction error could harm the overall prediction.
should have similar predictions. In our case, these two crops Thus, in this section we propose a more flexible aggregation
are produced by the Φstn . In details, the consistency loss is mechanism devised for predicting the score associated to a
defined on the network representations as following: video, leveraging the video-level annotations provided in the
2 ICLUS-DB (Section III).
LM SE = kΦcnn (x1 ) − Φcnn (x2 )k2 (5)
2) Model definition: In designing the model Ψ, we consider
Unfortunately, LM SE coupled with learnable affine transfor- the fact that it needs to operate in a low-data regime, where
mations produces degenerate solutions in which the localiza- few videos are provided with annotations as in the current
tion network of the STN learns to output identical parameters version of the ICLUS-DB. Inspired by the hard rule previously
for the affine transformations. In fact, it is enough to impose mentioned, we propose a simple strategy that combines frame-
θ1 = θ2 to minimize LM SE . To prevent this pathological be- level predictions using a parameterized aggregation layer, i.e.:
haviour of the network, we enforce a prior on the parameters of
Ψ(v) = ΨU (Φ(x1 ), . . . , Φ(xM )) (8)
the transformations. In particular, we stimulate the localization
network to produce reasonably scaled patches by minimizing Here Φ is the frame-level mapping and ΨU is an aggregation
|σ − σp |, where σp is a fixed prior. Now, in order to enable function based on uninorms [39], which are a principled way

0278-0062 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: IEEE Xplore. Downloaded on May 19,2020 at 04:20:21 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMI.2020.2994459, IEEE
Transactions on Medical Imaging
6 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. XX, NO. XX, XXXX 2020

to soften the hard rule. A uninorm U is a monotonic increas- the model for a maximum of 30 epochs and use the loss on
ing, commutative and associative mapping from [0, 1] × [0, 1] the training set to define an early stopping strategy.
to [0, 1] with neutral element e ∈ [0, 1]. This means that Note that the entire architecture including the frame-level
U (a, e) = U (e, a) = a for all a ∈ [0, 1]. If e = 1, U is component could be trained entirely end-to-end. However, this
fully non-compensatory (like taking the minimum between solution is not effective given the vast disproportion in the
a and b), while it is fully compensatory if e = 0 (like the amount of supervision at the video and frame levels currently
taking maximum). Choosing e ∈ (0, 1) allows the uninorm available in ICLUS-DB. We thus trained the aggregation layer
to have a hybrid behaviour. Note that being associative, after freezing the weights of the frame-based architecture. Full
uninorms can be applied to an arbitrary number of inputs end-to-end training combining frame-based and video-based
(e.g., U (a, b, c) = U (U (a, b), c)). Following [22], we learn supervision will be investigated in future work.
the appropriate value for the neutral element e from data.
Our aggregation layer takes as input the sequence of frame- C. Semantic Segmentation
based prediction scores Φ(x), aggregates them along each 1) Problem formulation and notation: Let X = Ri×j and Y
dimension/score using a uninorm U and returns the softmax denote the input (i.e. the image space) and output (i.e. the
of the resulting aggregation as a video-based prediction. The segmentation masks) space, respectively. In the earlier pre-
layer has only four parameters, which are the neutral elements sented frameworks for image- and video-based classification,
for each candidate score {0, 1, 2, 3}, and it is thus amenable the score set was defined as S = {0, 1, 2, 3}. For semantic
to training with little supervision. segmentation we however distinguish five different scores, i.e.
Any uninorm with neutral element e can be written as [39]: the four scores in S, complemented by the background (BG)
 a b
if a, b ∈ [0, e] score, assigned to pixels that were not annotated for showing
 eT ( e , e )
markers associated with any of the classes in S. As such,
Ue (a, b) = e + (1 − e)S( a−e , b−e
) if a, b ∈ [e, 1] (9)
 1−e 1−e Y = {0, 1, 2, 3, BG}i×j .
Û (a, b) otherwise
2) Model definition: We are interested in learning a mapping
for a certain choice of T , S and Û (a, b) such that min(a, b) ≤ Ω : X → Y, which given an input LUS image, outputs the
Û (a, b) ≤ max(a, b). The functions T and S are called t-norm associate pathological segmentation mask. To model Ω, we
and t-conorm respectively, and model the non-compensatory compare several network architectures for end-to-end image
and compensatory behaviour. Different choices for these func- segmentation, such as the vanilla U-Net [23], and the more
tions lead to different uninorms. We found the product t- recently proposed U-Net++ [40], and Deeplabv3+ [41].
norm T (a, b) = ab (and corresponding t-conorm S(a, b) = Our baseline U-Net model has three encoding layer blocks,
a + b − ab) to be the most effective choice as it allows the each comprising two convolutional layers with ReLU activa-
gradient to flow the most. Concerning the function Û (a, b), tions and one maxpool layer (pooling across 2, 2, and 5 pixels
common choices are min(a, b) and max(a, b), producing the in both dimensions, respectively), a latent layer, and a mirrored
so-called min-uninorms and max-uninorms respectively. We decoder (where pooling is replaced by nearest neighbour up-
found min-uninorms to be the best choice in our setting (with sampling). We use skip connections between each layer block
respect to max(a, b) but also mean(a, b)), likely because of of the encoder and decoder. To mitigate overfitting we apply
their fully non-compensatory behaviour in the area of highest dropout (p = 0.5) during training at the latent bottleneck of
discrepancy between frame-based predictions. the model. The Unet++ variant leverages the first four encoder
3) Loss definition: The architecture is trained using the blocks of the ResNet50 model [42] to construct a latent space.
SORD loss described in Eq. (5) computed over the video-level The latent space is upsampled in the decoder stage by means
prediction. of transpose 2D convolutional layers. The decoder contains
4) Training strategy: The frame-based predictor outputs pre- residual blocks, and also exploits skip connections between
diction scores with a distribution that differs between the (same-sized) hidden layer outputs in the ResNet50 encoder
training and the test set. In order not to overfit the video-based and the decoder. The Deeplabv3+ model similarly employs an
predictor on the training scores distribution, we completely encoder-decoder structure, where features are extracted using
separate the training sets of the frame-based and video-based spatial pyramid pooling (i.e. pooling at different grid scales)
predictor. We train the frame-based predictor on all video and atrous convolutions, resulting in decoded segmentation
sequences T without any video-based annotation, and evaluate maps with detailed object boundaries.
it on the remaining sequences T 0 . We then train and evaluate 3) Loss definition: We adopt a pixel-wise categorical cross-
the video-based predictor on T 0 , using a k-fold cross validation entropy loss between he segmentation masks g(yn ) and the
procedure (k = 5) with splits made at the patient level (i.e. model predictions ŷn = Ω(h(xn )). Functions g(·), and h(·)
all videos from the same patient are in the same fold). We are pre-processing transformations applied prior to training.
choose to use as video-level annotations the ones produced Function h(·) comprises the resizing of all acquired B-mode
by the first annotator, the clinician with the highest expertise. images to 260 × 200 pixels, preserving the original aspect
We train our model using an Adam optimizer with learning ratio of the scans by appropriate zero padding, and subsequent
rate 10−2 without weight decay and with no learning rate normalization between -1 and 1.
scheduling. For each epoch, we compute the loss for each 4) Training strategy: Due to the larger (and more represen-
train video sequence and accumulate its gradients, performing tative) set of pixel-level annotations for the convex probe,
a single optimization step at the end of each epoch. We train compared to the linear probe acquisitions (1,005 and 426

0278-0062 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: IEEE Xplore. Downloaded on May 19,2020 at 04:20:21 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMI.2020.2994459, IEEE
Transactions on Medical Imaging
SUBHANKAR ROY et al. (APRIL 2020) 7

Input Image GradCam a) Reg-STN (Translation + Fixed Scaling) b) Reg-STN (Translation + Scaling + Rotation)
Linear
Convex

Fig. 4: Examples of the image crops produced by the Reg-STN network. The first column shows input images acquired with linear and
convex sensors, respectively. In the second column we report the heatmaps produced by GradCam [44] and the bounding boxes obtained
by thresholding. In the remaining columns, original image overlayed with bounding boxes and the two respective crops (in red and green)
produced when the Reg-STN models: a) only translation and a fixed scaling; b) all possible transformations viz. translation, scaling and
rotation, are shown. In each case the Reg-STN focuses on the most salient parts which contains the pathological artifacts.

annotations, respectively), we here specifically focus on the A. Frame-based score prediction


convex acquisitions. We split our dataset into a train (70%) and To evaluate the performance of our proposed frame-based
test set (30%) at a patient level, i.e. all movies and frames from scoring method and its constituent components we consider
one patient fall into a specific set. Among the 1005 frames, a the following baselines: i) CNN trained with Cross Entropy
total of 1158 imaging biomarkers were segmented. loss (CE), ii) CNN trained with SORD, iii) Resnet-18 trained
During training, we are given a training set of N image-label with SORD, iv) STN based CNN trained with SORD; v)
pairs T = {(xn , yn )}N n=1 where xn ∈ X and yn ∈ Y. The CNN+Random Crop+SORD, a CNN trained on SORD with
model parameters are learned by back-propagating the earlier random crops rather than bounding boxes extracted by STN
defined categorical cross-entropy using the Adam optimizer and vi) Our proposed Reg-STN model.
(default settings), with a learning rate of 10−5 . Training was In Table I, we evaluate the performance of our method
stopped upon convergence of the training loss. in terms of F1-score. Since, the annotations in LUS images
Each training batch consists of 32 B-mode images and their are quite subjective (see later) we also report results for
corresponding segmentation masks, which are balanced across two additional metrics, which are then defined as Setting
patients and scores to avoid biases resulting from the length 2 and Setting 3, respectively. The metrics are: i) Setting 1
of the ultrasound scan (number of frames in a single video) considers the F1 score computed on the entire test set, ii)
and population-level distribution of scores. While these biases Setting 2 considers the F1 score computed on a modified
generally aid the overall accuracy, they hamper patient-level version of the test set obtained by dropping, for each video,
decision making across demographics. the K frames before and after each transition between two
To promote invariance to common LUS image transforma- different ground truth scores, potentially removing ambiguous
tions and thereby improve generalization at inference, each frames that present characteristics at the boundary between
image-label pair is heavily manipulated on-line during training two classes, thereby allowing us to identify the impact of noisy
by a set of augmentation functions that were each activated labeling on the performance of the model; and iii) Setting 3,
on the image-label pair with a probability of 0.33. The set of we drop the most challenging videos by using the inter-doctor
augmentation functions, each applied with a randomly sampled agreement between the 5 independent video-level annotations.
strength bounded by a set maximum, consists of: affine trans- In practice, we only keep in the test set the videos with
formations (translation (max. ±15%), rotation (max. ±15◦ ), at least A doctors agreeing on the video-level annotations.
scaling (max. ±45%), and shearing (max. ±4.5◦ )), multiplica- For completeness, we report under Setting 3 also the scores
tion with a constant (max. ±45%), Gaussian blurring (σmax = obtained on the complete portion of the test set containing
3
4 ), contrast distortion (max. ±45%), horizontal flipping (p = video-level annotations (Video Ann.).
0.5), and additive white Gaussian noise (σmax = 0.015). As shown in Table I, our proposed Reg-STN trained with
5) Inference: To further boost robustness and performance, SORD beat the baseline models in most of the settings
we apply model ensembling and calculate the unweighted and is the second best in the remaining. On average, Reg-
average over predicted softmax logits of the U-net, U-net++, STN performs the best amongst all baselines. This proves
and Deeplabv3+ models (all trained with data augmentation). the effectiveness of our proposed method for doing frame-
To allow for qualitatively assessment of the uncertainty of based prediction for pathology detection in LUS images. Our
the predictions, we produce pixel-level estimates of model experiments were run on a RTX-2080 NVIDIA GPU. As
uncertainty by using Monte-Carlo (MC) dropout [43]. During for computational complexity, it takes ∼11 hours to train a
inference, we stochastically apply dropout in the latent space, CNN+Reg-STN+SORD model on this hardware.
yielding multiple point estimates of our class predictions. The
amount of variation in the resulting predictions, ultimately B. Video-based score prediction
provides an indication of uncertainty for every pixel. We evaluate video-based score prediction in terms of
weighted F1 score, Precision and Recall. These are obtained
V. E XPERIMENTAL R ESULTS by first computing the metric for each score (zero to three), and

0278-0062 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: IEEE Xplore. Downloaded on May 19,2020 at 04:20:21 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMI.2020.2994459, IEEE
Transactions on Medical Imaging
8 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. XX, NO. XX, XXXX 2020

TABLE I: F1 scores (%) for the frame-based classification under different evaluation settings. Setting 1 represents evaluation on the full
test set, Setting 2 represents the analysis on the test set with dropped transition frames and Setting 3 represents the analysis accounting for
inter-doctor agreement. The baseline for this setting is provided by the evaluation on the set of test sequences with video-level annotations
(Video Ann.). Best and second best F1 scores (%) are in bold and underlines, respectively.
Setting 2 Setting 3
Model Setting 1 Drop Transition Frames (K) Inter-doctor Agreement (A) Avg
Regular Metric
K=1 K=3 K=5 K=7 Video Ann. A=3 A=4

CNN+CE 61.6 63.1 64.9 66.3 67.6 74.8 78.0 77.0 69.2
CNN+SORD 63.2 64.8 66.3 67.8 68.9 73.0 76.8 75.8 69.6
Resnet-18+SORD 62.2 63.9 65.5 66.9 67.8 74.5 77.4 76.4 69.3
CNN+STN+SORD 61.0 62.6 63.8 64.8 65.6 78.4 82.2 81.4 70.0
CNN+Random Crop+SORD 61.8 63.0 64.2 65.1 65.9 71.9 74.6 73.5 67.5
CNN+Reg-STN+SORD (Ours) 65.1 66.7 68.3 69.5 70.3 75.4 78.4 77.5 71.4

TABLE II: Mean and standard deviation of weighted F1 score,


precision and recall computed over the five cross validation folds,
We observe that using on-line augmentation of images and
for the proposed video-based classification method and baselines. annotations in combination with model ensembling yields a
strong performance gain over a baseline U-Net, increasing the
Method F1 (%) Precision (%) Recall (%)
max argmax 46 ± 21 55 ± 27 49 ± 18
Dice coefficient from 0.64 to 0.75 for the union of COVID-19
argmax mean 51 ± 12 56 ± 19 53 ± 09 markers. The ensemble model yields a categorical Dice score
uninorms 61 ± 12 70 ± 19 60 ± 07 of 0.65 (mean across the segmentations for score 0, 2 and 3).
This metric was 0.47 for our baseline U-net.
TABLE III: Confusion matrices (%) for the proposed video-based
classification method and baselines. In Fig. 6 we provide a visualization of uncertainty in the
predicted segmentations for two example images by plotting
max argmax argmax mean uninorms
0 1 2 3 0 1 2 3 0 1 2 3 the pixel-wise standard deviation yielded by MC dropout
0 7 9 7 3 16 10 0 0 10 16 0 0 across 40 samples. Arrows in (A) indicate a region displaying
1 0 3 12 5 3 9 9 0 2 17 2 0 COVID-19 markers for which ambiguity in the exact shape
2 0 0 17 9 0 3 21 2 0 5 16 5
3 0 0 5 22 0 2 17 9 0 5 5 17
and extent are well reflected in the pixel-level uncertainty.
Arrows in (B) indicate a seemingly false-positive region which
was assessed as a high-grade COVID-19 marker by the deep
then computing the weighted average over scores, where the
network, and not annotated as such. Interestingly, retrospec-
weight is the fraction of instances having that score. Note that
tively the network output was judged as a true positive by the
weighted recall corresponds to (multiscore) accuracy, i.e., the
annotators, showing an area of hyperechogenic lung below the
fraction of correctly predicted scores over the total number of
pleural surface [12], which characterizes a high permeability
predictions. Table II reports averages and standard deviations
and advanced disease state.
of these metrics over the five folds of the cross validation
procedure. We compare our video-level predictor with two VI. D ISCUSSION AND C ONCLUSIONS
standard aggregation methods, max argmax and argmax mean.
The former implements the hard rule described in Section IV- A. Frame-based score prediction evaluation
B. It labels each frame with the most probable score according In Table I we ablate the contribution of the building blocks
to the frame-level predictor, and takes the maximal score along of our model for frame-based prediction. The replacement
the video. The latter averages frame-level predictions over of the traditional cross-entropy (CE) with the SORD loss
the video and returns the score with the maximal average. for ordinal regression clearly improves the performance. On
The proposed method outperforms both baselines in terms the other hand, we found that the addition of STN leads to
of F1-score, precision and recall. Table III shows confusion a drop in the F1-score because of the additional trainable
matrices for the three methods, obtained by concatenating the parameters (as many as the CNN) introduced by the STN and
predictions for all folds. As expected, the max argmax hard the absence of a regularisation. However, STN comes with
rule is strongly biased towards predicting the highest score, two positive side effects: (i) it provides weakly supervised
resulting in bad performance on all other scores. On the other localizations without using fine-grained supervision; and (ii)
hand, the argmax mean baseline has the best performance in enables the use of consistency-based regularization, which is
predicting score zero, but performs poorly on the other scores very beneficial in terms of performance. Our full model, which
(under-predicting scores one and three and over-predicting embeds the STN module, the SORD loss and the proposed
score two). The uninorm-based aggregation is more balanced, consistency loss achieves an F1-score of 65.1, outperforming
outperforming each of the baselines on three out of four scores. all the baselines by a large margin. To further investigate if the
boost occurs because of the consistency term or the STN, we
C. Semantic Segmentation conducted an experiment using two sufficiently overlapping
Fig. 5 shows several illustrative examples of semantic random crops and enforced consistency loss between the two.
segmentation results of our ensemble network, along with their Unsurprisingly, the F1-score for CNN+Random Crop+SORD
ground-truth annotations. A quantitative assessment and com- stays much below to our proposed method. We hypothesize
parison of segmentation performance for the U-Net, U-Net++, that the consistency loss is only useful when the crops cover
Deeplabv3+, and ensemble models are provided in Table IV. the area of the artefact.

0278-0062 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: IEEE Xplore. Downloaded on May 19,2020 at 04:20:21 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMI.2020.2994459, IEEE
Transactions on Medical Imaging
SUBHANKAR ROY et al. (APRIL 2020) 9

B-mode scan Annotation Semantic segmentation COVID-19 marker contours


A

Fig. 5: Four examples of B-mode input image frames (first column), their annotations (second column) including COVID-19 biomarkers
(moderate / score 2: orange, severe / score 3: red), and signs of healthy lung (blue). The corresponding semantic segmentations and contours
of COVID-19 markers by deep learning are given in the third and fourth colomn, respectively.

In contrast to the previous work [18], we found that the use degree of agreement between doctors (A = 3 vs. A = 4).
of more complex architectures like ResNet18 does not bring This is probably caused by the fact that imposing stronger
any positive improvement in performance. We reason that this agreement makes the test set smaller, yielding less statistically
is due to the low intrinsic complexity of the task. Conversely, significant results.
we suggest that most of the confusion of the model is caused Finally, we visualize the crops yielded by the STN and
by the noise in both frames and labels. In turn, we believe illustrate them in Fig. 4. We considered two kind of affine
that this noisiness is due to the subjectivity of the annotation transformations modeled by the Reg-STN in our experiments:
and the presence of ambiguous frames. In fact, frame labels i) learnable translation with fixed scaling; and ii) learnable
do not take into account that multiple artifacts can be present translation, scaling and rotation. We compute an F1-score of
at a time. This happens mostly when the sensor is moving, 65.9 when the STN models a learnable translation with fixed
causing a transitions from one score to another. In order to scaling. In both the cases the STN produces highly localized
highlight the concentration of the errors of our models around crops that mostly hinges around the area of pathological
transitions, we devise the experimental Setting 2, as shown in artifact. Interestingly, for both convex and linear sensors acqui-
Table I, in which we drop frames close to transition points. The sitions, the Reg-STN learns to ignore the area above the pleura,
results in the Table I show that removing ambiguous frames which is essentially irrelevant for the prediction of a frame.
from the test set dramatically reduces the amount of errors of This validates the usefulness of incorporating STN blocks
the model, regardless of the architecture, empirically validating in our frame-based predictor. We also report the heatmaps
our hypothesis about noisy labeling. produced by GradCam [44] for the same images. Qualitatively,
GradCam does not always focus on the relevant areas of the
In Table I we also measured how the subjectivity of the an-
image. For example, for the linear probe image displayed in
notated scores affects the performance of the model in Setting
the figure, attention is given to the intercostal tissue layers and
3 and discovered that when there is a strong agreement among
not to the areas of the image below the pleural-line, which are
doctors (more than 2 doctors agree on a score) our network
the areas of interest for the analsysis of LUS data. Also, we
performs notably better, increasing the F1-score by almost 3
noticed that the quality of the heatmaps deteriorates when the
points. This suggests that some videos are intrinsically more
prediction of the network is incorrect. Moreover, we found it
ambiguous than others. In addition, we found that, on this
hard to produce reasonable boxes from the heatmaps produced
matter, the network seems to be behave similarly to human
by GradCam, since it requires thresholding. For these reasons,
annotators, which is a desirable property. Moreover, although
we believe that STN produces superior localizations.
it seems counter-intuitive, our experiments point out that the
performance of the model does not change much after a certain

0278-0062 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: IEEE Xplore. Downloaded on May 19,2020 at 04:20:21 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMI.2020.2994459, IEEE
Transactions on Medical Imaging
10 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. XX, NO. XX, XXXX 2020

B-mode scan Annotation Semantic segmentation COVID-19 uncertainty


A

Fig. 6: Two examples (A, B) of class uncertainty in the segmentations, showing B-mode input image frames (first column), annotations (second
column), including COVID-19 biomarkers (moderate / score 2: orange, severe / score 3: red), the corresponding semantic segmentations by
deep learning (third column), and pixel-level COVID-19 class uncertainty by MC-dropout (fourth column).
TABLE IV: Segmentation performance in terms of the mean categor-
B. Video-based score prediction evaluation ical accuracy across all pixels and scores (Acc.), the Dice coefficient
When trained on the annotations by the most expert clini- for the union of COVID-19-related scores (Dice), and the mean Dice
cian, video-based classification achieves an F1-score of 61%, across scores 0, 2, and 3 (Cat. Dice). Score 1 was excluded due to
the low number of annotations.
a precision of 70% and a recall of 60%. It is noticeable
Network and training strategy Acc. Dice Cat. Dice
that these values are in line with the low inter-annotator
agreement reported in Section III, which together to the small (1) U-Net 0.94 0.64 0.47
(2) U-Net with on-line augmentation 0.95 0.69 0.55
number of samples with video-level annotations can explain (3) U-Net++ with on-line augmentation 0.97 0.72 0.64
the high variance of the scores across folds. We expect that (4) Deeplab v3+ with on-line augmentation 0.95 0.71 0.62
extending our relatively small set of video-level annotations Ensemble of (2,3,4) 0.96 0.75 0.65
will help counteracting the labeling noise, increase the model
D. Limitations of the dataset
performance and reduce its variance.
In order to unravel the specific characteristics of this disease,
C. Segmentation evaluation researchers needed to gather as much data from patients as
possible. However, due to the enormous impact and rapid
Our segmentation model is able to segment and discriminate
spread of infected patients, data gathering in an organized
between areas in B-mode LUS images that contain back-
manner proved a challenge. As a result, the precise demo-
ground, healthy markers and (different stages of) COVID-19
graphics of the patient group in our database remain unknown.
biomarkers at a pixel-level, reaching a pixel-wise accuracy
Ideally, the dataset should be larger, more heterogeneous,
of 96% and a binary Dice score of 0.75. Alongside these
and more balanced in term of scores in order to be used for
segmentations, we provide spatial uncertainty estimates that
learning accurate deep models. In our case, the data has been
may be used to interpret model predictions.
collected in a limited set of hospitals, all of them located
Interestingly, and importantly, none of the highest (and
in Italy. Furthermore, the way data was collected is prone
most severe) score index annotations in the test set were
to certain bias, e.g. due to a high patient inflow, the most
missed by our model, judged by visual assessment of the
severe patients were prioritized and assessed, and ultrasound
resulting segmentations, and by analysing the relative image-
diagnosis was performed on patients with a high clinical
level intersections among the corresponding predicted and
suspicion. No subsequent testing was done, resulting in the
annotated regions. Moreover, we observed model predictions
possible inclusion of false positive cases.
of COVID-19-positive regions, that had however not been
Labels in the ICLUS-DB turned out to be noisy. Further-
annotated as such. Fig. 6B shows a representative example
more, for frame-based classification and segmentation tasks
of such a case. After re-evaluating some of such examples
the inter-operator agreement was not available. The noise
from the test set, together with the annotators, we learned that
can be indirectly observed in Table I, where using only
the annotators were sometimes unsure whether to annotate a
a selection of training samples, performance improves by
region as e.g. score 2 or 3, and therefore decided that the
almost 5%. Extending the database to obtain frame-level labels
marker was not clear enough to annotate the region at all,
from multiple annotators would surely lead to more robust
leading to the aforementioned discrepancy.
models. Finally, the included LUS videos with score 0 are
Segmentation performance and extraction of semantics
all of healthy patients, and therefore by no means we claim
could be further boosted by leveraging temporal structure
to distinguish between COVID-19 patients and those with
among frames in a sequential model. Such models could
different pathologies.
learn from annotations across full videos, or through partial
annotations and weak supervision. We leave these extensions E. Possible applications
of the present method to future work. A benefit of using ultrasound is the low risk of cross-
infection when using a plastic disposable cover and indi-

0278-0062 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: IEEE Xplore. Downloaded on May 19,2020 at 04:20:21 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMI.2020.2994459, IEEE
Transactions on Medical Imaging
SUBHANKAR ROY et al. (APRIL 2020) 11

vidually packaged ultrasound gel on a portable handheld [17] R. J. van Sloun and L. Demi, “Localizing b-lines in lung ultrasonography
machine [45]. This is in contrast with use of CT, for which by weakly-supervised deep learning, in-vivo results,” J-BHI, 2019.
[18] G. Soldati et al., “Towards computer aided lung ultrasound imaging for
rooms and systems need to be rigorously cleaned to prevent the management of patients affected by covid-19,” under submission.
contamination (and preferably reserved for patients with a [19] M. Jaderberg et al., “Spatial transformer networks,” in NIPS, 2015.
high COVID-19 suspicion). LUS can be performed inside the [20] S. Roy, A. Siarohin, E. Sangineto, S. R. Bulo, N. Sebe, and E. Ricci,
“Unsupervised domain adaptation using feature-whitening and consen-
patient’s room without need of transportation, making it a sus loss,” in CVPR, 2019.
superior method for point-of-care assessment of patients. [21] R. Diaz and A. Marathe, “Soft labels for ordinal regression,” in CVPR,
Moreover, ultrasound renders real-time images and, com- 2019.
[22] V. Melnikov and E. Hüllermeier, “Learning to aggregate using uni-
bined with our DL methods, provides results instantly. It may norms,” in ECML, 2016.
also directly assist in triage of patients; first-look estimation [23] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks
of the disease’s severity and the urgency at which a patient for biomedical image segmentation,” in MICCAI, 2015.
[24] P. Rajpurkar et al., “Chexnet: Radiologist-level pneumonia detection
needs to be addressed. In addition, low and middle-income on chest x-rays with deep learning,” arXiv preprint arXiv:1711.05225,
countries, where diagnosis through RT-PCR or CT may not 2017.
always be available, can particularly benefit from low-cost [25] D. Dong, Z. Tang, S. Wang, H. Hui, L. Gong, Y. Lu, Z. Xue, H. Liao,
F. Chen, F. Yang et al., “The role of imaging in the detection and
ultrasound imaging as well [46]. However lack of training on management of covid-19: a review,” IEEE Reviews in Biomedical
the interpretation of these LUS images [47] could still limit Engineering, 2020.
its use in practice. Our proposed DL method may therefore [26] F. Shi, J. Wang, J. Shi, Z. Wu, Q. Wang, Z. Tang, K. He, Y. Shi, and
D. Shen, “Review of artificial intelligence techniques in imaging data
facilitate ultrasound imaging in these countries. acquisition, segmentation and diagnosis for covid-19,” IEEE Reviews in
R EFERENCES Biomedical Engineering, 2020.
[27] J. Chen et al., “Deep learning-based model for detecting 2019 novel
[1] WHO, “Laboratory testing strategy recommendations for COVID- coronavirus pneumonia on high-resolution computed tomography: a
19: Interim guidance,” Tech. Rep., 2020. [Online]. Avail- prospective study,” medRxiv, 2020.
able: https://apps.who.int/iris/bitstream/handle/10665/331509/WHO- [28] S. Wang et al., “A deep learning algorithm using ct images to screen
COVID-19-lab testing-2020.1-eng.pdf for corona virus disease (covid-19),” medRxiv, 2020.
[2] R. Niehus, P. M. D. Salazar, A. Taylor, and M. Lipsitch, “Quantifying [29] X. Xu et al., “Deep learning system to screen coronavirus disease 2019
bias of COVID-19 prevalence and severity estimates in Wuhan, China pneumonia,” arXiv preprint arXiv:2002.09334, 2020.
that depend on reported cases in international travelers,” medRxiv, p. [30] S. Liu et al., “Deep learning in medical ultrasound analysis: a review,”
2020.02.13.20022707, feb 2020. Engineering, 2019.
[3] Y. Yang et al., “Evaluating the accuracy of different respiratory speci- [31] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Learning
mens in the laboratory diagnosis and monitoring the viral shedding of deep features for discriminative localization,” in CVPR, 2016.
2019-nCoV infections,” medRxiv, p. 2020.02.11.20021493, feb 2020. [32] G. Soldati et al., “Simple, Safe, Same: Lung Ultrasound for COVID-19
[4] S. Salehi, A. Abedi, S. Balakrishnan, and A. Gholamrezanezhad, “Coro- (LUSCOVID19),” ClinicalTrials.gov Identifier: NCT04322487, 2020.
navirus Disease 2019 (COVID-19): A Systematic Review of Imaging [33] G. Soldati, M. Demi, R. Inchingolo, A. Smargiassi, and L. Demi, “On
Findings in 919 Patients,” Am J Roentgenol, pp. 1–7, mar 2020. the physical basis of pulmonary sonographic interstitial syndrome,” J
[5] A. Bernheim et al., “Chest CT Findings in Coronavirus Ultrasound Med, vol. 35, no. 10, pp. 2075–2086, oct 2016.
Disease-19 (COVID-19): Relationship to Duration of Infec- [34] K. Wada, “labelme: Image Polygonal Annotation with Python,”
tion,” Radiology, p. 200463, feb 2020. [Online]. Available: https://github.com/wkentaro/labelme, 2016.
http://pubs.rsna.org/doi/10.1148/radiol.2020200463 [35] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
[6] F. Mojoli, B. Bouhemad, S. Mongodi, and D. Lichtenstein, “Lung with deep convolutional neural networks,” in NIPS, 2012.
ultrasound for critically ill patients,” pp. 701–714, mar 2019. [36] M. Sajjadi, M. Javanmardi, and T. Tasdizen, “Regularization with
[7] R. Raheja, M. Brahmavar, D. Joshi, and D. Raman, “Application of stochastic transformations and perturbations for deep semi-supervised
Lung Ultrasound in Critical Care Setting: A Review,” Cureus, vol. 11, learning,” in NIPS, 2016.
no. 7, jul 2019. [37] C. Winship and R. D. Mare, “Regression models with ordinal variables,”
[8] Y. Amatya, J. Rupp, F. M. Russell, J. Saunders, B. Bales, and D. R. American sociological review, pp. 512–525, 1984.
House, “Diagnostic use of lung ultrasound compared to chest radiograph [38] K. Crammer and Y. Singer, “Pranking with ranking,” in NIPS, 2002.
for suspected pneumonia in a resource-limited setting,” International [39] R. R. Yager and A. Rybalov, “Uninorm aggregation operators,” Fuzzy
Journal of Emergency Medicine, vol. 11, no. 1, dec 2018. Sets Syst., vol. 80, no. 1, p. 111–120, May 1996.
[9] E. Poggiali et al., “Can Lung US Help Critical Care Clinicians in [40] Z. Zhou, M. M. R. Siddiquee, N. Tajbakhsh, and J. Liang, “Unet++:
the Early Diagnosis of Novel Coronavirus (COVID-19) Pneumonia?” A nested u-net architecture for medical image segmentation,” in Deep
Radiology, p. 200847, mar 2020. Learning in Medical Image Analysis and Multimodal Learning for
[10] Q. Y. Peng et al., “Findings of lung ultrasonography of novel corona Clinical Decision Support. Springer, 2018, pp. 3–11.
virus pneumonia during the 2019 – 2020 epidemic,” Intensive Care [41] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder-
Medicine, no. 87, pp. 6–7, mar 2020. decoder with atrous separable convolution for semantic image segmen-
[11] G. Soldati et al., “Is there a role for lung ultrasound during the covid-19 tation,” in Proceedings of the European conference on computer vision
pandemic?” J Ultrasound Med, 2020. (ECCV), 2018, pp. 801–818.
[12] ——, “Proposal for international standardization of the use of lung [42] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
ultrasound for COVID-19 patients; a simple, quantitative, reproducible recognition,” in CVPR, 2016.
method,” J. Ultrasound Med., 2020. [43] Y. Gal and Z. Ghahramani, “Dropout as a bayesian approximation:
[13] K. Stefanidis et al., “Lung sonography and recruitment in patients with Representing model uncertainty in deep learning,” in ICML, 2016.
early acute respiratory distress syndrome: A pilot study,” Critical Care, [44] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and
vol. 15, no. 4, p. R185, aug 2011. D. Batra, “Grad-cam: Visual explanations from deep networks via
[14] K. A. Stewart et al., “Trends in Ultrasound Use in Low and Middle gradient-based localization,” in Proceedings of the IEEE international
Income Countries: A Systematic Review.” International journal of MCH conference on computer vision, 2017, pp. 618–626.
and AIDS, vol. 9, no. 1, pp. 103–120, 2020. [45] J. C.-H. Cheung and K. N. Lam, “POCUS in COVID-19: pearls and
[15] L. Tutino, G. Cianchi, F. Barbani, S. Batacchi, R. Cammelli, and pitfalls,” Tech. Rep. 0, apr 2020.
A. Peris, “Time needed to achieve completeness and accuracy in bedside [46] S. Sippel, K. Muruganandan, A. Levine, and S. Shah, “Review article:
lung ultrasound reporting in Intensive Care Unit,” SJTREM, vol. 18, Use of ultrasound in the developing world,” International Journal of
no. 1, p. 44, aug 2010. Emergency Medicine, vol. 4, no. 1, p. 72, dec 2011.
[16] R. J. van Sloun, R. Cohen, and Y. C. Eldar, “Deep learning in [47] S. Shah, B. A. Bellows, A. A. Adedipe, J. E. Totten, B. H. Backlund,
ultrasound imaging,” Proceedings of the IEEE, vol. 108, no. 1, pp. and D. Sajed, “Perceived barriers in the use of ultrasound in developing
11–29, jul 2019. [Online]. Available: http://arxiv.org/abs/1907.02994 countries,” Critical Ultrasound Journal, vol. 7, no. 1, dec 2015.

0278-0062 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: IEEE Xplore. Downloaded on May 19,2020 at 04:20:21 UTC from IEEE Xplore. Restrictions apply.

You might also like