Professional Documents
Culture Documents
net/publication/329903092
CITATIONS READS
39 2,638
11 authors, including:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Zhibin Liao on 13 March 2019.
Abstract—Accurate detection of end-systolic (ES) and end- magnetic resonance imaging (MRI), and computed tomog-
diastolic (ED) frames in an echocardiographic cine series can be a raphy (CT). While MRI and CT can provide high quality
difficult but necessary pre-processing step for the development of cardiac images, these modalities are not routinely used due to
automatic systems to measure cardiac parameters. The detection
task is challenging due to variations in cardiac anatomy and limited availability, prolonged acquisition time, and the use of
heart rate often associated with pathological conditions. We radiation for CT scans. Furthermore, many cardiac implantable
formulate this problem as a regression problem, and propose electronic devices are not considered MRI compatible, pre-
several deep learning-based architectures that minimize a novel cluding a significant portion of individuals who may require
global extrema structured loss function to localize the ED and cardiac imaging. Given these considerations, echo remains the
ES frames. The proposed architectures integrate convolution
neural networks (CNNs) based image feature extraction model first line modality for cardiac imaging and provides a non-
and recurrent neural networks (RNNs) to model temporal invasive, low-cost, and widely available diagnostic tool for the
dependencies between each frame in a sequence. We explore evaluation of cardiac structure and function.
two CNN architectures: DenseNet and ResNet, and four RNN Identification of the end-systolic (ES) and end-diastolic
architectures: long short term memory (LSTM), bi-directional (ED) phases from the echo cine series is a critical step in
LSTM (Bi-LSTM), gated recurrent unit (GRU), and Bi-GRU,
and compare the performance of these models. The optimal deep the quantification of cardiac chamber size and function. Sev-
learning model consisted of a DenseNet and GRU trained with eral measurements and calculations that rely on the accurate
the proposed loss function. On average, we achieved 0.20 and labelling of the ES and ED frames include left ventricular
1.43 frame mismatch for the ED and ES frames, respectively, (LV) dimension, LV ejection fraction (EF), stroke volume,
which are within reported inter-observer variability for manual wall thickness, and global longitudinal strain. Fig. 1 shows
detection of these frames.
an example of ES and ED frames and the corresponding
Index Terms—Deep Residual Neural Networks, Densely- electrocardiogram tracings and LV volume. The ED frame is
connected Networks, recurrent Neural Networks, Long Short defined as the first frame following closure of the mitral valve
Term Memory, Gated Recurrent Unit, Bi-directional RNN,
Echocardiography, Cardiac Cycle Phase Detection. (MVC), representing the largest LV volume. Likewise, the ES
frame is the first frame after the closure of the aortic valve
(AVC), representing the smallest LV volume [2]. Accurate
I. I NTRODUCTION localization of ED and ES frames influences the estimation
of LV function, particularly in patients with global or regional
Cardiovascular disease is the leading cause of morbidity wall motion abnormalities [3].
and premature death worldwide. Timely diagnosis is critical Conventionally, echocardiographers manually identify the
for early treatment and risk factor management. Important ED and ES phases by visually inspecting each frame of
diagnostic tests include echocardiography (echo) [1], cardiac the echo cine series for changes in the LV dimension and
∗ The left sided valves in relation to the electrocardiogram (ECG)
authors have contributed equally to this work.
† The corresponding authors have contributed equally to the manuscript tracing. This process can be time consuming and depending
(emails: purang@ece.ubc.ca, t.tsang@ubc.ca). on the expertise of the interpreter, may result in measurement
F. Taheri Dezaki, Z. Liao, N. Dhungel, A. Abdi and D. Behnami are with variability. Recently, Zolgharni et al. [4] demonstrated that
the Department of Electrical and Computer Engineering, The University of
British Columbia, Vancouver, BC V6T 1Z4, Canada. the median disagreement between five sonographers for the
C. Luong, H. Girgis, T. Tsang, and K. Gin are with Vancouver General identification of ED and ES phases is 3 frames.
Hospital Echocardiography Laboratory, Division of Cardiology, Department ECG can be an appropriate method to approximate the ED
of Medicine, The University of British Columbia, Vancouver, BC V5Z 1M9,
Canada. and ES frames by detecting the onset of the QRS complex and
R. Rohling is with the Department of Electrical and Computer Engineering at the end of the T-wave, respectively [3] (Fig. 1). However,
and the Department of Mechanical Engineering, The University of British there are a number of shortcomings that can reduce the
Columbia, Vancouver, BC V6T 1Z4, Canada
T. Tsang is the Director of the Vancouver General Hospital and University accuracy and practicality of this method. First, unconventional
of British Columbia Echocardiography Laboratories, and Principal Investiga- QRS morphology that is often encountered in patients with
tor of the CIHR-NSERC grant supporting this work. cardiomyopathy or regional wall motion abnormalities may
P. Abolmaesumi is Co-Principal Investigator for the grant supporting this
work and is with the Department of Electrical and Computer Engineering, result in unreliable detection of the ES and ED frames [3].
The University of British Columbia, Vancouver, BC V6T 1Z4, Canada. Furthermore, the application of ECG electrodes may be unde-
2
and DenseNet [13], which shows that DenseNet is more While the ground truth defined in Eq. (1) is designed to encode
favorable than ResNet in order to obtain higher accuracy. the aforementioned monotonic ventricular volume change,
Second, we compare the performance of the temporal Lmse only aims to reduce the mean of the label-prediction
component by testing LSTM and GRU, including their bi- difference. Therefore, it is not constructed to sustain such
directional variations. We find that these different RNN monotonic characteristic between the predictions of conse-
configurations perform similarly in the phase detection quent frames in each cardiac phase during the training. During
problem. the test phase, it is possible that the frames around ED and
4
Fig. 2: An overview of the deep learning framework architecture for the detection of ED and ES frames from cine series of
echocardiograms. The framework has three components: 1) a CNN module to generate per frame image features; 2) an RNN
module for capturing temporal dependencies; and 3) a regression module for computing the per frame regression scores. The
maximum and minimum prediction scores are determined as ED and ES frames, respectively.
ES frames obtain predictions that surpass the predictions of where ỹ(n,σED (T )) and ỹ(n,σES (T )) are the predictions of the true
the actual ED and ES frames, causing inaccurate ED and ES index of ED and ES frames, respectively, κn = {ỹ(n,σNC (T )) }
frame localization. To alleviate this issue, Kong et al. [27] represents the subset of predictions for the non-critical (NC)
also proposed a structured loss to reinforce the monotonic frames, and γ = 0.025 is a user-defined margin parameter.
characteristic during the training:
We give an example of how Lmono and Lge behave in Fig. 3.
In Fig. 3-(a), Lmono penalizes the violation of monotonicity on
N |Tn |
Lmono =
1 1 XX
1(y(n,t) > y(n,t−1) ) max(0, ỹ(n,t−1) − ỹ(n,t) )
15 predictions, creating 11 loss components (shown in sky-
N |Tn | n=1 t=2 blue colored lines with arrow-shaped ends), and to be averaged
by 32 (the number of frames in the cardiac cycle). In this case,
+ 1(y(n,t) < y(n,t−1) ) max(0, ỹ(n,t) − ỹ(n,t−1) ) ,
the normalization factor |Tn | = 32 in Lmono may generate a
(4) gradient with small step size as the training progresses. The
where 1(.) denotes the indicator function. reason is that the number of loss components will be reduced
During the development of this work, we found that the as the violations are resolved, but |Tn | remains at 32. This
monotonicity in Eq. (4) does not enforce the significance of may not be optimal because: 1) the small gradient degrades
the ED and ES frames w.r.t. the surrounding frames, and it is the ability of the training to escape shallow local minima in the
possible that a surrounding frame is misidentified as ED or ES loss landscape; and 2) the training may take longer to converge
frames for very small margin in a test cine series. In this sense, as it also needs to solve many counts of indirectly related
we argue that Eq. (4) is an in-direct surrogate of the inference monotonicity violations. On the other hand, in Fig. 3-(b), Lge
objective in Eq. (2), i.e., looking for the global extrema in the only tries to optimize the four most relevant predictions by a
cine series frame predictions. Therefore, we propose a global relatively large size gradient that is always summed by the two
extrema (GE) loss function to substitute Lmono , which focuses loss components; hence, normalization is not needed for Lge .
on promoting the ED and ES frames to be the global extrema In Fig. 3-(c) and (d), we show a training case with both ED and
during the training phase. This is achieved by imposing a ES frames correctly predicted, where Lmono further reinforces
margin between the ED (or ES) frame prediction and the the monotonic violations. However, this additional information
largest (or smallest) non-critical frame predictions: does not directly help with the objective of the phase detection
N problem; rather it acts as a regularizer, thus it can produce a
1 X
gradient that drives the training away from an optimal solution.
Lge = max (max(κn ) + γ) − ỹ(n,σED (T )) , 0
N n=1 On the other hand, Lge continues to promote the ground truth
(5)
ED and ES frames, where these margins established during
+ max ỹ(n,σES (T )) − (min(κn ) − γ), 0 , the training can tolerate a certain degree of erratic volumetric
5
Prediction
Prediction
Frame # Frame #
(a) (b)
Ground Truth Ground Truth
Prediction Prediction
Prediction
Prediction
Frame # Frame #
(c) (d)
Fig. 3: An example comparison of Lmono (left column) and Lge (right column), for a case with 4 frames error in ED localization
and 5 frames error in ES localization (top row), and a case with correct ED and ES localization (bottom row). The triangular
markers with the tip facing up (or down) indicate the ED (or ES) prediction and ground truth. The sky-blue colored lines
with arrow-shaped ends in (a) and (b) indicate individual loss components in Lmono and Lge , respectively. The transparent blue
boxes in (c) and (d) indicate the loss specific regions of interest. This figure is best viewed electronically for the details.
estimation of the surrounding frames, thus helping with the final prediction. For clarity, we use a single cine series as an
generalization ability of the model. example in Sec. II-C. Fig. 2 illustrates the main components
Finally, the training objective can be represented as: of the framework, and Fig. 4 depicts the tested modules in this
work.
Ltotal = (1 − α)Lmse + αLstruct , (6)
1) CNN Module for Image Feature Extraction: The first
where α is used to weigh the importance of the loss terms, module of fdnn is a CNN-based image feature extraction model
and Lstruct is either Lmono or Lge in our experiment. that generates per-frame features x̃t = ffeat (xt ).
ResNet module: One of the architectures that we explored
C. Deep Learning Framework for extracting the image features x̃t is the deep residual neural
Our proposed fdnn framework is similar to the RNN-based networks (ResNet) [16]. ResNet constitutes of a stack of
sequential prediction model introduced in [11], [19]. This residual layers, where each residual layer adds the input of
framework is a composition of three sub-modules: 1) a CNN a computation block to its own output. An individual residual
module for the purpose of image feature extraction; 2) an RNN layer can be expressed as the following:
module for learning the temporal dependencies between cine
series frames; and 3) a regression module that produces the x(t,l) = x(t,l−1) + fres (x(t,l−1) ; θfl eat ), (7)
6
Fig. 4: The CNN modules (left), RNN units (middle), and RNN structures (right) tested in this work. The details of the LSTM
and GRU units are specified in Fig. 5.
and the output image features x̃t of a ResNet are represented 2) RNN Module For Capturing Temporal Dependencies:
by x(t,L) : The second module is a temporal feature model that processes
L
the entire set of image features with the use of a recurrent
|Tn |
neural network (RNN), i.e., ht = frnn ({x̃t }t=1 ; θrnn ). We
X
x(t,L) = x(t,0) + fres (x(t,l−1) ; θfl eat ), (8)
l=1
test on two common RNN units, namely the LSTM [20] and
GRU [23] units.
where x(t,l) are the input features to the l ∈ {1, . . . ., L − 1}th
residual layer (i.e., a computation block), x(t,0) = xt is the LSTM module: The work-flow of a single LSTM unit is
input image, fres (.) represents a customized computation block depicted in Fig. 5(a). An LSTM unit consists of a memory cell
(the computation block in the original ResNet design has two and three gates: input gate, output gate, and forget gate. The
stacks of three CNN units, in the order of a convolution hidden state of an LSTM unit ht is computed by controlling
layer [14], [28], followed by a Batch Normalization (BN) the information flow through these gates:
unit [29] and a rectified linear (ReLU) activation unit [16],
it = s (Wx̃i x̃t + Whi ht−1 ) ;
[30]), and θfl eat denotes the collection of trainable model
parameters in the lth computation block. ft = s (Wx̃f x̃t + Whf ht−1 ) ;
DenseNet module: Another CNN architecture we explored ot = s (Wx̃o x̃t + Who ht−1 ) ;
in this work is the DenseNet [13] deep learning architecture. In (11)
c̃t = tanh(Wx̃g x̃t + Whg ht−1 );
comparison to ResNet, the outputs from all preceding layers
ct = ft ct−1 + it c̃t ;
in a DenseNet layer are concatenated to be the input for a
succeeding layer. The output image features x̃t of a DenseNet ht = ot tanh(ct );
are represented by: where denotes the element-wise product operation, s(.)
x(t,L) = [x(t,0) , x(t,1) , . . . , x(t,L−1) ], (9) represents the Sigmoid activation function, the variables
{Wx̃g , Whg } represent the weight parameters to compute the
and the intermediate layer outputs are computed as: candidate hidden state c̃t , the variables {Wx̃i , Whi } represent
the weight parameters to compute the input gate it (which
x(t,l) = fdense ([x(t,0) , x(t,1) , . . . , x(t,l−1) ]; θfl eat ), (10)
controls the influence of c̃t to the internal memory state ct ),
where [.] represents the concatenation operation, and fdense the variables {Wx̃f , Whf } represent the weight parameters to
denotes the DenseNet customizable computation block (in the compute the forget gate ft (which controls the mixture of the
original design, it contains only one stack of convolution layer, previous memory state ct−1 and the current memory state ct ),
BN, and ReLU units). and {Wx̃o , Who } represent the weight parameters to compute
7
ℎ𝑡−1 ℎ𝑡
ℎ𝑡−1 ℎ𝑡
memory cell 1 − 𝑧𝑡
𝑐𝑡−1 𝑐𝑡
gates 𝑟𝑡−1 𝑧𝑡−1 ℎ෨ 𝑡−1 𝑟t 𝑧𝑡 ℎ෨ 𝑡
𝑥𝑡−1 𝑥𝑡
(a)LSTM (b)GRU
Fig. 5: (a) Graphic model of LSTM, where c and c̃ denote a memory cell and a candidate memory cell state, respectively, and
i, f and o denote the input gate, forget gate, and output gate. (b) Graphic model of GRU, where h and h̃ are the hidden state
and candidate state, respectively, and z and r denote the update gate and reset gate, respectively.
the output gate ot (which controls the exposure of the current by element-wise sum operation to represent the output of
state ct to external network). The bias terms in Eq. (11) are bidirectional RNN for each time step. In our experiment, we
ignored for clarity. test the bidirectional variation on both LSTM and GRU units
GRU module: The work-flow of a GRU unit is shown in for a comprehensive comparison.
Fig. 5(b). The main characteristic difference of a GRU unit and 3) Regression Module: Finally, a regression model is
a LSTM unit is the simplified gating mechanism. The hidden used to produce the final prediction of each frame: ỹt =
|Tn |
state of a GRU unit is computed as: freg ({ht }t=1 ; θreg ). During the training of the framework, the
parameters of the framework are updated by using the Back-
zt = s(Wx̃z x̃t + Whz ht−1 ); propagation Through Time (BPTT) [32] method.
rt = s(Wx̃r x̃t + Whr ht−1 );
(12)
h̃t = tanh (Wx̃g x̃t + Whg (rt ht−1 )) ; III. E XPERIMENTS
ht = (1 − zt ) ht−1 + zt h̃t ; A. Dataset
where zt represents an update gate that decides a composition The echocardiography dataset used in this work was col-
of previous state ht−1 and candidate state h̃t to represent lected from the picture archiving and communication sys-
the hidden state ht of the unit (this can be thought as a tem (PACS) server of the Vancouver General Hospital with
simplification of the input gate and forget gate in an LSTM ethics approval from the Institutional Medical Research Ethics
unit), {Wx̃z , Whz } are the weight parameters associated with Board in coordination with the Information Privacy Office.
the update state zt , {Wx̃g , Whg } are the weight parameters The collected echocardiography studies are archived data
to compute h̃t , and {Wx̃r , Whr } are the weight parameters acquired between 2011 and 2015. The dataset consists of
to compute the reset gate rt (which allows the candidate state 3,087 patient studies. Each study is a 2D echo AP4 view
computation to optionally drop the irrelevant past information, cine series gathered from one patient and stored in DICOM
if any, allowing for a more compact representation [31]). The format. These clinical echos included various pathological
bias terms in Eq. (12) are also ignored for clarity. Note that conditions, a variety of heart rates (i.e., from 47 to 104 beats
the GRU unit uses two gates in contrast to the three gates per minute), and a variable number of frames (i.e., from 36 to
design in an LSTM unit, meaning GRU requires less amount 64). All identifiable patient information in the DICOM file was
of model parameters and can be computed faster (given the anonymized according to the conditions of the ethics approval.
same number of units are used). Each cine series contains frames of a complete cardiac cycle
Bidirectional LSTM/GRU modules: A conventional RNN with variable number of frames, with a minimum of 29 frames
unit takes into account the past state (information) as a part of and a maximum of 55 frames, and an average of 42 frames.
current state computation in each time step, while the bidirec- All studies in this dataset were acquired using the same type
tional RNN variant also maintains a separate “backward” state of ultrasound machine (Philips iE33) and contained labels of
on a secondary controller in addition to the “forward state” on the ES and ED frames recorded by the expert sonographer
the first controller. From an implementation point of view, the for clinical estimation of LV Ejection Fraction. Given that
second controller reversely reads the input sequence (i.e., the these studies were of appropriate quality for segmentation,
CNN computed image features in our case) to compute the we assume that this is a high quality dataset with adequate
backward state. The two controller outputs are combined visualization of endocardial borders across the cardiac cycle.
8
10 resnet_bigru resnet_bigru
10
resnet_2gru resnet_2gru
resnet_bilstm resnet_bilstm
8 resnet_2lstm 8 resnet_2lstm
densenet_bigru densenet_bigru
densenet_2gru densenet_2gru
6 densenet_bilstm 6
-ED
densenet_bilstm
-ES
densenet_2lstm densenet_2lstm
4 4
2 2
0 0
10 -5 10 0 10 -5 10 0
Initial Learning Rate Initial Learning Rate
(a) ED (b) ES
Fig. 6: Deep learning architecture comparison shown by the error measurement µ on the test set. The tested learning rates are
from 1e−4 up to 1e−1 in an interval of power of 10. Note that each configuration has its x-axis position deviated from the
exact value for better representational purpose.
TABLE I: Deep learning architecture comparison shown by the error measurement µ on the test set. The lowest test error in
each comparison group is highlighted.
CNN module RNN module No. of Param. R2 µED µES
2-LSTM [33] 1.18M 0.92 0.78 ± 1.02 1.45 ± 1.28
Bi-LSTM 1.10M 0.93 0.70 ± 0.99 1.57 ± 1.35
ResNet
2-GRU 1.12M 0.92 0.76 ± 1.00 1.42 ± 1.29
Bi-GRU 1.10M 0.92 0.83 ± 1.17 1.47 ± 1.36
2-LSTM 1.08M 0.94 0.57 ± 0.88 1.34 ± 1.18
Bi-LSTM 1.19M 0.93 0.66 ± 1.08 1.45 ± 1.31
DenseNet
2-GRU 0.98M 0.93 0.49 ± 0.78 1.36 ± 1.18
Bi-GRU 1.07M 0.93 0.64 ± 1.06 1.33 ± 1.21
B. Data Preparation and Experimental Setup module. The deployed ResNet and DenseNet are designed to
In the data preparation step, the ultrasound imaging beam in have the same number of layers and similar number of model
the raw DICOM file is cropped and then re-sized to 120 × 120 parameters.
pixels with bi-cubic interpolation method. The prepared cine ResNet: The ResNet feature module is an off-the-shelf
series are divided into three mutually exclusive sets: 60% as Lasagne implementation that starts with a convolution layer
training set, 20% as validation set, and the remaining 20% of sixteen 3 × 3 filters with 2 × 2 stride, followed by 30
as test set. The recorded sonographers’ labels (i.e., the frame residual layers grouped into three meta blocks. The first
index of ES and ED frames) are used to generate the regression residual layer of the second and the third meta blocks double
ground truth value for each frame, according to Eq. (1). the number of convolution filters from the previous meta block,
All experiments were deployed on a PC with the following and the rest of the residual layers within the meta block
configuration: Intel Core i7-2600k 3.40GHz (8 cores), 8GB share the same number of filters. As mentioned in Sec. II-C1,
of RAM, and a NVIDIA GeForce GTX 980Ti Video Card. each fres computation block contains two batch-normalized
Lasagne deep learning library with Theano backend [34] was convolution layers; hence, the total number of convolution
used to train and test the models. layers in the ResNet module is 61. As for the residual skip-
The optimal initial learning rate was determined as 1e−1 connection, we do not use the project shortcut (i.e., a skip-
by the experiment shown in Sec. IV-A. Throughout the exper- connection link composed by a trainable CNN layer) since it
iments, the learning rate decays by 1/10 at 31th epoch and 61th will introduce extra model parameters. The batch-normalized
epoch, where the maximum training epoch is 100. The weight convolution layers use 3 × 3 filter size, full zero padding,
decay method is used to regularize the model parameters, and 1 × 1 stride, and ReLU nonlinearity. In addition, the weights
is set to 1e−5 . of the convolution layers are initialized by the default Xavier
As mentioned in Sec. II-C1, we consider ResNet and Initialization method [35]. Finally, a 2 × 2 average pooling is
DenseNet as two candidates for spatial feature extraction used in-between the meta blocks, and a global average pooling
9
layer is used at the end of the last meta block. rate compared to the architecture. In Table I, we show the
DenseNet: The DenseNet feature module starts with a con- performance of the examined architectures at learning rate
volution layer of 16 3 × 3 filters with 2 × 2 stride (identical to 1e−1 . It can bee seen that the ED localization error is always
the first layer of the ResNet module), followed by five dense- lower than the ES localization error for all tested architectures.
blocks with 12 batch-normalized convolution layers in each From the clinical point of view, the primary indicator to
block, also resulting a 61-layer model. The growth rate hyper- identify the ED frame is the closure of the mitral valve, which
parameter is set to 6, meaning each fdense in a dense-block adds is clearly visible in the AP4 view, whereas the dominant
six more batch-normalized convolution layers to the block. The indicator of the ES frame (i.e., aortic valve’s closure), is
specification of the batch-normalized convolution layer is the not visible in this view. We believe that the lack of the ES
same as the one used in the ResNet. Finally, a 2 × 2 average characteristic information in the AP4 view is likely the reason
pooling layer is featured after each dense-block except the last for the relatively larger deviation in the ES frame localization
one, where a global average pooling layer is used instead. error.
The extracted spatial features are then passed to the tempo- By comparing the architectures that use the same type of
ral module, which can be either of the four options: 1) two- RNN module but use either ResNet or DenseNet module,
layer LSTM; 2) Bi-LSTM; 3) two-layer GRU; 4) Bi-GRU. The the DenseNet-based models are favourable as they result in
reason for using two-layer RNN is that a single Bi-RNN layer 0.18 lower ED frame error and 0.11 lower ES frame error on
is essentially made from two paralleled RNN layers; therefore, average. With the use of the same CNN module, the LSTM-
compared to one-layer RNN, the number of parameters of a based and GRU-based modules produce comparable results,
Bi-RNN layer is much closer to a two-layer RNN. Throughout which is inconclusive to recommend a prefered module for
our experiments, each layer of an RNN layer uses 128 units. this phase detection problem. Nevertheless, the observation is
consistent with the findings in [21], [37] that when similar
C. Evaluation Metrics number of LSTMs and GRUs are used, the performance
We use the R2 score as one of the evaluation methods for difference is minor. In addition, it can be observed that the
the regression performance. The R2 score is defined by R2 = two-layer and bi-directional variants of both RNN units yield
P
(y(n,t) −ỹ(n,t) )2 similar test performance.
1− P
(y(n,t) −ȳ)2 , where ȳt is the mean of true labels, and
ỹ is the predicted label. The best possible R2 score is 1 and
it can be negative as the model can be arbitrarily worse. B. State-of-the-art Comparison
Nevertheless, R2 itself is an indirect performance indicator Based on the above results, we choose the DenseNet + 2-
of the ED and ES frames detection problem. To quantify the GRU architecture for testing the proposed Lge loss function
ED and ES frame detection performance, we PNcompute the because it has the lowest combined µED and µES error, and
average error of the prediction as µe = N1 i=1 |qei − q̃ei |, least number of model parameters.
i
e ∈ {ED, ES}, where qED/ES is the the ground truth ED/ES From Table II, we can see that the proposed Lge loss
frame in ith echo cine i.e., qED = σED (T ) and qES = σES (T ), function reduces the ED localization error by 0.29 frame on
average. This improvement from Lmono to Lge is statistically
see Sec. II-A for the explanation of σ(.) , and the predicted
significant by examining the t-test statistical hypothesis test.
ED/ES frame is indicated by (i.e., q̃ED/ES , see Sec. II-B for As for the ES localization error, the Lge is 0.09 frame on
respective definition). average behind the Lmono . Nevertheless, the t-test does not
reject the null hypothesis at 5% significance level, suggesting
IV. R ESULT AND D ISCUSSION
comparable ES performance. In this experiment, the α value
A. Model and Hyper-parameter Selection Experiment for Lge is set to 0.3 to be comparable to the Lmono , and γ
Our first experiment is designed to determine the perfor- hyper-parameter of Lge is validated in the range between 0
mance of the selected CNN and RNN modules on the phase and 0.8, where the optimal value is 0.025. We illustrate the
detection problem. The performance of the architectures are sensitivity of Lge to γ in Fig. 7. It is observed that the ED error
evaluated at four initial learning rates ranging from 1e−4 to is more sensitive compared to the ES error with the change of
1e−1 in an interval of power of 10. We observe that by further γ.
increasing the learning rate, training convergence is hampered. In Fig. 8, we show the sample-wise frame error distribution
This experiment explores the best architectural configuration computed for the respective Lmono and Lge trained DenseNet +
with the use of the monotonic loss function Lmono (see details 2-GRU models. It can be observed that Lmono loss has slight
in Sec. II-B), where the weighting parameter α in Eq. (6) tendency towards a late prediction, while Lge loss is biased
is set to 0.3 for all the trained models in this experiment. towards an earlier prediction for the ES frame. On the other
The reference model of the experiment is the ResNet + 2- hand, Lge loss shows more correct ED predictions compared
LSTM model, originally proposed in [33]. The combination to Lmono loss.
of CNN modules, RNN modules, and learning rates results in For the completeness of the study, the performance of the
32 different models. Finally, the number of parameters of each Lmono model has been examined across several cine series
model is within a 10% variation from 1.1 million. properties, i.e., frame rate, patient heart rate, and number of
The experiment result is shown in Fig. 6, and it is noticeable frames, but we did not observe clear functional relationship
that the dominant factor for achieving low error is the learning between the model performance and these properties.
10
TABLE II: A close comparison between the monotonic loss Lmono and the proposed global extrema loss Lge .
Method No. of Param. µED µES
TempReg-Net [27] 5.37M 0.91 ± 1.16 1.75 ± 1.51
DMTRL-Net [36] 0.77M 0.65 ± 0.88 1.80 ± 1.55
DenseNet + 2-GRU + Lmono [27] 0.98M 0.49 ± 0.78 1.34 ± 1.17
DenseNet + 2-GRU + Lge (proposed) 0.98M 0.20 ± 0.67 1.43 ± 1.30
1600
-ED L mono
-ES L mono
-ES L ge
1400 -ED L ge
400
1200
Histogram Count
Histogram Count
300 1000
800
200
600
400
100
200
0 0
-7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7
Frame Error Distribution Frame Error Distribution
(a) ES (b) ED
Fig. 8: Histogram plot comparison of test error distribution of the Lmono and Lge trained DenseNet + 2-GRU models listed as
the last two entries of Table II.
400
Histogram Count
Histogram Count
100
300
200
50
100
0 0
-9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9
Frame Error Distribution Frame Error Distribution
(a) ES (b) ED
Fig. 9: Histogram plot of frame error distribution of the Lmono and Lge trained DenseNet + 2-GRU models on the PLAX view
dataset. Examples of ES and ED PLAX view sample images are overlaid on the respective figures.
“Automatic detection of end-diastolic and end-systolic frames in 2d [25] B. Kong, X. Wang, Z. Li, Q. Song, and S. Zhang, “Cancer metastasis
echocardiography,” Echocardiography, vol. 34, no. 7, pp. 956–967, detection via spatially structured deep network,” in International Con-
2017. ference on Information Processing in Medical Imaging. Springer, 2017,
[5] N. Kachenoura, A. Delouche, A. Herment, F. Frouin, and B. Diebold, pp. 236–248.
“Automatic detection of end systole within a sequence of left ventric- [26] W. Xue, I. B. Nachum, S. Pandey, J. Warrington, S. Leung, and S. Li,
ular echocardiographic images using autocorrelation and mitral valve “Direct estimation of regional wall thicknesses via residual recurrent
motion detection,” in IEEE International Conference on Engineering in neural network,” in International Conference on Information Processing
Medicine and Biology Society. IEEE, 2007, pp. 4504–4507. in Medical Imaging. Springer, 2017, pp. 505–516.
[6] A. Shalbaf, Z. AlizadehSani, and H. Behnam, “Echocardiography with- [27] B. Kong, Y. Zhan, M. Shin, T. Denny, and S. Zhang, “Recognizing
out electrocardiogram using nonlinear dimensionality reduction meth- end-diastole and end-systole frames via deep temporal regression net-
ods,” Journal of Medical Ultrasonics, vol. 42, no. 2, pp. 137–149, 2015. work,” in International Conference on Medical Image Computing and
[7] U. Barcaro, D. Moroni, and O. Salvetti, “Automatic computation of left Computer-Assisted Intervention (MICCAI). Springer, 2016, pp. 264–
ventricle ejection fraction from dynamic ultrasound images,” Pattern 272.
Recognition and Image Analysis, vol. 18, no. 2, p. 351, 2008. [28] Y. LeCun and Y. Bengio, “Convolutional networks for images, speech,
[8] S. Darvishi, H. Behnam, M. Pouladian, and N. Samiei, “Measuring and time series,” in Handbook of Brain Theory and Neural Networks,
left ventricular volumes in two-dimensional echocardiography image M. A. Arbib, Ed. MIT Press, 1995, p. 3361.
sequence using level-set method for automatic detection of end-diastole [29] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep
and end-systole frames,” Research in Cardiovascular Medicine, vol. 2, network training by reducing internal covariate shift,” arXiv preprint
no. 1, p. 39, 2013. arXiv:1502.03167, 2015.
[9] A. A. Abboud, R. W. Rahmat, S. B. Kadiman, M. Z. B. Dimon, [30] V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltz-
L. Nurliyana, M. I. Saripan, and H. H. Khaleel, “Automatic detection of mann machines,” in Proceedings of the 27th International Conference
the end-diastolic and end-systolic from 4d echocardiographic images,” on Machine Learning (ICML), 2010, pp. 807–814.
Journal of Computer Science, vol. 11, no. 1, pp. 230–240, 2015. [31] K. Cho, B. Van Merriënboer, D. Bahdanau, and Y. Bengio, “On the
[10] S. A. Aase, S. R. Snare, H. Dalen, A. Støylen, F. Orderud, and properties of neural machine translation: Encoder-decoder approaches,”
H. Torp, “Echocardiography without electrocardiogram,” European arXiv preprint arXiv:1409.1259, 2014.
Heart Journal-Cardiovascular Imaging, vol. 12, no. 1, pp. 3–10, 2011. [32] P. J. Werbos, “Backpropagation through time: what it does and how to
[11] S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer, “Scheduled sampling do it,” Proceedings of the IEEE, vol. 78, no. 10, pp. 1550–1560, 1990.
for sequence prediction with recurrent neural networks,” in Advances in [33] F. T. Dezaki, N. Dhungel, A. H. Abdi, C. Luong, T. Tsang, J. Jue,
Neural Information Processing Systems, 2015, pp. 1171–1179. K. Gin, D. Hawley, R. Rohling, and P. Abolmaesumi, “Deep residual
[12] A. Graves and J. Schmidhuber, “Framewise phoneme classification recurrent neural networks for characterisation of cardiac cycle phase
with bidirectional lstm and other neural network architectures,” Neural from echocardiograms,” in Deep Learning in Medical Image Analysis
Networks, vol. 18, no. 5, pp. 602–610, 2005. and Multimodal Learning for Clinical Decision Support. Springer,
2017, pp. 100–108.
[13] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely
[34] S. Dieleman, J. Schlter, C. Raffel, E. Olson, S. K. Snderby, D. Nouri
connected convolutional networks.” in Proceedings of the IEEE Confer-
et al., “Lasagne: First release.” 2015.
ence on Computer Vision and Pattern Recognition (CVPR), vol. 1, no. 2,
[35] X. Glorot and Y. Bengio, “Understanding the difficulty of training
2017, p. 3.
deep feedforward neural networks,” in Proceedings of the Thirteenth
[14] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
International Conference on Artificial Intelligence and Statistics, 2010,
with deep convolutional neural networks,” in Advances in Neural Infor-
pp. 249–256.
mation Processing Systems, 2012, pp. 1097–1105.
[36] W. Xue, G. Brahm, S. Pandey, S. Leung, and S. Li, “Full left ventricle
[15] G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi,
quantification via deep multitask relationships learning,” Medical Image
M. Ghafoorian, J. A. Van Der Laak, B. Van Ginneken, and C. I. Sánchez,
Analysis, vol. 43, pp. 54–65, 2018.
“A survey on deep learning in medical image analysis,” Medical Image
[37] W. Yin, K. Kann, M. Yu, and H. Schütze, “Comparative study of cnn and
Analysis, vol. 42, pp. 60–88, 2017.
rnn for natural language processing,” arXiv preprint arXiv:1702.01923,
[16] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image 2017.
recognition,” in Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), 2016, pp. 770–778.
[17] A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recognition with
deep recurrent neural networks,” in IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2013, pp.
6645–6649.
[18] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning
with neural networks,” in Advances in Neural Information Processing
Systems, 2014, pp. 3104–3112.
[19] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A
neural image caption generator,” in Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 3156–
3164.
[20] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[21] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of
gated recurrent neural networks on sequence modeling,” arXiv preprint
arXiv:1412.3555, 2014.
[22] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependen-
cies with gradient descent is difficult,” IEEE Transactions on Neural
Networks, vol. 5, no. 2, pp. 157–166, 1994.
[23] K. Cho, B. van Merriënboer, Ç. Gülçehre, D. Bahdanau, F. Bougares,
H. Schwenk, and Y. Bengio, “Learning phrase representations using rnn
encoder–decoder for statistical machine translation,” in Proceedings of
the learning on Empirical Methods in Natural Language Processing
(EMNLP). Association for Computational Linguistics, 2014, pp. 1724–
1734.
[24] H. Chen, Q. Dou, D. Ni, J.-Z. Cheng, J. Qin, S. Li, and P.-A. Heng,
“Automatic fetal ultrasound standard plane detection using knowledge
transferred recurrent neural networks,” in International learning on Med-
ical Image Computing and Computer-Assisted Intervention (MICCAI).
Springer, 2015, pp. 507–514.