You are on page 1of 13

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/329903092

Cardiac Phase Detection in Echocardiograms With Densely Gated Recurrent


Neural Networks and Global Extrema Loss

Article  in  IEEE Transactions on Medical Imaging · December 2018


DOI: 10.1109/TMI.2018.2888807

CITATIONS READS

39 2,638

11 authors, including:

Zhibin Liao Christina Luong


University of British Columbia - Vancouver University of British Columbia - Vancouver
34 PUBLICATIONS   550 CITATIONS    74 PUBLICATIONS   698 CITATIONS   

SEE PROFILE SEE PROFILE

Hany Younan Azer Girgis Neeraj Dhungel


Fayoum University - Egypt University of British Columbia - Vancouver
56 PUBLICATIONS   284 CITATIONS    19 PUBLICATIONS   1,051 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Atrial remodeling View project

Machine learning View project

All content following this page was uploaded by Zhibin Liao on 13 March 2019.

The user has requested enhancement of the downloaded file.


1

Cardiac Phase Detection in Echocardiograms with


Densely Gated Recurrent Neural Networks and
Global Extrema Loss
Fatemeh Taheri Dezaki∗ , Zhibin Liao∗ , Christina Luong∗ , Hany Girgis∗ , Neeraj Dhungel, Amir Abdi, Delaram
Behnami, Ken Gin, Robert Rohling, Purang Abolmaesumi† , Teresa Tsang†

Abstract—Accurate detection of end-systolic (ES) and end- magnetic resonance imaging (MRI), and computed tomog-
diastolic (ED) frames in an echocardiographic cine series can be a raphy (CT). While MRI and CT can provide high quality
difficult but necessary pre-processing step for the development of cardiac images, these modalities are not routinely used due to
automatic systems to measure cardiac parameters. The detection
task is challenging due to variations in cardiac anatomy and limited availability, prolonged acquisition time, and the use of
heart rate often associated with pathological conditions. We radiation for CT scans. Furthermore, many cardiac implantable
formulate this problem as a regression problem, and propose electronic devices are not considered MRI compatible, pre-
several deep learning-based architectures that minimize a novel cluding a significant portion of individuals who may require
global extrema structured loss function to localize the ED and cardiac imaging. Given these considerations, echo remains the
ES frames. The proposed architectures integrate convolution
neural networks (CNNs) based image feature extraction model first line modality for cardiac imaging and provides a non-
and recurrent neural networks (RNNs) to model temporal invasive, low-cost, and widely available diagnostic tool for the
dependencies between each frame in a sequence. We explore evaluation of cardiac structure and function.
two CNN architectures: DenseNet and ResNet, and four RNN Identification of the end-systolic (ES) and end-diastolic
architectures: long short term memory (LSTM), bi-directional (ED) phases from the echo cine series is a critical step in
LSTM (Bi-LSTM), gated recurrent unit (GRU), and Bi-GRU,
and compare the performance of these models. The optimal deep the quantification of cardiac chamber size and function. Sev-
learning model consisted of a DenseNet and GRU trained with eral measurements and calculations that rely on the accurate
the proposed loss function. On average, we achieved 0.20 and labelling of the ES and ED frames include left ventricular
1.43 frame mismatch for the ED and ES frames, respectively, (LV) dimension, LV ejection fraction (EF), stroke volume,
which are within reported inter-observer variability for manual wall thickness, and global longitudinal strain. Fig. 1 shows
detection of these frames.
an example of ES and ED frames and the corresponding
Index Terms—Deep Residual Neural Networks, Densely- electrocardiogram tracings and LV volume. The ED frame is
connected Networks, recurrent Neural Networks, Long Short defined as the first frame following closure of the mitral valve
Term Memory, Gated Recurrent Unit, Bi-directional RNN,
Echocardiography, Cardiac Cycle Phase Detection. (MVC), representing the largest LV volume. Likewise, the ES
frame is the first frame after the closure of the aortic valve
(AVC), representing the smallest LV volume [2]. Accurate
I. I NTRODUCTION localization of ED and ES frames influences the estimation
of LV function, particularly in patients with global or regional
Cardiovascular disease is the leading cause of morbidity wall motion abnormalities [3].
and premature death worldwide. Timely diagnosis is critical Conventionally, echocardiographers manually identify the
for early treatment and risk factor management. Important ED and ES phases by visually inspecting each frame of
diagnostic tests include echocardiography (echo) [1], cardiac the echo cine series for changes in the LV dimension and
∗ The left sided valves in relation to the electrocardiogram (ECG)
authors have contributed equally to this work.
† The corresponding authors have contributed equally to the manuscript tracing. This process can be time consuming and depending
(emails: purang@ece.ubc.ca, t.tsang@ubc.ca). on the expertise of the interpreter, may result in measurement
F. Taheri Dezaki, Z. Liao, N. Dhungel, A. Abdi and D. Behnami are with variability. Recently, Zolgharni et al. [4] demonstrated that
the Department of Electrical and Computer Engineering, The University of
British Columbia, Vancouver, BC V6T 1Z4, Canada. the median disagreement between five sonographers for the
C. Luong, H. Girgis, T. Tsang, and K. Gin are with Vancouver General identification of ED and ES phases is 3 frames.
Hospital Echocardiography Laboratory, Division of Cardiology, Department ECG can be an appropriate method to approximate the ED
of Medicine, The University of British Columbia, Vancouver, BC V5Z 1M9,
Canada. and ES frames by detecting the onset of the QRS complex and
R. Rohling is with the Department of Electrical and Computer Engineering at the end of the T-wave, respectively [3] (Fig. 1). However,
and the Department of Mechanical Engineering, The University of British there are a number of shortcomings that can reduce the
Columbia, Vancouver, BC V6T 1Z4, Canada
T. Tsang is the Director of the Vancouver General Hospital and University accuracy and practicality of this method. First, unconventional
of British Columbia Echocardiography Laboratories, and Principal Investiga- QRS morphology that is often encountered in patients with
tor of the CIHR-NSERC grant supporting this work. cardiomyopathy or regional wall motion abnormalities may
P. Abolmaesumi is Co-Principal Investigator for the grant supporting this
work and is with the Department of Electrical and Computer Engineering, result in unreliable detection of the ES and ED frames [3].
The University of British Columbia, Vancouver, BC V6T 1Z4, Canada. Furthermore, the application of ECG electrodes may be unde-
2

ES frames. Due to the inherent noisy nature of echocardio-


graphy images and discontinuous edges, these methods are
prone to significant errors. On the other hand, segmentation-
based methods that depend on active contours and deformable
models normally require initialization from user input. Aase
et al. [10] proposed an algorithm for estimation of the ED
frame from echo sequences using speckle tracking. In that
method, they performed a statistical analysis of the intensities
among the frames for estimating the cardiac cycle length,
which was followed by the speckle tracking of a point near the
mitral annulus using a deformable curve fitting approach for
estimating the starting point of the cardiac cycle. One of the
important considerations of this approach noted by the authors
is that the algorithm was limited to heart rates between 45-90
bpm, whereas the heart rate may fall outside of this range in
the presence of pathology.
Recently, fully end-to-end deep learning approaches such
as convolutional neural networks (CNNs) and recurrent neural
networks (RNNs) have experienced an enormous growth and
have produced state-of-the-art results in most of the problems
Fig. 1: End-diastolic (ED) and End-systolic (ES) frames with related to computer vision including medical image analy-
corresponding electrocardiogram (ECG) tracings, left ventric- sis [11], [12], [13], [14], [15].
ular (LV) volume, and the synthesized LV volumes computed The success of CNN-based methods lies in their ability
by using Eq. (1) (Section II) with different η values. Note that to encode mid-level and high-level spatial features of single
the synthesized volumes are only estimated from the the ED frames, whereas RNNs have excellent capability to encode the
and ES labels, and they do not necessarily correspond to the temporal dependencies between sequential data. It has been
actual LV function. shown that increasing the depth of CNNs can substantially
increase the classification accuracy [13], [16]. However, as
pointed out in [16], degradation is one of the issues preventing
sirable in emergency settings when there is limited time and efficient training of deep models. Variants of deep CNNs such
an urgent need for image-guided management. as Residual nets (ResNet) [16] and Densely connected nets
With these considerations, there is a demand for an accurate (DenseNet) [13] overcome this limitation by adding by-pass
and automated image-driven method to detect the ED and ES links to allow better propagation of the gradient from loss
frames in an echo cine series. layer to deep network layers. Specifically, ResNet uses skip-
Numerous attempts have been made to automatically de- connections to forward the features from one layer to the
tect the ED and ES frames from echo images. One of the next layer, whereas DenseNet utilizes all the information from
earliest methods proposed by Kachenoura et al. [5] compares preceding layers by concatenating their output features and
the correlation coefficient of the intensities between the ED passing to the next layer as its input features.
frame and the remaining frames, with the assumption that the Several RNN variants have shown promising results in
ED frame has been identified manually. The frame that has problems related to speech recognition and sequence mod-
minimal correlation with respect to the ED frame is regarded elling containing variable input/output lengths [17], [18], [19].
as the ES frame. The main disadvantage of this method is that Vanilla RNNs can be considered as a traditional feed-forward
it is a semi-automated method and requires manual detection neural network containing loops, which can retain tempo-
of the ED frame. In another approach, Shalbaf et al. [6] applied ral information in order to model variable-length sequential
a linear embedding algorithm to map the relationship between data [20], [21]. However, empirical observations show that
the frames of one cardiac cycle to a two-dimensional manifold. traditional vanilla RNNs can be difficult to train to capture
The ED and ES frames are then detected by computing the the long-term dependencies in the sequential data because the
Euclidean distance between each of the frames in the projected gradients tend to vanish in such networks [22]. This problem
space. However, the linear embedding approach does not can be mitigated by controlling the information flow within
consider the temporal relationship among the frames in the the RNN unit. Two representative RNN units are Long Short
cine series and is also tested on relatively low number of Term Memory (LSTM) [20] unit and Gated Recurrent Unit
patients. (GRU) [23].
Other methods for ED and ES frame detection use either A number of research approaches have been proposed to
segmentation or speckle tracking-based approaches for track- solve the medical image analysis problems with the use of
ing the changes in LV volume [7], [8], [9]. Segmentation- CNN- and RNN-based deep learning models. Chen et al. [24]
based approaches characterize the ED and ES frames with proposed to learn the spatio-temporal feature with a deep
the assumption that the largest and smallest LV segmented hierarchical CNN and an LSTM for automated fetal US
cross-sections in a cardiac cycle correspond to the ED and plane detection problem. That model was jointly trained with
3

knowledge transfer across multi-tasks to address the common II. M ETHODOLOGY


issue of insufficient training data in medical image analysis In this section, we present a detailed explanation of our
domain. Kong et al. [25] integrated a CNN with an LSTM proposed framework and its main components for localizing
for automated metastasis detection of lymph nodes in Whole- the ED and ES frames from echo cine series.
slide Images (WSIs). In additional to estimating regional wall
thicknesses (RWT) of LV myocardium with the use of a
A. Ground Truth Definition
CNN network as a preliminary assessment, Xue et al. [26]
proposed to refine the CNN assessment by adding another Let us assume that a training set is represented by D =
residual information passage to allow the temporal and spatial {zn }Nn=1 , i.e., a collection of N cardiac echo cine series,
|Tn |
dependencies being captured by two stacked Circle-RNNs. where each cine series zn = {xt }t=1 represents recorded heart
In this paper, we aim to develop an accurate technique for motion throughout a cardiac cycle. Each echo frame xt is a 2D
automatic detection of ED and ES frames from echo cine image paired with label lt from a discrete set L → {−1, 0, 1},
series. In particular, we implement a deep learning framework where −1 indicates xt is the ES frame, 1 indicates the ED
(depicted in Fig. 2) consisting of a CNN image feature module frame, and 0 notes the rest “non-critical” frames in zn . Kong et
for extracting image features, an RNN module for capturing al. [27] proposed to use the synthesized LV volume as the
the temporal dependencies, and a regression module to predict ground truth for labelling individual cardiac frames. The LV
the ED and ES frames. With the use of a proposed structured volume is estimated by a function that simulates monotonic
loss function, we demonstrate that the same deep learning decrease of volume during the systole phase and monotonic
framework is able to significantly improve its prediction error increase during the diastole phase. For each cine series zn , the
from an earlier loss function proposed by Kong et al. [27] training ground truth label yt can be computed as:
 |t−σ (T )| τ
in detection of ED and ES frames in cardiac MRI. To the ED
if σED (T ) ≤ t < σES (T )
 |σES (T )−σED (T )| ,

best of our knowledge, this paper is the first application of yt =   τ1
|t−σES (T )|
deep learning in echocardiography for detection of ED and η
 
σES (T )− σED (T )+1 , if σES (T ) ≤ t < σED (T ) + 1,
ES frames. (1)
Our contributions in this work are summarized as follows: where τ = 3 and η = 0.8 are constants, σED (T ) and σES (T )
• We demonstrate that it is possible to accurately estimate are selection functions that present the sequence index t of
the ES and ED frames from an echocardiography cine the ED frame and ES frame for lt = ±1, respectively, and
series without LV segmentation, other types of ventricular T = {1, . . . , |Tn |} is the collection of sequence indices. Note
measurements, or the use of electrocardiograms (ECG). that η is set to 1 in the original formulation [27], where we
• We propose a global extrema (GE) loss, which accurately reduce η in order to highlight the ED and ES frames. By using
interprets the objective of phase detection problem as this ground truth definition, the learning problem is formulated
significance promotion of the ED and ES frames. Our as a regression problem. An example usage of Eq. (1) can be
loss function is able to help a DenseNet-based model found in Fig. 1.
to achieve substantial improvement for ED localization
task and remain comparable for ES localization task, B. Loss Function
compared to the same model trained with the structured During the training phase, a deep learning framework fdnn
loss proposed in [27]. is learned to solve the above regression problem, i.e., ỹ(n,t) =
• We validate our proposed method on a large echocar- fdnn (x(n,t) , θdnn ). During the test phase, the inference of ED
diography cine series dataset consisting of 3,087 patient and ES frames in a cine series is computed by selecting the
studies of apical four-chamber (AP4) view, which encom- frame with the maximum and minimum prediction values:
pass a variety of heart rates and cardiac conditions. Since |Tn |
the loss function is not explicitly designed for a specific q̃ED = arg max{ỹ(n,t) }1 , and
t
cardiac view, it inherently lends itself to a more generaliz- |Tn |
(2)
q̃ES = arg min{ỹ(n,t) }1 .
able solution. We demonstrate this property of the loss by t
presenting results on a separate dataset of 1,382 patient
The training loss function has two components. The first
studies using the parasternal long-axis (PLAX) view.
component is the mean squared error (MSE) loss function:
• We fairly compare the performance of a set of network
|Tn |
N X
structures with a similar model complexity to maintain X
a similar computational efficiency. First, we compare Lmse = ||y(n,t) − ỹ(n,t) ||2 . (3)
two state-of-the-art deep learning models: ResNet [16] n=1 t=1

and DenseNet [13], which shows that DenseNet is more While the ground truth defined in Eq. (1) is designed to encode
favorable than ResNet in order to obtain higher accuracy. the aforementioned monotonic ventricular volume change,
Second, we compare the performance of the temporal Lmse only aims to reduce the mean of the label-prediction
component by testing LSTM and GRU, including their bi- difference. Therefore, it is not constructed to sustain such
directional variations. We find that these different RNN monotonic characteristic between the predictions of conse-
configurations perform similarly in the phase detection quent frames in each cardiac phase during the training. During
problem. the test phase, it is possible that the frames around ED and
4

Fig. 2: An overview of the deep learning framework architecture for the detection of ED and ES frames from cine series of
echocardiograms. The framework has three components: 1) a CNN module to generate per frame image features; 2) an RNN
module for capturing temporal dependencies; and 3) a regression module for computing the per frame regression scores. The
maximum and minimum prediction scores are determined as ED and ES frames, respectively.

ES frames obtain predictions that surpass the predictions of where ỹ(n,σED (T )) and ỹ(n,σES (T )) are the predictions of the true
the actual ED and ES frames, causing inaccurate ED and ES index of ED and ES frames, respectively, κn = {ỹ(n,σNC (T )) }
frame localization. To alleviate this issue, Kong et al. [27] represents the subset of predictions for the non-critical (NC)
also proposed a structured loss to reinforce the monotonic frames, and γ = 0.025 is a user-defined margin parameter.
characteristic during the training:
We give an example of how Lmono and Lge behave in Fig. 3.
In Fig. 3-(a), Lmono penalizes the violation of monotonicity on
N |Tn | 
Lmono =
1 1 XX
1(y(n,t) > y(n,t−1) ) max(0, ỹ(n,t−1) − ỹ(n,t) )
15 predictions, creating 11 loss components (shown in sky-
N |Tn | n=1 t=2 blue colored lines with arrow-shaped ends), and to be averaged
by 32 (the number of frames in the cardiac cycle). In this case,


+ 1(y(n,t) < y(n,t−1) ) max(0, ỹ(n,t) − ỹ(n,t−1) ) ,
the normalization factor |Tn | = 32 in Lmono may generate a
(4) gradient with small step size as the training progresses. The
where 1(.) denotes the indicator function. reason is that the number of loss components will be reduced
During the development of this work, we found that the as the violations are resolved, but |Tn | remains at 32. This
monotonicity in Eq. (4) does not enforce the significance of may not be optimal because: 1) the small gradient degrades
the ED and ES frames w.r.t. the surrounding frames, and it is the ability of the training to escape shallow local minima in the
possible that a surrounding frame is misidentified as ED or ES loss landscape; and 2) the training may take longer to converge
frames for very small margin in a test cine series. In this sense, as it also needs to solve many counts of indirectly related
we argue that Eq. (4) is an in-direct surrogate of the inference monotonicity violations. On the other hand, in Fig. 3-(b), Lge
objective in Eq. (2), i.e., looking for the global extrema in the only tries to optimize the four most relevant predictions by a
cine series frame predictions. Therefore, we propose a global relatively large size gradient that is always summed by the two
extrema (GE) loss function to substitute Lmono , which focuses loss components; hence, normalization is not needed for Lge .
on promoting the ED and ES frames to be the global extrema In Fig. 3-(c) and (d), we show a training case with both ED and
during the training phase. This is achieved by imposing a ES frames correctly predicted, where Lmono further reinforces
margin between the ED (or ES) frame prediction and the the monotonic violations. However, this additional information
largest (or smallest) non-critical frame predictions: does not directly help with the objective of the phase detection
N  problem; rather it acts as a regularizer, thus it can produce a
1 X 
gradient that drives the training away from an optimal solution.
Lge = max (max(κn ) + γ) − ỹ(n,σED (T )) , 0
N n=1 On the other hand, Lge continues to promote the ground truth
 (5)
 ED and ES frames, where these margins established during
+ max ỹ(n,σES (T )) − (min(κn ) − γ), 0 , the training can tolerate a certain degree of erratic volumetric
5

Ground Truth Ground Truth


Prediction Prediction

Prediction

Prediction
Frame # Frame #
(a) (b)
Ground Truth Ground Truth
Prediction Prediction
Prediction

Prediction

Frame # Frame #
(c) (d)
Fig. 3: An example comparison of Lmono (left column) and Lge (right column), for a case with 4 frames error in ED localization
and 5 frames error in ES localization (top row), and a case with correct ED and ES localization (bottom row). The triangular
markers with the tip facing up (or down) indicate the ED (or ES) prediction and ground truth. The sky-blue colored lines
with arrow-shaped ends in (a) and (b) indicate individual loss components in Lmono and Lge , respectively. The transparent blue
boxes in (c) and (d) indicate the loss specific regions of interest. This figure is best viewed electronically for the details.

estimation of the surrounding frames, thus helping with the final prediction. For clarity, we use a single cine series as an
generalization ability of the model. example in Sec. II-C. Fig. 2 illustrates the main components
Finally, the training objective can be represented as: of the framework, and Fig. 4 depicts the tested modules in this
work.
Ltotal = (1 − α)Lmse + αLstruct , (6)
1) CNN Module for Image Feature Extraction: The first
where α is used to weigh the importance of the loss terms, module of fdnn is a CNN-based image feature extraction model
and Lstruct is either Lmono or Lge in our experiment. that generates per-frame features x̃t = ffeat (xt ).
ResNet module: One of the architectures that we explored
C. Deep Learning Framework for extracting the image features x̃t is the deep residual neural
Our proposed fdnn framework is similar to the RNN-based networks (ResNet) [16]. ResNet constitutes of a stack of
sequential prediction model introduced in [11], [19]. This residual layers, where each residual layer adds the input of
framework is a composition of three sub-modules: 1) a CNN a computation block to its own output. An individual residual
module for the purpose of image feature extraction; 2) an RNN layer can be expressed as the following:
module for learning the temporal dependencies between cine
series frames; and 3) a regression module that produces the x(t,l) = x(t,l−1) + fres (x(t,l−1) ; θfl eat ), (7)
6

Fig. 4: The CNN modules (left), RNN units (middle), and RNN structures (right) tested in this work. The details of the LSTM
and GRU units are specified in Fig. 5.

and the output image features x̃t of a ResNet are represented 2) RNN Module For Capturing Temporal Dependencies:
by x(t,L) : The second module is a temporal feature model that processes
L
the entire set of image features with the use of a recurrent
|Tn |
neural network (RNN), i.e., ht = frnn ({x̃t }t=1 ; θrnn ). We
X
x(t,L) = x(t,0) + fres (x(t,l−1) ; θfl eat ), (8)
l=1
test on two common RNN units, namely the LSTM [20] and
GRU [23] units.
where x(t,l) are the input features to the l ∈ {1, . . . ., L − 1}th
residual layer (i.e., a computation block), x(t,0) = xt is the LSTM module: The work-flow of a single LSTM unit is
input image, fres (.) represents a customized computation block depicted in Fig. 5(a). An LSTM unit consists of a memory cell
(the computation block in the original ResNet design has two and three gates: input gate, output gate, and forget gate. The
stacks of three CNN units, in the order of a convolution hidden state of an LSTM unit ht is computed by controlling
layer [14], [28], followed by a Batch Normalization (BN) the information flow through these gates:
unit [29] and a rectified linear (ReLU) activation unit [16],
it = s (Wx̃i x̃t + Whi ht−1 ) ;
[30]), and θfl eat denotes the collection of trainable model
parameters in the lth computation block. ft = s (Wx̃f x̃t + Whf ht−1 ) ;
DenseNet module: Another CNN architecture we explored ot = s (Wx̃o x̃t + Who ht−1 ) ;
in this work is the DenseNet [13] deep learning architecture. In (11)
c̃t = tanh(Wx̃g x̃t + Whg ht−1 );
comparison to ResNet, the outputs from all preceding layers
ct = ft ct−1 + it c̃t ;
in a DenseNet layer are concatenated to be the input for a
succeeding layer. The output image features x̃t of a DenseNet ht = ot tanh(ct );
are represented by: where denotes the element-wise product operation, s(.)
x(t,L) = [x(t,0) , x(t,1) , . . . , x(t,L−1) ], (9) represents the Sigmoid activation function, the variables
{Wx̃g , Whg } represent the weight parameters to compute the
and the intermediate layer outputs are computed as: candidate hidden state c̃t , the variables {Wx̃i , Whi } represent
the weight parameters to compute the input gate it (which
x(t,l) = fdense ([x(t,0) , x(t,1) , . . . , x(t,l−1) ]; θfl eat ), (10)
controls the influence of c̃t to the internal memory state ct ),
where [.] represents the concatenation operation, and fdense the variables {Wx̃f , Whf } represent the weight parameters to
denotes the DenseNet customizable computation block (in the compute the forget gate ft (which controls the mixture of the
original design, it contains only one stack of convolution layer, previous memory state ct−1 and the current memory state ct ),
BN, and ReLU units). and {Wx̃o , Who } represent the weight parameters to compute
7

ℎ𝑡−1 ℎ𝑡
ℎ𝑡−1 ℎ𝑡
memory cell 1 − 𝑧𝑡
𝑐𝑡−1 𝑐𝑡
gates 𝑟𝑡−1 𝑧𝑡−1 ℎ෨ 𝑡−1 𝑟t 𝑧𝑡 ℎ෨ 𝑡

gates 𝑜𝑡−1 𝑓𝑡−1 𝑖𝑡−1 𝑐𝑡−1


ǁ 𝑜𝑡 𝑓𝑡 𝑖𝑡 𝑐෥𝑡
𝑥෤𝑡−1 𝑥෤𝑡

𝑥෤𝑡−1 𝑥෤𝑡

(a)LSTM (b)GRU

Fig. 5: (a) Graphic model of LSTM, where c and c̃ denote a memory cell and a candidate memory cell state, respectively, and
i, f and o denote the input gate, forget gate, and output gate. (b) Graphic model of GRU, where h and h̃ are the hidden state
and candidate state, respectively, and z and r denote the update gate and reset gate, respectively.

the output gate ot (which controls the exposure of the current by element-wise sum operation to represent the output of
state ct to external network). The bias terms in Eq. (11) are bidirectional RNN for each time step. In our experiment, we
ignored for clarity. test the bidirectional variation on both LSTM and GRU units
GRU module: The work-flow of a GRU unit is shown in for a comprehensive comparison.
Fig. 5(b). The main characteristic difference of a GRU unit and 3) Regression Module: Finally, a regression model is
a LSTM unit is the simplified gating mechanism. The hidden used to produce the final prediction of each frame: ỹt =
|Tn |
state of a GRU unit is computed as: freg ({ht }t=1 ; θreg ). During the training of the framework, the
parameters of the framework are updated by using the Back-
zt = s(Wx̃z x̃t + Whz ht−1 ); propagation Through Time (BPTT) [32] method.
rt = s(Wx̃r x̃t + Whr ht−1 );
(12)
h̃t = tanh (Wx̃g x̃t + Whg (rt ht−1 )) ; III. E XPERIMENTS
ht = (1 − zt ) ht−1 + zt h̃t ; A. Dataset
where zt represents an update gate that decides a composition The echocardiography dataset used in this work was col-
of previous state ht−1 and candidate state h̃t to represent lected from the picture archiving and communication sys-
the hidden state ht of the unit (this can be thought as a tem (PACS) server of the Vancouver General Hospital with
simplification of the input gate and forget gate in an LSTM ethics approval from the Institutional Medical Research Ethics
unit), {Wx̃z , Whz } are the weight parameters associated with Board in coordination with the Information Privacy Office.
the update state zt , {Wx̃g , Whg } are the weight parameters The collected echocardiography studies are archived data
to compute h̃t , and {Wx̃r , Whr } are the weight parameters acquired between 2011 and 2015. The dataset consists of
to compute the reset gate rt (which allows the candidate state 3,087 patient studies. Each study is a 2D echo AP4 view
computation to optionally drop the irrelevant past information, cine series gathered from one patient and stored in DICOM
if any, allowing for a more compact representation [31]). The format. These clinical echos included various pathological
bias terms in Eq. (12) are also ignored for clarity. Note that conditions, a variety of heart rates (i.e., from 47 to 104 beats
the GRU unit uses two gates in contrast to the three gates per minute), and a variable number of frames (i.e., from 36 to
design in an LSTM unit, meaning GRU requires less amount 64). All identifiable patient information in the DICOM file was
of model parameters and can be computed faster (given the anonymized according to the conditions of the ethics approval.
same number of units are used). Each cine series contains frames of a complete cardiac cycle
Bidirectional LSTM/GRU modules: A conventional RNN with variable number of frames, with a minimum of 29 frames
unit takes into account the past state (information) as a part of and a maximum of 55 frames, and an average of 42 frames.
current state computation in each time step, while the bidirec- All studies in this dataset were acquired using the same type
tional RNN variant also maintains a separate “backward” state of ultrasound machine (Philips iE33) and contained labels of
on a secondary controller in addition to the “forward state” on the ES and ED frames recorded by the expert sonographer
the first controller. From an implementation point of view, the for clinical estimation of LV Ejection Fraction. Given that
second controller reversely reads the input sequence (i.e., the these studies were of appropriate quality for segmentation,
CNN computed image features in our case) to compute the we assume that this is a high quality dataset with adequate
backward state. The two controller outputs are combined visualization of endocardial borders across the cardiac cycle.
8

10 resnet_bigru resnet_bigru
10
resnet_2gru resnet_2gru
resnet_bilstm resnet_bilstm
8 resnet_2lstm 8 resnet_2lstm
densenet_bigru densenet_bigru
densenet_2gru densenet_2gru
6 densenet_bilstm 6
-ED

densenet_bilstm

-ES
densenet_2lstm densenet_2lstm
4 4

2 2

0 0

10 -5 10 0 10 -5 10 0
Initial Learning Rate Initial Learning Rate
(a) ED (b) ES
Fig. 6: Deep learning architecture comparison shown by the error measurement µ on the test set. The tested learning rates are
from 1e−4 up to 1e−1 in an interval of power of 10. Note that each configuration has its x-axis position deviated from the
exact value for better representational purpose.

TABLE I: Deep learning architecture comparison shown by the error measurement µ on the test set. The lowest test error in
each comparison group is highlighted.
CNN module RNN module No. of Param. R2 µED µES
2-LSTM [33] 1.18M 0.92 0.78 ± 1.02 1.45 ± 1.28
Bi-LSTM 1.10M 0.93 0.70 ± 0.99 1.57 ± 1.35
ResNet
2-GRU 1.12M 0.92 0.76 ± 1.00 1.42 ± 1.29
Bi-GRU 1.10M 0.92 0.83 ± 1.17 1.47 ± 1.36
2-LSTM 1.08M 0.94 0.57 ± 0.88 1.34 ± 1.18
Bi-LSTM 1.19M 0.93 0.66 ± 1.08 1.45 ± 1.31
DenseNet
2-GRU 0.98M 0.93 0.49 ± 0.78 1.36 ± 1.18
Bi-GRU 1.07M 0.93 0.64 ± 1.06 1.33 ± 1.21

B. Data Preparation and Experimental Setup module. The deployed ResNet and DenseNet are designed to
In the data preparation step, the ultrasound imaging beam in have the same number of layers and similar number of model
the raw DICOM file is cropped and then re-sized to 120 × 120 parameters.
pixels with bi-cubic interpolation method. The prepared cine ResNet: The ResNet feature module is an off-the-shelf
series are divided into three mutually exclusive sets: 60% as Lasagne implementation that starts with a convolution layer
training set, 20% as validation set, and the remaining 20% of sixteen 3 × 3 filters with 2 × 2 stride, followed by 30
as test set. The recorded sonographers’ labels (i.e., the frame residual layers grouped into three meta blocks. The first
index of ES and ED frames) are used to generate the regression residual layer of the second and the third meta blocks double
ground truth value for each frame, according to Eq. (1). the number of convolution filters from the previous meta block,
All experiments were deployed on a PC with the following and the rest of the residual layers within the meta block
configuration: Intel Core i7-2600k 3.40GHz (8 cores), 8GB share the same number of filters. As mentioned in Sec. II-C1,
of RAM, and a NVIDIA GeForce GTX 980Ti Video Card. each fres computation block contains two batch-normalized
Lasagne deep learning library with Theano backend [34] was convolution layers; hence, the total number of convolution
used to train and test the models. layers in the ResNet module is 61. As for the residual skip-
The optimal initial learning rate was determined as 1e−1 connection, we do not use the project shortcut (i.e., a skip-
by the experiment shown in Sec. IV-A. Throughout the exper- connection link composed by a trainable CNN layer) since it
iments, the learning rate decays by 1/10 at 31th epoch and 61th will introduce extra model parameters. The batch-normalized
epoch, where the maximum training epoch is 100. The weight convolution layers use 3 × 3 filter size, full zero padding,
decay method is used to regularize the model parameters, and 1 × 1 stride, and ReLU nonlinearity. In addition, the weights
is set to 1e−5 . of the convolution layers are initialized by the default Xavier
As mentioned in Sec. II-C1, we consider ResNet and Initialization method [35]. Finally, a 2 × 2 average pooling is
DenseNet as two candidates for spatial feature extraction used in-between the meta blocks, and a global average pooling
9

layer is used at the end of the last meta block. rate compared to the architecture. In Table I, we show the
DenseNet: The DenseNet feature module starts with a con- performance of the examined architectures at learning rate
volution layer of 16 3 × 3 filters with 2 × 2 stride (identical to 1e−1 . It can bee seen that the ED localization error is always
the first layer of the ResNet module), followed by five dense- lower than the ES localization error for all tested architectures.
blocks with 12 batch-normalized convolution layers in each From the clinical point of view, the primary indicator to
block, also resulting a 61-layer model. The growth rate hyper- identify the ED frame is the closure of the mitral valve, which
parameter is set to 6, meaning each fdense in a dense-block adds is clearly visible in the AP4 view, whereas the dominant
six more batch-normalized convolution layers to the block. The indicator of the ES frame (i.e., aortic valve’s closure), is
specification of the batch-normalized convolution layer is the not visible in this view. We believe that the lack of the ES
same as the one used in the ResNet. Finally, a 2 × 2 average characteristic information in the AP4 view is likely the reason
pooling layer is featured after each dense-block except the last for the relatively larger deviation in the ES frame localization
one, where a global average pooling layer is used instead. error.
The extracted spatial features are then passed to the tempo- By comparing the architectures that use the same type of
ral module, which can be either of the four options: 1) two- RNN module but use either ResNet or DenseNet module,
layer LSTM; 2) Bi-LSTM; 3) two-layer GRU; 4) Bi-GRU. The the DenseNet-based models are favourable as they result in
reason for using two-layer RNN is that a single Bi-RNN layer 0.18 lower ED frame error and 0.11 lower ES frame error on
is essentially made from two paralleled RNN layers; therefore, average. With the use of the same CNN module, the LSTM-
compared to one-layer RNN, the number of parameters of a based and GRU-based modules produce comparable results,
Bi-RNN layer is much closer to a two-layer RNN. Throughout which is inconclusive to recommend a prefered module for
our experiments, each layer of an RNN layer uses 128 units. this phase detection problem. Nevertheless, the observation is
consistent with the findings in [21], [37] that when similar
C. Evaluation Metrics number of LSTMs and GRUs are used, the performance
We use the R2 score as one of the evaluation methods for difference is minor. In addition, it can be observed that the
the regression performance. The R2 score is defined by R2 = two-layer and bi-directional variants of both RNN units yield
P
(y(n,t) −ỹ(n,t) )2 similar test performance.
1− P
(y(n,t) −ȳ)2 , where ȳt is the mean of true labels, and
ỹ is the predicted label. The best possible R2 score is 1 and
it can be negative as the model can be arbitrarily worse. B. State-of-the-art Comparison
Nevertheless, R2 itself is an indirect performance indicator Based on the above results, we choose the DenseNet + 2-
of the ED and ES frames detection problem. To quantify the GRU architecture for testing the proposed Lge loss function
ED and ES frame detection performance, we PNcompute the because it has the lowest combined µED and µES error, and
average error of the prediction as µe = N1 i=1 |qei − q̃ei |, least number of model parameters.
i
e ∈ {ED, ES}, where qED/ES is the the ground truth ED/ES From Table II, we can see that the proposed Lge loss

frame in ith echo cine i.e., qED = σED (T ) and qES = σES (T ), function reduces the ED localization error by 0.29 frame on
 average. This improvement from Lmono to Lge is statistically
see Sec. II-A for the explanation of σ(.) , and the predicted
significant by examining the t-test statistical hypothesis test.
ED/ES frame is indicated by (i.e., q̃ED/ES , see Sec. II-B for As for the ES localization error, the Lge is 0.09 frame on
respective definition). average behind the Lmono . Nevertheless, the t-test does not
reject the null hypothesis at 5% significance level, suggesting
IV. R ESULT AND D ISCUSSION
comparable ES performance. In this experiment, the α value
A. Model and Hyper-parameter Selection Experiment for Lge is set to 0.3 to be comparable to the Lmono , and γ
Our first experiment is designed to determine the perfor- hyper-parameter of Lge is validated in the range between 0
mance of the selected CNN and RNN modules on the phase and 0.8, where the optimal value is 0.025. We illustrate the
detection problem. The performance of the architectures are sensitivity of Lge to γ in Fig. 7. It is observed that the ED error
evaluated at four initial learning rates ranging from 1e−4 to is more sensitive compared to the ES error with the change of
1e−1 in an interval of power of 10. We observe that by further γ.
increasing the learning rate, training convergence is hampered. In Fig. 8, we show the sample-wise frame error distribution
This experiment explores the best architectural configuration computed for the respective Lmono and Lge trained DenseNet +
with the use of the monotonic loss function Lmono (see details 2-GRU models. It can be observed that Lmono loss has slight
in Sec. II-B), where the weighting parameter α in Eq. (6) tendency towards a late prediction, while Lge loss is biased
is set to 0.3 for all the trained models in this experiment. towards an earlier prediction for the ES frame. On the other
The reference model of the experiment is the ResNet + 2- hand, Lge loss shows more correct ED predictions compared
LSTM model, originally proposed in [33]. The combination to Lmono loss.
of CNN modules, RNN modules, and learning rates results in For the completeness of the study, the performance of the
32 different models. Finally, the number of parameters of each Lmono model has been examined across several cine series
model is within a 10% variation from 1.1 million. properties, i.e., frame rate, patient heart rate, and number of
The experiment result is shown in Fig. 6, and it is noticeable frames, but we did not observe clear functional relationship
that the dominant factor for achieving low error is the learning between the model performance and these properties.
10

TABLE II: A close comparison between the monotonic loss Lmono and the proposed global extrema loss Lge .
Method No. of Param. µED µES
TempReg-Net [27] 5.37M 0.91 ± 1.16 1.75 ± 1.51
DMTRL-Net [36] 0.77M 0.65 ± 0.88 1.80 ± 1.55
DenseNet + 2-GRU + Lmono [27] 0.98M 0.49 ± 0.78 1.34 ± 1.17
DenseNet + 2-GRU + Lge (proposed) 0.98M 0.20 ± 0.67 1.43 ± 1.30

ethics approval. We partitioned the data into three mutually


1.6
exclusive sets of training (60%), validation (20%), and test
1.56
1.52 1.53 (20%) based on unique patients, and the training hyper-
1.4 1.48
1.43 1.43 ES parameters have been adopted from the above AP4 view
ED
1.2 experiment. The proposed model was trained on the combined
1.00
training and validation sets and the performance is reported on
measurement

1 the test set.


The ED and ES frame localization error distributions are
0.8 depicted in Fig. 9. The localization error of Lge for ED frame
0.63 is 0.71 ± 1.34 frames and the error for ES is 1.92 ± 1.71
0.6
frames. In comparison, the localization error of Lge for the
0.35 ED frame is 0.84 ± 1.25 frames, and the error for the ES is
0.4
0.28
0.26
0.27 2.02 ± 1.84 frames. Hence, the proposed Lge loss function
0.2 reduces the ED localization error by 0.13 frames on average.
0 0.0125 0.025 0.05 0.2 0.8
This improvement from Lmono to Lge is statistically significant
by examining the t-test (p value is 0.044). As for the ES
Fig. 7: The sensitivity of Lge loss function w.r.t. the mean localization error, the Lge is 0.10 frames on average better than
localization error µ, shown as a function of the margin hyper- the Lmono . The t-test does not reject the null hypothesis at 5%
parameter γ. significance level, suggesting comparable ES performance.
This experiment confirms the performance consistency of
the proposed model across another clinically important cardiac
In Table II, we also include the performance of the view.
TempReg-Net [27] and the deep multitask relationship learn-
ing network (DMTRL-Net) [36]. Both of these networks were
designed to solve the phase detection problem from MRI V. C ONCLUSION
data. The TempReg-Net consists of a pre-trained AlexNet
In this paper, we proposed a method to solve the ES and
feature model (five convolution layers and two fully-connected
ED frame detection problem directly, without the need for
layers), but only the features of the first fully-connected
LV segmentation. We reported results that were within inter-
layer are used as its CNN module, and a single layer of
observer variability error for detecting ED and ES frames
LSTM as its RNN module. The original DMTRL-Net contains
(i.e., median disagreement of 3 frames [4]). We demonstrated
two parallel RNNs, one for LV indices estimation task and
the performance of several deep learning architectures, based
another for cardiac phase detection task. Our implementation
on the combination of state-of-the-art CNN and RNN mod-
of the DMTRL-Net includes the CNN module (three batch-
ules, and evaluated these architectures on a large dataset of
normalized convolution layers and two FC layers) and the
echocardiography cine series. The results demonstrate that
phase detection RNN (LSTM) branch. It can be observed
DenseNets perform better than ResNets on the given phase
that TempReg-Net has nearly eight times more parameters
detection task. On the other hand, LSTMs, GRUs, and their
compared to DMTRL-Net and five times more parameters
bi-directional variants yielded comparable performance. We
compared to the proposed deep learning architecture. Never-
identified a key issue in the previous method proposed by
theless, the TempReg-Net and DMTRL-Net perform similarly
Kong et al. [27] for phase detection in cardiac MRI, which
on ES localization while DMTRL-Net has a lower ES local-
was that the training objective was an indirect surrogate of
ization error. The performance of both methods are behind the
the test objective. Therefore, we proposed a new structured
proposed deep learning architecture.
loss function (Eq. (5)) that directly optimized towards this
objective. We also observed that the ES frame localization
C. Extension of the Proposed Model to PLAX View error was always higher than the ED frame localization error,
Although we experimented on the AP4 view to this point, which can be a result of lacking ES characteristic visual
the proposed deep learning approach does not make any a information in the AP4 view. To test the generalizability of
priori assumption on a specific echo view. We conduct a the proposed loss and deep learning architecture, we also
feasibility study with the use of 1,382 patient studies on experimented with the PLAX view and obtained comparable
Parasternal Long Axis (PLAX) view, obtained under the same results.
11

1600
-ED L mono
-ES L mono
-ES L ge
1400 -ED L ge
400
1200

Histogram Count
Histogram Count

300 1000

800
200
600

400
100
200

0 0
-7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7
Frame Error Distribution Frame Error Distribution
(a) ES (b) ED
Fig. 8: Histogram plot comparison of test error distribution of the Lmono and Lge trained DenseNet + 2-GRU models listed as
the last two entries of Table II.

150 -ES L mono 500 -ED L mono


-ES L ge -ED L ge

400
Histogram Count
Histogram Count

100
300

200
50

100

0 0
-9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9
Frame Error Distribution Frame Error Distribution
(a) ES (b) ED
Fig. 9: Histogram plot of frame error distribution of the Lmono and Lge trained DenseNet + 2-GRU models on the PLAX view
dataset. Examples of ES and ED PLAX view sample images are overlaid on the respective figures.

Limitation of the study is comprehensive evaluation of the R EFERENCES


generalizability of the proposed method w.r.t. unseen data, [1] D. D. McManus, S. J. Shah, M. R. Fabi, A. Rosen, M. A. Whooley,
such as data from other cardiac views, data from various and N. B. Schiller, “Prognostic value of left ventricular end-systolic
ultrasound vendors, and data of low quality, since our current volume index as a predictor of heart failure hospitalization in stable
coronary artery disease: data from the heart and soul study,” Journal of
data are acquired by expert ultrasound operators. the American Society of Echocardiography, vol. 22, no. 2, pp. 190–197,
2009.
[2] R. M. Lang, L. P. Badano, V. Mor-Avi, J. Afilalo, A. Armstrong,
L. Ernande, F. A. Flachskampf, E. Foster, S. A. Goldstein, T. Kuznetsova
ACKNOWLEDGMENT et al., “Recommendations for cardiac chamber quantification by echocar-
diography in adults: an update from the american society of echocardio-
graphy and the european association of cardiovascular imaging,” Journal
The authors would like to thank contributions of the Nat- of the American Society of Echocardiography, vol. 28, no. 1, pp. 1–39,
2015.
ural Sciences and Engineering Research Council of Canada [3] R. O. Mada, P. Lysyansky, A. M. Daraban, J. Duchenne, and J.-U. Voigt,
(NSERC) and the Canadian Institutes of Health Research “How to define end-diastole and end-systole?: Impact of timing on strain
(CIHR) for funding this project. We would like to also thank measurements,” JACC: Cardiovascular Imaging, vol. 8, no. 2, pp. 148–
157, 2015.
Dale Hawley for helping with accessing data used in this [4] M. Zolgharni, M. Negoita, N. M. Dhutia, M. Mielewczik, K. Manoharan,
research. S. Sohaib, J. A. Finegold, S. Sacchi, G. D. Cole, and D. P. Francis,
12

“Automatic detection of end-diastolic and end-systolic frames in 2d [25] B. Kong, X. Wang, Z. Li, Q. Song, and S. Zhang, “Cancer metastasis
echocardiography,” Echocardiography, vol. 34, no. 7, pp. 956–967, detection via spatially structured deep network,” in International Con-
2017. ference on Information Processing in Medical Imaging. Springer, 2017,
[5] N. Kachenoura, A. Delouche, A. Herment, F. Frouin, and B. Diebold, pp. 236–248.
“Automatic detection of end systole within a sequence of left ventric- [26] W. Xue, I. B. Nachum, S. Pandey, J. Warrington, S. Leung, and S. Li,
ular echocardiographic images using autocorrelation and mitral valve “Direct estimation of regional wall thicknesses via residual recurrent
motion detection,” in IEEE International Conference on Engineering in neural network,” in International Conference on Information Processing
Medicine and Biology Society. IEEE, 2007, pp. 4504–4507. in Medical Imaging. Springer, 2017, pp. 505–516.
[6] A. Shalbaf, Z. AlizadehSani, and H. Behnam, “Echocardiography with- [27] B. Kong, Y. Zhan, M. Shin, T. Denny, and S. Zhang, “Recognizing
out electrocardiogram using nonlinear dimensionality reduction meth- end-diastole and end-systole frames via deep temporal regression net-
ods,” Journal of Medical Ultrasonics, vol. 42, no. 2, pp. 137–149, 2015. work,” in International Conference on Medical Image Computing and
[7] U. Barcaro, D. Moroni, and O. Salvetti, “Automatic computation of left Computer-Assisted Intervention (MICCAI). Springer, 2016, pp. 264–
ventricle ejection fraction from dynamic ultrasound images,” Pattern 272.
Recognition and Image Analysis, vol. 18, no. 2, p. 351, 2008. [28] Y. LeCun and Y. Bengio, “Convolutional networks for images, speech,
[8] S. Darvishi, H. Behnam, M. Pouladian, and N. Samiei, “Measuring and time series,” in Handbook of Brain Theory and Neural Networks,
left ventricular volumes in two-dimensional echocardiography image M. A. Arbib, Ed. MIT Press, 1995, p. 3361.
sequence using level-set method for automatic detection of end-diastole [29] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep
and end-systole frames,” Research in Cardiovascular Medicine, vol. 2, network training by reducing internal covariate shift,” arXiv preprint
no. 1, p. 39, 2013. arXiv:1502.03167, 2015.
[9] A. A. Abboud, R. W. Rahmat, S. B. Kadiman, M. Z. B. Dimon, [30] V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltz-
L. Nurliyana, M. I. Saripan, and H. H. Khaleel, “Automatic detection of mann machines,” in Proceedings of the 27th International Conference
the end-diastolic and end-systolic from 4d echocardiographic images,” on Machine Learning (ICML), 2010, pp. 807–814.
Journal of Computer Science, vol. 11, no. 1, pp. 230–240, 2015. [31] K. Cho, B. Van Merriënboer, D. Bahdanau, and Y. Bengio, “On the
[10] S. A. Aase, S. R. Snare, H. Dalen, A. Støylen, F. Orderud, and properties of neural machine translation: Encoder-decoder approaches,”
H. Torp, “Echocardiography without electrocardiogram,” European arXiv preprint arXiv:1409.1259, 2014.
Heart Journal-Cardiovascular Imaging, vol. 12, no. 1, pp. 3–10, 2011. [32] P. J. Werbos, “Backpropagation through time: what it does and how to
[11] S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer, “Scheduled sampling do it,” Proceedings of the IEEE, vol. 78, no. 10, pp. 1550–1560, 1990.
for sequence prediction with recurrent neural networks,” in Advances in [33] F. T. Dezaki, N. Dhungel, A. H. Abdi, C. Luong, T. Tsang, J. Jue,
Neural Information Processing Systems, 2015, pp. 1171–1179. K. Gin, D. Hawley, R. Rohling, and P. Abolmaesumi, “Deep residual
[12] A. Graves and J. Schmidhuber, “Framewise phoneme classification recurrent neural networks for characterisation of cardiac cycle phase
with bidirectional lstm and other neural network architectures,” Neural from echocardiograms,” in Deep Learning in Medical Image Analysis
Networks, vol. 18, no. 5, pp. 602–610, 2005. and Multimodal Learning for Clinical Decision Support. Springer,
2017, pp. 100–108.
[13] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely
[34] S. Dieleman, J. Schlter, C. Raffel, E. Olson, S. K. Snderby, D. Nouri
connected convolutional networks.” in Proceedings of the IEEE Confer-
et al., “Lasagne: First release.” 2015.
ence on Computer Vision and Pattern Recognition (CVPR), vol. 1, no. 2,
[35] X. Glorot and Y. Bengio, “Understanding the difficulty of training
2017, p. 3.
deep feedforward neural networks,” in Proceedings of the Thirteenth
[14] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
International Conference on Artificial Intelligence and Statistics, 2010,
with deep convolutional neural networks,” in Advances in Neural Infor-
pp. 249–256.
mation Processing Systems, 2012, pp. 1097–1105.
[36] W. Xue, G. Brahm, S. Pandey, S. Leung, and S. Li, “Full left ventricle
[15] G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi,
quantification via deep multitask relationships learning,” Medical Image
M. Ghafoorian, J. A. Van Der Laak, B. Van Ginneken, and C. I. Sánchez,
Analysis, vol. 43, pp. 54–65, 2018.
“A survey on deep learning in medical image analysis,” Medical Image
[37] W. Yin, K. Kann, M. Yu, and H. Schütze, “Comparative study of cnn and
Analysis, vol. 42, pp. 60–88, 2017.
rnn for natural language processing,” arXiv preprint arXiv:1702.01923,
[16] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image 2017.
recognition,” in Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), 2016, pp. 770–778.
[17] A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recognition with
deep recurrent neural networks,” in IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2013, pp.
6645–6649.
[18] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning
with neural networks,” in Advances in Neural Information Processing
Systems, 2014, pp. 3104–3112.
[19] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A
neural image caption generator,” in Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 3156–
3164.
[20] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[21] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of
gated recurrent neural networks on sequence modeling,” arXiv preprint
arXiv:1412.3555, 2014.
[22] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependen-
cies with gradient descent is difficult,” IEEE Transactions on Neural
Networks, vol. 5, no. 2, pp. 157–166, 1994.
[23] K. Cho, B. van Merriënboer, Ç. Gülçehre, D. Bahdanau, F. Bougares,
H. Schwenk, and Y. Bengio, “Learning phrase representations using rnn
encoder–decoder for statistical machine translation,” in Proceedings of
the learning on Empirical Methods in Natural Language Processing
(EMNLP). Association for Computational Linguistics, 2014, pp. 1724–
1734.
[24] H. Chen, Q. Dou, D. Ni, J.-Z. Cheng, J. Qin, S. Li, and P.-A. Heng,
“Automatic fetal ultrasound standard plane detection using knowledge
transferred recurrent neural networks,” in International learning on Med-
ical Image Computing and Computer-Assisted Intervention (MICCAI).
Springer, 2015, pp. 507–514.

View publication stats

You might also like