You are on page 1of 4

AUTOMATIC LABELING OF CONTINUOUS WAVE DOPPLER IMAGES BASED ON

COMBINED IMAGE AND SENTENCE NETWORKS

Mehdi Moradi∗ , Yufan Guo, Yaniv Gur, Tanveer Syeda-Mahmood

IBM Research - Almaden Research Center, San Jose, CA

ABSTRACT or composes text for a new image. These models are trained
As medical imaging datasets grow, we are approaching on very large datasets for natural images and domains such
the era of big data for radiologic decision support systems. as travel blogs [1, 2]. However, in the medical imaging do-
This requires renewed efforts in dataset curation and label- main, there is limited work in this area. In a recent work,
ing. We propose a methodology for weak labeling of medical we have proposed a cross-modality image transform for this
images for attributes such as anatomy and disease that relies purpose [3]. The essence of this methodology is to separate
on image to sentence transformation. The methodology con- the process of quantifying text and image, from the process of
sists of three models, a convolutional neural network that is training a transformation that maps the image space to the text
trained on a coarse classification task and acts as an image space. Unlike the work in automatic image captioning com-
feature generator, a language model to map sentences to a munity that uses hybrid models, our approach limits the need
fixed length space, and a multi-layer perceptron that acts as a for pairs of matched image and text. When the transforma-
function approximator to map images to the sentence space. tion is trained, one can utilize a large volume of non-parallel
The transform model is trained on matched image-sentence images and text segments to generate labels, by transform-
pairs on a dataset of echocardiography studies. For a given ing a given image and extracting keywords from the nearest
image, labels are extracted from the closest sentences to the neighbors of the output vector.
output of the image-sentence transform. We show that the re-
sulting solution has an 78.2% accuracy in labeling Doppler This work reports an improved version of the methodol-
images with aortic stenosis. We also show that the retrieved ogy in [3] and the results on an image dataset that is 20 times
sentences are consistent with the true sentences in terms of larger. The main methodological novelties of this work in-
meaning with an average BLEU score of 0.34, matching the clude: 1) In [3], we used a pre-trained convolutional neural
current highly performing machine translation solutions. In- network (CNN) as a source of features for images. That net-
work was trained for the ImageNet challenge and was not al-
dex Terms— Image labeling, multimodal classification.
tered [4]. This was due to the very limited size of the dataset
and lack of diagnostic labels to train a classifier in a super-
1. INTRODUCTION vised fashion. In this work, given a much larger dataset, we
propose to train a CNN on our imaging data that performs
The promise of big data brings also the challenge of curating
a coarser level classification task on the images of interest,
large datasets for training machine learning algorithms. The
for which the labels are typically available. This removes the
success of learning approaches in medical imaging is slowed
need to use a network with irrelevant categories as a source of
by the fact that most of available large datasets are in fact
features. We implement this idea in the domain of echocar-
within the category of dark data: unorganized data, without
diography Doppler image analysis by training a model as a
clinical labels, that is collected in the course of routine clin-
feature generator that classifies images for the cardiac valve
ical practice. The true labeling and annotation of this data
imaged, as opposed to disease. 2) The language model used in
can be performed by clinicians. However, this is very expen-
[3] transformed a full paragraph to a fixed length feature vec-
sive. As such, we require methods for weak initial labeling
tor. In the current work we use sentences as the language unit.
of medical data to reduce the amount of time the clinicians
A paragraph could consist of many sentences irrelevant to a
spend on such images. Crowd-sourcing is proposed as a so-
new given image. For example, in a clinical report written for
lution in this domain, but is limited by privacy concerns and
an echocardiography study, there is usually a paragraph writ-
also the difficulty of the radiology tasks.
ten for aortic valve. This includes sentences that describe the
Another potential solution lies in the domain of multi-
anatomy and function of the valve. The function is relevant
modal analysis of text and images, where one can use a large
to continuous wave (CW) Doppler images, while anatomy is
dataset of images and text to train a model that predicts labels
visible in B-mode images. Text modeling at sentence level
∗ Corresponding author: mmoradi@us.ibm.com allows us to retrieve more relevant sentences for images.

978-1-5090-1172-8/17/$31.00 ©2017 IEEE 667


2. METHODS

Our proposed method of image labeling uses a transforma-


tion that maps an image to a space defined by sentences writ-
ten for similar images by clinicians. The disease labels for
an image, in our case for aortic stenosis, are extracted from
the sentences that are closest to the output of this transforma-
tion given the image. This solution requires three different
components that need training: 1) a language model that can
represent a given sentence in a semantically consistent fash-
ion. 2) A method for extracting features from a given image
that best characterizes the images of interest, in this case the
cardiac CW Doppler images, 3) a transformation that maps
the image vector to the space formed by text vectors, trained
to minimize the distance of the transformed image from the
sentence actually written for that image by a clinician. This
Fig. 1. The samples of continuous wave Doppler images of
step requires training data in which image and text are real
(ccw): aortic, mitral, pulmonic, tricuspid valves.
matches of with each other.
In the current paper, we describe the methodology with
a focus on labeling images for existence and severity of aor-
tic stenosis. The choice to limit the scope is driven by the
with limited data, medical imaging experts have successfully
difficulty of quantifying the performance of the method for
used the principles of transfer learning to use a convolutional
any given disease, as the true diagnosis is only available to us
neural network trained on real world data [6].
for aortic stenosis. The method, however, is not theoretically
limited to this specific application. Training a supervised classifier for disease requires a large
dataset of images labeled for disease. This is generally diffi-
2.1. Data cult. Instead, we train a deep neural network that performs
a different classification task for which the labels are typi-
The data used in our work consisted of cardiac echocardio- cally available, namely valve detection. Specifically, we train
graphic studies of over 3300 patients. Each study comes with a convolutional neural network that labels a given CW image
a single text report, with paragraphs that are each labeled with as aortic, mitral, pulmonic or tricuspid valve. All 9,025 CW
an anatomy or condition. There are a total of 35,150 sen- images are used here. We use 70% of the images for training,
tences. Among these we found 10,978 sentences that mention 25% for validation to avoid overfitting, and 5% of the data for
“aortic valve”. On the imaging side, we have a total of 9,025 the final testing to report performance of this model in valve
CW Doppler images of aortic, pulmonic, mitral, and tricuspid classification.
valves (Fig. 1). The images have been labeled for the im-
aged valve, and consist of 3843 CW images of aortic valve, The typically successful CNNs within the computer vi-
1725 of the mitral valve, 1112 of the pulmonic valve, and sion literature, use millions of images to optimize both the
2345 images of tricuspid valve. The valve labels are available architecture and weights for classification. Given the size of
from the characters on the image or from DICOM headers our data, we opt to use the architecture of AlexNet that has
that come with the images. However, in the absence of these proved successful in previous work [7], and train only the
pieces of information, labeling for valve itself is challenging weights. This network consists of five convolutional layers,
as the wave patterns in Figure 1 show. two fully connected layers each with 4096 nodes, and a soft-
A total of 3251 images and text segments were matched max layer with 1000 nodes for the ImageNet classification
with each other for training of the image-sentence transform. task. Our only change to the architecture is that we replace
The results are reported for five fold cross validation. this output layer, with four neurons for the four valve types.

The network weights are trained using back-propagation


2.2. Image valve classifier/feature extractor
with stochastic gradient decent solver, and exponential de-
Our proposed image-sentence transform requires a quantita- cay. The error on validation set is monitored and training is
tive representation of images as input. This can be a mix of stopped when the error starts to rise, while training error stays
traditional hand-crafted features. However, the recent success steady. The resulting classifier is used as a source of features.
of deep learning in image classification, including our pre- Specifically, we use the output of the second fully connected
vious work [5], shows that deep learners can be effectively layer of the network, after training, as a 4096 dimensional
used as image classifiers and quantifiers. In fact, in situations feature vector for a given image [8].

668
2.3. Sentence model Since this network acts as a regressor as opposed to a clas-
sifier, the output layer activation functions were set to linear
As described earlier, in this work we have chosen to use a
as opposed to SoftMax. We optimized the network with the
sentence as the unit of text. For our work, it is critical that
objective of minimizing the mean Euclidean distance between
words/sentences with similar meanings have a similar con-
the output vector and the target text vector for the image.
tribution to anchoring a textual instance in the feature space.
Bag of words is a traditional method of converting a text seg-
ment to a fixed length vector. However, in models such as 2.5. Test pipeline
bag of words, words such as “narrowing” and “stenosis” are
In the test stage, given an image, we first run it through the
equally distant from each other as are ‘normal” and “steno-
CNN classifier. While this network produces a valve label,
sis”, regardless of meaning. Given the complexity and flexi-
we do not use the label as the data has labels for valve. In-
bility within natural languages, features such as bag of words
stead, we extract the outputs of the last fully connected layer
or word sequences usually result in a high dimensional vector,
of this model as the image feature vector. The output is then
which may cause data sparsity issues when the size of training
fed to the transformation network to obtain a vector in the text
data is incomparable to the number of features.
space. Then we search for the closest matches to this vector
In this work, we used an unsupervised method to create
in the text dataset. The closest matches, in terms of Euclidean
a fixed length vector given a sentence. This is based on the
distance are used for extraction of disease descriptors for the
neural network language model proposed in [9] to generate
image. For consistency, we use the top 20 matches of each
distributed representations of texts. This network is often
output vector along with majority voting to determine the ex-
referred to as Doc2Vec in the literature (open source code:
istence of aortic stenosis and also separate severe cases from
http://deeplearning4j.org/doc2vec.html).
mild or moderate cases of aortic stenosis.
We refer to it as sent2vec as we use it to quantify sentences.
The input of the neural network includes a sequence of ob-
served words (e.g. “aortic valve peak”), each represented by 2.6. BLEU score
a fixed-length vector, along with a text snippet token, also
We also calculate and report the BLEU score [10] between the
in the form of a dense vector and corresponding to the sen-
actual and predicted/retrieved image descriptions. The BLEU
tence/document source for the sequence. The concatenation
(bilingual evaluation understudy) score was originally devel-
or average of the word and sentence vectors was used to
oped for automatic evaluation of machine translation against
predict the next word (e.g. “velocity”) in the snippet. The
reference translation. The cornerstone of the metric is a mod-
two types of vectors were trained on the 35,150 available
ified version of n-gram precision, which is language indepen-
sentences. Training was performed using stochastic gradient
dent. Scores are calculated on individual translations and can
descent via backpropagation. At the testing stage, given an
be averaged over the whole corpus to approximate the over-
unseen sentence, we freeze the word vectors from training
all performance. It has been frequently reported that there is a
time and just infer the sentence vector.
good correlation between BLEU scores and human judgment,
The fixed length of the text feature vector is a parame-
and BLEU remains one of the most popular automated, effi-
ter in sent2vec model. Given the small size of the input (one
cient, and inexpensive metrics for machine translation natural
sentence) we limit the size of the text vectors to 10, after ex-
language generation evaluation. Our use of the BLEU score is
perimenting with a range of possibilities. An additional ad-
a new application for this classic measure. We used the open
vantage of a small size vector is that the transform model will
source implementation through NLTK [11] for corpus-BLEU
have fewer weights for training.
computation.

2.4. Text-inspired image transform


3. RESULTS AND CONCLUSION
A feedforward multi-layer perceptron with at least one hid-
den layer is a universal function approximator, according to Semantic consistency of the sent2vec language model: A
the universal approximation theorem. This is of course only primary assumption of the methodology presented here is that
true in practice in the presence of sufficient data. In the cur- if two vectors in the sent2vec are close to each other in terms
rent work, we have 3251 images matched with sentences that of a similarity measure, they mean the same and include the
can be used to train the required transformation. Since the ex- same keywords. The authors of [9] have provided evidence
periments are performed in 5 folds, each model is trained on of this in the context of a dataset of film reviews. Here, we
80% of this data. Note that this size of data is relatively small. present qualitative evidence of the semantic consistency of the
Therefore, we choose a relatively small network with a single model in the context of sentences extracted from clinical re-
hidden layer. This limits our ability for modeling complex ports written for cardiac echo Doppler images. Table 1 shows
relationships between the image and text spaces and impacts the top matche for a given sentence from the set of 10,978
our performance. The remedy for this problem is more data. analyzed sentences.

669
Table 1. Test sentences and their top three matches based on Euclidean distance from other sentences in the dataset.
no evidence of mitral stenosis.
no evidence of mitral stenosis.
peak aortic valve velocity is 275 cm/s with a mean gradient of 15 mmhg.
peak instantaneous aortic valve velocity 2.3 m/s with a mean gradient of 11 mmhg.
pulmonic valve: structurally normal pulmonic valve. trace pulmonic regurgitation.
pulmonic valve: structurally normal pulmonic valve. mild pulmonic regurgitation.

Performance of the valve classifier: While we do not use modality neural network transform for semi-automatic
the valve classifier labels, the performance of this network on medical image annotation,” in MICCAI, 2016, pp. 300–
the task of valve detection could be an indicator of the qual- 307.
ity of features. The optimal network was obtained on training
epoch 184, when the performance of the model on validation [4] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisser-
set started to decrease. This model was used for testing on the man, “Return of the devil in the details: Delving deep
5% holdout data which consisted of 450 images. The accu- into convolutional nets,” in British Machine Vision Con-
racy for this four-way classification task was 80.1%. ference, 2014.
Analysis of the results in terms of disease labeling: [5] Mehdi Moradi, Yaniv Gur, Hongzhi Wang, Prasanth
Based on majority voting, in 78.2% of the images, the re- Prasanna, and Tanveer Syeda-Mahmood, “A hybrid
trieved sentences agreed with the true disease label (positive learning approach for semantic labeling of cardiac ct
or negative for aortic stenosis). Estimating the severity of the slices and recognition of body position,” in IEEE ISBI,
disease (severe vs mild or moderate) is a more challenging 2016, pp. 1418–1421.
problem. Our accuracy rate for this problem was 60.2%.
Quality of top matches in terms of BLEU score: [6] H. C. Shin, et al., “Deep convolutional neural net-
By calculating the BLEU score between actual and pre- works for computer-aided detection: CNN architec-
dicted/retrieved image descriptions, we quantitatively evalu- tures, dataset characteristics and transfer learning,”
ate the performance of the learned (deep) networks. A high IEEE Transactions on Medical Imagin, vol. 35, pp.
BLEU score of 0.34 was achieved on average, which is com- 1285–1298, 2016.
parable to the score that the current top machine translation
systems achieve (between 0.2 and 0.4 for diffeernt languages [7] A Krizhevsky, I Sutskever, and GE Hinton, “Imagenet
as reported in [12]). classification with deep convolutional neural networks,”
Conclusions: We propose a method for labeling medical in Advances in neural information processing systems,
images that relies on transforming the images to a space de- 2012, pp. 1097–1105.
fined by the sent2doc representation of sentences written for [8] Tanveer Syeda-Mahmood, et al., “Identifying patients
similar images. We have developed a solution that provides at risk for aortic stenosis through learning from multi-
an accuracy of 78.2% in disease labeling for aortic stenosis modal data,” in MICCAI, 2016, pp. 238–245.
and also shows a high BLEU score, indicating consistency be-
tween the retrieved sentences and true sentences matching the [9] Quoc Le and Tomas Mikolov, “Distributed representa-
test images. As our dataset grows, we can train increasingly tions of sentences and documents,” in ICML, 2014, pp.
larger image-sentence transforms to model more complicated 1188–1196.
relationships between the two domains and improve the ac-
curacy of labeling. We will also expand our work to other [10] K. Papineni, et al., “BLEU: a method for automatic
diseases and domains. evaluation of machine translation,” in Proceedings of the
40th annual meeting on association for computational
linguistics. Association for Computational Linguistics,
4. REFERENCES 2002, pp. 311–318.
[1] A. Karpathy and L. Fei-Fei, “Deep visual-semantic [11] Steven Bird, Ewan Klein, and Edward Loper, Natural
alignments for generating image descriptions,” in language processing with Python, ” O’Reilly Media,
CVPR, 2015. Inc.”, 2009.
[2] G. Kulkarni, et al., “Understanding and generating im- [12] Philipp Koehn, “Europarl: A parallel corpus for statis-
age descriptions,” in CVPR, 2011. tical machine translation,” in MT summit, 2005, vol. 5,
pp. 79–86.
[3] Mehdi Moradi, Yufan Guo, Yaniv Gur, Mohammadreza
Negahdar, and Tanveer Syeda-Mahmood, “A cross-

670

You might also like