Professional Documents
Culture Documents
ABSTRACT or composes text for a new image. These models are trained
As medical imaging datasets grow, we are approaching on very large datasets for natural images and domains such
the era of big data for radiologic decision support systems. as travel blogs [1, 2]. However, in the medical imaging do-
This requires renewed efforts in dataset curation and label- main, there is limited work in this area. In a recent work,
ing. We propose a methodology for weak labeling of medical we have proposed a cross-modality image transform for this
images for attributes such as anatomy and disease that relies purpose [3]. The essence of this methodology is to separate
on image to sentence transformation. The methodology con- the process of quantifying text and image, from the process of
sists of three models, a convolutional neural network that is training a transformation that maps the image space to the text
trained on a coarse classification task and acts as an image space. Unlike the work in automatic image captioning com-
feature generator, a language model to map sentences to a munity that uses hybrid models, our approach limits the need
fixed length space, and a multi-layer perceptron that acts as a for pairs of matched image and text. When the transforma-
function approximator to map images to the sentence space. tion is trained, one can utilize a large volume of non-parallel
The transform model is trained on matched image-sentence images and text segments to generate labels, by transform-
pairs on a dataset of echocardiography studies. For a given ing a given image and extracting keywords from the nearest
image, labels are extracted from the closest sentences to the neighbors of the output vector.
output of the image-sentence transform. We show that the re-
sulting solution has an 78.2% accuracy in labeling Doppler This work reports an improved version of the methodol-
images with aortic stenosis. We also show that the retrieved ogy in [3] and the results on an image dataset that is 20 times
sentences are consistent with the true sentences in terms of larger. The main methodological novelties of this work in-
meaning with an average BLEU score of 0.34, matching the clude: 1) In [3], we used a pre-trained convolutional neural
current highly performing machine translation solutions. In- network (CNN) as a source of features for images. That net-
work was trained for the ImageNet challenge and was not al-
dex Terms— Image labeling, multimodal classification.
tered [4]. This was due to the very limited size of the dataset
and lack of diagnostic labels to train a classifier in a super-
1. INTRODUCTION vised fashion. In this work, given a much larger dataset, we
propose to train a CNN on our imaging data that performs
The promise of big data brings also the challenge of curating
a coarser level classification task on the images of interest,
large datasets for training machine learning algorithms. The
for which the labels are typically available. This removes the
success of learning approaches in medical imaging is slowed
need to use a network with irrelevant categories as a source of
by the fact that most of available large datasets are in fact
features. We implement this idea in the domain of echocar-
within the category of dark data: unorganized data, without
diography Doppler image analysis by training a model as a
clinical labels, that is collected in the course of routine clin-
feature generator that classifies images for the cardiac valve
ical practice. The true labeling and annotation of this data
imaged, as opposed to disease. 2) The language model used in
can be performed by clinicians. However, this is very expen-
[3] transformed a full paragraph to a fixed length feature vec-
sive. As such, we require methods for weak initial labeling
tor. In the current work we use sentences as the language unit.
of medical data to reduce the amount of time the clinicians
A paragraph could consist of many sentences irrelevant to a
spend on such images. Crowd-sourcing is proposed as a so-
new given image. For example, in a clinical report written for
lution in this domain, but is limited by privacy concerns and
an echocardiography study, there is usually a paragraph writ-
also the difficulty of the radiology tasks.
ten for aortic valve. This includes sentences that describe the
Another potential solution lies in the domain of multi-
anatomy and function of the valve. The function is relevant
modal analysis of text and images, where one can use a large
to continuous wave (CW) Doppler images, while anatomy is
dataset of images and text to train a model that predicts labels
visible in B-mode images. Text modeling at sentence level
∗ Corresponding author: mmoradi@us.ibm.com allows us to retrieve more relevant sentences for images.
668
2.3. Sentence model Since this network acts as a regressor as opposed to a clas-
sifier, the output layer activation functions were set to linear
As described earlier, in this work we have chosen to use a
as opposed to SoftMax. We optimized the network with the
sentence as the unit of text. For our work, it is critical that
objective of minimizing the mean Euclidean distance between
words/sentences with similar meanings have a similar con-
the output vector and the target text vector for the image.
tribution to anchoring a textual instance in the feature space.
Bag of words is a traditional method of converting a text seg-
ment to a fixed length vector. However, in models such as 2.5. Test pipeline
bag of words, words such as “narrowing” and “stenosis” are
In the test stage, given an image, we first run it through the
equally distant from each other as are ‘normal” and “steno-
CNN classifier. While this network produces a valve label,
sis”, regardless of meaning. Given the complexity and flexi-
we do not use the label as the data has labels for valve. In-
bility within natural languages, features such as bag of words
stead, we extract the outputs of the last fully connected layer
or word sequences usually result in a high dimensional vector,
of this model as the image feature vector. The output is then
which may cause data sparsity issues when the size of training
fed to the transformation network to obtain a vector in the text
data is incomparable to the number of features.
space. Then we search for the closest matches to this vector
In this work, we used an unsupervised method to create
in the text dataset. The closest matches, in terms of Euclidean
a fixed length vector given a sentence. This is based on the
distance are used for extraction of disease descriptors for the
neural network language model proposed in [9] to generate
image. For consistency, we use the top 20 matches of each
distributed representations of texts. This network is often
output vector along with majority voting to determine the ex-
referred to as Doc2Vec in the literature (open source code:
istence of aortic stenosis and also separate severe cases from
http://deeplearning4j.org/doc2vec.html).
mild or moderate cases of aortic stenosis.
We refer to it as sent2vec as we use it to quantify sentences.
The input of the neural network includes a sequence of ob-
served words (e.g. “aortic valve peak”), each represented by 2.6. BLEU score
a fixed-length vector, along with a text snippet token, also
We also calculate and report the BLEU score [10] between the
in the form of a dense vector and corresponding to the sen-
actual and predicted/retrieved image descriptions. The BLEU
tence/document source for the sequence. The concatenation
(bilingual evaluation understudy) score was originally devel-
or average of the word and sentence vectors was used to
oped for automatic evaluation of machine translation against
predict the next word (e.g. “velocity”) in the snippet. The
reference translation. The cornerstone of the metric is a mod-
two types of vectors were trained on the 35,150 available
ified version of n-gram precision, which is language indepen-
sentences. Training was performed using stochastic gradient
dent. Scores are calculated on individual translations and can
descent via backpropagation. At the testing stage, given an
be averaged over the whole corpus to approximate the over-
unseen sentence, we freeze the word vectors from training
all performance. It has been frequently reported that there is a
time and just infer the sentence vector.
good correlation between BLEU scores and human judgment,
The fixed length of the text feature vector is a parame-
and BLEU remains one of the most popular automated, effi-
ter in sent2vec model. Given the small size of the input (one
cient, and inexpensive metrics for machine translation natural
sentence) we limit the size of the text vectors to 10, after ex-
language generation evaluation. Our use of the BLEU score is
perimenting with a range of possibilities. An additional ad-
a new application for this classic measure. We used the open
vantage of a small size vector is that the transform model will
source implementation through NLTK [11] for corpus-BLEU
have fewer weights for training.
computation.
669
Table 1. Test sentences and their top three matches based on Euclidean distance from other sentences in the dataset.
no evidence of mitral stenosis.
no evidence of mitral stenosis.
peak aortic valve velocity is 275 cm/s with a mean gradient of 15 mmhg.
peak instantaneous aortic valve velocity 2.3 m/s with a mean gradient of 11 mmhg.
pulmonic valve: structurally normal pulmonic valve. trace pulmonic regurgitation.
pulmonic valve: structurally normal pulmonic valve. mild pulmonic regurgitation.
Performance of the valve classifier: While we do not use modality neural network transform for semi-automatic
the valve classifier labels, the performance of this network on medical image annotation,” in MICCAI, 2016, pp. 300–
the task of valve detection could be an indicator of the qual- 307.
ity of features. The optimal network was obtained on training
epoch 184, when the performance of the model on validation [4] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisser-
set started to decrease. This model was used for testing on the man, “Return of the devil in the details: Delving deep
5% holdout data which consisted of 450 images. The accu- into convolutional nets,” in British Machine Vision Con-
racy for this four-way classification task was 80.1%. ference, 2014.
Analysis of the results in terms of disease labeling: [5] Mehdi Moradi, Yaniv Gur, Hongzhi Wang, Prasanth
Based on majority voting, in 78.2% of the images, the re- Prasanna, and Tanveer Syeda-Mahmood, “A hybrid
trieved sentences agreed with the true disease label (positive learning approach for semantic labeling of cardiac ct
or negative for aortic stenosis). Estimating the severity of the slices and recognition of body position,” in IEEE ISBI,
disease (severe vs mild or moderate) is a more challenging 2016, pp. 1418–1421.
problem. Our accuracy rate for this problem was 60.2%.
Quality of top matches in terms of BLEU score: [6] H. C. Shin, et al., “Deep convolutional neural net-
By calculating the BLEU score between actual and pre- works for computer-aided detection: CNN architec-
dicted/retrieved image descriptions, we quantitatively evalu- tures, dataset characteristics and transfer learning,”
ate the performance of the learned (deep) networks. A high IEEE Transactions on Medical Imagin, vol. 35, pp.
BLEU score of 0.34 was achieved on average, which is com- 1285–1298, 2016.
parable to the score that the current top machine translation
systems achieve (between 0.2 and 0.4 for diffeernt languages [7] A Krizhevsky, I Sutskever, and GE Hinton, “Imagenet
as reported in [12]). classification with deep convolutional neural networks,”
Conclusions: We propose a method for labeling medical in Advances in neural information processing systems,
images that relies on transforming the images to a space de- 2012, pp. 1097–1105.
fined by the sent2doc representation of sentences written for [8] Tanveer Syeda-Mahmood, et al., “Identifying patients
similar images. We have developed a solution that provides at risk for aortic stenosis through learning from multi-
an accuracy of 78.2% in disease labeling for aortic stenosis modal data,” in MICCAI, 2016, pp. 238–245.
and also shows a high BLEU score, indicating consistency be-
tween the retrieved sentences and true sentences matching the [9] Quoc Le and Tomas Mikolov, “Distributed representa-
test images. As our dataset grows, we can train increasingly tions of sentences and documents,” in ICML, 2014, pp.
larger image-sentence transforms to model more complicated 1188–1196.
relationships between the two domains and improve the ac-
curacy of labeling. We will also expand our work to other [10] K. Papineni, et al., “BLEU: a method for automatic
diseases and domains. evaluation of machine translation,” in Proceedings of the
40th annual meeting on association for computational
linguistics. Association for Computational Linguistics,
4. REFERENCES 2002, pp. 311–318.
[1] A. Karpathy and L. Fei-Fei, “Deep visual-semantic [11] Steven Bird, Ewan Klein, and Edward Loper, Natural
alignments for generating image descriptions,” in language processing with Python, ” O’Reilly Media,
CVPR, 2015. Inc.”, 2009.
[2] G. Kulkarni, et al., “Understanding and generating im- [12] Philipp Koehn, “Europarl: A parallel corpus for statis-
age descriptions,” in CVPR, 2011. tical machine translation,” in MT summit, 2005, vol. 5,
pp. 79–86.
[3] Mehdi Moradi, Yufan Guo, Yaniv Gur, Mohammadreza
Negahdar, and Tanveer Syeda-Mahmood, “A cross-
670