You are on page 1of 6

ORIGINAL RESEARCH

Automated Image Quality Evaluation of


T2-Weighted Liver MRI Utilizing Deep
Learning Architecture
Steven J. Esses, MD,1 Xiaoguang Lu, PhD,2 Tiejun Zhao, PhD,2
Krishna Shanbhogue, MD,1 Bari Dane, MD,1 Mary Bruno, BS,1 and
Hersh Chandarana, MD1*

Purpose: To develop and test a deep learning approach named Convolutional Neural Network (CNN) for automated
screening of T2-weighted (T2WI) liver acquisitions for nondiagnostic images, and compare this automated approach to
evaluation by two radiologists.
Materials and Methods: We evaluated 522 liver magnetic resonance imaging (MRI) exams performed at 1.5T and 3T at
our institution between November 2014 and May 2016 for CNN training and validation. The CNN consisted of an input
layer, convolutional layer, fully connected layer, and output layer. 351 T2WI were anonymized for training. Each case was
annotated with a label of being diagnostic or nondiagnostic for detecting lesions and assessing liver morphology.
Another independently collected 171 cases were sequestered for a blind test. These 171 T2WI were assessed indepen-
dently by two radiologists and annotated as being diagnostic or nondiagnostic. These 171 T2WI were presented to the
CNN algorithm and image quality (IQ) output of the algorithm was compared to that of two radiologists.
Results: There was concordance in IQ label between Reader 1 and CNN in 79% of cases and between Reader 2 and
CNN in 73%. The sensitivity and the specificity of the CNN algorithm in identifying nondiagnostic IQ was 67% and 81%
with respect to Reader 1 and 47% and 80% with respect to Reader 2. The negative predictive value of the algorithm for
identifying nondiagnostic IQ was 94% and 86% (relative to Readers 1 and 2).
Conclusion: We demonstrate a CNN algorithm that yields a high negative predictive value when screening for nondiag-
nostic T2WI of the liver.
Level of Evidence: 2
Technical Efficacy: Stage 2
J. MAGN. RESON. IMAGING 2018;47:723–728.

A bdominal magnetic resonance imaging (MRI) is rou-


tinely performed to evaluate chronic liver diseases such
as liver cirrhosis and to detect and characterize focal liver
advances are tested by comparing images generated with a
newly implemented scheme to the conventional scheme. IQ
in such studies is routinely evaluated based on qualitative
lesions.1,2 Although liver MRI is a powerful tool, it suffers and quantitative metrics. Some of the quantitative measures
from a number of limitations: chief among them is inconsis- include signal-to-noise ratio (SNR), contrast-to-noise ratio
tent image quality and decreased robustness related to long (CNR), uniformity, ghosting, and geometric distortion.
acquisition time, motion artifact, and the need for acquiring These can often be assessed with the use of a phantom,
data in multiple breath-holds.3,4 T2-weighted sequences are especially during the technique development phase.6 An
particularly subject to suboptimal imaging quality. A advantage of this process is that it is objective and can be
recently published study demonstrated the need for measured more consistently. In addition, these parameters
sequence repetition in as many as 55% of the exams.5 lend themselves to automation, as the images do not need
Novel methods are being developed and implemented to be reviewed by a human reader.7 However, a disadvantage
to improve image quality (IQ) and acquisition speed. These of such metrics is that they do not address whether images

View this article online at wileyonlinelibrary.com. DOI: 10.1002/jmri.25779

Received Mar 9, 2017, Accepted for publication May 15, 2017.

*Address reprint requests to: H.C., Department of Radiology, 660 1st Ave., 3rd Fl., New York, NY 10016. E-mail: Hersh.Chandarana@nyumc.org

From the 1Center for Biomedical Imaging, Department of Radiology, New York University School of Medicine, New York, New York, USA; and 2Siemens
Healthineers, New York, New York, USA

C 2017 International Society for Magnetic Resonance in Medicine


V 723
Journal of Magnetic Resonance Imaging

TABLE 1. MRI Parameters of T2WI

1.5T System 3T System

TR 2170-3130 msec 2140-3490 msec


TE 90-115 msec 73-105 msec
Flip angle 120-1808 121-1328
# Echoes 1 1
Echo train length (ETL) 21-29 msec 17-31 msec
Section thickness 4-8 mm 4-5 mm
Intersection gap 4.8-9.6 mm 4.8-6 mm
Field of view 262-300 3 350-400 mm 255-300 3 339-375 mm
Matrix 100-205 3 192-256 162-203 3 256-320

are adequate for completing an underlying diagnostic task, liver acquisition, and compare this automated approach to
such as lesion detection or organ morphology analysis, and IQ evaluation by two radiologists.
sometimes may be of limited value with advances in parallel
imaging and nonlinear reconstruction schemes such as com- Materials and Methods
pressed sensing.8
Patients
Numerous imaging studies have therefore routinely
We conducted a HIPAA-compliant retrospective study. The Institu-
incorporated “task-based” qualitative metrics to evaluate tional Review Board deemed that informed consent was not
new sequences or machines. For example, Kenkel et al uti- required. A search of the radiology department’s MRI database
lized a “lesion conspicuity” metric, scored on the 5-point identified 1595 cases of liver MRI with and without contrast per-
Likert scale.9 In another study, Fischer et al measured qual- formed for indication of known or suspected liver cirrhosis or focal
ity by grading artifacts, ease of abdominal organ delineation, liver lesion evaluation over a period dating from November of
and diagnostic confidence.10 Such methodologies, although 2014 to May of 2016. We randomly selected 522 liver MRI cases
routinely used, are labor-intensive, requiring availability of for the purpose of the project.
trained radiologists. Furthermore, there is interreader vari-
ability between radiologists,11 which can make it challenging MRI Protocol
to qualitatively assess MRI quality consistently. All patients underwent MRI of the liver using either a 1.5T or 3T
One potential solution is to develop automated methods magnet with a torso phased-array coil. All liver examinations rou-
for task-based qualitative IQ evaluation. These can streamline tinely include 2D T2W sequence with frequency-selective fat sup-
IQ evaluation, making it efficient and cheap to develop novel pression, performed prior to contrast administration. A standard
technologies. Even more importantly they can enable real-time TSE T2W sequence was used with the range of parameters at 1.5T
scanning optimization (while the patient is being scanned) to and 3T as described in Table 1.
improve the robustness of the MR examination. However,
qualitative tasks are more difficult to automate, as the DL Architecture
“acceptability” of a set of images relies on subjective human
CNNS. CNNs are composed of connected neural nodes with
assessment, not on objective measurements. One field which
learnable parameters.14,15 We applied a CNN for image analysis
shows promise for this type of assessment is the field of
with a large number of layers to establish a hierarchical representa-
machine learning. Over the past two decades, much work has
tion of MR images. The CNN that we developed for the purpose
been done on using machine vision to identify objects and fea-
of this study consists of an input layer, convolutional layer, fully
tures.12 A subdivision of machine learning is deep learning connected layer, and output layer or loss layer (Fig. 1) with input
(DL), an artificial intelligence and powerful data analytics tool dimension of 150 3 150 3 3 and two output nodes. This net-
that is rapidly growing into the mainstream of machine learn- work architecture was adapted from architecture available in the
ing research and practice (NIPS https://nips.cc/; ICML http:// public domain.16 This neural net was trained through back-
icml.cc/; ICLR, http://www.iclr.cc/).13 propagation. Our training was conducted using an open source
The goal of our study was to develop and test a DL package Caffe.17 Our testing was validated on two packages, one is
approach using Convolutional Neural Network (CNN) for in-house implementation, and the other is a MatLab (MathWorks,
automated task-based IQ evaluation of T2-weighted (T2WI) Natick, MA) wrapper on top of Caffe.

724 Volume 47, No. 3


Esses et al.: Automated Image Quality Evaluation

TABLE 2. Patient Characteristics in the Training and


Validation Datasets

Training Validation

Age 58 56 P 5 0.39
Male 61% 49% P 5 0.06
1.5T 59% 58% P51
3T 41% 42% P51
Cirrhosis 33% 25% P 5 0.18
Liver lesion 48% 53% P 5 0.46
Ascites 13% 9% P 5 0.44

training, each slice was also rotated and scaled to generate more
image variations to augment the training database for improving
robustness in models. In total, our training data contained 14,670
FIGURE 1: Convolutional Neural Network (CNN) architecture images labeled nondiagnostic and 15,120 images as diagnostic.
used in our experiment in identifying diagnostic and nondiag-
nostic T2WI Testing and Validation
Another independently collected set of 171 cases (validation datasets)
The optimization technique used in training this neural net were sequestered for blind test. Each of these cases was independently
is the classic stochastic gradient descent algorithm; as the weights inspected by two radiologists who evaluated the IQ of the T2WI and
@c
are updated following the rule of wij ðt11Þ5wij ðtÞ1a @w ij
, where C annotated these as being diagnostic or nondiagnostic for detecting
is the cost function, w are the neuron weights, and a is the learn- lesions and assessing liver morphology. These 171 T2WI datasets were
ing rate.18 We formulate our quality assessment as a classification then presented to the CNN algorithm. For each case, the middle
task (diagnostic vs. nondiagnostic IQ). The softmax function19 is seven slices were selected and assessed by the learned models. Empiri-
used as cost function, rectified linear unit (ReLU) is chosen as an cally, if five or more out of these seven slices are classified as nondiag-
activation function, ie, f(x) 5 max(0,x), as it introduces nonlinear- nostic, the case is classified as nondiagnostic; otherwise, as diagnostic.
ity into the neural net to handle complex mapping learning,20 and We next investigated the imaging features of the discordant
Dropout is used to regularize the network weight updates to avoid cases (cases that were labeled as diagnostic by the radiologist but
overfitting.21 Following the input layer, five convolution layers nondiagnostic by the algorithm and vice versa). One of the radiol-
with ReLU as the activation function are applied. Pooling layers ogists tabulated the qualitative features such as the presence or
are added after convolution layers to propagate and consolidate absence of lesion and presence or absence of artifacts.
information at various image scales. Subsequently, three fully con-
nected layers are introduced in the end. The entire pipeline is fully Statistical Analysis
automatic without any hand-crafted features and is purely data- The label output from the algorithm (diagnostic or nondiagnostic IQ)
driven. was compared to the labels assigned by the two radiologists in 171
cases. A confusion matrix was constructed, and sensitivity, specificity,
Training
positive predictive value (PPV), and negative predictive value (NPV)
In all, 351 T2WI datasets from clinical liver exams were anony-
along with the confidence interval in screening of T2WI liver acquisi-
mized and collected for training. Each case was visually inspected
tions for nondiagnostic images was computed with respect to the two
and annotated with a label of being diagnostic or nondiagnostic by
radiologists. A confusion matrix is a commonly used method for eval-
a trained observer (under the supervision of a board-certified radi-
uating machine-learning algorithms.22 Analysis was performed using
ologist). Another independently collected 171 cases (validation
MedCalc v. 17 (Mariakerke, Belgium). We also compared age, sex,
datasets) were sequestered for blind test.
absence or presence of cirrhosis, liver lesion, and ascites between the
For each subject, a stack of slices of T2WI were available.
training and validation patient datasets with Fisher’s exact test using
Ten middle 2D trans-axial slices were selected, as such selection
GraphPad (La Jolla, CA) and unpaired t-test.
tends to cover the liver, which is the area of interest. Each slice was
resized to have apparent resolution of 3 mm/pixel using bi-cubic
Results
interpolation. Subsequently, 150 3 150 pixel2 image patches were
cropped (zero padding is applied when necessary around the Table 2 summarizes the patient characteristics of the training
boundaries) beginning from the center of the image. These 150 3 dataset and validation dataset. There were no significant differ-
150 pixel2 patches were duplicated into three channels, resulting in ences between the two groups with respect to the presence or
a 150 3 150 3 3 tensor as the input to the network. During absence of liver cirrhosis, ascites, or liver lesions (all P > 0.05)

March 2018 725


Journal of Magnetic Resonance Imaging

TABLE 3. Concordance between radiologists (Reader 1


- Table 3A; Reader 2 – Table 3B) and CNN Algorithm in
identifying non-diagnostic T2WI

Reader1
Nondiagnostic Diagnostic Total

Nondiagnostic 16 28 44
CNN Diagnostic 8 119 127
Total 24 147

Comparison Between the CNN Algorithm and


Reader 1
FIGURE 2: Concordant case of nondiagnostic T2WI. This case
At the blind test, Reader 1 scored 86% (147/171) of cases was identified as nondiagnostic by the two radiologists and the
diagnostic and 14% (24/171) nondiagnostic. The algorithm CNN algorithm
scored 74.3% (127/171) diagnostic and 25.7% (44/171)
nondiagnostic. There was agreement between Reader 1 and concordant cases, there was agreement that the images were
the algorithm in 79% of cases (135/171). In 119/135 of the nondiagnostic. There was disagreement between Reader 2
concordant cases, there was agreement between Reader 1 and the algorithm in 27% (46/171) of cases. In 28/46 dis-
and the algorithm that the images were diagnostic. In 16/ agreements, the reader labeled a study diagnostic and the
135 of the concordant cases, there was agreement that the algorithm labeled them nondiagnostic. In 18/46 disagree-
images were nondiagnostic. There was disagreement between ments, the reader labeled the study nondiagnostic and the
Reader 1 and the algorithm in 21% (36/171) of cases. In algorithm diagnostic (Table 4). Based on Reader 2, the sen-
28/36 disagreements, the reader labeled a study diagnostic sitivity, specificity, PPV, and NPV of the algorithm for
and the algorithm labeled them nondiagnostic. In 8/36 dis- determining which studies were “nondiagnostic” were 47%,
agreements, the reader labeled the study nondiagnostic, and 80%, 36%, and 86%, respectively (Table 4).
the algorithm diagnostic (Table 3). Based on Reader 1, the The concordance between Reader 1 and Reader 2 was
sensitivity, specificity, PPV, and NPV of the algorithm for 88% (151/171). In 133/151 of the concordant cases, there
determining which studies were “nondiagnostic” were 67%, was agreement that the images were diagnostic. In 18/151
81%, 36%, and 94%, respectively (Table 4; Fig. 2). concordant cases there was agreement that the images were
nondiagnostic.
Comparison between the CNN Algorithm and
Reader 2 IQ Features of Discordant Cases
At the blind test, Reader 2 scored 80% (137/171) of cases In 23 cases there was agreement between the readers that
diagnostic and 20% (34/171) nondiagnostic. There was the images were diagnostic, yet the algorithm labeled them
agreement between Reader 2 and the algorithm in 73% nondiagnostic. Although the reason for this discordance is
(125/171) of cases. In 109/125 of the concordant cases, unknown, the following features were observed: four had
there was agreement between Reader 2 and the algorithm liver lesions (Fig. 3A), 12 had inhomogeneous subcutaneous
that the images were diagnostic. In 16/125 of the fat suppression (Fig. 3B), five had artifacts external to the
body, and two had both inhomogeneous subcutaneous fat
TABLE 4. Concordance between radiologists (Reader suppression and external artifacts (Table 5). In eight cases,
1 - Table 3A; Reader 2 – Table 3B) and CNN Algorithm in there was agreement between the readers that the cases were
identifying non-diagnostic T2WI nondiagnostic, yet the algorithm labeled them diagnostic.
These cases had no readily discernable features.
Reader 2
Nondiagnostic Diagnostic Total Discussion
The clinical problem we aimed to address was creating a
Nondiagnostic 16 28 44 CNN deep learning algorithm that could identify liver T2W
CNN Diagnostic 18 109 127 images with nondiagnostic IQ. The proposed algorithm dem-
Total 34 137 onstrated a high NPV such that cases that are considered diag-
nostic by the algorithm tend to also be considered diagnostic

726 Volume 47, No. 3


Esses et al.: Automated Image Quality Evaluation

TABLE 5. Algorithm Performance in Detecting “Nondiagnostic” T2WI Quality

Relative to Reader 1 Relative to Reader 2

Concordance rate 79% 73%


Sensitivity 67%; CI (45-84%) 47%; CI (30-65%)
Specificity 81%; CI (74-87%) 80%; CI (72-86%)
Positive predictive value 36%; CI (27-47%) 36%; CI (26-48%)
Negative predictive value 94%; CI (89-96%) 86%; CI (81-89%)

by the two radiologists. The NPV was high due to the low The ability to flag low-quality images in real time
number of false negatives. However, PPV was low, as some of would enable the technologist to address quality issues by
the cases flagged as nondiagnostic by the algorithm tended to altering technical parameters, re-running a sequence, or run-
have acceptable diagnostic IQ on the review by the two radi- ning additional sequences. For example, if a T2W image
ologists. Although the accuracy of the algorithm needs to be sequence is deemed nondiagnostic due to motion artifact, a
improved, we envision the algorithm being utilized to flag propeller or BLADE sequence can be performed as a correc-
cases for technologist review. Under this scheme, false posi- tive measure.23 Furthermore, in a multicenter institution,
tives are less concerning than false negatives, as any case “hot spot” problem areas could automatically be identified
flagged positive will be reviewed by a technologist. and addressed, avoiding the usual process of collecting nega-
tive feedback from radiologists, a process which can be slow
and inefficient.
While the precise reason between the discordance
between the CNN algorithm and the radiologists is
unknown, several of the discordant cases had features which
might enable future adjustment of the algorithm to improve
accuracy. For example, one way to address external artifacts
and inhomogeneous fat suppression would be to limit the
portion of the image being analyzed by the algorithm. The
algorithm can be trained to include only the liver in its field
of view. In addition, to properly characterize studies where
the liver contains lesions, the algorithm can be trained on
more cases and taught that lesions do not make an image
“nondiagnostic.”
One limitation of our study is that the algorithm was
trained and tested based on subjective qualitative human
assessment. It is possible that an image deemed nondiagnos-
tic by the readers would have been deemed diagnostic to
others. However, it should be noted that the algorithm can
continue to learn what is diagnostic versus nondiagnostic
based on multiple subsequent readers. As more labeled

TABLE 6. Imaging Characteristics of the Discordant


Cases

Inhomogeneous fat suppression 12

FIGURE 3: A: Discordant case labeled as diagnostic by both Lesions 4


readers, but nondiagnostic by the CNN algorithm. There is External artifact 5
hepatic lesion noted in this case on review by the radiologist.
B: Discordant case labeled as diagnostic by both readers, but Inhomogeneous fat suppression 2
nondiagnostic by the CNN algorithm. There is inhomogeneous and external artifact
fat suppression noted on review by the radiologist.

March 2018 727


Journal of Magnetic Resonance Imaging

images are fed into the system, the algorithm will more 2. Hecht EM, Holland AE, Israel GM, et al. Hepatocellular carcinoma in
the cirrhotic liver: gadolinium-enhanced 3D T1-weighted MR imaging
closely approximate a general concept of “diagnostic” versus as a stand-alone sequence for diagnosis. Radiology 2006;239:438–
“nondiagnostic.” 447.

The algorithm evaluated the middle seven slices of the 3. Choi JY, Kim MJ, Chung YE, et al. Abdominal applications of 3.0-T
MR imaging: comparative review versus a 1.5-T system. Radiographics
series instead of the entire stack of T2WI. We limited the 2008;28:e30.
assessment of the algorithm to the seven slices in the middle
4. Tsurusaki M, Semelka RC, Zapparoli M, et al. Quantitative and qualita-
of the stack, as these axial slices are expected to cover the tive comparison of 3.0T and 1.5T MR imaging of the liver in patients
liver. We also advised the readers to assess IQ of the liver with diffuse parenchymal liver disease. Eur J Radiol 2009;72:314–320.

and ignore slices above and below the liver. Nevertheless, 5. Schreiber-Zinaman J, Rosenkrantz AB. Frequency and reasons for
extra sequences in clinical abdominal MRI examinations. Abdom
this may introduce variability in comparisons between the Radiol (NY) 2017;42:306–311.
CNN algorithm and the readers. Future work will focus on 6. Ihalainen T, Sipila O, Savolainen S. MRI quality control: six imagers
developing and integrating an automated liver slice-selection studied using eleven unified image quality parameters. Eur Radiol
2004;14:1859–1865.
algorithm.
7. Davids M, Zollner FG, Ruttorf M, et al. Fully-automated quality assur-
This work reflects our initial experience with machine ance in multi-center studies using MRI phantom measurements. Magn
learning in automated assessment of IQ. We are currently Reson Imaging 2014;32:771–780.
implementing this in clinical practice for prospective assess- 8. Feng L, Benkert T, Block KT, Sodickson DK, Otazo R, Chandarana H.
ment of IQ in hospitalized patients undergoing liver MRI. Compressed sensing for body MRI. J Magn Reson Imaging 2017;45:
966–987.
Although the machine learning algorithm is to be imple-
9. Kenkel D, Barth BK, Piccirelli M, et al. Simultaneous multislice
mented to improve the robustness of the MRI exam, it is diffusion-weighted imaging of the kidney: a systematic analysis of
important to note that ethical and regulatory concerns will image quality. Invest Radiol 2017;52:163–169.

need to be resolved before such algorithms can be imple- 10. Fischer S, Grodzki DM, Domschke M, et al. Quiet MR sequences in
clinical routine: initial experience in abdominal imaging. Radiol Med
mented for diagnostic purposes. 2017;122:194–203.
In our neural network, we have one input layer, five
11. Rosenkrantz AB, Patel JM, Babb JS, Storey P, Hecht EM. Liver MRI at
convolutional layers, three fully connected layers, and one 3 T using a respiratory-triggered time-efficient 3D T2-weighted tech-
output layer. Although this architecture performed well for nique: impact on artifacts and image quality. AJR Am J Roentgenol
2010;194:634–641.
the task at hand, it should be noted that this specific CNN
12. de Bruijne M. Machine learning approaches in medical image analysis:
architecture is not necessarily the only approach for this From detection to diagnosis. Med Image Anal 2016;33:94–97.
task. As CNNs continue to evolve, other approaches may be 13. LeCun Y, Bengio Y, Hinton G. Deep learning. Nature 2015;521:436–
worth investigating. 444.
In conclusion, we demonstrated the promise of a deep 14. Lecun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning
applied to document recognition. Proc IEEE 1998;86:2278–2324.
learning approach utilizing a Convolutional Neural Network
(CNN) for automated task-based IQ evaluation of T2W 15. Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep
convolutional neural networks. Conference Proceeding: Advances in
liver acquisition. Future areas of research could include neural information processing systems, 2012. p 1097–1105.
applications of CNN for quality analysis of other sequences 16. https://github.com/BVLC/caffe/blob/master/models/bvlc_alexnet/
such as T1WI, and the entire liver MRI exam. train_val.prototxt

17. https://github.com/BVLC/caffe

18. Goodfellow I, Bengio Y, Courville A. Deep learning. Cambridge, MA:


Acknowledgments MIT Press; 2016.
Two employees of Siemens Healthineers, USA provided 19. Bishop CM. Pattern recognition and machine learning. New York:
technical support for the training, development, and imple- Springer; 2006.

mentation of CNN architecture. Authors not associated 20. He K, Zhang X, Ren S, Sun J. Delving deep into rectifiers: Surpassing
human-level performance on imagenet classification. In: Proc IEEE Int
with Siemens Healthineers maintained full control of the Conf Comput Vis; 2015. p 1026–1034.
data at all times, and these authors were responsible for the 21. Srivastava N, Hinton GE, Krizhevsky A, Sutskever I, Salakhutdinov R.
validation study comparing the quality label from the Dropout: a simple way to prevent neural networks from overfitting.
J Mach Learn Res 2014;15:1929–1958.
machine learning algorithm to the image quality assessment
by the two radiologists. 22. Tarca AL, Carey VJ, Chen XW, Romero R, Draghici S. Machine learn-
ing and its applications to biology. PLoS Comput Biol 2007;3:e116.

23. Rosenkrantz AB, Mannelli L, Mossa D, Babb JS. Breath-hold T2-


References weighted MRI of the liver at 3T using the BLADE technique: impact
1. Danrad R, Martin DR. MR imaging of diffuse liver diseases. Magn upon image quality and lesion detection. Clin Radiol 2011;66:426–
Reson Imaging Clin N Am 2005;13:277–293, vi. 433.

728 Volume 47, No. 3

You might also like