Professional Documents
Culture Documents
Automated Image Quality Evaluation of T2 - Weighted Liver MRI Utilizing Deep Learning Architecture PDF
Automated Image Quality Evaluation of T2 - Weighted Liver MRI Utilizing Deep Learning Architecture PDF
Purpose: To develop and test a deep learning approach named Convolutional Neural Network (CNN) for automated
screening of T2-weighted (T2WI) liver acquisitions for nondiagnostic images, and compare this automated approach to
evaluation by two radiologists.
Materials and Methods: We evaluated 522 liver magnetic resonance imaging (MRI) exams performed at 1.5T and 3T at
our institution between November 2014 and May 2016 for CNN training and validation. The CNN consisted of an input
layer, convolutional layer, fully connected layer, and output layer. 351 T2WI were anonymized for training. Each case was
annotated with a label of being diagnostic or nondiagnostic for detecting lesions and assessing liver morphology.
Another independently collected 171 cases were sequestered for a blind test. These 171 T2WI were assessed indepen-
dently by two radiologists and annotated as being diagnostic or nondiagnostic. These 171 T2WI were presented to the
CNN algorithm and image quality (IQ) output of the algorithm was compared to that of two radiologists.
Results: There was concordance in IQ label between Reader 1 and CNN in 79% of cases and between Reader 2 and
CNN in 73%. The sensitivity and the specificity of the CNN algorithm in identifying nondiagnostic IQ was 67% and 81%
with respect to Reader 1 and 47% and 80% with respect to Reader 2. The negative predictive value of the algorithm for
identifying nondiagnostic IQ was 94% and 86% (relative to Readers 1 and 2).
Conclusion: We demonstrate a CNN algorithm that yields a high negative predictive value when screening for nondiag-
nostic T2WI of the liver.
Level of Evidence: 2
Technical Efficacy: Stage 2
J. MAGN. RESON. IMAGING 2017;00:000–000.
*Address reprint requests to: H.C., Department of Radiology, 660 1st Ave., 3rd Fl., New York, NY 10016. E-mail: Hersh.Chandarana@nyumc.org
From the 1Center for Biomedical Imaging, Department of Radiology, New York University School of Medicine, New York, New York, USA; and 2Siemens
Healthineers, New York, New York, USA
are adequate for completing an underlying diagnostic task, liver acquisition, and compare this automated approach to
such as lesion detection or organ morphology analysis, and IQ evaluation by two radiologists.
sometimes may be of limited value with advances in parallel
imaging and nonlinear reconstruction schemes such as com- Materials and Methods
pressed sensing.8
Patients
Numerous imaging studies have therefore routinely
We conducted a HIPAA-compliant retrospective study. The Institu-
incorporated “task-based” qualitative metrics to evaluate tional Review Board deemed that informed consent was not
new sequences or machines. For example, Kenkel et al uti- required. A search of the radiology department’s MRI database
lized a “lesion conspicuity” metric, scored on the 5-point identified 1595 cases of liver MRI with and without contrast per-
Likert scale.9 In another study, Fischer et al measured qual- formed for indication of known or suspected liver cirrhosis or focal
ity by grading artifacts, ease of abdominal organ delineation, liver lesion evaluation over a period dating from November of
and diagnostic confidence.10 Such methodologies, although 2014 to May of 2016. We randomly selected 522 liver MRI cases
routinely used, are labor-intensive, requiring availability of for the purpose of the project.
trained radiologists. Furthermore, there is interreader vari-
ability between radiologists,11 which can make it challenging MRI Protocol
to qualitatively assess MRI quality consistently. All patients underwent MRI of the liver using either a 1.5T or 3T
One potential solution is to develop automated methods magnet with a torso phased-array coil. All liver examinations rou-
for task-based qualitative IQ evaluation. These can streamline tinely include 2D T2W sequence with frequency-selective fat sup-
IQ evaluation, making it efficient and cheap to develop novel pression, performed prior to contrast administration. A standard
technologies. Even more importantly they can enable real-time TSE T2W sequence was used with the range of parameters at 1.5T
scanning optimization (while the patient is being scanned) to and 3T as described in Table 1.
improve the robustness of the MR examination. However,
qualitative tasks are more difficult to automate, as the DL Architecture
“acceptability” of a set of images relies on subjective human
CNNS. CNNs are composed of connected neural nodes with
assessment, not on objective measurements. One field which
learnable parameters.14,15 We applied a CNN for image analysis
shows promise for this type of assessment is the field of
with a large number of layers to establish a hierarchical representa-
machine learning. Over the past two decades, much work has
tion of MR images. The CNN that we developed for the purpose
been done on using machine vision to identify objects and fea-
of this study consists of an input layer, convolutional layer, fully
tures.12 A subdivision of machine learning is deep learning connected layer, and output layer or loss layer (Fig. 1) with input
(DL), an artificial intelligence and powerful data analytics tool dimension of 150 3 150 3 3 and two output nodes. This net-
that is rapidly growing into the mainstream of machine learn- work architecture was adapted from architecture available in the
ing research and practice (NIPS https://nips.cc/; ICML http:// public domain.16 This neural net was trained through back-
icml.cc/; ICLR, http://www.iclr.cc/).13 propagation. Our training was conducted using an open source
The goal of our study was to develop and test a DL package Caffe.17 Our testing was validated on two packages, one is
approach using Convolutional Neural Network (CNN) for in-house implementation, and the other is a MatLab (MathWorks,
automated task-based IQ evaluation of T2-weighted (T2WI) Natick, MA) wrapper on top of Caffe.
Training Validation
Age 58 56 P 5 0.39
Male 61% 49% P 5 0.06
1.5T 59% 58% P51
3T 41% 42% P51
Cirrhosis 33% 25% P 5 0.18
Liver lesion 48% 53% P 5 0.46
Ascites 13% 9% P 5 0.44
training, each slice was also rotated and scaled to generate more
image variations to augment the training database for improving
robustness in models. In total, our training data contained 14,670
FIGURE 1: Convolutional Neural Network (CNN) architecture images labeled nondiagnostic and 15,120 images as diagnostic.
used in our experiment in identifying diagnostic and nondiag-
nostic T2WI Testing and Validation
Another independently collected set of 171 cases (validation datasets)
The optimization technique used in training this neural net were sequestered for blind test. Each of these cases was independently
is the classic stochastic gradient descent algorithm; as the weights inspected by two radiologists who evaluated the IQ of the T2WI and
@c
are updated following the rule of wij ðt11Þ5wij ðtÞ1a @w ij
, where C annotated these as being diagnostic or nondiagnostic for detecting
is the cost function, w are the neuron weights, and a is the learn- lesions and assessing liver morphology. These 171 T2WI datasets were
ing rate.18 We formulate our quality assessment as a classification then presented to the CNN algorithm. For each case, the middle
task (diagnostic vs. nondiagnostic IQ). The softmax function19 is seven slices were selected and assessed by the learned models. Empiri-
used as cost function, rectified linear unit (ReLU) is chosen as an cally, if five or more out of these seven slices are classified as nondiag-
activation function, ie, f(x) 5 max(0,x), as it introduces nonlinear- nostic, the case is classified as nondiagnostic; otherwise, as diagnostic.
ity into the neural net to handle complex mapping learning,20 and We next investigated the imaging features of the discordant
Dropout is used to regularize the network weight updates to avoid cases (cases that were labeled as diagnostic by the radiologist but
overfitting.21 Following the input layer, five convolution layers nondiagnostic by the algorithm and vice versa). One of the radiol-
with ReLU as the activation function are applied. Pooling layers ogists tabulated the qualitative features such as the presence or
are added after convolution layers to propagate and consolidate absence of lesion and presence or absence of artifacts.
information at various image scales. Subsequently, three fully con-
nected layers are introduced in the end. The entire pipeline is fully Statistical Analysis
automatic without any hand-crafted features and is purely data- The label output from the algorithm (diagnostic or nondiagnostic IQ)
driven. was compared to the labels assigned by the two radiologists in 171
cases. A confusion matrix was constructed, and sensitivity, specificity,
Training
positive predictive value (PPV), and negative predictive value (NPV)
In all, 351 T2WI datasets from clinical liver exams were anony-
along with the confidence interval in screening of T2WI liver acquisi-
mized and collected for training. Each case was visually inspected
tions for nondiagnostic images was computed with respect to the two
and annotated with a label of being diagnostic or nondiagnostic by
radiologists. A confusion matrix is a commonly used method for eval-
a trained observer (under the supervision of a board-certified radi-
uating machine-learning algorithms.22 Analysis was performed using
ologist). Another independently collected 171 cases (validation
MedCalc v. 17 (Mariakerke, Belgium). We also compared age, sex,
datasets) were sequestered for blind test.
absence or presence of cirrhosis, liver lesion, and ascites between the
For each subject, a stack of slices of T2WI were available.
training and validation patient datasets with Fisher’s exact test using
Ten middle 2D trans-axial slices were selected, as such selection
GraphPad (La Jolla, CA) and unpaired t-test.
tends to cover the liver, which is the area of interest. Each slice was
resized to have apparent resolution of 3 mm/pixel using bi-cubic
Results
interpolation. Subsequently, 150 3 150 pixel2 image patches were
cropped (zero padding is applied when necessary around the Table 2 summarizes the patient characteristics of the training
boundaries) beginning from the center of the image. These 150 3 dataset and validation dataset. There were no significant differ-
150 pixel2 patches were duplicated into three channels, resulting in ences between the two groups with respect to the presence or
a 150 3 150 3 3 tensor as the input to the network. During absence of liver cirrhosis, ascites, or liver lesions (all P > 0.05)
Month 2017 3
Journal of Magnetic Resonance Imaging
Reader1
Nondiagnostic Diagnostic Total
Nondiagnostic 16 28 44
CNN Diagnostic 8 119 127
Total 24 147
by the two radiologists. The NPV was high due to the low The ability to flag low-quality images in real time
number of false negatives. However, PPV was low, as some of would enable the technologist to address quality issues by
the cases flagged as nondiagnostic by the algorithm tended to altering technical parameters, re-running a sequence, or run-
have acceptable diagnostic IQ on the review by the two radi- ning additional sequences. For example, if a T2W image
ologists. Although the accuracy of the algorithm needs to be sequence is deemed nondiagnostic due to motion artifact, a
improved, we envision the algorithm being utilized to flag propeller or BLADE sequence can be performed as a correc-
cases for technologist review. Under this scheme, false posi- tive measure.23 Furthermore, in a multicenter institution,
tives are less concerning than false negatives, as any case “hot spot” problem areas could automatically be identified
flagged positive will be reviewed by a technologist. and addressed, avoiding the usual process of collecting nega-
tive feedback from radiologists, a process which can be slow
and inefficient.
While the precise reason between the discordance
between the CNN algorithm and the radiologists is
unknown, several of the discordant cases had features which
might enable future adjustment of the algorithm to improve
accuracy. For example, one way to address external artifacts
and inhomogeneous fat suppression would be to limit the
portion of the image being analyzed by the algorithm. The
algorithm can be trained to include only the liver in its field
of view. In addition, to properly characterize studies where
the liver contains lesions, the algorithm can be trained on
more cases and taught that lesions do not make an image
“nondiagnostic.”
One limitation of our study is that the algorithm was
trained and tested based on subjective qualitative human
assessment. It is possible that an image deemed nondiagnos-
tic by the readers would have been deemed diagnostic to
others. However, it should be noted that the algorithm can
continue to learn what is diagnostic versus nondiagnostic
based on multiple subsequent readers. As more labeled
Month 2017 5
Journal of Magnetic Resonance Imaging
images are fed into the system, the algorithm will more 2. Hecht EM, Holland AE, Israel GM, et al. Hepatocellular carcinoma in
the cirrhotic liver: gadolinium-enhanced 3D T1-weighted MR imaging
closely approximate a general concept of “diagnostic” versus as a stand-alone sequence for diagnosis. Radiology 2006;239:438–
“nondiagnostic.” 447.
The algorithm evaluated the middle seven slices of the 3. Choi JY, Kim MJ, Chung YE, et al. Abdominal applications of 3.0-T
MR imaging: comparative review versus a 1.5-T system. Radiographics
series instead of the entire stack of T2WI. We limited the 2008;28:e30.
assessment of the algorithm to the seven slices in the middle
4. Tsurusaki M, Semelka RC, Zapparoli M, et al. Quantitative and qualita-
of the stack, as these axial slices are expected to cover the tive comparison of 3.0T and 1.5T MR imaging of the liver in patients
liver. We also advised the readers to assess IQ of the liver with diffuse parenchymal liver disease. Eur J Radiol 2009;72:314–320.
and ignore slices above and below the liver. Nevertheless, 5. Schreiber-Zinaman J, Rosenkrantz AB. Frequency and reasons for
extra sequences in clinical abdominal MRI examinations. Abdom
this may introduce variability in comparisons between the Radiol (NY) 2017;42:306–311.
CNN algorithm and the readers. Future work will focus on 6. Ihalainen T, Sipila O, Savolainen S. MRI quality control: six imagers
developing and integrating an automated liver slice-selection studied using eleven unified image quality parameters. Eur Radiol
2004;14:1859–1865.
algorithm.
7. Davids M, Zollner FG, Ruttorf M, et al. Fully-automated quality assur-
This work reflects our initial experience with machine ance in multi-center studies using MRI phantom measurements. Magn
learning in automated assessment of IQ. We are currently Reson Imaging 2014;32:771–780.
implementing this in clinical practice for prospective assess- 8. Feng L, Benkert T, Block KT, Sodickson DK, Otazo R, Chandarana H.
ment of IQ in hospitalized patients undergoing liver MRI. Compressed sensing for body MRI. J Magn Reson Imaging 2017;45:
966–987.
Although the machine learning algorithm is to be imple-
9. Kenkel D, Barth BK, Piccirelli M, et al. Simultaneous multislice
mented to improve the robustness of the MRI exam, it is diffusion-weighted imaging of the kidney: a systematic analysis of
important to note that ethical and regulatory concerns will image quality. Invest Radiol 2017;52:163–169.
need to be resolved before such algorithms can be imple- 10. Fischer S, Grodzki DM, Domschke M, et al. Quiet MR sequences in
clinical routine: initial experience in abdominal imaging. Radiol Med
mented for diagnostic purposes. 2017;122:194–203.
In our neural network, we have one input layer, five
11. Rosenkrantz AB, Patel JM, Babb JS, Storey P, Hecht EM. Liver MRI at
convolutional layers, three fully connected layers, and one 3 T using a respiratory-triggered time-efficient 3D T2-weighted tech-
output layer. Although this architecture performed well for nique: impact on artifacts and image quality. AJR Am J Roentgenol
2010;194:634–641.
the task at hand, it should be noted that this specific CNN
12. de Bruijne M. Machine learning approaches in medical image analysis:
architecture is not necessarily the only approach for this From detection to diagnosis. Med Image Anal 2016;33:94–97.
task. As CNNs continue to evolve, other approaches may be 13. LeCun Y, Bengio Y, Hinton G. Deep learning. Nature 2015;521:436–
worth investigating. 444.
In conclusion, we demonstrated the promise of a deep 14. Lecun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning
applied to document recognition. Proc IEEE 1998;86:2278–2324.
learning approach utilizing a Convolutional Neural Network
(CNN) for automated task-based IQ evaluation of T2W 15. Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep
convolutional neural networks. Conference Proceeding: Advances in
liver acquisition. Future areas of research could include neural information processing systems, 2012. p 1097–1105.
applications of CNN for quality analysis of other sequences 16. https://github.com/BVLC/caffe/blob/master/models/bvlc_alexnet/
such as T1WI, and the entire liver MRI exam. train_val.prototxt
17. https://github.com/BVLC/caffe
mentation of CNN architecture. Authors not associated 20. He K, Zhang X, Ren S, Sun J. Delving deep into rectifiers: Surpassing
human-level performance on imagenet classification. In: Proc IEEE Int
with Siemens Healthineers maintained full control of the Conf Comput Vis; 2015. p 1026–1034.
data at all times, and these authors were responsible for the 21. Srivastava N, Hinton GE, Krizhevsky A, Sutskever I, Salakhutdinov R.
validation study comparing the quality label from the Dropout: a simple way to prevent neural networks from overfitting.
J Mach Learn Res 2014;15:1929–1958.
machine learning algorithm to the image quality assessment
by the two radiologists. 22. Tarca AL, Carey VJ, Chen XW, Romero R, Draghici S. Machine learn-
ing and its applications to biology. PLoS Comput Biol 2007;3:e116.