Professional Documents
Culture Documents
Background and Purpose—Treatment options for patients with acute ischemic stroke depend on the volume of salvageable
tissue. This volume assessment is currently based on fixed thresholds and single imagine modalities, limiting accuracy.
We wish to develop and validate a predictive model capable of automatically identifying and combining acute imaging
features to accurately predict final lesion volume.
Methods—Using acute magnetic resonance imaging, we developed and trained a deep convolutional neural network
(CNNdeep) to predict final imaging outcome. A total of 222 patients were included, of which 187 were treated with
rtPA (recombinant tissue-type plasminogen activator). The performance of CNNdeep was compared with a shallow
CNN based on the perfusion-weighted imaging biomarker Tmax (CNNTmax), a shallow CNN based on a combination
of 9 different biomarkers (CNNshallow), a generalized linear model, and thresholding of the diffusion-weighted imaging
biomarker apparent diffusion coefficient (ADC) at 600×10−6 mm2/s (ADCthres). To assess whether CNNdeep is capable of
Downloaded from http://stroke.ahajournals.org/ by guest on May 2, 2018
differentiating outcomes of ±intravenous rtPA, patients not receiving intravenous rtPA were included to train CNNdeep, −rtpa
to access a treatment effect. The networks’ performances were evaluated using visual inspection, area under the receiver
operating characteristic curve (AUC), and contrast.
Results—CNNdeep yields significantly better performance in predicting final outcome (AUC=0.88±0.12) than generalized
linear model (AUC=0.78±0.12; P=0.005), CNNTmax (AUC=0.72±0.14; P<0.003), and ADCthres (AUC=0.66±0.13;
P<0.0001) and a substantially better performance than CNNshallow (AUC=0.85±0.11; P=0.063). Measured by contrast,
CNNdeep improves the predictions significantly, showing superiority to all other methods (P≤0.003). CNNdeep also seems
to be able to differentiate outcomes based on treatment strategy with the volume of final infarct being significantly
different (P=0.048).
Conclusions—The considerable prediction improvement accuracy over current state of the art increases the potential for
automated decision support in providing recommendations for personalized treatment plans. (Stroke. 2018;49:00-00.
DOI: 10.1161/STROKEAHA.117.019740.)
Key Words: area under curve ◼ biomarkers ◼ follow-up studies ◼ humans ◼ magnetic resonance imaging ◼ stroke
Received October 12, 2017; final revision received April 4, 2018; accepted April 6, 2018.
From the Department of Clinical Medicine, Center of Functionally Integrative Neuroscience and MINDLAB, Aarhus University, Denmark (A.N.,
M.B.H., A.T., K.M.); Cercare Medical ApS, Aarhus, Denmark (A.N.); and Institute of Neuroradiology, Charité Universitätsmedizin, Germany (A.T.).
Presented in part at the International Society for Magnetic Resonance in Medicine Annual Meeting and Exhibition, Honolulu, HI, April 22–27, 2017.
The online-only Data Supplement is available with this article at http://stroke.ahajournals.org/lookup/suppl/doi:10.1161/STROKEAHA.
117.019740/-/DC1.
Correspondence to Anne Nielsen, MSc, Department of Clinical Medicine, Center of Functionally Integrative Neuroscience and MINDLAB, Aarhus
University Hospital, Bldg 10G, 4th Floor, Nørrebrogade 44, DK-8000 Aarhus C, Denmark. E-mail anne@cfin.au.dk
© 2018 American Heart Association, Inc.
Stroke is available at http://stroke.ahajournals.org DOI: 10.1161/STROKEAHA.117.019740
1
2 Stroke June 2018
model without the capacity to adapt as new data become GLM6,7 and thresholding of the ADC biomarker at 600×10−6
available in the clinic or from trials. Third, thresholding mm2/s2 (ADCthres). We hypothesize CNNdeep to be superior
does usually not encompass the broad information range compared with the other methods as a result of the model’s
obtainable from, primarily, PWI scans, which may hold representational power.
further clues to tissue progression. Considering the small
difference in blood flow rate between electrical silence and Materials and Methods
energy failure, supportive information from blood volume,
capillary transit times, and oxygen availability markers5 is Patients and Image Acquisition
highly desirable. Fourth, the indirect assessment of the isch- Because of the sensitive nature of the data and compliance regula-
tions pertaining to general data protection regulation, requests to
emic penumbra offered by PWI measurements is potentially access the data set from qualified researchers trained in human subject
limited to the value and confidence afforded by individual confidentiality protocols may be sent to Aarhus University at leif@
metrics. Fifth, the dichotomous nature of the tissue catego- cfin.au.dk and grethe.andersen@clin.au.dk. In this retrospective study,
rization into irreversibly damaged or potentially salvageable 222 patients (91 women) from the I-KNOW multicenter (105)15–17
and remote ischemic perconditioning (117)18,19 studies were analyzed.
tissue is likely too simplistic and certainly lacks reflection on
Patients were admitted with symptoms consistent with acute ischemic
the certainty level. stroke and triaged with MRI for intravenous rtPA (recombinant tissue-
The volume and location of the final infarct depend on a type plasminogen activator). Patient characteristics are summarized
complex interplay between many tissue characteristics and in the Table. Included were all patients with acute DWI, acute PWI,
clinical characteristics, of which probably only a few have acute T2-weighted fluid-attenuated inversion recovery (T2-FLAIR),
and follow-up T2-FLAIR. The main focus was to predict imaging out-
been studied. Simultaneous inclusion of multiple modalities
Downloaded from http://stroke.ahajournals.org/ by guest on May 2, 2018
correction, arterial input function selection, and calculation of perfusion obtained in ≈60 seconds. The training of the networks, which needs
maps, were done using the PENGUIN perfusion analysis software (www. only to be done once, took around 5 days for CNNdeep and less than a
cfin.au.dk/software/pgui). The arterial input function was initially identi- day for the other CNNs on a standard work station with an NVIDIA
fied automatically by the software23 and then examined and adjusted, if Quadro K2200 GPU with 4GB memory. The statistical analyses were
necessary, by an expert neuroradiologist. Mean capillary transit time and performed using R, version 3.2.3.30
cerebral blood flow were computed using a parametric deconvolution5,24
of the concentration curve. In addition, the imaging biomarkers cerebral
metabolism of oxygen and relative transit time heterogeneity were com- Statistical Analysis
puted using a procedure5,24 based on a vascular model22 of capillary trans- Performance Evaluation
port and oxygen availability, which have previously been hypothesized
to contribute to the characterization and prognosis of acute ischemic The 187 patients who received intravenous rtPA treatment
stroke.15,25 The temporal difference between site of measurement of the were randomly divided into independent training set (158
tissue concentration curve and the arterial input function, the bolus delay, patients) and test set (29 patients) using an 85/15 split, thus
was estimated directly by the vascular model and represented with oscil- allowing assessment of model performance in independent
lation index singular value decomposition26 as the Tmax map. patients, unknown to the models during training. The training
The DWI and T2-FLAIR images were linearly coregistered and
resliced to acute PWI space using a 12-parameter affine transforma- process was monitored using independent validation patches,
tion with a normalized mutual information cost function, as imple- to prevent overfitting. The online-only Data Supplement con-
mented in the Statistical Parametric Mapping v. 8 (SPM 8; Wellcome tains further details on data sampling. The follow-up infarcts
Trust Centre for Neuroimaging, London, United Kingdom) toolbox. were independently delineated by 4 expert radiologists on the
An average DWI image (TRACE DWI) was obtained from the follow-up T2-FLAIR scan (demonstrated to minimize interra-
DWI sequence by averaging overall b=1000 s/mm2 images and used
ter variability31) acquired 1 month after the stroke. The delin-
Downloaded from http://stroke.ahajournals.org/ by guest on May 2, 2018
Figure 2. Results from the predictive models for patients from the test set. Four biomarkers used for predictions (TRACE diffusion-
weighted imaging [DWI], cerebral metabolism of oxygen [CMRO2], mean capillary transit time [MTT], and time point for the maximum
of the residue function [Tmax]) are shown, as well as the follow-up T2-weighted fluid-attenuated inversion recovery with the final infarct
shown as a red contour. Patient A is a 58-year-old man (NIHSS=23), scanned 2.5 hours after symptom onset. Patient B is a 44-year-old
man (NIHSS=13), scanned after 2 hours. Patient C is a 74-year-old woman (NIHSS=4) scanned after 1 hour. Patient D is a 69-year-old
woman (NIHSS=8) scanned after 2.5 hours. ADCthres indicates thresholding of the ADC biomarker; CNNdeep, deep convolutional neural net-
work; CNNshallow, simple CNN; CNNTmax, CNN based on the Tmax imaging biomarker; and GLM, generalized linear model.
infarct risk. The rightmost column shows the T2-FLAIR fol- between CNNdeep and GLM (P=0.005), CNNTmax (P<0.003),
low-up scan with the manually delineated infarcted tissue and ADCthres (P<0.0001). The difference was not significant
as a contour. The examples overall display that GLMs’ and (P=0.063) for CNNdeep and CNNshallow, despite substantial dif-
ADCthress’ voxel-wise predictions lead to scattered risk maps ference in visual appearance as demonstrated in Figure 2. The
compared with the CNN-based methods. The CNN-based test also yielded a significant difference between CNNshallow
methods generally outperform other methods in producing and GLM (P=0.013), CNNTmax (P<0.005), and ADCthres
spatially coherent lesion estimates, albeit considerable differ- (P<0.0001). Thus, the CNNs with many biomarkers as input
ences in accuracy are observed. lead to superior performance measured by AUC.
CNNshallow tends to overestimate the final lesion volume,
and CNNTmax predicts low infarct risk, which is also not well Contrast
aligned with the actual outcomes. CNNdeep provides visually Figure 3B shows box plots of the image contrast. For
superior predictions compared with the other models. Of par- CNNdeep, the contrast was (0.99±0.02), followed by CNNshallow
ticular interest is patient A—a 58-year-old man—who did not (0.88±0.04), CNNTmax (0.95±0.01), GLM (0.96±0.04), and
have any visible lesion at follow-up, which is correctly pre- ADCthres (0.88±0.19). Hence, CNNshallow and ADCthres had a
dicted by CNNdeep, whereas the other methods substantially lower mean contrast compared with CNNdeep, CNNTmax, and
overestimate the permanent lesion. GLM, which was consistent with the high infarction risk in
areas outside the final infarct observed in Figure 2. Indeed,
Performance pairwise paired t tests showed a significant difference between
Figure 3A shows box plots of AUCs for each predictive model CNNshallow and CNNdeep, CNNTmax, and GLM (P<0.0001). A
for the 20 test set patients with a final infarct. The highest AUC significant difference was also found between CNNdeep and
was obtained for CNNdeep (0.88±0.12), followed by CNNshallow CNNTmax and GLM (P<0.0001). ADCthres was significantly
(0.85±0.11), GLM (0.78±0.12), CNNTmax (0.72±0.14), and inferior to CNNdeep (P=0.004) and GLM (P=0.026), but not to
ADCthres (0.66±0.13). A significant difference was shown CNNTmax (P=0.058).
Nielsen et al Final Outcome Prediction of Acute Ischemic Stroke 5
Treatment Effect able to use information from the acute images and transform
AUC for the intravenous rtPA-treated test patients evaluated them into accurate predictions of final outcome. In indepen-
using CNNdeep, −rtPA was 0.85±0.15. This is not significantly dent test data, the performance of CNNdeep, as measured by
lower than AUC for CNNdeep (P=0.16), and hence, if the AUC, is highly concordant with the actual outcome, assessed
desired outcome is a difference in AUC, there appears to be by follow-up T2-FLAIR images, and superior compared with
no compelling reason to train a network specifically for non– shallow networks and voxel-based approaches. This is con-
rtPA-treated patients. However, Figure 4 shows 2 patient cases firmed by visual inspection. A clear contrast between final
of which the first patient—a 58-year-old man—presents dif- infarct and voxels outside the infarction in the risk map is cru-
ferent DWI/PWI lesions. According to the penumbra model, cial, as the image becomes easier to interpret. Contrast-wise,
this would suggest a large treatment effect. This supposition CNNdeep was significantly better than the other methods. The
is accurately picked up by the substantial treatment effect treatment effect was slightly significant with no intravenous
predicted by CNNdeep. Conversely, the DWI/PWI lesions are rtPA, yielding a higher volume of final infarct.
similar for the second patient—a 65-year-old man—leading We think that the superior performance observed for CNNdeep
to almost no treatment effect, consistent with the penumbra is rooted in better utilization of the information encoded in
model. data from previous patients. During the training process, the
As one might expect, the mean volume of the infarcted model self-regulates, while simultaneously incorporating the
areas was lower (16.44 mL [0–121.21]) for CNNdeep com- heterogeneity of the stroke progressions. This is exemplified
pared with CNNdeep, −rtPA (29.40 mL [0–108.17]) with a by the predicted CNNdeep risk maps in Figure 2, showing a
slightly significant difference (P=0.048), where especially fairly high prediction certainty in terms of infarction risk.
CNNdeep was close to the mean of the ground truth volume CNNdeep model is capable of retaining and processing
(17.84 mL [0–193.09]). complex information and thereby discover a more accurate
connection between the input and the output, considering
Discussion the heterogeneity in stroke pathophysiology. This enables
We hypothesized CNNdeep to yield superior results compared CNNdeep to predict the final outcome more accurately in an
with the other methods. Our analysis showed CNNdeep to be automatic and user-independent manner, and we speculate
6 Stroke June 2018
Downloaded from http://stroke.ahajournals.org/ by guest on May 2, 2018
Figure 4. Assessment of the treatment effect. Both patients received intravenous rtPA (recombinant tissue-type plasminogen activa-
tor). Patient A is a 58-year-old man, has a large diffusion-weighted imaging (DWI)/perfusion-weighted imaging mismatch volume with an
extensive hypoperfusion on mean capillary transit time (MTT), time point for the maximum of the residue function (Tmax), and cerebral
metabolism of oxygen (CMRO2) parameter maps compared with a small volume of restricted diffusion on TRACE maps. No final infarct
is demonstrated on T2-weighted fluid-attenuated inversion recovery (T2-FLAIR) after reperfusion treatment. Patient B is a 65-year-old
man, has no mismatch on MTT and CMRO2 maps compared with trace DWI but a moderate mismatch when using Tmax. The predicted
treatment effect is small, confirmed by follow-up T2-FLAIR images.
that the predictions may improve by incorporating additional imaging in acute stroke. In our opinion, it is pivotal to gather
data in subsequent model training. all possibly available information to achieve the best predic-
During the training process, a CNN extracts important fea- tions. Therefore, we decided to sample from all slices in the
tures in a data-driven fashion, whereas these features have to training data ensuring a balanced data set, which hopefully
be handcrafted for the GLM. influences positively on the generalization of CNNdeep. This is
Although the results presented here show that CNNdeep in contrast to the study by Stier et al,13 where only the slices
predicts final outcome accurate, there is a need for external with the largest lesion were selected, discarding information
validation of the model’s applicability in a clinical setting. An from the remaining imaging voxels. In our view, this approach
important first step would be to test the model on a data set disregards the infarct heterogeneity and how different param-
from another clinical study to assess generalizability. eters react in the presence of ischemia.
Current state-of-the-art decision support for acute ischemic One important difference between GLM and CNN is GLM
stroke in clinical use (Brain CT Perfusion Package [Philips being a voxel-by-voxel–based technique. The CNNs include
Healthcare, the Netherlands], Syngo Volume Perfusion CT Neuro spatial information by allowing 2-dimensional images as
[Siemens Healthcare, Erlangen, Germany], and RAPID [Rapid input. We speculate that this is one of the key differences, giv-
Processing of Perfusion and Diffusion; iSchemaView, Inc, Menlo ing CNN an advantage simply by having spatial information
Park, CA]) identifies core and salvageable tissue as areas exceed- available and thereby making the predictions less sensitive to
ing population-wide or relative thresholds on individual imaging noisy data.
modalities.1 These methods use different imaging biomarkers but We chose to use the performance measure presented by
are all limited to 2. Our analysis shows ADC thresholding to be Jonsdottir et al32 to evaluate the performance of the predic-
insufficient, which might be because of an interplay between tis- tions. This approach ensures a threshold-independent measure
sue characteristics not captured by a single biomarker. (as opposed to measures, such as DICE coefficient or accu-
We decided to assess the ability of the CNN to identify racy) and emphasizes the performance of tissue infarction risk
treatment differences based on whether or not the patients inside the hypoperfused tissue.
received intravenous rtPA treatment. The same methodology
could equally well be used to examine the direct effects of Limitations
recanalization. Here, we took ±intravenous rtPA to avoid the The data used in this study were retrospectively acquired.
need to handle partial recanalization, which would have lim- To mimic a prospective study, we divided the data and used
ited our data volume. some of the patients for testing only. However, because the
A considerable data volume representing actual clinical data are not collected to improve our predictive model’s
variability is necessary for any method attempting to uncover performance, there might be a variation in new patients not
and harness the complex relation between acute and follow-up included in the current data. A further drawback might be the
Nielsen et al Final Outcome Prediction of Acute Ischemic Stroke 7
fact that we included patients scanned with a variety of scan- A CNN has the advantage of being able to retain spatial infor-
ners, scan parameters, and field strengths, which introduces mation, resulting in more accurate predictions compared with
uncontrollable sources of data material variation. However, a GLM-based model. The depth of the CNN is important, with
even with these challenges, CNNdeep yielded good results, many layers in the network yielding a better contrast and a
which in our view speaks to the generalization and robust- higher accuracy of the predicted images.
ness of CNNdeep. The new model paradigms have been shown to lead to
All our patients experienced an acute ischemic stroke, improved predictions and thereby a much increased potential
thereby effectively omitting a control group from the study for use in automated decision support systems providing rec-
and introducing a risk of being biased toward infarct overesti- ommendations for personalized treatment and thereby hope-
mation. However, the data set contained numerous slices with fully better outcome for the individual patient. CNNs will
normoperfused voxels, and the predictive models effectively likely benefit from increasingly larger image collections, in
classified those accordingly. We think this constitutes a robust contrast to less-complex methods, such as GLM and popu-
approach, with overestimation bias being likelier in the train- lation-wide thresholds, which lack the information-encoding
ing data preparation method used by Stier et al.13 capabilities of CNNs. An advantage of CNNs is its ability to
One drawback of CNN methods is the training time. The learn and become progressively better with every new patient.
training time is related to the complexity of the network and
becomes more pronounced with deeper networks. However, Acknowledgments
it is only necessary to train the network once, and the evalu- We would like to thank Prof Grethe Andersen and Dr Kristina Dupont
ation of a new patient takes ≈1 minute. Another drawback Hougaard for kindly making the Ischemic Perconditioning Study
Downloaded from http://stroke.ahajournals.org/ by guest on May 2, 2018
of the CNN method is the amount of training data needed. available for model training and validation.
Without sufficient training data, the CNN is prone to overfit-
ting, and CNNdeep is the most vulnerable because it contains Sources of Funding
more parameters. We mitigated this by following the standard A. Nielsen is funded by Innovation Fund Denmark (5189-00209B),
machine learning procedure and evaluated the models’ perfor- research training supplement from Aarhus University, Denmark, and
mance on an independent test set. Furthermore, we trained the Cercare Medical ApS.
networks using mini-batch stochastic gradient descent to sta-
bilize the training process and avoid too much adaption to the Disclosures
training set. We suspect the performance of the CNN-based A. Nielsen is employed by Cercare Medical ApS. Drs Hansen and
Mouridsen are shareholders in Cercare Medical ApS.
methods in general, and CNNdeep in particular will increase
with more training data.
In this article, we found a treatment effect measured by final References
1. Austein F, Riedel C, Kerby T, Meyne J, Binder A, Lindner T, et al.
infarct volume, although the effect was small for most patients. Comparison of perfusion CT software to predict the final infarct
It could be speculated whether the network underestimates volume after thrombectomy. Stroke. 2016;47:2311–2317. doi:
the treatment effect. However, the intravenous rtPA treatment 10.1161/STROKEAHA.116.013147.
effect is time dependent, expected to be smaller than the effect 2. Straka M, Albers GW, Bammer R. Real-time diffusion-perfusion mis-
match analysis in acute stroke. J Magn Reson Imaging. 2010;32:1024–
of thrombectomy, and not guaranteed to lead to reperfusion. 1037. doi: 10.1002/jmri.22338.
Additionally, a minor treatment effect would be expected if 3. Christensen S, Mouridsen K, Wu O, Hjort N, Karstoft H, Thomalla G, et
the DWI/PWI mismatch is small, according to the penumbra al. Comparison of 10 perfusion MRI parameters in 97 sub-6-hour stroke
model. Therefore, we find it encouraging that the network was patients using voxel-based receiver operating characteristics analysis.
Stroke. 2009;40:2055–2061. doi: 10.1161/STROKEAHA.108.546069.
able to detect the small treatment effect. However, the data set 4. Astrup J, Siesjö BK, Symon L. Thresholds in cerebral ischemia - the
is relatively small, and further validation before clinical use ischemic penumbra. Stroke. 1981;12:723–725.
is required. 5. Mouridsen K, Hansen MB, Østergaard L, Jespersen SN. Reliable esti-
mation of capillary transit time distributions using DSC-MRI. J Cereb
Different end points can be considered in stroke predic- Blood Flow Metab. 2014;34:1511–1521. doi: 10.1038/jcbfm.2014.111.
tion. Here, we chose to use imaging outcome because this 6. Wu O, Koroshetz WJ, Ostergaard L, Buonanno FS, Copen WA, Gonzalez
is a high-resolution representation of stroke outcome and, RG, et al. Predicting tissue outcome in acute human cerebral ischemia
therefore, a demanding task. Functional outcome could be an using combined diffusion- and perfusion-weighted MR imaging. Stroke.
2001;32:933–942.
alternative, however, that would reduce the follow-up infor- 7. Wu O, Sumii T, Asahi M, Sasamata M, Ostergaard L, Rosen BR,
mation available per patient (from voxels to a single score). et al. Infarct prediction and treatment assessment with MRI-based
Moreover, functional outcome might even be obtainable via algorithms in experimental stroke models. J Cereb Blood Flow Metab.
the predicted risk map. One issue concerns how to establish 2007;27:196–204. doi: 10.1038/sj.jcbfm.9600328.
8. LeCun Y, Boser B, Denker JS, Howard RE, Habbard W, Jackel LD,
the outcome reference. We chose to apply a consensus deci- et al. Handwritten digit recognition with a back-propagation network.
sion of at least 3 of 4 expert neuroradiologists using follow- Adv Neural Inf Process Syst. 1990;2:396–404.
up T2-FLAIR images to minimize interrater variability21,31 to 9. Maier O, Schröder C, Forkert ND, Martinetz T, Handels H. Classifiers
for ischemic stroke lesion segmentation: a comparison study. PLoS One.
mitigate this problem.
2015;10:e0145118. doi: 10.1371/journal.pone.0145118.
10. Kamnitsas K, Chen L, Ledig C, Rueckert D, Glocker B. Multi-scale 3D
Conclusions convolutional neural networks for lesion segmentation in brain MRI.
MICCAI Brain Lesion Workshop 2015. 2015. http://www.doc.ic.ac.
The comparison of predictive models described in this article
uk/~bglocker/pdfs/kamnitsas2015isles.pdf.
shows a clear advantage of using a deep CNN, such as CNNdeep, 11. Dutil F, Havaei M, Pal C, Larochelle H, Jodoin PM. A convolutional
to produce predictions of final infarct in acute ischemic stroke. neural network approach to brain lesion segmentation. MICCAI Brain
8 Stroke June 2018
Lesion Workshop 2015. 2015. https://www.cbica.upenn.edu/sbia/ and metabolism. J Cereb Blood Flow Metab. 2012;32:264–277. doi:
Spyridon.Bakas/MICCAI_BraTS/MICCAI_BraTS_2015_proceed- 10.1038/jcbfm.2011.153.
ings.pdf. 23. Mouridsen K, Christensen S, Gyldensted L, Ostergaard L. Automatic
12. Huang S, Shen Q, Duong TQ. Artificial neural network prediction of selection of arterial input function using cluster analysis. Magn Reson
ischemic tissue fate in acute stroke imaging. J Cereb Blood Flow Metab. Med. 2006;55:524–531. doi: 10.1002/mrm.20759.
2010;30:1661–1670. doi: 10.1038/jcbfm.2010.56. 24. Mouridsen K, Friston K, Hjort N, Gyldensted L, Østergaard L, Kiebel
13. Stier N, Vincent N, Liebeskind D, Scalzo F. Deep learning of tissue fate fea- S. Bayesian estimation of cerebral perfusion using a physiologi-
tures in acute ischemic stroke. Proceedings (IEEE Int Conf Bioinformatics cal model of microvasculature. Neuroimage. 2006;33:570–579. doi:
Biomed). 2015;2015:1316–1321. doi: 10.1109/BIBM.2015.7359869. 10.1016/j.neuroimage.2006.06.015.
14. Badrinarayanan V, Kendall A, Cipolla R. SegNet: a deep convolutional 25. Østergaard L, Jespersen SN, Mouridsen K, Mikkelsen IK, Jonsdottír
encoder-decoder architecture for image segmentation. IEEE Trans KÝ, Tietze A, et al. The role of the cerebral capillaries in acute isch-
Pattern Anal Mach Intell. 2017;39:2481–2495. emic stroke: the extended penumbra model. J Cereb Blood Flow Metab.
15. Engedal TS, Hjort N, Hougaard KD, Simonsen CZ, Andersen G, Mikkelsen 2013;33:635–648. doi: 10.1038/jcbfm.2013.18.
IK, et al. Transit time homogenization in ischemic stroke - a novel biomarker 26. Wu O, Østergaard L, Weisskoff RM, Benner T, Rosen BR, Sorensen AG.
of penumbral microvascular failure? [published online ahead of print January Tracer arrival timing-insensitive technique for estimating flow in MR
1, 2017]. J Cereb Blood Flow Metab. doi: 10.1177/0271678X17721666. perfusion-weighted imaging using singular value decomposition with a
16. Alawneh JA, Jones PS, Mikkelsen IK, Cho TH, Siemonsen S, Mouridsen block-circulant deconvolution matrix. Magn Reson Med. 2003;50:164–
K, et al. Infarction of ‘non-core-non-penumbral’ tissue after stroke: mul- 174. doi: 10.1002/mrm.10522.
tivariate modelling of clinical impact. Brain. 2011;134(pt 6):1765–1776. 27. Olivot JM, Mlynash M, Thijs VN, Kemp S, Lansberg MG, Wechsler
doi: 10.1093/brain/awr100. L, et al. Optimal Tmax threshold for predicting penumbral tissue in
17. I-KNOW. Integrating information from molecules to man: knowledge acute stroke. Stroke. 2009;40:469–475. doi: 10.1161/STROKEAHA.
discovery accelerates drug development and personalized treatment in 108.526954.
acute stroke. 2006. https://cordis.europa.eu/project/rcn/78374_en.html. 28. Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, et al.
18. Hougaard KD, Hjort N, Zeidler D, Sørensen L, Nørgaard A, Thomsen Tensorflow: large-scale machine learning on heterogeneous distributed
Downloaded from http://stroke.ahajournals.org/ by guest on May 2, 2018
RB, et al. Remote ischemic perconditioning in thrombolysed stroke systems. ArXiv e-prints. 2016;1603. https://arxiv.org/abs/1603.04467,
patients: randomized study of activating endogenous neuroprotection software available from tensorflow.org.
- design and MRI measurements. Int J Stroke. 2013;8:141–146. doi: 29. Python Software Foundation. Python 2.7, http://www.python.org/. 2016.
10.1111/j.1747-4949.2012.00786.x. 30. R Core Team; R Foundation for Statistical Computing. R: A Language
19. Hougaard KD, Hjort N, Zeidler D, Sørensen L, Nørgaard A, Hansen TM, and Environment for Statistical Computing. Vienna, Austria; 2015.
et al. Remote ischemic perconditioning as an adjunct therapy to throm- 31. Neumann AB, Jonsdottir KY, Mouridsen K, Hjort N, Gyldensted C,
bolysis in patients with acute ischemic stroke: a randomized trial. Stroke. Bizzi A, et al. Interrater agreement for final infarct MRI lesion delin-
2014;45:159–167. doi: 10.1161/STROKEAHA.113.001346. eation. Stroke. 2009;40:3768–3771. doi: 10.1161/STROKEAHA.
20. Carrera E, Jones PS, Alawneh JA, Klærke Mikkelsen I, Cho TH, 108.545368.
Siemonsen S, et al. Predicting infarction within the diffusion-weighted 32. Jonsdottir KY, Østergaard L, Mouridsen K. Predicting tissue outcome
imaging lesion: does the mean transit time have added value? Stroke. from acute stroke magnetic resonance imaging: improving model perfor-
2011;42:1602–1607. doi: 10.1161/STROKEAHA.110.606970. mance by optimal sampling of training data. Stroke. 2009;40:3006–3011.
21. Hansen MB, Nagenthiraja K, Ribe LR, Dupont KH, Østergaard L, Mouridsen doi: 10.1161/STROKEAHA.109.552216.
K. Automated estimation of salvageable tissue: comparison with expert read- 33. Wu O, Christensen S, Hjort N, Dijkhuizen RM, Kucinski T, Fiehler J,
ers. J Magn Reson Imaging. 2016;43:220–228. doi: 10.1002/jmri.24963. et al. Characterizing physiological heterogeneity of infarction risk in
22. Jespersen SN, Østergaard L. The roles of cerebral blood flow, capillary acute human ischaemic stroke using MRI. Brain. 2006;129(pt 9):2384–
transit time heterogeneity, and oxygen tension in brain oxygenation 2393. doi: 10.1093/brain/awl183.
Prediction of Tissue Outcome and Assessment of Treatment Effect in Acute Ischemic
Stroke Using Deep Learning
Anne Nielsen, Mikkel Bo Hansen, Anna Tietze and Kim Mouridsen
Stroke is published by the American Heart Association, 7272 Greenville Avenue, Dallas, TX 75231
Copyright © 2018 American Heart Association, Inc. All rights reserved.
Print ISSN: 0039-2499. Online ISSN: 1524-4628
The online version of this article, along with updated information and services, is located on the
World Wide Web at:
http://stroke.ahajournals.org/content/early/2018/05/01/STROKEAHA.117.019740
Permissions: Requests for permissions to reproduce figures, tables, or portions of articles originally published
in Stroke can be obtained via RightsLink, a service of the Copyright Clearance Center, not the Editorial Office.
Once the online version of the published article for which permission is being requested is located, click
Request Permissions in the middle column of the Web page under Services. Further information about this
process is available in the Permissions and Rights Question and Answer document.
Prediction of tissue outcome and assessment of treatment effect in acute ischemic stroke using
Deep Learning
Scanner parameters, implementation details, data sampling, and sample size considerations
Anne Nielsena,b MSc, Mikkel Bo Hansena PhD, Anna Tietzea,c PhD, Kim Mouridsena PhD
a Center of Functionally Integrative Neuroscience and MINDLab, Inst. of Clinical Medicine, Aarhus
University, Denmark
c Cercare Medical ApS, Aarhus, Denmark
cInst. of Neuoradiology, Charité Universitätsmedizin, Germany
Corresponding Author:
PhD Student, Anne Nielsen, MSc
CFIN, Aarhus University Hospital
Building 10G, 4th Floor, Nørrebrogade 44, DK-8000 Aarhus C, Denmark
Phone/e-mail +45 28902983 / anne@cfin.au.dk
Tables 2, Figures 0
Study I Study II
1.5T 3T 1.5T 3T
Number of patients 58 12 22 95
PWI
Brian coverage (mm) [72, 120] [98, 124] [91, 143] [112, 160]
DWI
FOV (mm) [230, 280] [192, 240] [230, 240] [230, 240]
Brain coverage (mm) [100, 330] [105, 129] 110 [78, 90]
*The matrix is always square, ie, a number of 128 correspond to a matrix of size 128 x 128.
Table I Summary of parameters used in the PWI and DWI sequences depending on magnetic field
strength and study, Study I being I-Know multicenter (105)1-3 and Study II Remote Ischemic Per-
conditioning (117)4, 5. Numbers stated as [x, y] refers to the minimum and maximum value.
Implementation details and data sampling
Training
The CNNs above are all trained using the TensorFlow10 framework for 100 epochs by minimizing
the multinomial logistic loss function (representing the difference between the predicted outcome
and the observed) using stochastic gradient descent11 as optimization method.
Test
At test time, all slices for the patients in the test data was examined using a sliding window
approach to ensure predictions for all voxels. For the CNN deep and CNN shallow, the output for each
patch was a 64x64 image with the predicted probabilities of each voxels belonging to three classes
(background, healthy tissue and death tissue). The result for a given voxel is the class-wise mean of
the probabilities that the voxel belonged to the three classes. For the CNNTmax, the outcome of the
network was used as the probability of the midpoint of the patch belonging to the healthy and
infarcted tissue classes. For GLM, predicted tissue risks were calculated for every voxel.
The CNNs proposed in this paper have many parameters, making overfitting and learning ability
important topics. To our knowledge, the deep learning literature still lacks formal methods for
determining sample size for fitting a specific model. This is our thoughts on the sample size of our
models.
For traditional statistics using p-values, there is a risk of erroneous conclusions with very small
samples12. In machine learning, the risk of overfitting is smaller due to the use of independent test
sets in the performance evaluation. In fitting a classic regression model, the optimal number of
parameters is unknown and regularization or hypothesis testing is used to find the optimal
parameter combination. Model complexity determined via regularization is typically done by cross-
validation, iteratively finding the most important parameters for a subset of the data and then
assessing performance on an independent test set. With deep learning models, we employ mini-
batch stochastic gradient descent, which in practice has an effect similar to cross-validation
principle by having an implicit cross-validation during the training process (each batch is a subset
of the data and if the model adapts to well to a batch, it will perform purely on the next).
The performance of a deep learning model is influenced by a combination of network architecture,
optimization method, hyperparameters, and training data. These models are by construction prone
to overfitting. To account for the overfitting, we used batch normalization known to act as (among
other things) an effective regularizer13 and carefully tuned our hyperparameters through several
training runs.
Furthermore, we used mini-batch stochastic gradient descent to train the algorithm with a batch size
of 40 patches for each iteration. Therefore, the model does not have access to the entire training set
in each iteration, making it less prone to overfitting. If the model adapts too much to a batch, it will
be punished by poor performance on the next batch.
We have taken several measures to assess the magnitude of potential overfitting. One of them is to
compare the model’s performance on the training set and test set. We have calculated the AUC on
the training set to be (0.97±0.06) compared to the test set AUC (0.88±0.12). As expected, the
training set AUC is higher than the test set AUC because, when calculating the training set AUC,
the same data is used to train the model and assess its error.
Furthermore, we included another regularizing mechanism to prevent overfitting by training the
CNNdeep network using dropout14, in the middle layers (as in Kendall et. al.15). This yielded a
similar AUC on the training set (0.97±0.03), but a lower AUC on the test set (0.80±0.22),
indicating an inferior ability to adapt to the data.
To assess the risk of overestimating the test performance, we trained the same model with half of
the data in the training set. This resulted in a training set AUC at (0.98±0.03) and a test set AUC at
(0.84±0.16). As expected, CNNdeep yields a higher AUC on the training set when trained on a
smaller sample size (small variance in batches yield a higher AUC), but a lower AUC on the
original test set. This shows that even with a small amount of data, a high AUC on the training set
do not trigger a high AUC on the independent test set, which is the performance measure reported
in the manuscript.
These results indicate that training using mini-batch stochastic gradient descent is a strong tool to
avoid overfitting.
We want to emphasize that even though we obtain a higher AUC on the training set than on the test
set, the models’ performances are evaluated solely on the independent test set. CNNdeep is most
prone to overfitting and still surpasses all the other models measured by performance, which is the
main point of the manuscript.
However, an increased amount of training data will probably increase CNN deep’s performance and
consider collecting of more data an important step towards improving our model.
Deep convolutional neural networks like CNN deep contain a lot of parameters. An interesting
question is, how much data is needed to fit the number of parameters in the network.
Although 187 un-treated patients (as used in this article) are a relatively small sample of patients,
the network is trained on 4270 slices with 128x128 pixels each which are further presented to the
network as 50.000 image patches (64x64 randomly sampled image ‘sub-windows’). Although
patches and slices within patients are not fully independent they nevertheless serve to exemplify
stroke progression complexity to the network. This subsampling approach has been shown to
improve independent test performance in earlier studies 16-18.
References
1. Alawneh JA, Jones PS, Mikkelsen IK, Cho TH, Siemonsen S, Mouridsen K, et al. Infarction
of 'non-core-non-penumbral' tissue after stroke: Multivariate modelling of clinical impact.
Brain. 2011;134:1765-1776
2. Engedal TS, Hjort N, Hougaard KD, Simonsen CZ, Andersen G, Mikkelsen IK, et al.
Transit time homogenization in ischemic stroke - a novel biomarker of penumbral
microvascular failure? J Cereb Blood Flow Metab. 2017:271678X17721666
3. I-KNOW. Integrating information from molecules to man: Knowledge discovery accelerates
drug development and personalized treatment in acute stroke. 2006
4. Hougaard KD, Hjort N, Zeidler D, Sorensen L, Norgaard A, Thomsen RB, et al. Remote
ischemic perconditioning in thrombolysed stroke patients: Randomized study of activating
endogenous neuroprotection - design and mri measurements. Int J Stroke. 2013;8:141-146
5. Hougaard KD, Hjort N, Zeidler D, Sorensen L, Norgaard A, Hansen TM, et al. Remote
ischemic perconditioning as an adjunct therapy to thrombolysis in patients with acute
ischemic stroke: A randomized trial. Stroke. 2014;45:159-167
6. Badrinarayanan V, Kendall A, Cipolla R. Segnet: A deep convolutional encoder-decoder
architecture for image segmentation. ArXiv e-prints. 2015;1511
7. Wu O, Sumii T, Asahi M, Sasamata M, Ostergaard L, Rosen BR, et al. Infarct prediction
and treatment assessment with mri-based algorithms in experimental stroke models. J
Cerebr Blood F Met. 2007;27:196-204
8. Wu O, Koroshetz WJ, Østergaard L, Buonanno FS, Copen WA, Gonzalez RG, et al.
Predicting tissue outcome in acute human cerebral ischemia using combined diffusion- and
perfusion-weighted mr imaging. Stroke. 2001;32:933-942
9. Stier N, Vincent N, Liebeskind D, Scalzo F. Deep learning of tissue fate features in acute
ischemic stroke. Ieee Int C Bioinform. 2015:1316-1321
10. Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, et al. Tensorflow: Large-scale
machine learning on heterogeneous distributed systems. ArXiv e-prints. 2016;1603
11. Goodfellow I, Bengio Y, Courville A. Deep learning. MIT Press; 2016.
12. Button KS, Ioannidis JP, Mokrysz C, Nosek BA, Flint J, Robinson ES, et al. Power failure:
Why small sample size undermines the reliability of neuroscience. Nat Rev Neurosci.
2013;14:365-376
13. Ioffe S, Szegedy C. Batch normalization: Accelerating deep network training by reducing
internal covariate shift. ArXiv e-prints. 2015;1502
14. Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R. Dropout: A simple
way to prevent neural networks from overfitting. J Mach Learn Res. 2014;15:1929-1958
15. Kendall A, Badrinarayanan V, Cipolla R. Bayesian segnet: Model uncertainty in deep
convolutional encoder-decoder architectures for scene understanding. ArXiv e-prints.
2015;1511
16. Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional
neural networks. In: Pereira F, Burges CJC, Bottou L, Weinberger KQ, eds. Advances in
neural information processing systems 25. Curran Associates, Inc.; 2012:1097--1105.
17. Pinheiro PHO, Collobert R. Recurrent convolutional neural networks for scene parsing.
ArXiv e-prints. 2013;1306
18. Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation.
Proc Cvpr Ieee. 2015:3431-3440