You are on page 1of 16

Original Contribution

Prediction of Tissue Outcome and Assessment of Treatment


Effect in Acute Ischemic Stroke Using Deep Learning
Anne Nielsen, MSc; Mikkel Bo Hansen, PhD; Anna Tietze, PhD; Kim Mouridsen, PhD

Background and Purpose—Treatment options for patients with acute ischemic stroke depend on the volume of salvageable
tissue. This volume assessment is currently based on fixed thresholds and single imagine modalities, limiting accuracy.
We wish to develop and validate a predictive model capable of automatically identifying and combining acute imaging
features to accurately predict final lesion volume.
Methods—Using acute magnetic resonance imaging, we developed and trained a deep convolutional neural network
(CNNdeep) to predict final imaging outcome. A total of 222 patients were included, of which 187 were treated with
rtPA (recombinant tissue-type plasminogen activator). The performance of CNNdeep was compared with a shallow
CNN based on the perfusion-weighted imaging biomarker Tmax (CNNTmax), a shallow CNN based on a combination
of 9 different biomarkers (CNNshallow), a generalized linear model, and thresholding of the diffusion-weighted imaging
biomarker apparent diffusion coefficient (ADC) at 600×10−6 mm2/s (ADCthres). To assess whether CNNdeep is capable of
Downloaded from http://stroke.ahajournals.org/ by guest on May 2, 2018

differentiating outcomes of ±intravenous rtPA, patients not receiving intravenous rtPA were included to train CNNdeep, −rtpa
to access a treatment effect. The networks’ performances were evaluated using visual inspection, area under the receiver
operating characteristic curve (AUC), and contrast.
Results—CNNdeep yields significantly better performance in predicting final outcome (AUC=0.88±0.12) than generalized
linear model (AUC=0.78±0.12; P=0.005), CNNTmax (AUC=0.72±0.14; P<0.003), and ADCthres (AUC=0.66±0.13;
P<0.0001) and a substantially better performance than CNNshallow (AUC=0.85±0.11; P=0.063). Measured by contrast,
CNNdeep improves the predictions significantly, showing superiority to all other methods (P≤0.003). CNNdeep also seems
to be able to differentiate outcomes based on treatment strategy with the volume of final infarct being significantly
different (P=0.048).
Conclusions—The considerable prediction improvement accuracy over current state of the art increases the potential for
automated decision support in providing recommendations for personalized treatment plans.   (Stroke. 2018;49:00-00.
DOI: 10.1161/STROKEAHA.117.019740.)
Key Words: area under curve ◼ biomarkers ◼ follow-up studies ◼ humans ◼ magnetic resonance imaging ◼ stroke

T he progression of ischemic stroke is highly complex and


individual. However, although advanced magnetic reso-
nance (MRI) or computed tomographic (CT) imaging hold
tomography and electroencephalography studies by Astrup et
al4 in animals and humans. They proposed using the volume of
the so-called ischemic penumbra as a treatment target in acute
much potential, these modalities are often used only to measure stroke. Summarizing earlier studies, Astrup et al4 note that neu-
the volume of tissue exceeding fixed uniform thresholds for ronal electrical activity vanishes around a blood flow of 16.0
hypoperfusion and tissue damage.1,2 With MRI, tissue is com- mL/g per minute, whereas irreversible chronic infarction devel-
monly considered irreversibly damaged if the apparent diffusion ops in areas with a blood flow <10.0 mL/g per minute, which
coefficient (ADC) calculated from diffusion-weighted imaging is likely because of energy and ion pump failure. Thresholds
(DWI) is <600×10−6 mm2/s2. Similarly, tissue exceeding 6 sec- on CT- and MRI-based imaging biomarkers may be considered
onds on the time point for the maximum of the residue function modern instrumentations of these concepts.
(Tmax) image derived from MRI or CT perfusion-weighted The uniform thresholding approach suffers from several
imaging (PWI) is considered at risk of infarct.2,3 Threshold- shortcomings. First, stroke progression is highly heteroge-
based relations between compromised blood delivery and neous and probably defies population-based thresholds in
later infarct can be traced at least back to positron-emission imaging biomarkers. Second, thresholding represents a static

Received October 12, 2017; final revision received April 4, 2018; accepted April 6, 2018.
From the Department of Clinical Medicine, Center of Functionally Integrative Neuroscience and MINDLAB, Aarhus University, Denmark (A.N.,
M.B.H., A.T., K.M.); Cercare Medical ApS, Aarhus, Denmark (A.N.); and Institute of Neuroradiology, Charité Universitätsmedizin, Germany (A.T.).
Presented in part at the International Society for Magnetic Resonance in Medicine Annual Meeting and Exhibition, Honolulu, HI, April 22–27, 2017.
The online-only Data Supplement is available with this article at http://stroke.ahajournals.org/lookup/suppl/doi:10.1161/STROKEAHA.
117.019740/-/DC1.
Correspondence to Anne Nielsen, MSc, Department of Clinical Medicine, Center of Functionally Integrative Neuroscience and MINDLAB, Aarhus
University Hospital, Bldg 10G, 4th Floor, Nørrebrogade 44, DK-8000 Aarhus C, Denmark. E-mail anne@cfin.au.dk
© 2018 American Heart Association, Inc.
Stroke is available at http://stroke.ahajournals.org DOI: 10.1161/STROKEAHA.117.019740

1
2  Stroke  June 2018

model without the capacity to adapt as new data become GLM6,7 and thresholding of the ADC biomarker at 600×10−6
available in the clinic or from trials. Third, thresholding mm2/s2 (ADCthres). We hypothesize CNNdeep to be superior
does usually not encompass the broad information range compared with the other methods as a result of the model’s
obtainable from, primarily, PWI scans, which may hold representational power.
further clues to tissue progression. Considering the small
difference in blood flow rate between electrical silence and Materials and Methods
energy failure, supportive information from blood volume,
capillary transit times, and oxygen availability markers5 is Patients and Image Acquisition
highly desirable. Fourth, the indirect assessment of the isch- Because of the sensitive nature of the data and compliance regula-
tions pertaining to general data protection regulation, requests to
emic penumbra offered by PWI measurements is potentially access the data set from qualified researchers trained in human subject
limited to the value and confidence afforded by individual confidentiality protocols may be sent to Aarhus University at leif@
metrics. Fifth, the dichotomous nature of the tissue catego- cfin.au.dk and grethe.andersen@clin.au.dk. In this retrospective study,
rization into irreversibly damaged or potentially salvageable 222 patients (91 women) from the I-KNOW multicenter (105)15–17
and remote ischemic perconditioning (117)18,19 studies were analyzed.
tissue is likely too simplistic and certainly lacks reflection on
Patients were admitted with symptoms consistent with acute ischemic
the certainty level. stroke and triaged with MRI for intravenous rtPA (recombinant tissue-
The volume and location of the final infarct depend on a type plasminogen activator). Patient characteristics are summarized
complex interplay between many tissue characteristics and in the Table. Included were all patients with acute DWI, acute PWI,
clinical characteristics, of which probably only a few have acute T2-weighted fluid-attenuated inversion recovery (T2-FLAIR),
and follow-up T2-FLAIR. The main focus was to predict imaging out-
been studied. Simultaneous inclusion of multiple modalities
Downloaded from http://stroke.ahajournals.org/ by guest on May 2, 2018

come in the subgroup of patients receiving intravenous rtPA (n=187).


into 1 model has primarily been attempted through general- The 35 untreated patients were used to assess the algorithm’s ability
ized linear models (GLMs),6,7 where tissue categorization pre- to identify treatment-based differences in disease progression by post-
dictions are offered based on several input maps. The GLM training of CNNdeep. The original clinical studies were conducted in
approach overcomes the thresholding methods’ shortcom- accordance with the Helsinki declaration and approved by local ethics
committees. All patients gave written informed consent. Subsequent
ings and is capable of incorporating voxel-wise information data usage for the purpose of retrospective studies, such as ours, was
from multiple input channels but neglects spatial information. included in the original study protocols.16–18,20
Aiming to increase predictive accuracy, we set out to develop The acute imaging protocols included standard gradient-echo
a statistical model capable of better integrating all available dynamic susceptibility contrast PWI MRI, T2-FLAIR, and DWI
imaging information from the individual patient in a spatial MRI. The dynamic susceptibility contrast PWI sequence was
acquired during intravenous gadolinium-based contrast injection (0.1
manner. mmol/kg at rate 5 mL/s) followed by 30 mL physiological saline
In the present work, we apply a convolutional neural net- (injected at rate 5 mL/s). Echo-planar DWI was obtained at magnetic
work (CNN)8 to MRI data to predict tissue outcome. field strengths of b=0 s/mm2 and b=1000 s/mm2. The nonzero images
The CNNs accommodate stroke heterogeneity from data- were acquired at 3 to 12 directions, depending on the scanner/vendor
type at the admitting hospital (GE: Signa Excite 1.5T, Signa Excite
bases containing information on tissue outcome from previous
3T, Signa HDx 1.5T, Signa Genesis 1.5T [Milwaukee, WI]; Siemens:
patients and have potential to increase predictive ability with TrioTim 3T, Avanto 1.5T, Sonata 1.5T [Erlangen, Germany]; Philips:
additional patients. Additionally, CNN has the advantage of Gyroscan NT 1.5T, Achieva 3T, Intera 1.5T [Best, the Netherlands]).
including simultaneously both multiple input biomarkers and Further details on the imaging parameters are provided in the study
spatial information and being capable of modeling complex by Hansen et al21 and Table I in the online-only Data Supplement.
interplays between the input images. The spatial information
inclusion and complex interplay modeling is the main differ- Imaging Modalities
ence compared with the GLM, making CNNs less sensible The following maps were derived from the perfusion data: mean cap-
to noise and artifacts and providing CNNs with the ability to illary transit time, cerebral blood volume, cerebral blood flow, cere-
bral metabolism of oxygen22, relative transit time heterogeneity, delay,
learn more from the data. Furthermore, the predictive results and Tmax. The perfusion preprocessing steps, consisting of motion
from CNNs yield an infarction probability providing a much-
needed certainty level, and CNN is thus a suitable candidate
Table.  Summarized Patient Characteristics
for stroke progression assessment. A few attempts have been
made to use the technology in ischemic stroke lesion segmen- Median (Min–Max)
tation,9–11 but the use of CNNs for final infarct prediction of Age, y 68 (18–90)
acute ischemic stroke is limited.12,13
We implemented 3 different CNNs in this study. First, as Time, onset to MRI 120 (15–525)
an alternative to Tmax thresholding,2 a CNN based on the Admission NIHSS 8 (1–26)
Tmax imaging biomarker (CNNTmax)13 was applied to use Follow-up T2-FLAIR volume, mL 4.8 (0–211.1)
spatial information from Tmax, while simultaneously learn-
TRACE DWI volume, mL 4.0 (0–161.3)
ing to detect false-positive regions. Second, a simple CNN
(CNNshallow), taking 9 MRI biomarkers into account, was devel- SNR DWI 8.0 (1.8–30.4)
oped to use available information from MRI scans. Finally, a SNR PWI 39.9 (13.2–112.8)
deep CNN (CNNdeep), which is a modified version of SegNet,14 DWI indicates diffusion-weighted imaging; Max, maximum; Min, minimum;
using 9 MRI biomarkers was implemented to account for the MRI, magnetic resonance imaging; NIHSS, National Institutes of Health Stroke
biomarkers’ complex interplay and their role in stroke devel- Scale; PWI, perfusion-weighted imaging; SNR, signal-to-noise ratio; and T2-
opment. The performance of the networks is compared with a FLAIR, T2-weighted fluid-attenuated inversion recovery.
Nielsen et al   Final Outcome Prediction of Acute Ischemic Stroke   3

correction, arterial input function selection, and calculation of perfusion obtained in ≈60 seconds. The training of the networks, which needs
maps, were done using the PENGUIN perfusion analysis software (www. only to be done once, took around 5 days for CNNdeep and less than a
cfin.au.dk/software/pgui). The arterial input function was initially identi- day for the other CNNs on a standard work station with an NVIDIA
fied automatically by the software23 and then examined and adjusted, if Quadro K2200 GPU with 4GB memory. The statistical analyses were
necessary, by an expert neuroradiologist. Mean capillary transit time and performed using R, version 3.2.3.30
cerebral blood flow were computed using a parametric deconvolution5,24
of the concentration curve. In addition, the imaging biomarkers cerebral
metabolism of oxygen and relative transit time heterogeneity were com- Statistical Analysis
puted using a procedure5,24 based on a vascular model22 of capillary trans- Performance Evaluation
port and oxygen availability, which have previously been hypothesized
to contribute to the characterization and prognosis of acute ischemic The 187 patients who received intravenous rtPA treatment
stroke.15,25 The temporal difference between site of measurement of the were randomly divided into independent training set (158
tissue concentration curve and the arterial input function, the bolus delay, patients) and test set (29 patients) using an 85/15 split, thus
was estimated directly by the vascular model and represented with oscil- allowing assessment of model performance in independent
lation index singular value decomposition26 as the Tmax map. patients, unknown to the models during training. The training
The DWI and T2-FLAIR images were linearly coregistered and
resliced to acute PWI space using a 12-parameter affine transforma- process was monitored using independent validation patches,
tion with a normalized mutual information cost function, as imple- to prevent overfitting. The online-only Data Supplement con-
mented in the Statistical Parametric Mapping v. 8 (SPM 8; Wellcome tains further details on data sampling. The follow-up infarcts
Trust Centre for Neuroimaging, London, United Kingdom) toolbox. were independently delineated by 4 expert radiologists on the
An average DWI image (TRACE DWI) was obtained from the follow-up T2-FLAIR scan (demonstrated to minimize interra-
DWI sequence by averaging overall b=1000 s/mm2 images and used
ter variability31) acquired 1 month after the stroke. The delin-
Downloaded from http://stroke.ahajournals.org/ by guest on May 2, 2018

in conjunction with the b=0 s/mm2 image to obtain an ADC image.2


The PWI, DWI, and T2-weighted fluid-attenuated inversion recovery eations were performed using an in-house developed software.
values were normalized to normal-appearing contralateral white mat- The raters worked independently and were blinded to patients’
ter to facilitate comparison across subjects. clinical data and any results from automated delineation meth-
ods. A consensus degree of 3 was used to determine a com-
Prediction of Imaging Outcome mon final infarct.21 The predictive performance was evaluated
A CNN consists of layers with different properties, which are con- using the area under the receiver operating characteristic
nected according to a network architecture (an example is shown in curve (AUC), calculated following Jonsdottir et al.32 The AUC
Figure 1). In general, with more layers in a network (the deeper the has the advantage of being threshold independent and can be
network is), the network is able to recognize more complex struc-
tures. We developed CNNdeep to predict stroke imaging outcome interpreted as the probability of an infracting voxel receiving
based on SegNet14 with mean capillary transit time, cerebral blood a higher risk score than a noninfarcting. The contrast is mea-
volume, cerebral blood flow, cerebral metabolism of oxygen, relative sured as 1 minus the mean of the predicted risks of infarcting
transit time heterogeneity, delay, TRACE DWI, ADC, and T2-FLAIR for voxels outside the final infarct. The calculated AUC and
biomarkers as input. contrast metrics were pairwise compared through paired t
To assess whether high performance shown by CNNdeep was caused
by the inclusion of spatial information, a CNN with a few layers tests. Please note that AUC and contrast values are presented
(shallow), called CNNshallow, taking the same input as CNNdeep was as mean±SD.
implemented.
The Tmax biomarker has been used in clinical trials for patient
Treatment Effect
triaging and is believed to have a strong association with final out-
come.2,3,27 Therefore, CNNTmax was implemented to evaluate whether To assess the effect of the intravenous rtPA treatment, the
Tmax contains sufficient information to make accurate predictions, weights from CNNdeep were used as a starting point for post-
if spatial information was included and the tissue predictions were training on 35 untreated patients, resulting in CNNdeep, −rtpa.
not restricted to using a single population-wide threshold. CNNTmax The 29 intravenous rtPA-treated patients in the test set were
is based on previously published work by Stier et al13 and is recon-
then evaluated using the new model. The AUC and the size of
structed as close to their reported implementation as possible.
The CNNs were compared with a regression approach, which in a the final infarct (stated as [mean (min–max)]) were compared.
simple linear fashion combines biomarkers to predicted risk of infarct This approximation of treatment effect is the same approach
at the level of single voxels using a GLM.6,7 Thresholding of the ADC used by Wu et al.33
biomarker at 600×10−6 mm2/s (ADCthres) was used to estimate the
final infarct of patients treated with intravenous rtPA.2 Results
Implementation-relevant details are stated in the online-only Data
Supplement. Figure 2 compares predicted and actual imaging outcome of
The CNN was trained using TensorFlow28 with Python 2.729 inter- 4 patients. The left column shows 4 selected acute images,
face. Once a network is trained, risk maps for a new patient are whereas columns 2 to 6 show the 5 different models’ predicted

Figure 1.  The architecture of deep con-


volutional neural network. The colors rep-
resent different layers. See the online-only
Data Supplement for further details.
4  Stroke  June 2018
Downloaded from http://stroke.ahajournals.org/ by guest on May 2, 2018

Figure 2.  Results from the predictive models for patients from the test set. Four biomarkers used for predictions (TRACE diffusion-
weighted imaging [DWI], cerebral metabolism of oxygen [CMRO2], mean capillary transit time [MTT], and time point for the maximum
of the residue function [Tmax]) are shown, as well as the follow-up T2-weighted fluid-attenuated inversion recovery with the final infarct
shown as a red contour. Patient A is a 58-year-old man (NIHSS=23), scanned 2.5 hours after symptom onset. Patient B is a 44-year-old
man (NIHSS=13), scanned after 2 hours. Patient C is a 74-year-old woman (NIHSS=4) scanned after 1 hour. Patient D is a 69-year-old
woman (NIHSS=8) scanned after 2.5 hours. ADCthres indicates thresholding of the ADC biomarker; CNNdeep, deep convolutional neural net-
work; CNNshallow, simple CNN; CNNTmax, CNN based on the Tmax imaging biomarker; and GLM, generalized linear model.

infarct risk. The rightmost column shows the T2-FLAIR fol- between CNNdeep and GLM (P=0.005), CNNTmax (P<0.003),
low-up scan with the manually delineated infarcted tissue and ADCthres (P<0.0001). The difference was not significant
as a contour. The examples overall display that GLMs’ and (P=0.063) for CNNdeep and CNNshallow, despite substantial dif-
ADCthress’ voxel-wise predictions lead to scattered risk maps ference in visual appearance as demonstrated in Figure 2. The
compared with the CNN-based methods. The CNN-based test also yielded a significant difference between CNNshallow
methods generally outperform other methods in producing and GLM (P=0.013), CNNTmax (P<0.005), and ADCthres
spatially coherent lesion estimates, albeit considerable differ- (P<0.0001). Thus, the CNNs with many biomarkers as input
ences in accuracy are observed. lead to superior performance measured by AUC.
CNNshallow tends to overestimate the final lesion volume,
and CNNTmax predicts low infarct risk, which is also not well Contrast
aligned with the actual outcomes. CNNdeep provides visually Figure 3B shows box plots of the image contrast. For
superior predictions compared with the other models. Of par- CNNdeep, the contrast was (0.99±0.02), followed by CNNshallow
ticular interest is patient A—a 58-year-old man—who did not (0.88±0.04), CNNTmax (0.95±0.01), GLM (0.96±0.04), and
have any visible lesion at follow-up, which is correctly pre- ADCthres (0.88±0.19). Hence, CNNshallow and ADCthres had a
dicted by CNNdeep, whereas the other methods substantially lower mean contrast compared with CNNdeep, CNNTmax, and
overestimate the permanent lesion. GLM, which was consistent with the high infarction risk in
areas outside the final infarct observed in Figure 2. Indeed,
Performance pairwise paired t tests showed a significant difference between
Figure 3A shows box plots of AUCs for each predictive model CNNshallow and CNNdeep, CNNTmax, and GLM (P<0.0001). A
for the 20 test set patients with a final infarct. The highest AUC significant difference was also found between CNNdeep and
was obtained for CNNdeep (0.88±0.12), followed by CNNshallow CNNTmax and GLM (P<0.0001). ADCthres was significantly
(0.85±0.11), GLM (0.78±0.12), CNNTmax (0.72±0.14), and inferior to CNNdeep (P=0.004) and GLM (P=0.026), but not to
ADCthres (0.66±0.13). A significant difference was shown CNNTmax (P=0.058).
Nielsen et al   Final Outcome Prediction of Acute Ischemic Stroke   5

Figure 3.  Box plots for the predictive models


used on the test data showing (A) the per-
formance measure area under the receiver
operating characteristic curve (AUC) and (B)
the contrast in the predicted images. ADCthres
indicates thresholding of the ADC biomarker;
CNNdeep, deep convolutional neural network;
CNNshallow, simple CNN; CNNTmax, CNN based
Downloaded from http://stroke.ahajournals.org/ by guest on May 2, 2018

on the Tmax imaging biomarker; and GLM,


generalized linear model.

Treatment Effect able to use information from the acute images and transform
AUC for the intravenous rtPA-treated test patients evaluated them into accurate predictions of final outcome. In indepen-
using CNNdeep, −rtPA was 0.85±0.15. This is not significantly dent test data, the performance of CNNdeep, as measured by
lower than AUC for CNNdeep (P=0.16), and hence, if the AUC, is highly concordant with the actual outcome, assessed
desired outcome is a difference in AUC, there appears to be by follow-up T2-FLAIR images, and superior compared with
no compelling reason to train a network specifically for non– shallow networks and voxel-based approaches. This is con-
rtPA-treated patients. However, Figure 4 shows 2 patient cases firmed by visual inspection. A clear contrast between final
of which the first patient—a 58-year-old man—presents dif- infarct and voxels outside the infarction in the risk map is cru-
ferent DWI/PWI lesions. According to the penumbra model, cial, as the image becomes easier to interpret. Contrast-wise,
this would suggest a large treatment effect. This supposition CNNdeep was significantly better than the other methods. The
is accurately picked up by the substantial treatment effect treatment effect was slightly significant with no intravenous
predicted by CNNdeep. Conversely, the DWI/PWI lesions are rtPA, yielding a higher volume of final infarct.
similar for the second patient—a 65-year-old man—leading We think that the superior performance observed for CNNdeep
to almost no treatment effect, consistent with the penumbra is rooted in better utilization of the information encoded in
model. data from previous patients. During the training process, the
As one might expect, the mean volume of the infarcted model self-regulates, while simultaneously incorporating the
areas was lower (16.44 mL [0–121.21]) for CNNdeep com- heterogeneity of the stroke progressions. This is exemplified
pared with CNNdeep, −rtPA (29.40 mL [0–108.17]) with a by the predicted CNNdeep risk maps in Figure 2, showing a
slightly significant difference (P=0.048), where especially fairly high prediction certainty in terms of infarction risk.
CNNdeep was close to the mean of the ground truth volume CNNdeep model is capable of retaining and processing
(17.84 mL [0–193.09]). complex information and thereby discover a more accurate
connection between the input and the output, considering
Discussion the heterogeneity in stroke pathophysiology. This enables
We hypothesized CNNdeep to yield superior results compared CNNdeep to predict the final outcome more accurately in an
with the other methods. Our analysis showed CNNdeep to be automatic and user-independent manner, and we speculate
6  Stroke  June 2018
Downloaded from http://stroke.ahajournals.org/ by guest on May 2, 2018

Figure 4.  Assessment of the treatment effect. Both patients received intravenous rtPA (recombinant tissue-type plasminogen activa-
tor). Patient A is a 58-year-old man, has a large diffusion-weighted imaging (DWI)/perfusion-weighted imaging mismatch volume with an
extensive hypoperfusion on mean capillary transit time (MTT), time point for the maximum of the residue function (Tmax), and cerebral
metabolism of oxygen (CMRO2) parameter maps compared with a small volume of restricted diffusion on TRACE maps. No final infarct
is demonstrated on T2-weighted fluid-attenuated inversion recovery (T2-FLAIR) after reperfusion treatment. Patient B is a 65-year-old
man, has no mismatch on MTT and CMRO2 maps compared with trace DWI but a moderate mismatch when using Tmax. The predicted
treatment effect is small, confirmed by follow-up T2-FLAIR images.

that the predictions may improve by incorporating additional imaging in acute stroke. In our opinion, it is pivotal to gather
data in subsequent model training. all possibly available information to achieve the best predic-
During the training process, a CNN extracts important fea- tions. Therefore, we decided to sample from all slices in the
tures in a data-driven fashion, whereas these features have to training data ensuring a balanced data set, which hopefully
be handcrafted for the GLM. influences positively on the generalization of CNNdeep. This is
Although the results presented here show that CNNdeep in contrast to the study by Stier et al,13 where only the slices
predicts final outcome accurate, there is a need for external with the largest lesion were selected, discarding information
validation of the model’s applicability in a clinical setting. An from the remaining imaging voxels. In our view, this approach
important first step would be to test the model on a data set disregards the infarct heterogeneity and how different param-
from another clinical study to assess generalizability. eters react in the presence of ischemia.
Current state-of-the-art decision support for acute ischemic One important difference between GLM and CNN is GLM
stroke in clinical use (Brain CT Perfusion Package [Philips being a voxel-by-voxel–based technique. The CNNs include
Healthcare, the Netherlands], Syngo Volume Perfusion CT Neuro spatial information by allowing 2-dimensional images as
[Siemens Healthcare, Erlangen, Germany], and RAPID [Rapid input. We speculate that this is one of the key differences, giv-
Processing of Perfusion and Diffusion; iSchemaView, Inc, Menlo ing CNN an advantage simply by having spatial information
Park, CA]) identifies core and salvageable tissue as areas exceed- available and thereby making the predictions less sensitive to
ing population-wide or relative thresholds on individual imaging noisy data.
modalities.1 These methods use different imaging biomarkers but We chose to use the performance measure presented by
are all limited to 2. Our analysis shows ADC thresholding to be Jonsdottir et al32 to evaluate the performance of the predic-
insufficient, which might be because of an interplay between tis- tions. This approach ensures a threshold-independent measure
sue characteristics not captured by a single biomarker. (as opposed to measures, such as DICE coefficient or accu-
We decided to assess the ability of the CNN to identify racy) and emphasizes the performance of tissue infarction risk
treatment differences based on whether or not the patients inside the hypoperfused tissue.
received intravenous rtPA treatment. The same methodology
could equally well be used to examine the direct effects of Limitations
recanalization. Here, we took ±intravenous rtPA to avoid the The data used in this study were retrospectively acquired.
need to handle partial recanalization, which would have lim- To mimic a prospective study, we divided the data and used
ited our data volume. some of the patients for testing only. However, because the
A considerable data volume representing actual clinical data are not collected to improve our predictive model’s
variability is necessary for any method attempting to uncover performance, there might be a variation in new patients not
and harness the complex relation between acute and follow-up included in the current data. A further drawback might be the
Nielsen et al   Final Outcome Prediction of Acute Ischemic Stroke   7

fact that we included patients scanned with a variety of scan- A CNN has the advantage of being able to retain spatial infor-
ners, scan parameters, and field strengths, which introduces mation, resulting in more accurate predictions compared with
uncontrollable sources of data material variation. However, a GLM-based model. The depth of the CNN is important, with
even with these challenges, CNNdeep yielded good results, many layers in the network yielding a better contrast and a
which in our view speaks to the generalization and robust- higher accuracy of the predicted images.
ness of CNNdeep. The new model paradigms have been shown to lead to
All our patients experienced an acute ischemic stroke, improved predictions and thereby a much increased potential
thereby effectively omitting a control group from the study for use in automated decision support systems providing rec-
and introducing a risk of being biased toward infarct overesti- ommendations for personalized treatment and thereby hope-
mation. However, the data set contained numerous slices with fully better outcome for the individual patient. CNNs will
normoperfused voxels, and the predictive models effectively likely benefit from increasingly larger image collections, in
classified those accordingly. We think this constitutes a robust contrast to less-complex methods, such as GLM and popu-
approach, with overestimation bias being likelier in the train- lation-wide thresholds, which lack the information-encoding
ing data preparation method used by Stier et al.13 capabilities of CNNs. An advantage of CNNs is its ability to
One drawback of CNN methods is the training time. The learn and become progressively better with every new patient.
training time is related to the complexity of the network and
becomes more pronounced with deeper networks. However, Acknowledgments
it is only necessary to train the network once, and the evalu- We would like to thank Prof Grethe Andersen and Dr Kristina Dupont
ation of a new patient takes ≈1 minute. Another drawback Hougaard for kindly making the Ischemic Perconditioning Study
Downloaded from http://stroke.ahajournals.org/ by guest on May 2, 2018

of the CNN method is the amount of training data needed. available for model training and validation.
Without sufficient training data, the CNN is prone to overfit-
ting, and CNNdeep is the most vulnerable because it contains Sources of Funding
more parameters. We mitigated this by following the standard A. Nielsen is funded by Innovation Fund Denmark (5189-00209B),
machine learning procedure and evaluated the models’ perfor- research training supplement from Aarhus University, Denmark, and
mance on an independent test set. Furthermore, we trained the Cercare Medical ApS.
networks using mini-batch stochastic gradient descent to sta-
bilize the training process and avoid too much adaption to the Disclosures
training set. We suspect the performance of the CNN-based A. Nielsen is employed by Cercare Medical ApS. Drs Hansen and
Mouridsen are shareholders in Cercare Medical ApS.
methods in general, and CNNdeep in particular will increase
with more training data.
In this article, we found a treatment effect measured by final References
1. Austein F, Riedel C, Kerby T, Meyne J, Binder A, Lindner T, et al.
infarct volume, although the effect was small for most patients. Comparison of perfusion CT software to predict the final infarct
It could be speculated whether the network underestimates volume after thrombectomy. Stroke. 2016;47:2311–2317. doi:
the treatment effect. However, the intravenous rtPA treatment 10.1161/STROKEAHA.116.013147.
effect is time dependent, expected to be smaller than the effect 2. Straka M, Albers GW, Bammer R. Real-time diffusion-perfusion mis-
match analysis in acute stroke. J Magn Reson Imaging. 2010;32:1024–
of thrombectomy, and not guaranteed to lead to reperfusion. 1037. doi: 10.1002/jmri.22338.
Additionally, a minor treatment effect would be expected if 3. Christensen S, Mouridsen K, Wu O, Hjort N, Karstoft H, Thomalla G, et
the DWI/PWI mismatch is small, according to the penumbra al. Comparison of 10 perfusion MRI parameters in 97 sub-6-hour stroke
model. Therefore, we find it encouraging that the network was patients using voxel-based receiver operating characteristics analysis.
Stroke. 2009;40:2055–2061. doi: 10.1161/STROKEAHA.108.546069.
able to detect the small treatment effect. However, the data set 4. Astrup J, Siesjö BK, Symon L. Thresholds in cerebral ischemia - the
is relatively small, and further validation before clinical use ischemic penumbra. Stroke. 1981;12:723–725.
is required. 5. Mouridsen K, Hansen MB, Østergaard L, Jespersen SN. Reliable esti-
mation of capillary transit time distributions using DSC-MRI. J Cereb
Different end points can be considered in stroke predic- Blood Flow Metab. 2014;34:1511–1521. doi: 10.1038/jcbfm.2014.111.
tion. Here, we chose to use imaging outcome because this 6. Wu O, Koroshetz WJ, Ostergaard L, Buonanno FS, Copen WA, Gonzalez
is a high-resolution representation of stroke outcome and, RG, et al. Predicting tissue outcome in acute human cerebral ischemia
therefore, a demanding task. Functional outcome could be an using combined diffusion- and perfusion-weighted MR imaging. Stroke.
2001;32:933–942.
alternative, however, that would reduce the follow-up infor- 7. Wu O, Sumii T, Asahi M, Sasamata M, Ostergaard L, Rosen BR,
mation available per patient (from voxels to a single score). et al. Infarct prediction and treatment assessment with MRI-based
Moreover, functional outcome might even be obtainable via algorithms in experimental stroke models. J Cereb Blood Flow Metab.
the predicted risk map. One issue concerns how to establish 2007;27:196–204. doi: 10.1038/sj.jcbfm.9600328.
8. LeCun Y, Boser B, Denker JS, Howard RE, Habbard W, Jackel LD,
the outcome reference. We chose to apply a consensus deci- et al. Handwritten digit recognition with a back-propagation network.
sion of at least 3 of 4 expert neuroradiologists using follow- Adv Neural Inf Process Syst. 1990;2:396–404.
up T2-FLAIR images to minimize interrater variability21,31 to 9. Maier O, Schröder C, Forkert ND, Martinetz T, Handels H. Classifiers
for ischemic stroke lesion segmentation: a comparison study. PLoS One.
mitigate this problem.
2015;10:e0145118. doi: 10.1371/journal.pone.0145118.
10. Kamnitsas K, Chen L, Ledig C, Rueckert D, Glocker B. Multi-scale 3D
Conclusions convolutional neural networks for lesion segmentation in brain MRI.
MICCAI Brain Lesion Workshop 2015. 2015. http://www.doc.ic.ac.
The comparison of predictive models described in this article
uk/~bglocker/pdfs/kamnitsas2015isles.pdf.
shows a clear advantage of using a deep CNN, such as CNNdeep, 11. Dutil F, Havaei M, Pal C, Larochelle H, Jodoin PM. A convolutional
to produce predictions of final infarct in acute ischemic stroke. neural network approach to brain lesion segmentation. MICCAI Brain
8  Stroke  June 2018

Lesion Workshop 2015. 2015. https://www.cbica.upenn.edu/sbia/ and metabolism. J Cereb Blood Flow Metab. 2012;32:264–277. doi:
Spyridon.Bakas/MICCAI_BraTS/MICCAI_BraTS_2015_proceed- 10.1038/jcbfm.2011.153.
ings.pdf. 23. Mouridsen K, Christensen S, Gyldensted L, Ostergaard L. Automatic
12. Huang S, Shen Q, Duong TQ. Artificial neural network prediction of selection of arterial input function using cluster analysis. Magn Reson
ischemic tissue fate in acute stroke imaging. J Cereb Blood Flow Metab. Med. 2006;55:524–531. doi: 10.1002/mrm.20759.
2010;30:1661–1670. doi: 10.1038/jcbfm.2010.56. 24. Mouridsen K, Friston K, Hjort N, Gyldensted L, Østergaard L, Kiebel
13. Stier N, Vincent N, Liebeskind D, Scalzo F. Deep learning of tissue fate fea- S. Bayesian estimation of cerebral perfusion using a physiologi-
tures in acute ischemic stroke. Proceedings (IEEE Int Conf Bioinformatics cal model of microvasculature. Neuroimage. 2006;33:570–579. doi:
Biomed). 2015;2015:1316–1321. doi: 10.1109/BIBM.2015.7359869. 10.1016/j.neuroimage.2006.06.015.
14. Badrinarayanan V, Kendall A, Cipolla R. SegNet: a deep convolutional 25. Østergaard L, Jespersen SN, Mouridsen K, Mikkelsen IK, Jonsdottír
encoder-decoder architecture for image segmentation. IEEE Trans KÝ, Tietze A, et al. The role of the cerebral capillaries in acute isch-
Pattern Anal Mach Intell. 2017;39:2481–2495. emic stroke: the extended penumbra model. J Cereb Blood Flow Metab.
15. Engedal TS, Hjort N, Hougaard KD, Simonsen CZ, Andersen G, Mikkelsen 2013;33:635–648. doi: 10.1038/jcbfm.2013.18.
IK, et al. Transit time homogenization in ischemic stroke - a novel biomarker 26. Wu O, Østergaard L, Weisskoff RM, Benner T, Rosen BR, Sorensen AG.
of penumbral microvascular failure? [published online ahead of print January Tracer arrival timing-insensitive technique for estimating flow in MR
1, 2017]. J Cereb Blood Flow Metab. doi: 10.1177/0271678X17721666. perfusion-weighted imaging using singular value decomposition with a
16. Alawneh JA, Jones PS, Mikkelsen IK, Cho TH, Siemonsen S, Mouridsen block-circulant deconvolution matrix. Magn Reson Med. 2003;50:164–
K, et al. Infarction of ‘non-core-non-penumbral’ tissue after stroke: mul- 174. doi: 10.1002/mrm.10522.
tivariate modelling of clinical impact. Brain. 2011;134(pt 6):1765–1776. 27. Olivot JM, Mlynash M, Thijs VN, Kemp S, Lansberg MG, Wechsler
doi: 10.1093/brain/awr100. L, et al. Optimal Tmax threshold for predicting penumbral tissue in
17. I-KNOW. Integrating information from molecules to man: knowledge acute stroke. Stroke. 2009;40:469–475. doi: 10.1161/STROKEAHA.
discovery accelerates drug development and personalized treatment in 108.526954.
acute stroke. 2006. https://cordis.europa.eu/project/rcn/78374_en.html. 28. Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, et al.
18. Hougaard KD, Hjort N, Zeidler D, Sørensen L, Nørgaard A, Thomsen Tensorflow: large-scale machine learning on heterogeneous distributed
Downloaded from http://stroke.ahajournals.org/ by guest on May 2, 2018

RB, et al. Remote ischemic perconditioning in thrombolysed stroke systems. ArXiv e-prints. 2016;1603. https://arxiv.org/abs/1603.04467,
patients: randomized study of activating endogenous neuroprotection software available from tensorflow.org.
- design and MRI measurements. Int J Stroke. 2013;8:141–146. doi: 29. Python Software Foundation. Python 2.7, http://www.python.org/. 2016.
10.1111/j.1747-4949.2012.00786.x. 30. R Core Team; R Foundation for Statistical Computing. R: A Language
19. Hougaard KD, Hjort N, Zeidler D, Sørensen L, Nørgaard A, Hansen TM, and Environment for Statistical Computing. Vienna, Austria; 2015.
et al. Remote ischemic perconditioning as an adjunct therapy to throm- 31. Neumann AB, Jonsdottir KY, Mouridsen K, Hjort N, Gyldensted C,
bolysis in patients with acute ischemic stroke: a randomized trial. Stroke. Bizzi A, et al. Interrater agreement for final infarct MRI lesion delin-
2014;45:159–167. doi: 10.1161/STROKEAHA.113.001346. eation. Stroke. 2009;40:3768–3771. doi: 10.1161/STROKEAHA.
20. Carrera E, Jones PS, Alawneh JA, Klærke Mikkelsen I, Cho TH, 108.545368.
Siemonsen S, et al. Predicting infarction within the diffusion-weighted 32. Jonsdottir KY, Østergaard L, Mouridsen K. Predicting tissue outcome
imaging lesion: does the mean transit time have added value? Stroke. from acute stroke magnetic resonance imaging: improving model perfor-
2011;42:1602–1607. doi: 10.1161/STROKEAHA.110.606970. mance by optimal sampling of training data. Stroke. 2009;40:3006–3011.
21. Hansen MB, Nagenthiraja K, Ribe LR, Dupont KH, Østergaard L, Mouridsen doi: 10.1161/STROKEAHA.109.552216.
K. Automated estimation of salvageable tissue: comparison with expert read- 33. Wu O, Christensen S, Hjort N, Dijkhuizen RM, Kucinski T, Fiehler J,
ers. J Magn Reson Imaging. 2016;43:220–228. doi: 10.1002/jmri.24963. et al. Characterizing physiological heterogeneity of infarction risk in
22. Jespersen SN, Østergaard L. The roles of cerebral blood flow, capillary acute human ischaemic stroke using MRI. Brain. 2006;129(pt 9):2384–
transit time heterogeneity, and oxygen tension in brain oxygenation 2393. doi: 10.1093/brain/awl183.
Prediction of Tissue Outcome and Assessment of Treatment Effect in Acute Ischemic
Stroke Using Deep Learning
Anne Nielsen, Mikkel Bo Hansen, Anna Tietze and Kim Mouridsen

Stroke. published online May 2, 2018;


Downloaded from http://stroke.ahajournals.org/ by guest on May 2, 2018

Stroke is published by the American Heart Association, 7272 Greenville Avenue, Dallas, TX 75231
Copyright © 2018 American Heart Association, Inc. All rights reserved.
Print ISSN: 0039-2499. Online ISSN: 1524-4628

The online version of this article, along with updated information and services, is located on the
World Wide Web at:
http://stroke.ahajournals.org/content/early/2018/05/01/STROKEAHA.117.019740

Data Supplement (unedited) at:


http://stroke.ahajournals.org/content/suppl/2018/05/02/STROKEAHA.117.019740.DC1

Permissions: Requests for permissions to reproduce figures, tables, or portions of articles originally published
in Stroke can be obtained via RightsLink, a service of the Copyright Clearance Center, not the Editorial Office.
Once the online version of the published article for which permission is being requested is located, click
Request Permissions in the middle column of the Web page under Services. Further information about this
process is available in the Permissions and Rights Question and Answer document.

Reprints: Information about reprints can be found online at:


http://www.lww.com/reprints

Subscriptions: Information about subscribing to Stroke is online at:


http://stroke.ahajournals.org//subscriptions/
SUPPLEMENTAL MATERIAL

Prediction of tissue outcome and assessment of treatment effect in acute ischemic stroke using
Deep Learning

Scanner parameters, implementation details, data sampling, and sample size considerations

Anne Nielsena,b MSc, Mikkel Bo Hansena PhD, Anna Tietzea,c PhD, Kim Mouridsena PhD
a Center of Functionally Integrative Neuroscience and MINDLab, Inst. of Clinical Medicine, Aarhus

University, Denmark
c Cercare Medical ApS, Aarhus, Denmark
cInst. of Neuoradiology, Charité Universitätsmedizin, Germany

Corresponding Author:
PhD Student, Anne Nielsen, MSc
CFIN, Aarhus University Hospital
Building 10G, 4th Floor, Nørrebrogade 44, DK-8000 Aarhus C, Denmark
Phone/e-mail +45 28902983 / anne@cfin.au.dk

Tables 2, Figures 0
Study I Study II

1.5T 3T 1.5T 3T

Number of patients 58 12 22 95

PWI

Matrix* [128, 256] [96, 128] 128 144

FOV (mm) [230, 260] [192, 240] [230, 240] 230

Slices [12, 20] [15, 19] [14, 20] [28, 40]

Slice thickness (mm) [6, 6.5] 6.5 6.5 4

TE (ms) [30,45] [30, 45] [30, 45] 30

TR (ms) [260, 1540] 1500 [1500, 1630] 1500

Brian coverage (mm) [72, 120] [98, 124] [91, 143] [112, 160]

DWI

Matrixa [128, 256] [128, 256] [192, 256] [192, 240]

FOV (mm) [230, 280] [192, 240] [230, 240] [230, 240]

Slices [18, 55] [21, 43] 22 [24, 30]

Slice thickness (mm) [3, 6] [3, 5] 5 [4, 5]

TE (ms) [69, 124] [71, 104] [85, 133] [69, 100]

TR (ms) [2395, 6500] [6000, 6500] [3600, 6200] [2648, 6000]

Brain coverage (mm) [100, 330] [105, 129] 110 [78, 90]
*The matrix is always square, ie, a number of 128 correspond to a matrix of size 128 x 128.
Table I Summary of parameters used in the PWI and DWI sequences depending on magnetic field
strength and study, Study I being I-Know multicenter (105)1-3 and Study II Remote Ischemic Per-
conditioning (117)4, 5. Numbers stated as [x, y] refers to the minimum and maximum value.
Implementation details and data sampling

Deep Convolutional Neural Networks (CNNdeep)


This network has 37 layers, is the most complicated and sophisticated presented in this article, and
is a modified version of SegNet6 with nine input channels and convolutional downsampling. The
network extracts the important features during the training process and utilizes spatial information
in the input to make predictions of the probability of tissue-death at time of follow-up as a spatial
output (segmentation). This enables smoother predictive images compared to voxel-by-voxel
methods. The input image biomarkers are delay, MTT, CBV, CBF, CMRO2, RTH, TRACE DWI,
ADC and T2-FLAIR. For further implementation details, see Table II.
The training set consisted of 50,000 patches. Each biomarker in every slice was standardized
according to the mean value and variance. Image patches of size 64x64 voxels were randomly
sampled and matched by the final infarct mask, as drawn by an expert neuroradiologist on the
follow-up T2-FLAIR scan. To get a balanced data set we ensured that half of the patches contain
voxels classified as lesion on the follow-up scan.

Generalized Linear Model (GLM)


GLM is a voxel-by-voxel based predictive modeling technique, where image biomarkers for a
single voxel are used to predict the probability of tissue-death for that single voxel7, 8. GLMs use
handcrafted features with an associated weighting coefficient for each type, which adapts during the
training process. We used a feature vector consisting of delay, MTT, CBV, CBF, CMRO2, RTH,
TRACE WI, ADC and T2-FLAIR. For the input data, we standardized each biomarker in each
patient to have zero mean and standard variance. We sampled 50,000 random voxels.

Tmax based Convolutional Neural Network (CNNTmax)


This is a network with four layers and a linear classifier. The slice from each patient exhibiting the
largest lesion is used in the training set and the rest of the slices are omitted 9. Only Tmax was used
for this network and the slices were standardized. From these slices, 50,000 random 23x23 voxel
patches were sampled and the patches matched by the class of the midpoint. To ensure binary
classification, the midpoints are classified into healthy or dead tissue depending on status on follow
up.

Shallow Convolutional Neural Network (CNNshallow)


In order to investigate the effect of the depth of the networks, a shallow CNN with three layers was
constructed aimed at providing a point of reference. The input data to the CNN shallow is the same as
for the CNNdeep and is also used for segmentation of the input image.

Training
The CNNs above are all trained using the TensorFlow10 framework for 100 epochs by minimizing
the multinomial logistic loss function (representing the difference between the predicted outcome
and the observed) using stochastic gradient descent11 as optimization method.

Test
At test time, all slices for the patients in the test data was examined using a sliding window
approach to ensure predictions for all voxels. For the CNN deep and CNN shallow, the output for each
patch was a 64x64 image with the predicted probabilities of each voxels belonging to three classes
(background, healthy tissue and death tissue). The result for a given voxel is the class-wise mean of
the probabilities that the voxel belonged to the three classes. For the CNNTmax, the outcome of the
network was used as the probability of the midpoint of the patch belonging to the healthy and
infarcted tissue classes. For GLM, predicted tissue risks were calculated for every voxel.

CNNdeep CNNshallow CNNTmax


Convolutional Number 26 2 2
layers
Kernel size 3x3 3x3 6x6

Filters 2∙64, 2∙128, 64, 3 12, 12


3∙256, 11∙512,
3∙256, 2∙128,
2∙64, 3
Stride 1 1 1
Pad 1 1 0
ReLU Number 25 2 -
Downsampling Number 5 0 2
layers Type Convolution - Average
Kernel size 2 - 2
Stride 2 - 2
Upsampling Number 5 0 0
layers Type Convolution - -
Kernel size 2 - -
Stride 2 - -
Hyper Learning rate 0.0001
parameters Momentum 0.9
Weight decay 0.001
Epochs 100
Table II CNN implementation details
Thoughts on sample size

The CNNs proposed in this paper have many parameters, making overfitting and learning ability
important topics. To our knowledge, the deep learning literature still lacks formal methods for
determining sample size for fitting a specific model. This is our thoughts on the sample size of our
models.

For traditional statistics using p-values, there is a risk of erroneous conclusions with very small
samples12. In machine learning, the risk of overfitting is smaller due to the use of independent test
sets in the performance evaluation. In fitting a classic regression model, the optimal number of
parameters is unknown and regularization or hypothesis testing is used to find the optimal
parameter combination. Model complexity determined via regularization is typically done by cross-
validation, iteratively finding the most important parameters for a subset of the data and then
assessing performance on an independent test set. With deep learning models, we employ mini-
batch stochastic gradient descent, which in practice has an effect similar to cross-validation
principle by having an implicit cross-validation during the training process (each batch is a subset
of the data and if the model adapts to well to a batch, it will perform purely on the next).
The performance of a deep learning model is influenced by a combination of network architecture,
optimization method, hyperparameters, and training data. These models are by construction prone
to overfitting. To account for the overfitting, we used batch normalization known to act as (among
other things) an effective regularizer13 and carefully tuned our hyperparameters through several
training runs.

Furthermore, we used mini-batch stochastic gradient descent to train the algorithm with a batch size
of 40 patches for each iteration. Therefore, the model does not have access to the entire training set
in each iteration, making it less prone to overfitting. If the model adapts too much to a batch, it will
be punished by poor performance on the next batch.
We have taken several measures to assess the magnitude of potential overfitting. One of them is to
compare the model’s performance on the training set and test set. We have calculated the AUC on
the training set to be (0.97±0.06) compared to the test set AUC (0.88±0.12). As expected, the
training set AUC is higher than the test set AUC because, when calculating the training set AUC,
the same data is used to train the model and assess its error.
Furthermore, we included another regularizing mechanism to prevent overfitting by training the
CNNdeep network using dropout14, in the middle layers (as in Kendall et. al.15). This yielded a
similar AUC on the training set (0.97±0.03), but a lower AUC on the test set (0.80±0.22),
indicating an inferior ability to adapt to the data.

To assess the risk of overestimating the test performance, we trained the same model with half of
the data in the training set. This resulted in a training set AUC at (0.98±0.03) and a test set AUC at
(0.84±0.16). As expected, CNNdeep yields a higher AUC on the training set when trained on a
smaller sample size (small variance in batches yield a higher AUC), but a lower AUC on the
original test set. This shows that even with a small amount of data, a high AUC on the training set
do not trigger a high AUC on the independent test set, which is the performance measure reported
in the manuscript.

These results indicate that training using mini-batch stochastic gradient descent is a strong tool to
avoid overfitting.
We want to emphasize that even though we obtain a higher AUC on the training set than on the test
set, the models’ performances are evaluated solely on the independent test set. CNNdeep is most
prone to overfitting and still surpasses all the other models measured by performance, which is the
main point of the manuscript.
However, an increased amount of training data will probably increase CNN deep’s performance and
consider collecting of more data an important step towards improving our model.

Deep convolutional neural networks like CNN deep contain a lot of parameters. An interesting
question is, how much data is needed to fit the number of parameters in the network.
Although 187 un-treated patients (as used in this article) are a relatively small sample of patients,
the network is trained on 4270 slices with 128x128 pixels each which are further presented to the
network as 50.000 image patches (64x64 randomly sampled image ‘sub-windows’). Although
patches and slices within patients are not fully independent they nevertheless serve to exemplify
stroke progression complexity to the network. This subsampling approach has been shown to
improve independent test performance in earlier studies 16-18.
References
1. Alawneh JA, Jones PS, Mikkelsen IK, Cho TH, Siemonsen S, Mouridsen K, et al. Infarction
of 'non-core-non-penumbral' tissue after stroke: Multivariate modelling of clinical impact.
Brain. 2011;134:1765-1776
2. Engedal TS, Hjort N, Hougaard KD, Simonsen CZ, Andersen G, Mikkelsen IK, et al.
Transit time homogenization in ischemic stroke - a novel biomarker of penumbral
microvascular failure? J Cereb Blood Flow Metab. 2017:271678X17721666
3. I-KNOW. Integrating information from molecules to man: Knowledge discovery accelerates
drug development and personalized treatment in acute stroke. 2006
4. Hougaard KD, Hjort N, Zeidler D, Sorensen L, Norgaard A, Thomsen RB, et al. Remote
ischemic perconditioning in thrombolysed stroke patients: Randomized study of activating
endogenous neuroprotection - design and mri measurements. Int J Stroke. 2013;8:141-146
5. Hougaard KD, Hjort N, Zeidler D, Sorensen L, Norgaard A, Hansen TM, et al. Remote
ischemic perconditioning as an adjunct therapy to thrombolysis in patients with acute
ischemic stroke: A randomized trial. Stroke. 2014;45:159-167
6. Badrinarayanan V, Kendall A, Cipolla R. Segnet: A deep convolutional encoder-decoder
architecture for image segmentation. ArXiv e-prints. 2015;1511
7. Wu O, Sumii T, Asahi M, Sasamata M, Ostergaard L, Rosen BR, et al. Infarct prediction
and treatment assessment with mri-based algorithms in experimental stroke models. J
Cerebr Blood F Met. 2007;27:196-204
8. Wu O, Koroshetz WJ, Østergaard L, Buonanno FS, Copen WA, Gonzalez RG, et al.
Predicting tissue outcome in acute human cerebral ischemia using combined diffusion- and
perfusion-weighted mr imaging. Stroke. 2001;32:933-942
9. Stier N, Vincent N, Liebeskind D, Scalzo F. Deep learning of tissue fate features in acute
ischemic stroke. Ieee Int C Bioinform. 2015:1316-1321
10. Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, et al. Tensorflow: Large-scale
machine learning on heterogeneous distributed systems. ArXiv e-prints. 2016;1603
11. Goodfellow I, Bengio Y, Courville A. Deep learning. MIT Press; 2016.
12. Button KS, Ioannidis JP, Mokrysz C, Nosek BA, Flint J, Robinson ES, et al. Power failure:
Why small sample size undermines the reliability of neuroscience. Nat Rev Neurosci.
2013;14:365-376
13. Ioffe S, Szegedy C. Batch normalization: Accelerating deep network training by reducing
internal covariate shift. ArXiv e-prints. 2015;1502
14. Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R. Dropout: A simple
way to prevent neural networks from overfitting. J Mach Learn Res. 2014;15:1929-1958
15. Kendall A, Badrinarayanan V, Cipolla R. Bayesian segnet: Model uncertainty in deep
convolutional encoder-decoder architectures for scene understanding. ArXiv e-prints.
2015;1511
16. Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional
neural networks. In: Pereira F, Burges CJC, Bottou L, Weinberger KQ, eds. Advances in
neural information processing systems 25. Curran Associates, Inc.; 2012:1097--1105.
17. Pinheiro PHO, Collobert R. Recurrent convolutional neural networks for scene parsing.
ArXiv e-prints. 2013;1306
18. Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation.
Proc Cvpr Ieee. 2015:3431-3440

You might also like