Professional Documents
Culture Documents
PII: S0165-0327(19)31141-3
DOI: https://doi.org/10.1016/j.jad.2019.11.093
Reference: JAD 11340
Please cite this article as: Miché Marcel PhD , Studerus Erich PhD , Meyer Andrea Hans PhD ,
Gloster Andrew Thomas PhD , Beesdo-Baum Katja PhD , Wittchen Hans-Ulrich PhD ,
Lieb Roselind PhD , Prospective prediction of suicide attempts in community adolescents and
young adults, using regression methods and machine learning, Journal of Affective Disorders (2019),
doi: https://doi.org/10.1016/j.jad.2019.11.093
This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition
of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of
record. This version will undergo additional copyediting, typesetting and review before it is published
in its final form, but we are providing this version to give early visibility of the article. Please note that,
during the production process, errors may be discovered which could affect the content, and all legal
disclaimers that apply to the journal pertain.
Highlights
First study – to the best of our knowledge – to apply Machine Learning (ML)
alongside conventional prediction models to predict future suicide attempts, by
using data from a 10-year prospective longitudinal study.
We used a community sample with ages 14-34 years (full study period) that
covers the high-risk period for the first lifetime suicide attempt, which according
to the WHO (2014) is between 15-29 years of age.
We adhered to the TRIPOD guidelines (Collins et al., 2015) in order to increase
transparency and reproducibility, as well as to facilitate cross-study
comparisons.
We adhered to further recommendations in order to meet current standards for
studies that apply ML, for instance, we used the best current approach for
internal cross-validation, as recommended by Krstajic et al. (2014).
Our overall prediction performance of our selected models all fall in the category
“very good”, according to Šimundić (2009).
Running head: PREDICT SA – REGRESSION METHODS AND MACHINE LEARNING 2
1
University of Basel, Department of Psychology, Division of Clinical Psychology and Epidemiology,
Basel, Switzerland
2
University of Basel, Department of Psychology, Division of Personality and Developmental
Germany
6
Ludwig Maximilians University Munich, Department of Psychiatry and Psychotherapy, Munich,
Germany
Corresponding Author
Running head: PREDICT SA – REGRESSION METHODS AND MACHINE LEARNING 3
Department of Psychology
University of Basel
Missionsstrasse 60-62
4055 Basel
Switzerland
Phone: 0041-61-2070278
Email: roselind.lieb@unibas.ch
Acknowledgments
This work is part of the Early Developmental Stages of Psychopathology (EDSP) Study and is
funded by the German Federal Ministry of Education and Research (BMBF) project nos.
01EB9405/6, 01EB9901/6, EB01016200, 01EB0140, and 01EB0440. Part of the field work
grants LA1148/1-1, WI2246/1-1, WI 709/7-1, and WI 709/8-1. Principal investigators are Dr.
Hans-Ulrich Wittchen and Dr. Roselind Lieb, who take responsibility for the integrity of the
study data. Core staff members of the EDSP group are Dr. Kirsten von Sydow, Dr. Gabriele
Lachner, Dr. Axel Perkonigg, Dr. Peter Schuster, Dr. Michael Höfler, Dipl.-Psych. Holger
Sonntag, Dr. Tanja Brückl, Dipl.-Psych. Elzbieta Garczynski, Dr. Barbara Isensee, Dr. Agnes
Nocon, Dr. Chris Nelson, Dipl.-Inf. Hildegard Pfister, Dr. Victoria Reed, Dipl.-Soz. Barbara
Spiegel, Dr. Andrea Schreier, Dr. Ursula Wunderlich, Dr. Petra Zimmermann, Dr. Katja
Beesdo-Baum, Dr. Antje Bittner, Dr. Silke Behrendt, and Dr. Susanne Knappe. Scientific
advisors are Dr. Jules Angst (Zurich), Dr. Jürgen Margraf (Basel), Dr. Günther Esser
(Potsdam), Dr. Kathleen Merikangas (NIMH, Bethesda), Dr. Ron Kessler (Harvard
Dr. Katja Beesdo-Baum is currently funded by the BMBF (project nos. 01ER1303,
01ER1703).
Running head: PREDICT SA – REGRESSION METHODS AND MACHINE LEARNING 5
Abstract
Background. The use of machine learning (ML) algorithms to study suicidality has recently
been recommended. Our aim was to explore whether ML approaches have the potential to
improve the prediction of suicide attempt (SA) risk. Using the epidemiological multiwave
Methods. The EDSP Study prospectively assessed, over the course of 10 years, adolescents
and young adults aged 14–24 years at baseline. Of 3021 subjects, 2797 were eligible for
prospective analyses because they participated in at least one of the three follow-up
assessments. Sixteen baseline predictors, all selected a priori from the literature, were used to
predict follow-up SAs. Model performance was assessed using repeated nested 10-fold cross-
validation. As the main measure of predictive performance we used the area under the curve
(AUC).
Results. The mean AUCs of the four predictive models, logistic regression, lasso, ridge, and
distinguishing between a future SA case and a non-SA case in community adolescents and
young adults. When choosing an algorithm, different considerations, however, such as ease of
implementation, might in some instances lead to one algorithm being prioritized over another.
Keywords: Machine learning, future suicide attempt, prediction, adolescents and young adults,
Introduction
Suicide research has suggested many correlates and some risk factors for completed
suicide, suicide attempt (SA), and suicidal ideation. Nonetheless, according to a recent meta-
analysis, the ability to accurately predict SAs remains poor, rarely exceeding chance
machine learning (ML) algorithms have been recommended (Bentley et al., 2016; Franklin et
al., 2017; Walsh et al., 2018, 2017), in addition to the use of more traditional statistical
approaches, for example, multiple logistic regression (for a brief comparison of both
approaches see Bennett et al., 2019). One advantage of ML algorithms is that they can better
deal with the problem of ―overfitting.‖ Overfitting occurs where a statistical model fits well
with one data set, yet fails to accurately predict new observations, a problem for which the
ML framework provides several solutions, for example, adjusting the flexibility with which
the model will learn from the data in order to control the degree of overfitting (Krstajic et al.,
In suicidality research, some studies that have applied ML have found that SA can be
predicted above chance level, for example for SA (Delgado-Gomez et al., 2012, 2011; Hettige
et al., 2017; Just et al., 2017; Mann et al., 2008; Passos et al., 2016; Simon et al., 2018; Walsh
et al., 2017) and for suicidal behavior (i.e., suicide and SA combined) (Barak-Corren et al.,
2017).
When dealing with categorical outcomes, prediction is often quantified using the area
under the receiver operating characteristic curve (AUC). Chance prediction is thereby defined
as an AUC of 0.5. Šimundić (2009) suggested five heuristic categories of AUC results that
she termed ―bad‖ (0.5–0.59), ―sufficient‖ (0.6–0.69), ―good‖ (0.7–0.79), ―very good‖ (0.8–
0.89), and ―excellent‖ (0.9–1.0). Walsh et al. (2017) achieved very good prediction accuracy
for a future SA among adult patients, using electronic health record data (EHR; AUC range
0.80–0.84). Furthermore, the random forest model yielded a better prognostic performance
Running head: PREDICT SA – REGRESSION METHODS AND MACHINE LEARNING 7
than multiple logistic regression (AUC range 0.66–0.68; Walsh et al., 2017). Walsh et al.
(2018) replicated this finding in a sample of adolescent patients and controls, again using
EHR data, with the random forest model yielding AUCs of more than 0.8, while logistic
regression yielded AUCs of less than 0.7. In the National Comorbidity Survey (NCS), a
community study of 15- to 54-year-olds, Kessler et al. (2016) reported logistic regression
one of them (AUC: 0.70 vs. 0.76). Delgado-Gomez and colleagues (2012, 2011) also
compared SA prediction accuracies, applying both ML models, for example, support vector
machines (SVMs), and a traditional model, multiple linear regression, using questionnaire
data of almost 900 adults (admitted to an emergency department, inpatients, and blood
donors) in each of the two cross-sectional studies. In the first study, Delgado-Gomez et al.
(2011) reported that ML models outperformed the traditional model, for example, prediction
accuracy (with 100 being the best possible result) of SVM being 76.7 vs. 71.5 in the linear
regression model, whereas in the second study the ML models and the linear regression model
rendered comparable results (Delgado-Gomez et al., 2012). Other studies that reported an
overall measure of prediction performance with SA as outcome did not report any comparison
between ML and statistical models (Barak-Corren et al., 2017; Hettige et al., 2017; Mann et
al., 2008; Nock et al., 2018; Passos et al., 2016; Simon et al., 2018). While four of these other
SA prediction studies applied ML models only (AUCs ranging between 0.65 and 0.8; Barak-
Corren et al., 2017; Hettige et al., 2017; Mann et al., 2008; Passos et al., 2016), the other two
replicated n-fold cross-validation) to control overfitting (AUCs being 0.85 [Simon et al.,
ML models and/or techniques to control overfitting, and using prospective longitudinal data
from an epidemiological sample of adolescents and young adults (aged 14–24 years at
Running head: PREDICT SA – REGRESSION METHODS AND MACHINE LEARNING 8
baseline). This age range can be regarded as a time of ―high risk‖ for incident SA; in fact
among 15- to 29-year-olds, suicide is the second leading cause of death (World Health
Organization [WHO], 2014). Thus the three properties, namely, prospective study design,
general community, and young age group, are important, both methodologically (e.g.,
temporally prospective vs. cross-sectional data analysis; Kraemer, 2010; Kraemer et al., 1997)
and practically. That is, in terms of testing the utility of ML approaches it is essential to
derive indicators that are able to help clinical decision makers, such as general practitioners or
pediatricians, better recognize the individual risk of a future SA (or suicide) as early as
four prediction approaches, namely, three regression-based models (logistic, lasso, and ridge),
and one ML model (random forest), using the data of the epidemiological Early
Methods
Sample
In the EDSP Study, community adolescents and young adults were assessed up to four
times between 1995 and 2005. At baseline, participants were between 14 and 24 years of age.
The four assessments T0–T3 included sample sizes of, respectively, 3021 (T0, response =
70.9%), 1228 (T1, response = 88%, range 1.2–2.1 years after baseline), 2548 (T2, response =
84.3%, range 2.8–4.1 years after baseline), and 2210 (T3, response = 73.2%, range = 7.3–10.6
years after baseline). At baseline, T2, and T3, subjects from the full sample were assessed; at
T1 a subsample of those 14–17 years old at baseline was assessed. Subjects were selected
from the government registries of the greater Munich area, Germany; 14- to 15-year-olds
were sampled at twice the probability of 16- to 21-year-olds, whereas 22- to 24-year-olds
were sampled at half the probability. Sample weights were generated to account for this
Running head: PREDICT SA – REGRESSION METHODS AND MACHINE LEARNING 9
sampling scheme. Further details of the EDSP Study methods, design, and sample
characteristics have been presented elsewhere (Beesdo-Baum et al., 2015; Lieb et al., 2000a;
Wittchen et al., 1998b). The EDSP project was reviewed by the Ethics Committee of the
Medical Faculty at the Dresden University of Technology. All participants provided informed
consent.
We selected 16 predictors. First, predictors were derived a priori from the research
literature on suicidality (Cha et al., 2018; Franklin et al., 2017; Miché et al., 2018; Nock et al.,
2008), as currently recommended for ML studies (e.g., Passos et al., 2016; Steyerberg, 2009).
Our literature-guided predictor selection was based on the broad risk and protective factor
categories presented in the extensive meta-analysis by Franklin et al. (2017) to ensure each of
our predictors maps onto one of these categories identified in the last 50 years of suicidality
research. Our predictors map onto the categories of demographics, cognitive abilities, family
or behaviors, social factors, and treatment history. Second, predictors were selected from the
EDSP baseline assessment only, in order to ensure the temporal order of predictors and the
outcome, that is, future SA (between T1 and T3). Third, we remained close to a recommended
event per variable (EPV) value of 10, that is, to have 10 cases per predictor (Studerus et al.,
(Mushkudiani et al., 2008). Since we observed 137 future SAs, our EPV was 8.5. It should be
noted, however, that high EPV values are not as important in penalized regression methods as
Of the 16 baseline predictors (in the following labeled with letters a–p), 10 were
(DIA-X/M-CIDI; Wittchen and Pfister, 1997), a fully structured clinical interview for the
assessment of syndromes, symptoms, and mental disorders pertaining to the Diagnostic and
Running head: PREDICT SA – REGRESSION METHODS AND MACHINE LEARNING 10
Statistical Manual of Mental Disorders (4th ed.; DSM-IV; American Psychiatric Association,
1994), along with various items of personal information. The DIA-X/M-CIDI has shown good
to excellent reliability (Wittchen et al., 1998a) and validity (Reed et al., 1998). The baseline
predictors assessed with the DIA-X/M-CIDI were (a) sex, (b) age, (c) education, (d) the
number of DSM-IV lifetime mental diagnoses (including panic disorder [PD], agoraphobia
with or without PD, social phobia, specific phobia, generalized anxiety disorder, post-
traumatic stress disorder, obsessive compulsive disorder, major depressive disorder [MDD],
dysthymia, any bipolar disorder, nicotine dependence, alcohol abuse or dependence, drug
abuse or dependence, pain disorder, and any eating disorder), (e) the number of lifetime
traumatic events (including war experience, physical attack, natural disaster, serious accident,
event), (f) rape or childhood sexual abuse (excluded from predictor (e)), (g) parental loss or
separation, (h) prior help seeking for any kind of psychological difficulty, and (i) parental
baseline; for its criterion-related validity, see Lieb et al., 2000b). The baseline predictor (j),
prior SA (lifetime), as well as the outcome, future SA (follow-up), was assessed in section E
of the DIA-X/M-CIDI. At baseline the SA question read: ―Have you ever attempted suicide?‖
At each follow-up (DIA-X/M-CIDI interval versions) it read: ―Since our last interview, have
you attempted suicide?‖ At both baseline and T1, only those participants who had confirmed
at least one of the MDD stem questions were asked the SA question (unavailable baseline
data on lifetime SA was set to ―no SA‖), whereas at both T2 and T3, all participants were
with the Retrospective Self-Report of Inhibition (RSRI); Reznick et al., 1992), (l) subclinical
psychotic experiences during the previous 7 days (assessed with the SCL-90-R; Derogatis et
al., 1973), (m) negative life events in the previous 5 years (assessed with the Munich Life
Running head: PREDICT SA – REGRESSION METHODS AND MACHINE LEARNING 11
Event List; Maier-Diewald et al., 1983), (n) daily hassles in the previous 2 weeks (assessed
with the Daily Hassles Scale; Perkonigg and Wittchen, 1995a), whether the participant was
(o) living in a rural area (population density of 553 inhabitants per square mile) or in an
urban area (population density of 4061 inhabitants per square mile) (Spauwen et al., 2004),
and (p) subjectively perceived coping efficacy within the next 6 months (assessed with the
German Scale for Self-Control and Coping Skills; Perkonigg and Wittchen, 1995b; higher
Data analysis
The outcome predicted was a reported SA after baseline (binary: yes–no). We used
four prediction models: Logistic regression, lasso, ridge (both variants of logistic regression),
version 3.3.3 (R Core Team, 2017). In the preprocessing of the data we excluded all cases
without any follow-up data (n = 224), or missing data (n = 4) in any predictor variable at
baseline, resulting in an N of 2793. Our chosen ML models could not deal with missing data
and since there were only four such cases, we did not see the need to apply imputation
methods, assuming that results would not be much different. The categories for the predictor
of education (low, middle, high, other) were modified by merging the categories low and
other, the latter representing a high-risk group of low educational attainment (endorsed by
2.7% of N = 2793). In our sample there were 137 future SA cases (weighted percentage =
4.9). For the application of all prediction models, we used the R package mlr (Machine
Team, 2017).
predictors, yet testing for collinearity, with maximum absolute correlation between predictors
Running head: PREDICT SA – REGRESSION METHODS AND MACHINE LEARNING 12
of 0.4 and a maximum variance inflation factor of 1.74) and two other models of the logistic
regression family, lasso and ridge, that include an additional parameter for penalizing factors
with low predictive contributions. The ML model we selected, random forest, belongs to the
family of ensemble classifiers. Random forests have been shown to make the best predictions
across diverse data sets in comparison to many other algorithms, for example, neural
networks (Fernández-Delgado et al., 2014). The single prediction models were computed by
mlr (Bischl et al., 2016), accessing the R-packages that were relevant for our analyses: For
logistic regression this was the R base package stats; for both lasso and ridge this was
LiblineaR (Helleputte, 2017), and for the random forest model this was the ranger package
The procedure of obtaining the final results in mlr (Bischl et al., 2016) consisted of the
following steps:
First, with the aim of having each prediction model weighting all 16 predictor
Model for Individual Prognosis or Diagnosis (TRIPOD) statement (Collins et al., 2015), we
selected performance measures relating to both discrimination and calibration, with the
former measuring a model’s ability to accurately discriminate new outcome cases and the
calibration). We chose the AUC as the measure of discrimination, which summarizes the
ratios of the true positive rate (sensitivity) and the false positive rate (1–specificity), across all
possible thresholds of predicted probabilities (from 0 to 1), according to which each observed
case is assigned to the outcome class of either 0 (no event) or 1 (event). As the measure of
overall performance, we chose the scaled Brier score. The best model performance is denoted
by the highest scaled Brier score, which is conceptually similar to Pearson’s statistic
(Steyerberg et al., 2010). Calibration denotes one particular aspect of a prediction model’s
Running head: PREDICT SA – REGRESSION METHODS AND MACHINE LEARNING 13
accuracy, namely the agreement of predicted SA risk and actually observed SA rates (Alba et
al., 2017; Steyerberg, 2009; Studerus et al., 2017). Due to limitations of the AUC in
imbalanced datasets (e.g., Lobo et al., 2008), where the outcome group is much smaller than
the non-outcome group, we additionally report two other important performance metrics:
sensitivity (in ML termed recall) and positive predictive value (PPV; in ML termed
precision). Whereas sensitivity describes the proportion of those the model classifies as
having the outcome (testing positive) among those who actually have the outcome, PPV
describes the proportion of those who actually have the outcome among those who tested
positive. Values for both sensitivity and PPV can range between 0 (worst) and 1 (best).
of choice (Krstajic et al., 2014), whenever the gold-standard, external validation, cannot be
applied (Bleeker et al., 2003). Internal cross-validation includes the strict separation of a
given data set into a training data set, used to build a prediction model, and a test data set,
used to validate the model (Steyerberg, 2009; Studerus et al., 2017). Repeated nested cross-
validation is a two-stage process. At stage 1, the selected hyperparameters of the model are
tuned, such that the model’s performance is optimized, as measured on a validation data set.
Hyperparameters are different from the standard model parameters (e.g., weights in a
regression model) in that they do not represent the learning from the data itself but instead
define higher level properties of the model, which cannot be learned from the data. Tuning of
hyperparameters means specifying how the model will learn from the data, for example, the
degree of model complexity, for which we used an automated grid search with 10 different
hyperparameter values. For the lasso and ridge regression models we chose to tune the
parameter cost (cost of constraints violation) in the range of 0.001 and 0.3. For the random
forest model we chose to tune the parameter mtry (number of variables randomly sampled as
candidates at each split) in the range of 1 and 16, while keeping the parameter ntree (number
Running head: PREDICT SA – REGRESSION METHODS AND MACHINE LEARNING 14
of trees to grow) at its default value of 500, because tuning this parameter is generally not
recommended (Probst and Boulesteix, 2018). For selecting the best tuning-based prediction
useful for imbalanced class sizes). Bootstrapping with iterations generates multiple samples
from and of the same size as the original data set. The training of the model uses the sampled
cases (for each of the 10 hyperparameter values), after which the model is validated on the so-
called out-of-bag data, which has not been used for model building in the respective bootstrap
overfitting is strongly avoided (Kuhn and Johnson, 2013, p. 78). At stage 2 of the repeated
nested cross-validation, the optimal prediction model of stage 1 is used, with the aim of
estimating this model’s final prediction performance, for which we used 10-fold repeated
recommended for several reasons, for example, to obtain robust estimates of model
performance (Kuhn and Johnson, 2013, p. 78). With this setup we obtained 100 estimates of
Results
Means and medians of AUC and scaled Brier score of all four models are shown in
Table 1. AUC values were very similar among the four models for both mean (0.824–0.829)
and median (0.822–0.830), with strongly overlapping boxplots (Fig. 1). The scaled Brier
score was highest for the ridge model (mean: 0.466, median: 0.461) while the values of the
other three models ranged between 0.136 and 0.245 (mean) and 0.167 and 0.246 (median).
Mean sensitivity and positive predictive value (PPV), each based on a predicted
Running head: PREDICT SA – REGRESSION METHODS AND MACHINE LEARNING 15
probability cutoff of 0.5, are shown in table 1. Mean sensitivity ranged from 2.8% (random
forest) to 25% (ridge regression), whereas both logistic and lasso regression showed similar
sensitivities of around 22%. PPV among the logistic regression family fell into a close range
of between 66% and 72%, whereas the random forest achieved a PPV of 87%.
- Figure 1 here -
- Table 1 here -
Predictor importance
Predictor importance values for each model are summarized in Table 2. In all four
prediction models, the most important predictor was prior SA. In the logistic-regression-based
models, it increased the odds of a future SA by 57% (logistic), 55% (lasso), and 14% (ridge).
All following ranks, that is, ranks 2 to 16, were not consistent across all four models. Whereas
education ranked second in the logistic and lasso models (33% and 30% risk decrease,
respectively), prior help seeking ranked second in the ridge model (5% risk increase), and
number of DSM-IV lifetime mental disorders ranked second in the random forest model. Prior
help seeking ranked third in all models except for the ridge model, showing a risk increase for
a future SA of around 30% (logistic and lasso models). In the ridge model the number of
DSM-IV lifetime mental disorders ranked third, with a risk increase of 4%. Negative life
events and psychotic experiences were discarded by the lasso model, indicating that these two
Regarding the overall predictor importance ranking, the logistic and lasso regression models
showed a 44% concordance, that is, 7 of 16 predictors had the exact same rank in both
models. Rank concordance ranged between 6% and 12% for all other possible comparisons of
two models. When permitting ranks per predictor to differ by a maximum of 1 between two
models, rank concordance increased to 100% when comparing logistic and lasso regression,
while the range of concordant ranks increased to 19% and 38%. All three regression-based
Running head: PREDICT SA – REGRESSION METHODS AND MACHINE LEARNING 16
models assigned similarly high ranks to the predictors parental loss or separation and
- Table 2 here -
Discussion
All four prediction models, that is, logistic regression, lasso, ridge, and random forest,
yielded comparable prediction accuracies. According to categories of AUC results, our results
(median AUC ranging between 0.822 and 0.830) represent a very good prediction (Šimundić,
2009). In terms of Cohen’s d, our AUC results can be translated to an effect size of about 1.3
(Rice and Harris, 2005). When comparing the discriminative ability of our prediction models
with other studies predicting SA on the individual level, our results fit into the upper part of
the AUC range of 0.65–0.93 across these studies (Hettige et al., 2017; Kessler et al., 2016;
Mann et al., 2008; Passos et al., 2016; Simon et al., 2018; Walsh et al., 2018, 2017). However,
we refrain from comparisons with most of these studies, because of the fundamental
differences between them and our study, for instance, in terms of sample type (mostly patients
or army soldiers vs. community), sample size, study design (mostly cross-sectional or
electronic health record data vs. prospectively assessed data), and age group (almost
comparability is the NCS study by Kessler et al. (2016) who also used a representative
community sample to prospectively predict SAs. However, the sample they used consisted of
a subsample of 1056 respondents (age range reported only for the full sample) with a DSM-
III-R (American Psychiatric Association, 1987) lifetime MDD diagnosis at baseline (1990-
1992), who were reinterviewed once 10–12 years after baseline. SA was reported by 4.5% of
those respondents. Whereas the ML models contained between 9 and 13 predictors, logistic
Several possible reasons might explain the difference between Kessler et al.’s (2016)
results for SA (AUC: 0.70 by logistic models, 0.76 by ML models) and our results for SA
(AUC: around 0.82 by both logistic models and the ML model). First and foremost, Kessler
et al. (2016) used prediction models that were developed using the baseline data (van Loo et
al., 2014; Wardenaar et al., 2014), and then applied these models independently to the follow-
up data. Other possible explanations for differing results might be sample source (NCS: MDD
diagnosis vs. EDSP Study: general community), diagnostic criteria (DSM-III-R vs. DSM-IV),
age range at baseline (15–54 years vs. 14–24 years), number of assessment waves within the
respective study period (two in 10–12 years vs. a maximum of four in 10 years), and number
of predictors used in both the logistic models and the ML models (23 for logistic and 9–13 for
ML vs. 16 in both logistic and ML). Notably, Kessler et al. (2016) did not use prior SA as one
of the predictors, which turned out to be the most important predictor across all of our
Unlike the AUC, the scaled Brier score does not come with recommended cut-off
categories. We can therefore only descriptively note that the ridge regression performed best
in terms of the scaled Brier score (combination of prediction accuracy and calibration),
whereas the other three models performed less well, with a 47% to 71% reduced scaled Brier
score. Interestingly, even though the ridge model showed no particularly increased AUC
values (see Fig. 1, left panel), the scaled Brier score markedly differed from the other models,
in terms of both the median and the variability (see Fig. 1, right panel).
conventional logistic or linear regression models (Delgado-Gomez et al., 2011; Kessler et al.,
2016; Walsh et al., 2018, 2017), whereas other studies (SA: Delgado-Gomez et al., 2012;
suicide: Kessler et al., 2017, 2015), including ours, reported comparable prediction
depend on several data-related properties, for example, on sample size (Hahn et al., 2017)
Running head: PREDICT SA – REGRESSION METHODS AND MACHINE LEARNING 18
(ML prefers ―big data‖), on high-dimensional complexity (e.g., nonlinear associations, high-
order interactions) actually being present in the data (Walsh et al., 2018), on predictor sets
that contain different data types and sources (Lee et al., 2018), and, according to Walsh et al.
(2018), on how difficult group differences are to detect, which might be more difficult in two
relatively homogeneous groups (e.g., suicide ideators with vs. without SA) than in
heterogeneous groups (e.g., general community members with vs. without SA). Another
whether there is a sufficient number of outcome cases per predictor (the EPV
recommendation is 10; Studerus et al., 2017). On the one hand, the above-mentioned studies
Kessler et al., 2016; Walsh et al., 2018, 2017), used patient or MDD-diagnosis samples of
various sizes (Ns ranging between 879 and over 33000), which additionally fulfill some of the
other criteria that the ML approach seems to favor. However, there are two studies by Kessler
et al. (2017, 2015) on U.S. army soldiers, both using suicide as outcome, in either individuals
regression models performed equally high, compared to ML models, despite the large sample
sizes (between 40000 and 975000), despite a presumably high complexity in the actual data,
despite predictor sets of different data types and sources, and despite the homogeneity of the
samples, which might have made it somewhat difficult to detect group differences. Notably,
ML models were used, both to predict the outcome and to select a lower number of relevant
predictors, which then were used in discrete-time survival (Kessler et al., 2015) or logistic
regression (Kessler et al., 2017) models. Nonetheless, the overall prediction performance was
comparably high between conventional regression and ML models (0.84 vs. 0.85 (Kessler et
al., 2015) and 0.72 vs. 0.72 (Kessler et al., 2017)). Therefore, our study results might not be
fully explained by the above-mentioned criteria that favor the use of ML, which are not
completely met by the EDSP data. Of note, a current systematic review by Christodoulou et
Running head: PREDICT SA – REGRESSION METHODS AND MACHINE LEARNING 19
al. (2019) found no performance benefit of ML over logistic regression for clinical prediction
cardiology, or oncology. Similarly, Belsher et al. (2019) conclude that ML models currently
are not ready for clinical applications across health systems concerning SA and suicide
deaths, due to several critical concerns that in their view have remained unaddressed.
In addition to our main performance metric AUC, we also calculated both sensitivity
and PPV. All of these performance metrics may make most sense in combination, since each
captures a specific aspect of model performance. While the AUC is recommended by some
authors as a global model performance metric (e.g., Bradley, 1997), others acknowledge its
widespread use (Saito and Rehmsmeier, 2014), and yet others call for it to be abandoned or
replaced (Lobo et al., 2008; Wald and Bestwick, 2014). However, to date the AUC still seems
to be useful for comparing model performances across studies, which in our view is
somewhat less the case with sensitivity and the PPV. Unlike the AUC, sensitivity is not a
global measure applicable across all possible thresholds of predicted probabilities, but it is a
local measure for one specific threshold. The PPV depends on the outcome base rate, whereas
the AUC does not (Hajian-Tilaki, 2013), which makes comparison across studies difficult to
the degree that base rates differ. When applying both sensitivity and the PPV to compare our
models with each other, the approximate model performance equality (in terms of the AUC)
disappears. Instead, only the logistic regression family performs in a close range, with
sensitivities (for a predicted probabilities threshold of 0.5) being relatively low between 20%
and 25%, and PPVs (for the average outcome rate of about 5%) being fairly high between
66% and 72%. The random forest model, on the other hand, shows an extremely low
sensitivity of 3%, yet the highest PPV of 87% across all four models. We emphasize that the
AUC and measures such as sensitivity and PPV evaluate model performance very differently.
One important aspect that must not be neglected is the context in which one of these measures
is more appropriate than another. For instance, in a model comparison study such as this one,
Running head: PREDICT SA – REGRESSION METHODS AND MACHINE LEARNING 20
the AUC is more appropriate since it captures overall model performance, whereas when it
comes to the clinical application of the model, then finding and setting a probability threshold
(to balance specificity and sensitivity) by applying a loss/utility/cost function that depends on
Many ML models are considered black boxes (Gilpin et al., 2018); that is, even though
the importance of the predictors can be extracted from a model, the self-learning algorithm
might have used the predictors for computing the outcome in such a way that human beings
are not able to comprehend it, for example, 10th-order interaction. The random forest model
selected constructs as most important predictors that differed from those of the logistic
regression models. Even within the logistic regression models there were some differences
(see Table 2, e.g., logistic and ridge). This poses the difficult question of which predictor
selection mechanisms to ―trust‖ when trying to interpret the results. Irrespective of this issue,
it is interesting to note that prior SA was the most important predictor across all models,
confirming this variable’s reputation as supplying the highest predictive power for a
subsequent SA (Borges et al., 2010, 2006; Brown et al., 2000; Glenn and Nock, 2014; Joiner
et al., 2005; Kuo et al., 2001; Nordström et al., 1995; Ribeiro et al., 2016; WHO, 2014). In
particular, we would emphasize that we compared the predictors’ rank across models, so the
magnitude of the coefficients should not be compared between nonpenalized and penalized
logistic regression on account of the coefficients being regularized (biased) in the latter case.
The second most important predictor was educational level in the logistic and lasso models.
This confirms the plausibility of this variable as being protective against SA, for example, in
that higher educational achievement in adolescence is associated with greater life satisfaction
(Crede et al., 2015). In the ridge model, prior psychological help seeking was selected as
second most important predictor, whereas it ranked third in the random forest, logistic and
lasso models, respectively. Prior psychological help seeking might thus be seen as indicating
a greater severity of psychological problem(s) or disorder(s) present at that time (Han et al.,
Running head: PREDICT SA – REGRESSION METHODS AND MACHINE LEARNING 21
2018; Hom et al., 2015), which might serve as one possible explanation for the positive
association with SA. Finally, the number of prior mental disorders (comorbidity) has often
been found to be associated with SA (e.g., Bronisch and Wittchen, 1994; Lewinsohn et al.,
1995; Miché et al., 2018), which is confirmed by the random forest and the ridge regression
We want to mention several strengths of our study. First, to the best of our knowledge
this is the first study that applied ML procedures to prospectively predict SAs in community
adolescents and young adults (an assumption being supported by a current systematic review
on the use of ML in the study of suicidal behaviors; Burke et al., 2019), a group that is known
to be the high-risk group for first lifetime SA (WHO, 2014). Second, we used repeated nested
cross-validation, which Krstajic et al. (2014) recommended as the best approach for training
and testing a prediction model within a single dataset, that is, external validation being
inapplicable. Third, we adhered to the reporting guidelines known as the TRIPOD statement
(Collins et al., 2015). This strength is also supported by two systematic reviews (Burke et al.,
2019; Christodoulou et al., 2019), who criticize the inconsistent reporting methods of
classifier performance across studies. Fourth, we used predictors that were a priori defined,
taken from the suicide literature. We assume that this and the EDSP data quality might have
led to the very good (Šimundić, 2009) discriminative ability of the predictive models we
applied.
There are also limitations of our study. First, the predictive performance of ML
algorithms such as random forests depends on the sample size, with larger sample sizes at
times leading to an increased performance result (Raudys and Jain, 1991). In that respect our
sample size may be considered a weakness. It may also be argued that it is not sample size per
se which matters, but rather the relationship between predictor and outcome in the data, that is
ML techniques such as random forest may simply not be able to show their predictive
Running head: PREDICT SA – REGRESSION METHODS AND MACHINE LEARNING 22
study design enabled us to include predictors that must be conceived as distal, as opposed to
proximal. Future research on predicting individual SA risk should include both distal and
proximal risk factors, since the main purpose of predictive analytics is to offer tools for risk
assessment in the near future, rather than in the distant future. Third, we used self-reported
data, which is critical in terms of several inherent biases, for example, recall bias. It also
means that we lacked other than self-report data, e.g., genetic or neuropsychological data.
Fourth, we did not apply external cross-validation, which is considered the gold-standard in
estimating the degree of overfitting and which might have yielded lower model performances
compared to our cross-validation procedure. Fifth, our outcome was assessed with a one-item
measure, which might have led to an increased misclassification rate, estimated by Millner et
al. (2015) to be 11%. However, this possible error rate must not be overstated either. Mazza et
al. (2011) empirically support the notion that single-item SA responses appear to be valid.
Sixth, there might have been undetected SA cases at T1, depending on whether participants
entered the MDD interview section. However, we consider this a minor limitation because T1
was the only one of the four EDSP waves where a subsample was assessed.
Despite these limitations, our study has shown that all four models resulted in a very
good overall ability to discriminate between individuals who attempt suicide in the future
from individuals who do not, in a high-risk sample of community adolescents and young
adults. This might be seen as a promising contribution to the ongoing pursuit of fruitfully
combining both statistical methods and ML methods, aiming to improve SA risk assessment
data from the general community might be to use the best model or combination of models as
Author declaration
Running head: PREDICT SA – REGRESSION METHODS AND MACHINE LEARNING 23
We wish to confirm that there are no known conflicts of interest associated with this
publication and there has been no significant financial support for this work that could have
We confirm that the manuscript has been read and approved by all named authors and that
there are no other persons who satisfied the criteria for authorship but are not listed. We
further confirm that the order of authors listed in the manuscript has been approved by all of
us.
We confirm that we have given due consideration to the protection of intellectual property
associated with this work and that there are no impediments to publication, including the
We understand that the Corresponding Author is the sole contact for the Editorial process
(including Editorial Manager and direct communications with the office). He/she is
responsible for communicating with the other authors about progress, submissions of
revisions and final approval of proofs. We confirm that we have provided a current, correct
email address which is accessible by the Corresponding Author and which has been
Brief statement concerning each named author's contributions to the paper under the heading
Contributors:
Author Marcel Miché did the literature searches, undertook the statistical analyses, and wrote
the first draft of the manuscript.
Author Erich Studerus reviewed the statistical analyses and the reporting of our study results.
Author Andrea Meyer reviewed methodological parts of the manuscript.
Author Andrew Gloster reviewed the manuscript.
Running head: PREDICT SA – REGRESSION METHODS AND MACHINE LEARNING 24
Conflict of Interest
Declarations of interest:
none.
Running head: PREDICT SA – REGRESSION METHODS AND MACHINE LEARNING 26
References
Alba, A.C., Agoritsas, T., Walsh, M., Hanna, S., Iorio, A., Devereaux, P.J., McGinn, T.,
https://doi.org/10.1001/jama.2017.12126
Barak-Corren, Y., Castro, V.M., Javitt, S., Hoffnagle, A.G., Dai, Y., Perlis, R.H., Nock, M.K.,
Smoller, J.W., Reis, B.Y., 2017. Predicting Suicidal Behavior From Longitudinal
https://doi.org/10.1176/appi.ajp.2016.16010077
Beesdo-Baum, K., Knappe, S., Asselmann, E., Zimmermann, P., Bruckl, T., Hofler, M.,
Behrendt, S., Lieb, R., Wittchen, H.U., 2015. The ―Early Developmental Stages of
1062-x
Belsher, B.E., Smolenski, D.J., Pruitt, L.D., Bush, N.E., Beech, E.H., Workman, D.E.,
Morgan, R.L., Evatt, D.P., Tucker, J., Skopp, N.A., 2019. Prediction Models for
Psychiatry. https://doi.org/10.1001/jamapsychiatry.2019.0174
Running head: PREDICT SA – REGRESSION METHODS AND MACHINE LEARNING 27
Bennett, D., Silverstein, S.M., Niv, Y., 2019. The Two Cultures of Computational Psychiatry.
Bentley, K.H., Franklin, J.C., Ribeiro, J.D., Kleiman, E.M., Fox, K.R., Nock, M.K., 2016.
Anxiety and its disorders as risk factors for suicidal thoughts and behaviors: A meta-
https://doi.org/10.1016/j.cpr.2015.11.008
Bischl, B., Lang, M., Kotthoff , L., Schiff ner, J., Richter, J., Studerus, E., Casalicchio, G.,
Jones, Z.M., 2016. mlr: Machine Learning in R. J. Mach. Learn. Res. 17, 1–5.
Bleeker, S.E., Moll, H.A., Steyerberg, E.W., Donders, A.R.T., Derksen-Lubsen, G., Grobbee,
Borges, G., Angst, J., Nock, M.K., Ruscio, A.M., Walters, E.E., Kessler, R.C., 2006. A risk
index for 12-month suicide attempts in the National Comorbidity Survey Replication
Borges, G., Nock, M.K., Haro Abad, J.M., Hwang, I., Sampson, N.A., Alonso, J., Andrade,
L.H., Angermeyer, M.C., Beautrais, A., Bromet, E., Bruffaerts, R., de Girolamo, G.,
Florescu, S., Gureje, O., Hu, C., Karam, E.G., Kovess-Masfety, V., Lee, S., Levinson,
D., Medina-Mora, M.E., Ormel, J., Posada-Villa, J., Sagar, R., Tomov, T., Uda, H.,
Williams, D.R., Kessler, R.C., 2010. Twelve-Month Prevalence of and Risk Factors
for Suicide Attempts in the World Health Organization World Mental Health Surveys.
Bradley, A.P., 1997. The use of the area under the ROC curve in the evaluation of machine
3203(96)00142-2
Bronisch, T., Wittchen, H.U., 1994. Suicidal Ideation and Suicide Attempts - Comorbidity
Brown, G.K., Beck, A.T., Steer, R.A., Grisham, J.R., 2000. Risk factors for suicide in
371–377. https://doi.org/10.1037/0022-006X.68.3.371
Burke, T.A., Ammerman, B.A., Jacobucci, R., 2019. The use of machine learning in the study
Cha, C.B., Franz, P.J., Guzmán, E.M., Glenn, C.R., Kleiman, E.M., Nock, M.K., 2018.
https://doi.org/10.1111/jcpp.12831
Christodoulou, E., Ma, J., Collins, G.S., Steyerberg, E.W., Verbakel, J.Y., van Calster, B.,
https://doi.org/10.1016/j.jclinepi.2019.02.004
Collins, G.S., Reitsma, J.B., Altman, D.G., Moons, K., 2015. Transparent reporting of a
Crede, J., Wirthwein, L., McElvany, N., Steinmayr, R., 2015. Adolescents’ academic
achievement and life satisfaction: the role of parents’ education. Front. Psychol. 6, 52.
https://doi.org/10.3389/fpsyg.2015.00052
A., Baca-Garcia, E., 2011. Improving the accuracy of suicide attempter classification.
Delgado-Gomez, D., Blasco-Fontecilla, H., Sukno, F., Socorro Ramos-Plasencia, M., Baca-
https://doi.org/10.1016/j.neucom.2011.08.033
Derogatis, L.R., Lipman, R.S., Covi, L., 1973. The SCL-90-R: An outpatient psychiatric
9, 13–27.
Fernández-Delgado, M., Cernadas, E., Barro, S., Amorim, D., 2014. Do we Need Hundreds of
Classifiers to Solve Real World Classification Problems? J. Mach. Learn. Res. 15,
3133–3181.
Franklin, J.C., Ribeiro, J.D., Fox, K.R., Bentley, K.H., Kleiman, E.M., Huang, X., Musacchio,
K.M., Jaroszewski, A.C., Chang, B.P., Nock, M.K., 2017. Risk factors for suicidal
187–232. https://doi.org/10.1037/bul0000084
Gilpin, L.H., Bau, D., Yuan, B.Z., Bajwa, A., Specter, M., Kagal, L., 2018. Explaining
Glenn, C.R., Nock, M.K., 2014. Improving the Short-Term Prediction of Suicidal Behavior.
Hahn, T., Nierenberg, A.A., Whitfield-Gabrieli, S., 2017. Predictive analytics in mental
health: applications, guidelines, challenges and perspectives. Mol. Psychiatry 22, 37–
43. https://doi.org/10.1038/mp.2016.201
Hajian-Tilaki, K., 2013. Receiver Operating Characteristic (ROC) Curve Analysis for
Han, J., Batterham, P.J., Calear, A.L., Randall, R., 2018. Factors Influencing Professional
https://doi.org/10.1027/0227-5910/a000485
Helleputte, T., 2017. LiblineaR: Linear Predictive Models Based on the LIBLINEAR C/C++
Library.
Hettige, N.C., Nguyen, T.B., Yuan, C., Rajakulendran, T., Baddour, J., Bhagwat, N., Bani-
Fatemi, A., Voineskos, A.N., Mallar Chakravarty, M., De Luca, V., 2017.
https://doi.org/10.1016/j.genhosppsych.2017.03.001
Hom, M.A., Stanley, I.H., Joiner, T.E., 2015. Evaluating factors and interventions that
https://doi.org/10.1016/j.cpr.2015.05.006
Joiner, T.E., Conwell, Y., Fitzpatrick, K.K., Witte, T.K., Schmidt, N.B., Berlim, M.T., Fleck,
Running head: PREDICT SA – REGRESSION METHODS AND MACHINE LEARNING 31
M.P.A., Rudd, M.D., 2005. Four Studies on How Past and Current Suicidality Relate
Even When “Everything But the Kitchen Sink” Is Covaried. J. Abnorm. Psychol. 114,
291–303. https://doi.org/10.1037/0021-843X.114.2.291
Just, M.A., Pan, L., Cherkassky, V.L., McMakin, D.L., Cha, C., Nock, M.K., Brent, D., 2017.
0234-y
Kessler, R.C., Stein, M.B., Petukhova, M.V., Bliese, P., Bossarte, R.M., Bromet, E.J.,
Fullerton, C.S., Gilman, S.E., Ivany, C., Lewandowski-Romps, L., Millikan Bell, A.,
Naifeh, J.A., Nock, M.K., Reis, B.Y., Rosellini, A.J., Sampson, N.A., Zaslavsky,
A.M., Ursano, R.J., 2017. Predicting suicides after outpatient mental health visits in
the Army Study to Assess Risk and Resilience in Servicemembers (Army STARRS).
Kessler, R.C., van Loo, H.M., Wardenaar, K.J., Bossarte, R.M., Brenner, L.A., Cai, T., Ebert,
D.D., Hwang, I., Li, J., de Jonge, P., Nierenberg, A.A., Petukhova, M.V., Rosellini,
A.J., Sampson, N.A., Schoevers, R.A., Wilcox, M.A., Zaslavsky, A.M., 2016. Testing
https://doi.org/10.1038/mp.2015.198
Kessler, R.C., Warner, C.H., Ivany, C., Petukhova, M.V., Rose, S., Bromet, E.J., Brown, M.,
Cai, T., Colpe, L.J., Cox, K.L., Fullerton, C.S., Gilman, S.E., Gruber, M.J., Heeringa,
S.G., Lewandowski-Romps, L., Li, J., Millikan-Bell, A.M., Naifeh, J.A., Nock, M.K.,
Rosellini, A.J., Sampson, N.A., Schoenbaum, M., Stein, M.B., Wessely, S., Zaslavsky,
Army Soldiers: The Army Study to Assess Risk and Resilience in Servicemembers
https://doi.org/10.1001/jamapsychiatry.2014.1754
Kraemer, H.C., 2010. Epidemiological Methods: About Time. Int. J. Environ. Res. Public.
Kraemer, H.C., Kazdin, A.E., Offord, D.R., Kessler, R.C., Jensen, P.S., Kupfer, D.J., 1997.
Coming to terms with the terms of risk. Arch Gen Psychiatry 54, 337–343.
Krstajic, D., Buturovic, L.J., Leahy, D.E., Thomas, S., 2014. Cross-validation pitfalls when
https://doi.org/10.1186/1758-2946-6-10
Kuhn, M., Johnson, K., 2013. Applied Predictive Modeling, 5th ed. Springer, New York.
Kuo, W.-H., Gallo, J.J., Tien, A.Y., 2001. Incidence of suicide ideation and attempts in
Lee, Y., Ragguett, R.-M., Mansur, R.B., Boutilier, J.J., Rosenblat, J.D., Trevizol, A.,
Brietzke, E., Lin, K., Pan, Z., Subramaniapillai, M., Chan, T.C.Y., Fus, D., Park, C.,
Musial, N., Zuckerman, H., Chen, V.C.-H., Ho, R., Rong, C., McIntyre, R.S., 2018.
https://doi.org/10.1016/j.jad.2018.08.073
Lewinsohn, P.M., Rohde, P., Seeley, J.R., 1995. Adolescent Psychopathology: III. The
Lieb, R, Isensee, B., von Sydow, K., Wittchen, H.U., 2000a. The Early Developmental Stages
Lieb, Roselind, Wittchen, H.-U., Höfler, M., Fuetsch, M., Stein, M.B., Merikangas, K.R.,
2000b. Parental Psychopathology, Parenting Styles, and the Risk of Social Phobia in
859–866. https://doi.org/10.1001/archpsyc.57.9.859
Lobo, J.M., Jiménez-Valverde, A., Real, R., 2008. AUC: a misleading measure of the
https://doi.org/10.1111/j.1466-8238.2007.00358.x
Maier-Diewald, W., Wittchen, H.-U., Hecht, H., Werner-Eilert, K., 1983. Die Münchner
Munich.
Mann, J.J., Ellis, S.P., Waternaux, C.M., Liu, X., Oquendo, M.A., Malone, K.M., Brodsky,
B.S., Haas, G.L., Currier, D., 2008. Classification Trees Distinguish Suicide
Mazza, J.J., Catalano, R.F., Abbott, R.D., Haggerty, K.P., 2011. An Examination of the
Miché, M., Hofer, P.D., Voss, C., Meyer, A.H., Gloster, A.T., Beesdo-Baum, K., Lieb, R.,
Running head: PREDICT SA – REGRESSION METHODS AND MACHINE LEARNING 34
2018. Mental disorders and the risk for the subsequent first suicide attempt: results of
a community study on adolescents and young adults. Eur. Child Adolesc. Psychiatry
Millner, A.J., Lee, M.D., Nock, M.K., 2015. Single-Item Measurement of Suicidal Behaviors:
Mushkudiani, N.A., Hukkelhoven, C.W.P.M., Hernández, A.V., Murray, G.D., Choi, S.C.,
https://doi.org/10.1016/j.jclinepi.2007.06.011
Nock, M.K., Borges, G., Bromet, E.J., Cha, C.B., Kessler, R.C., Lee, S., 2008. Suicide and
https://doi.org/10.1093/epirev/mxn002
Nock, M.K., Millner, A.J., Joiner, T.E., Gutierrez, P.M., Han, G., Hwang, I., King, A.,
Naifeh, J.A., Sampson, N.A., Zaslavsky, A.M., Stein, M.B., Ursano, R.J., Kessler,
R.C., 2018. Risk factors for the transition from suicide ideation to suicide attempt:
Results from the Army Study to Assess Risk and Resilience in Servicemembers
https://doi.org/10.1037/abn0000317
Nordström, P., Samuelsson, M., Åsberg, M., 1995. Survival analysis of suicide risk after
0447.1995.tb09791.x
Passos, I.C., Mwangi, B., Cao, B., Hamilton, J.E., Wu, M.-J., Zhang, X.Y., Zunta-Soares,
Running head: PREDICT SA – REGRESSION METHODS AND MACHINE LEARNING 35
G.B., Quevedo, J., Kauer-Sant’Anna, M., Kapczinski, F., Soares, J.C., 2016.
pilot study using a machine learning approach. J. Affect. Disord. 193, 109–116.
https://doi.org/10.1016/j.jad.2015.12.066
Pavlou, M., Ambler, G., Seaman, S., De Iorio, M., Omar, R.Z., 2016. Review and evaluation
of penalised regression methods for risk prediction in low-dimensional data with few
Perkonigg, A., Wittchen, H.-U., 1995a. The Daily-Hassles Scale. Research version. Max
Probst, P., Boulesteix, A.-L., 2018. To Tune or Not to Tune the Number of Trees in Random
R Core Team, 2017. R: a language and environment for statistical computing. R Foundation
Raudys, S.J., Jain, A.K., 1991. Small sample size effects in statistical pattern recognition:
Recommendations for practitioners. IEEE Trans. Pattern Anal. Mach. Intell. 3, 252–
264.
Reed, V., Gander, F., Pfister, H., Steiger, A., Sonntag, H., Trenkwalder, C., Sonntag, A.,
Hundt, W., Wittchen, H.-U., 1998. To what degree does the Composite International
https://doi.org/10.1002/mpr.44
Reznick, J.S., Hegeman, I.M., Kaufman, E.R., Woods, S.W., Jacobs, M., 1992. Retrospective
and concurrent self-report of behavioral inhibition and their relation to adult mental
Ribeiro, J.D., Franklin, J.C., Fox, K.R., Bentley, K.H., Kleiman, E.M., Chang, B.P., Nock,
M.K., 2016. Self-injurious thoughts and behaviors as risk factors for future suicide
Rice, M.E., Harris, G.T., 2005. Comparing effect sizes in follow-up studies: ROC Area,
6832-7
Saito, T., Rehmsmeier, M., 2014. The Precision-Recall plot is more informative than the ROC
plot when evaluating binary classifiers on imbalanced datasets. PLOS ONE 10,
e0118432. https://doi.org/10.6084/m9.figshare.1245061.v1
Simon, G.E., Johnson, E., Lawrence, J.M., Rossom, R.C., Ahmedani, B., Lynch, F.L., Beck,
A., Waitzfelder, B., Ziebell, R., Penfold, R.B., Shortreed, S.M., 2018. Predicting
Suicide Attempts and Suicide Deaths Following Outpatient Visits Using Electronic
https://doi.org/10.1176/appi.ajp.2018.17101167
Šimundić, A.-M., 2009. Measures of Diagnostic Accuracy: Basic Definitions. J. Int. Fed.
Spauwen, J., Krabbendam, L., Lieb, R., Wittchen, H.-U., van Os, J., 2004. Does urbanicity
Running head: PREDICT SA – REGRESSION METHODS AND MACHINE LEARNING 37
https://doi.org/10.1016/j.jpsychires.2004.04.003
Validation, and Updating, In Statistics for Biology and Health Health (series eds. M
https://doi.org/10.1007/978-0-387-77244-8
Steyerberg, E.W., Vickers, A.J., Cook, N.R., Gerds, T., Gonen, M., Obuchowski, N., Pencina,
Studerus, E., Ramyead, A., Riecher-Rössler, A., 2017. Prediction of transition to psychosis in
patients with a clinical high risk for psychosis: a systematic review of methodology
https://doi.org/10.1017/S0033291716003494
van Loo, H.M., Cai, T., Gruber, M.J., Li, J., de Jonge, P., Petukhova, M., Rose, S., Sampson,
N.A., Schoevers, R.A., Wardenaar, K.J., Wilcox, M.A., Al-Hamzawi, A.O., Andrade,
L.H., Bromet, E.J., Bunting, B., Fayyad, J., Florescu, S.E., Gureje, O., Hu, C., Huang,
Y., Levinson, D., Medina-Mora, M.E., Nakane, Y., Posada-Villa, J., Scott, K.M.,
Xavier, M., Zarkov, Z., Kessler, R.C., 2014. MAJOR DEPRESSIVE DISORDER
https://doi.org/10.1002/da.22233
Wald, N., Bestwick, J., 2014. Is the area under an ROC curve a valid measure of the
https://doi.org/10.1177/0969141313517497
Walsh, C.G., Ribeiro, J.D., Franklin, J.C., 2018. Predicting suicide attempts in adolescents
with longitudinal clinical data and machine learning. J. Child Psychol. Psychiatry 59,
1261–1270. https://doi.org/10.1111/jcpp.12916
Walsh, C.G., Ribeiro, J.D., Franklin, J.C., 2017. Predicting Risk of Suicide Attempts Over
https://doi.org/10.1177/2167702617691560
Wardenaar, K.J., van Loo, H.M., Cai, T., Fava, M., Gruber, M.J., Li, J., de Jonge, P.,
Nierenberg, A.A., Petukhova, M.V., Rose, S., Sampson, N.A., Schoevers, R.A.,
Wilcox, M.A., Alonso, J., Bromet, E.J., Bunting, B., Florescu, S.E., Fukao, A., Gureje,
O., Hu, C., Huang, Y.Q., Karam, A.N., Levinson, D., Medina Mora, M.E., Posada-
Villa, J., Scott, K.M., Taib, N.I., Viana, M.C., Xavier, M., Zarkov, Z., Kessler, R.C.,
https://doi.org/10.1017/S0033291714000993
Wittchen, H.U., Lachner, G., Wunderlich, U., Pfister, H., 1998a. Test-retest reliability of the
https://doi.org/DOI 10.1007/s001270050095
Wittchen, H.U., Perkonigg, A., Lachner, G., Nelson, C.B., 1998b. Early Developmental
Stages of Psychopathology Study (EDSP): Objectives and design. Eur. Addict. Res. 4,
Wittchen, H.-U., Pfister, H., 1997. DIA-X-Interviews: Manual für Screening-Verfahren und
Running head: PREDICT SA – REGRESSION METHODS AND MACHINE LEARNING 39
Document]. URL
https://apps.who.int/iris/bitstream/handle/10665/131056/9789241564779_eng.pdf;jses
Wright, M.N., Ziegler, A., 2017. ranger: A Fast Implementation of Random Forests for High
https://doi.org/10.18637/jss.v077.i01
Yarkoni, T., Westfall, J., 2017. Choosing Prediction Over Explanation in Psychology:
https://doi.org/10.1177/1745691617693393
Running head: PREDICT SA – REGRESSION METHODS AND MACHINE LEARNING 40
Table 1
Overview of the performance estimates for each prediction model.
Model AUC M AUC Md BS M BS Md Sens PPV
Logistic 0.828 0.825 0.179 0.190 0.223 0.704
Lasso 0.826 0.822 0.245 0.246 0.212 0.716
Ridge 0.829 0.830 0.466 0.461 0.251 0.658
Random forest 0.824 0.826 0.136 0.167 0.028 0.870
Note. AUC, area under the receiver operating curve; BS, Brier-scaled; Sens, sensitivity; PPV,
positive predictive value.
Figure 1. Boxplot of 100 resampling results for each prediction model (see median results in
Table 1). Logistic, Logistic regression model; Rf, Random forest model. Left: Area under the
curve (AUC), including the AUC of 0.58 as reported in the meta-analysis by Franklin et al.
(2017). Right: Scaled Brier score, with values below zero indicating a model
performance/calibration inferior to that of a chance prediction model applied to the validation
dataset.
Running head: PREDICT SA – REGRESSION METHODS AND MACHINE LEARNING 41
Table 2
Overview of the decreasing importance of the 16 baseline predictors for each prediction model.
Logistic Lasso Ridge Random forest
Predictor β OR Rank % β OR Rank* % β OR Rank* % Importance Rank*
Prior SA (j) 0.454 1.57 1 57.5 0.439 1.55 1 55.2 0.130 1.14 1 13.9 4.434 1
Education (c) -0.405 0.67 2 33.3 -0.361 0.70 2 30.3 -0.031 0.97 5 3.1 0.235 10
Prior help-seeking (h) 0.296 1.34 3 34.5 0.276 1.32 3 31.7 0.048 1.05 2 4.9 0.664 3
Any parental mental dx (i) 0.278 1.32 4 32.1 0.226 1.25 4 25.4 0.018 1.02 10 1.8 0.117 14
Parental loss or separation (g) 0.245 1.28 5 27.8 0.216 1.24 5 24.1 0.030 1.03 6 3.1 0.199 11
BI (k) 0.221 1.25 6 24.8 0.193 1.21 6 21.2 0.029 1.03 7 3.0 0.428 5
Number of mental dx (d) 0.187 1.21 7 20.6 0.164 1.18 7 17.8 0.043 1.04 3 4.3 1.020 2
DH (n) -0.171 0.84 8 15.7 -0.107 0.90 9 10.2 -0.007 0.99 16 0.7 0.323 7
Number of traumatic events (e) 0.157 1.17 9 17.0 0.109 1.12 8 11.6 0.016 1.02 11 1.6 -0.086 15
Age (b) -0.142 0.87 10 13.2 -0.097 0.91 11 9.3 -0.011 0.99 15 1.1 0.173 13
Rural (o) -0.142 0.87 11 13.2 -0.097 0.91 10 9.3 -0.014 0.99 12 1.4 0.182 12
Sex (a) 0.106 1.11 12 11.2 0.060 1.06 13 6.2 0.011 1.01 14 1.1 0.045 16
PCE (p) -0.099 0.91 13 9.4 -0.081 0.92 12 7.8 -0.019 0.98 9 1.9 0.300 8
NLE (m) -0.020 0.98 14 2.0 0.000 1.00 15 0.0 0.013 1.01 13 1.3 0.480 4
Rape/Childhood sexual abuse (f) 0.011 1.01 15 1.1 0.013 1.01 14 1.3 0.034 1.03 4 3.4 0.240 9
PE (l) 0.001 1.00 16 0.1 0.000 1.00 15 0.0 0.023 1.02 8 2.4 0.343 6
Note. Letter in brackets after each predictor corresponds to ordering of predictors in section ―Selection and assessment of predictors‖; Rank*, Order
according to the predictor ranking of the logistic regression model; β, beta-coefficient of the (penalized) logistic regression model; OR, odds ratio;
%, OR translated to percentage. The original importance values of the random forest model have been multiplied by 1000, to avoid having to
display too many digits. Prior SA, Lifetime suicide attempt, reported at baseline; Education 1 = low, 2 = middle, 3 = high; dx disorder; BI,
behavioral inhibition; Number of mental dx, Number of DSM-IV diagnoses; DH daily hassles; Rural, 0 = living in an urban area; PCE perceived
coping efficacy (higher PCE values denote lower PCE); NLE, negative life events; PE, psychotic experiences.