Marcel 2019

Journal Pre-proof
Prospective prediction of suicide attempts in community adolescents

and young adults, using regression methods and machine learning
Miché Marcel PhD , Studerus Erich PhD ,

Meyer Andrea Hans PhD , Gloster Andrew Thomas PhD ,
Beesdo-Baum Katja PhD , Wittchen Hans-Ulrich PhD ,
Lieb Roselind PhD
PII: S0165-0327(19)31141-3
DOI: https://doi.org/10.1016/j.jad.2019.11.093
Reference: JAD 11340
To appear in: Journal of Affective Disorders
Received date: 3 May 2019

Revised date: 20 September 2019
Accepted date: 12 November 2019
Please cite this article as: Miché Marcel PhD , Studerus Erich PhD , Meyer Andrea Hans PhD ,
Gloster Andrew Thomas PhD , Beesdo-Baum Katja PhD , Wittchen Hans-Ulrich PhD ,
Lieb Roselind PhD , Prospective prediction of suicide attempts in community adolescents and
young adults, using regression methods and machine learning, Journal of Affective Disorders (2019),
doi: https://doi.org/10.1016/j.jad.2019.11.093
This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition
of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of
record. This version will undergo additional copyediting, typesetting and review before it is published
in its final form, but we are providing this version to give early visibility of the article. Please note that,
during the production process, errors may be discovered which could affect the content, and all legal
disclaimers that apply to the journal pertain.
© 2019 Published by Elsevier B.V.

Running head: PREDICT SA – REGRESSION METHODS AND MACHINE LEARNING 1
Highlights
 First study – to the best of our knowledge – to apply Machine Learning (ML)
alongside conventional prediction models to predict future suicide attempts, by
using data from a 10-year prospective longitudinal study.
 We used a community sample with ages 14-34 years (full study period) that
covers the high-risk period for the first lifetime suicide attempt, which according
to the WHO (2014) is between 15-29 years of age.
 We adhered to the TRIPOD guidelines (Collins et al., 2015) in order to increase
transparency and reproducibility, as well as to facilitate cross-study
comparisons.
 We adhered to further recommendations in order to meet current standards for
studies that apply ML, for instance, we used the best current approach for
internal cross-validation, as recommended by Krstajic et al. (2014).
 Our overall prediction performance of our selected models all fall in the category
“very good”, according to Šimundić (2009).
Prospective prediction of suicide attempts in community adolescents and young
adults, using regression methods and machine learning
Miché, Marcel, PhD1
Studerus, Erich, PhD2
Meyer, Andrea Hans, PhD1
Gloster, Andrew Thomas, PhD3
Beesdo-Baum, Katja, PhD4, 5
Wittchen, Hans-Ulrich, PhD5, 6
Lieb, Roselind, PhD1
1
University of Basel, Department of Psychology, Division of Clinical Psychology and Epidemiology,
Basel, Switzerland
2
University of Basel, Department of Psychology, Division of Personality and Developmental
Psychology, Basel, Switzerland

3
University of Basel, Department of Psychology, Division of Clinical Psychology and Intervention
Science, Basel, Switzerland

4
Technische Universitaet Dresden, Department of Behavioral Epidemiology, Dresden, Germany
5
Technische Universitaet Dresden, Institute of Clinical Psychology and Psychotherapy, Dresden,
Germany
6
Ludwig Maximilians University Munich, Department of Psychiatry and Psychotherapy, Munich,
Germany
Corresponding Author
Roselind Lieb, PhD
Division of Clinical Psychology and Epidemiology
Department of Psychology
University of Basel
Missionsstrasse 60-62
4055 Basel
Switzerland
Phone: 0041-61-2070278
Email: roselind.lieb@unibas.ch
Acknowledgments
This work is part of the Early Developmental Stages of Psychopathology (EDSP) Study and is
funded by the German Federal Ministry of Education and Research (BMBF) project nos.
01EB9405/6, 01EB9901/6, EB01016200, 01EB0140, and 01EB0440. Part of the field work
and analyses were also additionally supported by Deutsche Forschungsgemeinschaft (DFG)
grants LA1148/1-1, WI2246/1-1, WI 709/7-1, and WI 709/8-1. Principal investigators are Dr.
Hans-Ulrich Wittchen and Dr. Roselind Lieb, who take responsibility for the integrity of the
study data. Core staff members of the EDSP group are Dr. Kirsten von Sydow, Dr. Gabriele
Lachner, Dr. Axel Perkonigg, Dr. Peter Schuster, Dr. Michael Höfler, Dipl.-Psych. Holger
Sonntag, Dr. Tanja Brückl, Dipl.-Psych. Elzbieta Garczynski, Dr. Barbara Isensee, Dr. Agnes
Nocon, Dr. Chris Nelson, Dipl.-Inf. Hildegard Pfister, Dr. Victoria Reed, Dipl.-Soz. Barbara
Spiegel, Dr. Andrea Schreier, Dr. Ursula Wunderlich, Dr. Petra Zimmermann, Dr. Katja
Beesdo-Baum, Dr. Antje Bittner, Dr. Silke Behrendt, and Dr. Susanne Knappe. Scientific
advisors are Dr. Jules Angst (Zurich), Dr. Jürgen Margraf (Basel), Dr. Günther Esser
(Potsdam), Dr. Kathleen Merikangas (NIMH, Bethesda), Dr. Ron Kessler (Harvard
University, Boston), and Dr. Jim van Os (Maastricht University).

Dr. Katja Beesdo-Baum is currently funded by the BMBF (project nos. 01ER1303,
01ER1703).
Abstract
Background. The use of machine learning (ML) algorithms to study suicidality has recently
been recommended. Our aim was to explore whether ML approaches have the potential to
improve the prediction of suicide attempt (SA) risk. Using the epidemiological multiwave
prospective-longitudinal Early Developmental Stages of Psychopathology (EDSP) data set,
we compared four algorithms—logistic regression, lasso, ridge, and random forest—in
predicting a future SA in a community sample of adolescents and young adults.
Methods. The EDSP Study prospectively assessed, over the course of 10 years, adolescents
and young adults aged 14–24 years at baseline. Of 3021 subjects, 2797 were eligible for
prospective analyses because they participated in at least one of the three follow-up
assessments. Sixteen baseline predictors, all selected a priori from the literature, were used to
predict follow-up SAs. Model performance was assessed using repeated nested 10-fold cross-
validation. As the main measure of predictive performance we used the area under the curve
(AUC).
Results. The mean AUCs of the four predictive models, logistic regression, lasso, ridge, and
random forest, were 0.828, 0.826, 0.829, and 0.824, respectively.
Conclusions. Based on our comparison, each algorithm performed equally well in
distinguishing between a future SA case and a non-SA case in community adolescents and
young adults. When choosing an algorithm, different considerations, however, such as ease of
implementation, might in some instances lead to one algorithm being prioritized over another.
Further research and replication studies are required in this regard.
Keywords: Machine learning, future suicide attempt, prediction, adolescents and young adults,
community sample, prospective design

Introduction
Suicide research has suggested many correlates and some risk factors for completed
suicide, suicide attempt (SA), and suicidal ideation. Nonetheless, according to a recent meta-
analysis, the ability to accurately predict SAs remains poor, rarely exceeding chance
prediction (Franklin et al., 2017). In an endeavor to increase accuracy of SA prediction,
machine learning (ML) algorithms have been recommended (Bentley et al., 2016; Franklin et
al., 2017; Walsh et al., 2018, 2017), in addition to the use of more traditional statistical
approaches, for example, multiple logistic regression (for a brief comparison of both
approaches see Bennett et al., 2019). One advantage of ML algorithms is that they can better
deal with the problem of ―overfitting.‖ Overfitting occurs where a statistical model fits well
with one data set, yet fails to accurately predict new observations, a problem for which the
ML framework provides several solutions, for example, adjusting the flexibility with which
the model will learn from the data in order to control the degree of overfitting (Krstajic et al.,
2014; Studerus et al., 2017; Yarkoni and Westfall, 2017).
In suicidality research, some studies that have applied ML have found that SA can be
predicted above chance level, for example for SA (Delgado-Gomez et al., 2012, 2011; Hettige
et al., 2017; Just et al., 2017; Mann et al., 2008; Passos et al., 2016; Simon et al., 2018; Walsh
et al., 2017) and for suicidal behavior (i.e., suicide and SA combined) (Barak-Corren et al.,
2017).
When dealing with categorical outcomes, prediction is often quantified using the area
under the receiver operating characteristic curve (AUC). Chance prediction is thereby defined
as an AUC of 0.5. Šimundić (2009) suggested five heuristic categories of AUC results that
she termed ―bad‖ (0.5–0.59), ―sufficient‖ (0.6–0.69), ―good‖ (0.7–0.79), ―very good‖ (0.8–
0.89), and ―excellent‖ (0.9–1.0). Walsh et al. (2017) achieved very good prediction accuracy
for a future SA among adult patients, using electronic health record data (EHR; AUC range
0.80–0.84). Furthermore, the random forest model yielded a better prognostic performance
than multiple logistic regression (AUC range 0.66–0.68; Walsh et al., 2017). Walsh et al.
(2018) replicated this finding in a sample of adolescent patients and controls, again using
EHR data, with the random forest model yielding AUCs of more than 0.8, while logistic
regression yielded AUCs of less than 0.7. In the National Comorbidity Survey (NCS), a
community study of 15- to 54-year-olds, Kessler et al. (2016) reported logistic regression
being outperformed by ML models across all reported depression-related outcomes, SA being
one of them (AUC: 0.70 vs. 0.76). Delgado-Gomez and colleagues (2012, 2011) also
compared SA prediction accuracies, applying both ML models, for example, support vector
machines (SVMs), and a traditional model, multiple linear regression, using questionnaire
data of almost 900 adults (admitted to an emergency department, inpatients, and blood
donors) in each of the two cross-sectional studies. In the first study, Delgado-Gomez et al.
(2011) reported that ML models outperformed the traditional model, for example, prediction
accuracy (with 100 being the best possible result) of SVM being 76.7 vs. 71.5 in the linear
regression model, whereas in the second study the ML models and the linear regression model
rendered comparable results (Delgado-Gomez et al., 2012). Other studies that reported an
overall measure of prediction performance with SA as outcome did not report any comparison
between ML and statistical models (Barak-Corren et al., 2017; Hettige et al., 2017; Mann et
al., 2008; Nock et al., 2018; Passos et al., 2016; Simon et al., 2018). While four of these other
SA prediction studies applied ML models only (AUCs ranging between 0.65 and 0.8; Barak-
Corren et al., 2017; Hettige et al., 2017; Mann et al., 2008; Passos et al., 2016), the other two
studies applied statistical models in combination with techniques (e.g., penalization,
replicated n-fold cross-validation) to control overfitting (AUCs being 0.85 [Simon et al.,
2018] and 0.93 [Nock et al., 2018]).
We are unaware of any study that investigated SA prediction accuracy by applying
ML models and/or techniques to control overfitting, and using prospective longitudinal data
from an epidemiological sample of adolescents and young adults (aged 14–24 years at
baseline). This age range can be regarded as a time of ―high risk‖ for incident SA; in fact
among 15- to 29-year-olds, suicide is the second leading cause of death (World Health
Organization [WHO], 2014). Thus the three properties, namely, prospective study design,
general community, and young age group, are important, both methodologically (e.g.,
temporally prospective vs. cross-sectional data analysis; Kraemer, 2010; Kraemer et al., 1997)
and practically. That is, in terms of testing the utility of ML approaches it is essential to
derive indicators that are able to help clinical decision makers, such as general practitioners or
pediatricians, better recognize the individual risk of a future SA (or suicide) as early as
possible. To explore the utility of ML approaches we examined the prediction accuracy of
four prediction approaches, namely, three regression-based models (logistic, lasso, and ridge),
and one ML model (random forest), using the data of the epidemiological Early
Developmental Stages of Psychopathology (EDSP) Study, which prospectively assessed
community adolescents and young adults over the course of 10 years.
Methods
Sample
In the EDSP Study, community adolescents and young adults were assessed up to four
times between 1995 and 2005. At baseline, participants were between 14 and 24 years of age.
The four assessments T0–T3 included sample sizes of, respectively, 3021 (T0, response =
70.9%), 1228 (T1, response = 88%, range 1.2–2.1 years after baseline), 2548 (T2, response =
84.3%, range 2.8–4.1 years after baseline), and 2210 (T3, response = 73.2%, range = 7.3–10.6
years after baseline). At baseline, T2, and T3, subjects from the full sample were assessed; at
T1 a subsample of those 14–17 years old at baseline was assessed. Subjects were selected
from the government registries of the greater Munich area, Germany; 14- to 15-year-olds
were sampled at twice the probability of 16- to 21-year-olds, whereas 22- to 24-year-olds
were sampled at half the probability. Sample weights were generated to account for this
sampling scheme. Further details of the EDSP Study methods, design, and sample
characteristics have been presented elsewhere (Beesdo-Baum et al., 2015; Lieb et al., 2000a;
Wittchen et al., 1998b). The EDSP project was reviewed by the Ethics Committee of the
Medical Faculty at the Dresden University of Technology. All participants provided informed
consent.
Selection and assessment of predictors
We selected 16 predictors. First, predictors were derived a priori from the research
literature on suicidality (Cha et al., 2018; Franklin et al., 2017; Miché et al., 2018; Nock et al.,
2008), as currently recommended for ML studies (e.g., Passos et al., 2016; Steyerberg, 2009).
Our literature-guided predictor selection was based on the broad risk and protective factor
categories presented in the extensive meta-analysis by Franklin et al. (2017) to ensure each of
our predictors maps onto one of these categories identified in the last 50 years of suicidality
research. Our predictors map onto the categories of demographics, cognitive abilities, family
history of psychopathology, general psychopathology, psychosis, prior self-injurious thoughts
or behaviors, social factors, and treatment history. Second, predictors were selected from the
EDSP baseline assessment only, in order to ensure the temporal order of predictors and the
outcome, that is, future SA (between T1 and T3). Third, we remained close to a recommended
event per variable (EPV) value of 10, that is, to have 10 cases per predictor (Studerus et al.,
2017) in order to avoid methodological shortcomings, such as unreliable predictor selection
(Mushkudiani et al., 2008). Since we observed 137 future SAs, our EPV was 8.5. It should be
noted, however, that high EPV values are not as important in penalized regression methods as
they are in other methods (Pavlou et al., 2016).
Of the 16 baseline predictors (in the following labeled with letters a–p), 10 were
assessed with the computer-assisted Munich-Composite International Diagnostic Interview
(DIA-X/M-CIDI; Wittchen and Pfister, 1997), a fully structured clinical interview for the
assessment of syndromes, symptoms, and mental disorders pertaining to the Diagnostic and
Statistical Manual of Mental Disorders (4th ed.; DSM-IV; American Psychiatric Association,
1994), along with various items of personal information. The DIA-X/M-CIDI has shown good
to excellent reliability (Wittchen et al., 1998a) and validity (Reed et al., 1998). The baseline
predictors assessed with the DIA-X/M-CIDI were (a) sex, (b) age, (c) education, (d) the
number of DSM-IV lifetime mental diagnoses (including panic disorder [PD], agoraphobia
with or without PD, social phobia, specific phobia, generalized anxiety disorder, post-
traumatic stress disorder, obsessive compulsive disorder, major depressive disorder [MDD],
dysthymia, any bipolar disorder, nicotine dependence, alcohol abuse or dependence, drug
abuse or dependence, pain disorder, and any eating disorder), (e) the number of lifetime
traumatic events (including war experience, physical attack, natural disaster, serious accident,
imprisonment/kept hostage/abduction, witness to someone else experiencing a traumatic
event), (f) rape or childhood sexual abuse (excluded from predictor (e)), (g) parental loss or
separation, (h) prior help seeking for any kind of psychological difficulty, and (i) parental
psychopathology (assessed via family history information provided by the offspring at
baseline; for its criterion-related validity, see Lieb et al., 2000b). The baseline predictor (j),
prior SA (lifetime), as well as the outcome, future SA (follow-up), was assessed in section E
of the DIA-X/M-CIDI. At baseline the SA question read: ―Have you ever attempted suicide?‖
At each follow-up (DIA-X/M-CIDI interval versions) it read: ―Since our last interview, have
you attempted suicide?‖ At both baseline and T1, only those participants who had confirmed
at least one of the MDD stem questions were asked the SA question (unavailable baseline
data on lifetime SA was set to ―no SA‖), whereas at both T2 and T3, all participants were
asked the SA question (T2: lifetime, T3: since last interview).
Additional predictors assessed at baseline were (k) behavioral inhibition (assessed
with the Retrospective Self-Report of Inhibition (RSRI); Reznick et al., 1992), (l) subclinical
psychotic experiences during the previous 7 days (assessed with the SCL-90-R; Derogatis et
al., 1973), (m) negative life events in the previous 5 years (assessed with the Munich Life
Event List; Maier-Diewald et al., 1983), (n) daily hassles in the previous 2 weeks (assessed
with the Daily Hassles Scale; Perkonigg and Wittchen, 1995a), whether the participant was
(o) living in a rural area (population density of 553 inhabitants per square mile) or in an
urban area (population density of 4061 inhabitants per square mile) (Spauwen et al., 2004),
and (p) subjectively perceived coping efficacy within the next 6 months (assessed with the
German Scale for Self-Control and Coping Skills; Perkonigg and Wittchen, 1995b; higher
scale values denote lower perceived coping efficacy).
Data analysis
The outcome predicted was a reported SA after baseline (binary: yes–no). We used
four prediction models: Logistic regression, lasso, ridge (both variants of logistic regression),
and random forest, a widely used ML algorithm (Fernández-Delgado et al., 2014).
All data-related procedures were done in the statistical software environment R,
version 3.3.3 (R Core Team, 2017). In the preprocessing of the data we excluded all cases
without any follow-up data (n = 224), or missing data (n = 4) in any predictor variable at
baseline, resulting in an N of 2793. Our chosen ML models could not deal with missing data
and since there were only four such cases, we did not see the need to apply imputation
methods, assuming that results would not be much different. The categories for the predictor
of education (low, middle, high, other) were modified by merging the categories low and
other, the latter representing a high-risk group of low educational attainment (endorsed by
2.7% of N = 2793). In our sample there were 137 future SA cases (weighted percentage =
4.9). For the application of all prediction models, we used the R package mlr (Machine
Learning in R; Bischl et al., 2016), which is a framework for ML experiments in R (R Core
Team, 2017).
Prediction models and performance measures
We selected a conventional logistic regression model (based on the full set of
predictors, yet testing for collinearity, with maximum absolute correlation between predictors
of 0.4 and a maximum variance inflation factor of 1.74) and two other models of the logistic
regression family, lasso and ridge, that include an additional parameter for penalizing factors
with low predictive contributions. The ML model we selected, random forest, belongs to the
family of ensemble classifiers. Random forests have been shown to make the best predictions
across diverse data sets in comparison to many other algorithms, for example, neural
networks (Fernández-Delgado et al., 2014). The single prediction models were computed by
mlr (Bischl et al., 2016), accessing the R-packages that were relevant for our analyses: For
logistic regression this was the R base package stats; for both lasso and ridge this was
LiblineaR (Helleputte, 2017), and for the random forest model this was the ranger package
(Wright and Ziegler, 2017).
The procedure of obtaining the final results in mlr (Bischl et al., 2016) consisted of the
following steps:
First, with the aim of having each prediction model weighting all 16 predictor
variables equally, we normalized them.
Second, in accordance with the Transparent Reporting of a Multivariable Prediction
Model for Individual Prognosis or Diagnosis (TRIPOD) statement (Collins et al., 2015), we
selected performance measures relating to both discrimination and calibration, with the
former measuring a model’s ability to accurately discriminate new outcome cases and the
latter measuring a model’s overall performance (combination of both discrimination and
calibration). We chose the AUC as the measure of discrimination, which summarizes the
ratios of the true positive rate (sensitivity) and the false positive rate (1–specificity), across all
possible thresholds of predicted probabilities (from 0 to 1), according to which each observed
case is assigned to the outcome class of either 0 (no event) or 1 (event). As the measure of
overall performance, we chose the scaled Brier score. The best model performance is denoted
by the highest scaled Brier score, which is conceptually similar to Pearson’s statistic
(Steyerberg et al., 2010). Calibration denotes one particular aspect of a prediction model’s
accuracy, namely the agreement of predicted SA risk and actually observed SA rates (Alba et
al., 2017; Steyerberg, 2009; Studerus et al., 2017). Due to limitations of the AUC in
imbalanced datasets (e.g., Lobo et al., 2008), where the outcome group is much smaller than
the non-outcome group, we additionally report two other important performance metrics:
sensitivity (in ML termed recall) and positive predictive value (PPV; in ML termed
precision). Whereas sensitivity describes the proportion of those the model classifies as
having the outcome (testing positive) among those who actually have the outcome, PPV
describes the proportion of those who actually have the outcome among those who tested
positive. Values for both sensitivity and PPV can range between 0 (worst) and 1 (best).
Third, as a means to avoid performance estimates that are optimistically biased
(overfitting), we applied repeated nested cross-validation, which is the recommended method
of choice (Krstajic et al., 2014), whenever the gold-standard, external validation, cannot be
applied (Bleeker et al., 2003). Internal cross-validation includes the strict separation of a
given data set into a training data set, used to build a prediction model, and a test data set,
used to validate the model (Steyerberg, 2009; Studerus et al., 2017). Repeated nested cross-
validation is a two-stage process. At stage 1, the selected hyperparameters of the model are
tuned, such that the model’s performance is optimized, as measured on a validation data set.
Hyperparameters are different from the standard model parameters (e.g., weights in a
regression model) in that they do not represent the learning from the data itself but instead
define higher level properties of the model, which cannot be learned from the data. Tuning of
hyperparameters means specifying how the model will learn from the data, for example, the
degree of model complexity, for which we used an automated grid search with 10 different
hyperparameter values. For the lasso and ridge regression models we chose to tune the
parameter cost (cost of constraints violation) in the range of 0.001 and 0.3. For the random
forest model we chose to tune the parameter mtry (number of variables randomly sampled as
candidates at each split) in the range of 1 and 16, while keeping the parameter ntree (number
of trees to grow) at its default value of 500, because tuning this parameter is generally not
recommended (Probst and Boulesteix, 2018). For selecting the best tuning-based prediction
model, we used stratified out-of-bag bootstrapping with 50 iterations (stratification being
useful for imbalanced class sizes). Bootstrapping with iterations generates multiple samples
from and of the same size as the original data set. The training of the model uses the sampled
cases (for each of the 10 hyperparameter values), after which the model is validated on the so-
called out-of-bag data, which has not been used for model building in the respective bootstrap
iteration. Bootstrapping is a recommended choice when selecting a prediction model, because
overfitting is strongly avoided (Kuhn and Johnson, 2013, p. 78). At stage 2 of the repeated
nested cross-validation, the optimal prediction model of stage 1 is used, with the aim of
estimating this model’s final prediction performance, for which we used 10-fold repeated
cross-validation, with 10 repetitions per fold. Repeated n-fold cross-validation is
recommended for several reasons, for example, to obtain robust estimates of model
performance (Kuhn and Johnson, 2013, p. 78). With this setup we obtained 100 estimates of
prediction performances for each model.
Results
Predictive performance measures
Means and medians of AUC and scaled Brier score of all four models are shown in
Table 1. AUC values were very similar among the four models for both mean (0.824–0.829)
and median (0.822–0.830), with strongly overlapping boxplots (Fig. 1). The scaled Brier
score was highest for the ridge model (mean: 0.466, median: 0.461) while the values of the
other three models ranged between 0.136 and 0.245 (mean) and 0.167 and 0.246 (median).
Sensitivity and positive predictive value
Mean sensitivity and positive predictive value (PPV), each based on a predicted
probability cutoff of 0.5, are shown in table 1. Mean sensitivity ranged from 2.8% (random
forest) to 25% (ridge regression), whereas both logistic and lasso regression showed similar
sensitivities of around 22%. PPV among the logistic regression family fell into a close range
of between 66% and 72%, whereas the random forest achieved a PPV of 87%.
- Figure 1 here -
- Table 1 here -
Predictor importance
Predictor importance values for each model are summarized in Table 2. In all four
prediction models, the most important predictor was prior SA. In the logistic-regression-based
models, it increased the odds of a future SA by 57% (logistic), 55% (lasso), and 14% (ridge).
All following ranks, that is, ranks 2 to 16, were not consistent across all four models. Whereas
education ranked second in the logistic and lasso models (33% and 30% risk decrease,
respectively), prior help seeking ranked second in the ridge model (5% risk increase), and
number of DSM-IV lifetime mental disorders ranked second in the random forest model. Prior
help seeking ranked third in all models except for the ridge model, showing a risk increase for
a future SA of around 30% (logistic and lasso models). In the ridge model the number of
DSM-IV lifetime mental disorders ranked third, with a risk increase of 4%. Negative life
events and psychotic experiences were discarded by the lasso model, indicating that these two
variables were not useful in predicting the risk of future SAs.
Regarding the overall predictor importance ranking, the logistic and lasso regression models
showed a 44% concordance, that is, 7 of 16 predictors had the exact same rank in both
models. Rank concordance ranged between 6% and 12% for all other possible comparisons of
two models. When permitting ranks per predictor to differ by a maximum of 1 between two
models, rank concordance increased to 100% when comparing logistic and lasso regression,
while the range of concordant ranks increased to 19% and 38%. All three regression-based
models assigned similarly high ranks to the predictors parental loss or separation and
behavioral inhibition, respectively (ranking between fifth and seventh).
- Table 2 here -
Discussion
All four prediction models, that is, logistic regression, lasso, ridge, and random forest,
yielded comparable prediction accuracies. According to categories of AUC results, our results
(median AUC ranging between 0.822 and 0.830) represent a very good prediction (Šimundić,
2009). In terms of Cohen’s d, our AUC results can be translated to an effect size of about 1.3
(Rice and Harris, 2005). When comparing the discriminative ability of our prediction models
with other studies predicting SA on the individual level, our results fit into the upper part of
the AUC range of 0.65–0.93 across these studies (Hettige et al., 2017; Kessler et al., 2016;
Mann et al., 2008; Passos et al., 2016; Simon et al., 2018; Walsh et al., 2018, 2017). However,
we refrain from comparisons with most of these studies, because of the fundamental
differences between them and our study, for instance, in terms of sample type (mostly patients
or army soldiers vs. community), sample size, study design (mostly cross-sectional or
electronic health record data vs. prospectively assessed data), and age group (almost
exclusively adults vs. adolescents/young adults). The only exception in terms of
comparability is the NCS study by Kessler et al. (2016) who also used a representative
community sample to prospectively predict SAs. However, the sample they used consisted of
a subsample of 1056 respondents (age range reported only for the full sample) with a DSM-
III-R (American Psychiatric Association, 1987) lifetime MDD diagnosis at baseline (1990-
1992), who were reinterviewed once 10–12 years after baseline. SA was reported by 4.5% of
those respondents. Whereas the ML models contained between 9 and 13 predictors, logistic
models contained 23 predictors.

Several possible reasons might explain the difference between Kessler et al.’s (2016)
results for SA (AUC: 0.70 by logistic models, 0.76 by ML models) and our results for SA
(AUC: around 0.82 by both logistic models and the ML model). First and foremost, Kessler
et al. (2016) used prediction models that were developed using the baseline data (van Loo et
al., 2014; Wardenaar et al., 2014), and then applied these models independently to the follow-
up data. Other possible explanations for differing results might be sample source (NCS: MDD
diagnosis vs. EDSP Study: general community), diagnostic criteria (DSM-III-R vs. DSM-IV),
age range at baseline (15–54 years vs. 14–24 years), number of assessment waves within the
respective study period (two in 10–12 years vs. a maximum of four in 10 years), and number
of predictors used in both the logistic models and the ML models (23 for logistic and 9–13 for
ML vs. 16 in both logistic and ML). Notably, Kessler et al. (2016) did not use prior SA as one
of the predictors, which turned out to be the most important predictor across all of our
prediction models. This, too, might explain differences between results.
Unlike the AUC, the scaled Brier score does not come with recommended cut-off
categories. We can therefore only descriptively note that the ridge regression performed best
in terms of the scaled Brier score (combination of prediction accuracy and calibration),
whereas the other three models performed less well, with a 47% to 71% reduced scaled Brier
score. Interestingly, even though the ridge model showed no particularly increased AUC
values (see Fig. 1, left panel), the scaled Brier score markedly differed from the other models,
in terms of both the median and the variability (see Fig. 1, right panel).
The question arises as to why some studies reported ML models outperforming
conventional logistic or linear regression models (Delgado-Gomez et al., 2011; Kessler et al.,
2016; Walsh et al., 2018, 2017), whereas other studies (SA: Delgado-Gomez et al., 2012;
suicide: Kessler et al., 2017, 2015), including ours, reported comparable prediction
performances. Several suggestions can be found in the literature. The advantages of ML
depend on several data-related properties, for example, on sample size (Hahn et al., 2017)
(ML prefers ―big data‖), on high-dimensional complexity (e.g., nonlinear associations, high-
order interactions) actually being present in the data (Walsh et al., 2018), on predictor sets
that contain different data types and sources (Lee et al., 2018), and, according to Walsh et al.
(2018), on how difficult group differences are to detect, which might be more difficult in two
relatively homogeneous groups (e.g., suicide ideators with vs. without SA) than in
heterogeneous groups (e.g., general community members with vs. without SA). Another
explanation for comparable prediction performances across different models might be
whether there is a sufficient number of outcome cases per predictor (the EPV
recommendation is 10; Studerus et al., 2017). On the one hand, the above-mentioned studies
that found ML to outperform logistic or linear regression (Delgado-Gomez et al., 2011;
Kessler et al., 2016; Walsh et al., 2018, 2017), used patient or MDD-diagnosis samples of
various sizes (Ns ranging between 879 and over 33000), which additionally fulfill some of the
other criteria that the ML approach seems to favor. However, there are two studies by Kessler
et al. (2017, 2015) on U.S. army soldiers, both using suicide as outcome, in either individuals
hospitalized with mental disorders or psychiatric outpatients. In both studies conventional
regression models performed equally high, compared to ML models, despite the large sample
sizes (between 40000 and 975000), despite a presumably high complexity in the actual data,
despite predictor sets of different data types and sources, and despite the homogeneity of the
samples, which might have made it somewhat difficult to detect group differences. Notably,
ML models were used, both to predict the outcome and to select a lower number of relevant
predictors, which then were used in discrete-time survival (Kessler et al., 2015) or logistic
regression (Kessler et al., 2017) models. Nonetheless, the overall prediction performance was
comparably high between conventional regression and ML models (0.84 vs. 0.85 (Kessler et
al., 2015) and 0.72 vs. 0.72 (Kessler et al., 2017)). Therefore, our study results might not be
fully explained by the above-mentioned criteria that favor the use of ML, which are not
completely met by the EDSP data. Of note, a current systematic review by Christodoulou et
al. (2019) found no performance benefit of ML over logistic regression for clinical prediction
models in 71 studies across several epidemiological research fields, such as psychiatry,
cardiology, or oncology. Similarly, Belsher et al. (2019) conclude that ML models currently
are not ready for clinical applications across health systems concerning SA and suicide
deaths, due to several critical concerns that in their view have remained unaddressed.
In addition to our main performance metric AUC, we also calculated both sensitivity
and PPV. All of these performance metrics may make most sense in combination, since each
captures a specific aspect of model performance. While the AUC is recommended by some
authors as a global model performance metric (e.g., Bradley, 1997), others acknowledge its
widespread use (Saito and Rehmsmeier, 2014), and yet others call for it to be abandoned or
replaced (Lobo et al., 2008; Wald and Bestwick, 2014). However, to date the AUC still seems
to be useful for comparing model performances across studies, which in our view is
somewhat less the case with sensitivity and the PPV. Unlike the AUC, sensitivity is not a
global measure applicable across all possible thresholds of predicted probabilities, but it is a
local measure for one specific threshold. The PPV depends on the outcome base rate, whereas
the AUC does not (Hajian-Tilaki, 2013), which makes comparison across studies difficult to
the degree that base rates differ. When applying both sensitivity and the PPV to compare our
models with each other, the approximate model performance equality (in terms of the AUC)
disappears. Instead, only the logistic regression family performs in a close range, with
sensitivities (for a predicted probabilities threshold of 0.5) being relatively low between 20%
and 25%, and PPVs (for the average outcome rate of about 5%) being fairly high between
66% and 72%. The random forest model, on the other hand, shows an extremely low
sensitivity of 3%, yet the highest PPV of 87% across all four models. We emphasize that the
AUC and measures such as sensitivity and PPV evaluate model performance very differently.
One important aspect that must not be neglected is the context in which one of these measures
is more appropriate than another. For instance, in a model comparison study such as this one,
the AUC is more appropriate since it captures overall model performance, whereas when it
comes to the clinical application of the model, then finding and setting a probability threshold
(to balance specificity and sensitivity) by applying a loss/utility/cost function that depends on
several contextual considerations is more appropriate.
Many ML models are considered black boxes (Gilpin et al., 2018); that is, even though
the importance of the predictors can be extracted from a model, the self-learning algorithm
might have used the predictors for computing the outcome in such a way that human beings
are not able to comprehend it, for example, 10th-order interaction. The random forest model
selected constructs as most important predictors that differed from those of the logistic
regression models. Even within the logistic regression models there were some differences
(see Table 2, e.g., logistic and ridge). This poses the difficult question of which predictor
selection mechanisms to ―trust‖ when trying to interpret the results. Irrespective of this issue,
it is interesting to note that prior SA was the most important predictor across all models,
confirming this variable’s reputation as supplying the highest predictive power for a
subsequent SA (Borges et al., 2010, 2006; Brown et al., 2000; Glenn and Nock, 2014; Joiner
et al., 2005; Kuo et al., 2001; Nordström et al., 1995; Ribeiro et al., 2016; WHO, 2014). In
particular, we would emphasize that we compared the predictors’ rank across models, so the
magnitude of the coefficients should not be compared between nonpenalized and penalized
logistic regression on account of the coefficients being regularized (biased) in the latter case.
The second most important predictor was educational level in the logistic and lasso models.
This confirms the plausibility of this variable as being protective against SA, for example, in
that higher educational achievement in adolescence is associated with greater life satisfaction
(Crede et al., 2015). In the ridge model, prior psychological help seeking was selected as
second most important predictor, whereas it ranked third in the random forest, logistic and
lasso models, respectively. Prior psychological help seeking might thus be seen as indicating
a greater severity of psychological problem(s) or disorder(s) present at that time (Han et al.,
2018; Hom et al., 2015), which might serve as one possible explanation for the positive
association with SA. Finally, the number of prior mental disorders (comorbidity) has often
been found to be associated with SA (e.g., Bronisch and Wittchen, 1994; Lewinsohn et al.,
1995; Miché et al., 2018), which is confirmed by the random forest and the ridge regression
models ranking this predictor second and third respectively.
We want to mention several strengths of our study. First, to the best of our knowledge
this is the first study that applied ML procedures to prospectively predict SAs in community
adolescents and young adults (an assumption being supported by a current systematic review
on the use of ML in the study of suicidal behaviors; Burke et al., 2019), a group that is known
to be the high-risk group for first lifetime SA (WHO, 2014). Second, we used repeated nested
cross-validation, which Krstajic et al. (2014) recommended as the best approach for training
and testing a prediction model within a single dataset, that is, external validation being
inapplicable. Third, we adhered to the reporting guidelines known as the TRIPOD statement
(Collins et al., 2015). This strength is also supported by two systematic reviews (Burke et al.,
2019; Christodoulou et al., 2019), who criticize the inconsistent reporting methods of
classifier performance across studies. Fourth, we used predictors that were a priori defined,
taken from the suicide literature. We assume that this and the EDSP data quality might have
led to the very good (Šimundić, 2009) discriminative ability of the predictive models we
applied.
There are also limitations of our study. First, the predictive performance of ML
algorithms such as random forests depends on the sample size, with larger sample sizes at
times leading to an increased performance result (Raudys and Jain, 1991). In that respect our
sample size may be considered a weakness. It may also be argued that it is not sample size per
se which matters, but rather the relationship between predictor and outcome in the data, that is
to say, whether additive or multiplicative (interaction). In the case of an additive association,
ML techniques such as random forest may simply not be able to show their predictive
potential, as opposed to a robust multiplicative association. Second, our 10-year longitudinal
study design enabled us to include predictors that must be conceived as distal, as opposed to
proximal. Future research on predicting individual SA risk should include both distal and
proximal risk factors, since the main purpose of predictive analytics is to offer tools for risk
assessment in the near future, rather than in the distant future. Third, we used self-reported
data, which is critical in terms of several inherent biases, for example, recall bias. It also
means that we lacked other than self-report data, e.g., genetic or neuropsychological data.
Fourth, we did not apply external cross-validation, which is considered the gold-standard in
estimating the degree of overfitting and which might have yielded lower model performances
compared to our cross-validation procedure. Fifth, our outcome was assessed with a one-item
measure, which might have led to an increased misclassification rate, estimated by Millner et
al. (2015) to be 11%. However, this possible error rate must not be overstated either. Mazza et
al. (2011) empirically support the notion that single-item SA responses appear to be valid.
Sixth, there might have been undetected SA cases at T1, depending on whether participants
entered the MDD interview section. However, we consider this a minor limitation because T1
was the only one of the four EDSP waves where a subsample was assessed.
Despite these limitations, our study has shown that all four models resulted in a very
good overall ability to discriminate between individuals who attempt suicide in the future
from individuals who do not, in a high-risk sample of community adolescents and young
adults. This might be seen as a promising contribution to the ongoing pursuit of fruitfully
combining both statistical methods and ML methods, aiming to improve SA risk assessment
of individuals. One possible clinical implication of ML studies using survey-based self-report
data from the general community might be to use the best model or combination of models as
a screening tool for SA in primary prevention efforts.
Author declaration
We wish to confirm that there are no known conflicts of interest associated with this
publication and there has been no significant financial support for this work that could have
influenced its outcome.
We confirm that the manuscript has been read and approved by all named authors and that
there are no other persons who satisfied the criteria for authorship but are not listed. We
further confirm that the order of authors listed in the manuscript has been approved by all of
us.
We confirm that we have given due consideration to the protection of intellectual property
associated with this work and that there are no impediments to publication, including the
timing of publication, with respect to intellectual property. In so doing we confirm that we
have followed the regulations of our institutions concerning intellectual property.
We understand that the Corresponding Author is the sole contact for the Editorial process
(including Editorial Manager and direct communications with the office). He/she is
responsible for communicating with the other authors about progress, submissions of
revisions and final approval of proofs. We confirm that we have provided a current, correct
email address which is accessible by the Corresponding Author and which has been
configured to accept email from roselind.lieb@unibas.ch.
Brief statement concerning each named author's contributions to the paper under the heading
Contributors:
Author Marcel Miché did the literature searches, undertook the statistical analyses, and wrote
the first draft of the manuscript.
Author Erich Studerus reviewed the statistical analyses and the reporting of our study results.
Author Andrea Meyer reviewed methodological parts of the manuscript.
Author Andrew Gloster reviewed the manuscript.
Author Katja Beesdo-Baum reviewed the manuscript.

Author Hans-Ulrich Wittchen is one of two principal investigators of the EDSP study and
reviewed the manuscript.
Author Roselind Lieb is the other principal investigator of the EDSP study. She reviewed the
manuscript and is the corresponding author.
All authors contributed to and have approved the final manuscript.
Conflict of Interest
Declarations of interest:
none.
References
Alba, A.C., Agoritsas, T., Walsh, M., Hanna, S., Iorio, A., Devereaux, P.J., McGinn, T.,
Guyatt, G., 2017. Discrimination and Calibration of Clinical Prediction Models:
Users’ Guides to the Medical Literature. JAMA Psychiatry 318, 1377–1348.
https://doi.org/10.1001/jama.2017.12126
American Psychiatric Association, 1994. Diagnostic and statistical manual of mental
disorders (4th ed.). Author: Washington, DC.
American Psychiatric Association, 1987. Diagnostic and statistical manual of mental
disorders (3rd ed., revised). Author: Washington, DC.
Barak-Corren, Y., Castro, V.M., Javitt, S., Hoffnagle, A.G., Dai, Y., Perlis, R.H., Nock, M.K.,
Smoller, J.W., Reis, B.Y., 2017. Predicting Suicidal Behavior From Longitudinal
Electronic Health Records. Am. J. Psychiatry 174, 154–162.
https://doi.org/10.1176/appi.ajp.2016.16010077
Beesdo-Baum, K., Knappe, S., Asselmann, E., Zimmermann, P., Bruckl, T., Hofler, M.,
Behrendt, S., Lieb, R., Wittchen, H.U., 2015. The ―Early Developmental Stages of
Psychopathology (EDSP) Study‖: a 20-year review of methods and findings. Soc.
Psychiatry Psychiatr. Epidemiol. 50, 851–866. https://doi.org/10.1007/s00127-015-
1062-x
Belsher, B.E., Smolenski, D.J., Pruitt, L.D., Bush, N.E., Beech, E.H., Workman, D.E.,
Morgan, R.L., Evatt, D.P., Tucker, J., Skopp, N.A., 2019. Prediction Models for
Suicide Attempts and Deaths: A Systematic Review and Simulation. JAMA
Psychiatry. https://doi.org/10.1001/jamapsychiatry.2019.0174
Bennett, D., Silverstein, S.M., Niv, Y., 2019. The Two Cultures of Computational Psychiatry.
JAMA Psychiatry. https://doi.org/10.1001/jamapsychiatry.2019.0231
Bentley, K.H., Franklin, J.C., Ribeiro, J.D., Kleiman, E.M., Fox, K.R., Nock, M.K., 2016.
Anxiety and its disorders as risk factors for suicidal thoughts and behaviors: A meta-
analytic review. Clin. Psychol. Rev. 43, 30–46.
https://doi.org/10.1016/j.cpr.2015.11.008
Bischl, B., Lang, M., Kotthoff , L., Schiff ner, J., Richter, J., Studerus, E., Casalicchio, G.,
Jones, Z.M., 2016. mlr: Machine Learning in R. J. Mach. Learn. Res. 17, 1–5.
Bleeker, S.E., Moll, H.A., Steyerberg, E.W., Donders, A.R.T., Derksen-Lubsen, G., Grobbee,
D.E., Moons, K.G.M., 2003. External validation is necessary in prediction research: J.
Clin. Epidemiol. 56, 826–832. https://doi.org/10.1016/S0895-4356(03)00207-5
Borges, G., Angst, J., Nock, M.K., Ruscio, A.M., Walters, E.E., Kessler, R.C., 2006. A risk
index for 12-month suicide attempts in the National Comorbidity Survey Replication
(NCS-R). Psychol. Med. 36, 1747–1757. https://doi.org/10.1017/S0033291706008786
Borges, G., Nock, M.K., Haro Abad, J.M., Hwang, I., Sampson, N.A., Alonso, J., Andrade,
L.H., Angermeyer, M.C., Beautrais, A., Bromet, E., Bruffaerts, R., de Girolamo, G.,
Florescu, S., Gureje, O., Hu, C., Karam, E.G., Kovess-Masfety, V., Lee, S., Levinson,
D., Medina-Mora, M.E., Ormel, J., Posada-Villa, J., Sagar, R., Tomov, T., Uda, H.,
Williams, D.R., Kessler, R.C., 2010. Twelve-Month Prevalence of and Risk Factors
for Suicide Attempts in the World Health Organization World Mental Health Surveys.
J. Clin. Psychiatry 71, 1617–1628. https://doi.org/10.4088/JCP.08m04967blu
Bradley, A.P., 1997. The use of the area under the ROC curve in the evaluation of machine
learning algorithms. Pattern Recognit. 30, 1145–1159. https://doi.org/10.1016/S0031-

3203(96)00142-2
Bronisch, T., Wittchen, H.U., 1994. Suicidal Ideation and Suicide Attempts - Comorbidity
with Depression, Anxiety Disorders, and Substance-Abuse Disorder. Eur. Arch.
Psychiatry Clin. Neurosci. 244, 93–98. https://doi.org/Doi 10.1007/Bf02193525
Brown, G.K., Beck, A.T., Steer, R.A., Grisham, J.R., 2000. Risk factors for suicide in
psychiatric outpatients: A 20-year prospective study. J. Consult. Clin. Psychol. 68,
371–377. https://doi.org/10.1037/0022-006X.68.3.371
Burke, T.A., Ammerman, B.A., Jacobucci, R., 2019. The use of machine learning in the study
of suicidal and non-suicidal self-injurious thoughts and behaviors: A systematic
review. J. Affect. Disord. 245, 869–884. https://doi.org/10.1016/j.jad.2018.11.073
Cha, C.B., Franz, P.J., Guzmán, E.M., Glenn, C.R., Kleiman, E.M., Nock, M.K., 2018.
Annual Research Review: Suicide among youth - epidemiology, (potential) etiology,
and treatment. J. Child Psychol. Psychiatry 59, 460–482.
https://doi.org/10.1111/jcpp.12831
Christodoulou, E., Ma, J., Collins, G.S., Steyerberg, E.W., Verbakel, J.Y., van Calster, B.,
2019. A systematic review shows no performance benefit of machine learning over
logistic regression for clinical prediction models. J. Clin. Epidemiol.
https://doi.org/10.1016/j.jclinepi.2019.02.004
Collins, G.S., Reitsma, J.B., Altman, D.G., Moons, K., 2015. Transparent reporting of a
multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the
TRIPOD Statement. BMC Med. 13, 1. https://doi.org/10.1186/s12916-014-0241-z

Crede, J., Wirthwein, L., McElvany, N., Steinmayr, R., 2015. Adolescents’ academic
achievement and life satisfaction: the role of parents’ education. Front. Psychol. 6, 52.
https://doi.org/10.3389/fpsyg.2015.00052
Delgado-Gomez, D., Blasco-Fontecilla, H., Alegria, A.A., Legido-Gil, T., Artes-Rodriguez,
A., Baca-Garcia, E., 2011. Improving the accuracy of suicide attempter classification.
Artif. Intell. Med. 52, 165–168. https://doi.org/10.1016/j.artmed.2011.05.004
Delgado-Gomez, D., Blasco-Fontecilla, H., Sukno, F., Socorro Ramos-Plasencia, M., Baca-
Garcia, E., 2012. Suicide attempters classification: Toward predictive models of
suicidal behavior. Neurocomputing 92, 3–8.
https://doi.org/10.1016/j.neucom.2011.08.033
Derogatis, L.R., Lipman, R.S., Covi, L., 1973. The SCL-90-R: An outpatient psychiatric
rating scale: Preliminary report. Deutsche Bearbeitung CIPS. Psychopharmacol. Bull.
9, 13–27.
Fernández-Delgado, M., Cernadas, E., Barro, S., Amorim, D., 2014. Do we Need Hundreds of
Classifiers to Solve Real World Classification Problems? J. Mach. Learn. Res. 15,
3133–3181.
Franklin, J.C., Ribeiro, J.D., Fox, K.R., Bentley, K.H., Kleiman, E.M., Huang, X., Musacchio,
K.M., Jaroszewski, A.C., Chang, B.P., Nock, M.K., 2017. Risk factors for suicidal
thoughts and behaviors: A meta-analysis of 50 years of research. Psychol. Bull. 143,
187–232. https://doi.org/10.1037/bul0000084
Gilpin, L.H., Bau, D., Yuan, B.Z., Bajwa, A., Specter, M., Kagal, L., 2018. Explaining
Explanations: An Approach to Evaluating Interpretability of Machine Learning. ArXiv
Prepr. arXiv, 1806.00069.

Glenn, C.R., Nock, M.K., 2014. Improving the Short-Term Prediction of Suicidal Behavior.
Am. J. Prev. Med. 47, S176–S180. https://doi.org/10.1016/j.amepre.2014.06.004
Hahn, T., Nierenberg, A.A., Whitfield-Gabrieli, S., 2017. Predictive analytics in mental
health: applications, guidelines, challenges and perspectives. Mol. Psychiatry 22, 37–
43. https://doi.org/10.1038/mp.2016.201
Hajian-Tilaki, K., 2013. Receiver Operating Characteristic (ROC) Curve Analysis for
Medical Diagnostic Test Evaluation. Casp. J. Intern. Med. 4, 627–635.
Han, J., Batterham, P.J., Calear, A.L., Randall, R., 2018. Factors Influencing Professional
Help-Seeking for Suicidality: A Systematic Review. Crisis 39, 175–196.
https://doi.org/10.1027/0227-5910/a000485
Helleputte, T., 2017. LiblineaR: Linear Predictive Models Based on the LIBLINEAR C/C++
Library.
Hettige, N.C., Nguyen, T.B., Yuan, C., Rajakulendran, T., Baddour, J., Bhagwat, N., Bani-
Fatemi, A., Voineskos, A.N., Mallar Chakravarty, M., De Luca, V., 2017.
Classification of suicide attempters in schizophrenia using sociocultural and clinical
features: A machine learning approach. Gen. Hosp. Psychiatry 47, 20–28.
https://doi.org/10.1016/j.genhosppsych.2017.03.001
Hom, M.A., Stanley, I.H., Joiner, T.E., 2015. Evaluating factors and interventions that
influence help-seeking and mental health service utilization among suicidal
individuals: A review of the literature. Clin. Psychol. Rev. 40, 28–39.
https://doi.org/10.1016/j.cpr.2015.05.006
Joiner, T.E., Conwell, Y., Fitzpatrick, K.K., Witte, T.K., Schmidt, N.B., Berlim, M.T., Fleck,
M.P.A., Rudd, M.D., 2005. Four Studies on How Past and Current Suicidality Relate
Even When “Everything But the Kitchen Sink” Is Covaried. J. Abnorm. Psychol. 114,
291–303. https://doi.org/10.1037/0021-843X.114.2.291
Just, M.A., Pan, L., Cherkassky, V.L., McMakin, D.L., Cha, C., Nock, M.K., Brent, D., 2017.
Machine learning of neural representations of suicide and emotion concepts identifies
suicidal youth. Nat. Hum. Behav. 1, 911–919. https://doi.org/10.1038/s41562-017-
0234-y
Kessler, R.C., Stein, M.B., Petukhova, M.V., Bliese, P., Bossarte, R.M., Bromet, E.J.,
Fullerton, C.S., Gilman, S.E., Ivany, C., Lewandowski-Romps, L., Millikan Bell, A.,
Naifeh, J.A., Nock, M.K., Reis, B.Y., Rosellini, A.J., Sampson, N.A., Zaslavsky,
A.M., Ursano, R.J., 2017. Predicting suicides after outpatient mental health visits in
the Army Study to Assess Risk and Resilience in Servicemembers (Army STARRS).
Mol. Psychiatry 22, 544–551. https://doi.org/10.1038/mp.2016.110
Kessler, R.C., van Loo, H.M., Wardenaar, K.J., Bossarte, R.M., Brenner, L.A., Cai, T., Ebert,
D.D., Hwang, I., Li, J., de Jonge, P., Nierenberg, A.A., Petukhova, M.V., Rosellini,
A.J., Sampson, N.A., Schoevers, R.A., Wilcox, M.A., Zaslavsky, A.M., 2016. Testing
a machine-learning algorithm to predict the persistence and severity of major
depressive disorder from baseline self-reports. Mol. Psychiatry 21, 1366–1371.
https://doi.org/10.1038/mp.2015.198
Kessler, R.C., Warner, C.H., Ivany, C., Petukhova, M.V., Rose, S., Bromet, E.J., Brown, M.,
Cai, T., Colpe, L.J., Cox, K.L., Fullerton, C.S., Gilman, S.E., Gruber, M.J., Heeringa,
S.G., Lewandowski-Romps, L., Li, J., Millikan-Bell, A.M., Naifeh, J.A., Nock, M.K.,
Rosellini, A.J., Sampson, N.A., Schoenbaum, M., Stein, M.B., Wessely, S., Zaslavsky,
A.M., Ursano, R.J., 2015. Predicting Suicides After Psychiatric Hospitalization in US

Army Soldiers: The Army Study to Assess Risk and Resilience in Servicemembers
(Army STARRS). JAMA Psychiatry 72, 49–57.
https://doi.org/10.1001/jamapsychiatry.2014.1754
Kraemer, H.C., 2010. Epidemiological Methods: About Time. Int. J. Environ. Res. Public.
Health 7, 29–45. https://doi.org/10.3390/ijerph7010029
Kraemer, H.C., Kazdin, A.E., Offord, D.R., Kessler, R.C., Jensen, P.S., Kupfer, D.J., 1997.
Coming to terms with the terms of risk. Arch Gen Psychiatry 54, 337–343.
Krstajic, D., Buturovic, L.J., Leahy, D.E., Thomas, S., 2014. Cross-validation pitfalls when
selecting and assessing regression and classification models. J. Cheminformatics 6, 10.
https://doi.org/10.1186/1758-2946-6-10
Kuhn, M., Johnson, K., 2013. Applied Predictive Modeling, 5th ed. Springer, New York.
Kuo, W.-H., Gallo, J.J., Tien, A.Y., 2001. Incidence of suicide ideation and attempts in
adults: the 13-year follow-up of a community sample in Baltimore, Maryland.
Psychol. Med. 31, 1181–1191. https://doi.org/10.1017/S0033291701004482
Lee, Y., Ragguett, R.-M., Mansur, R.B., Boutilier, J.J., Rosenblat, J.D., Trevizol, A.,
Brietzke, E., Lin, K., Pan, Z., Subramaniapillai, M., Chan, T.C.Y., Fus, D., Park, C.,
Musial, N., Zuckerman, H., Chen, V.C.-H., Ho, R., Rong, C., McIntyre, R.S., 2018.
Applications of machine learning algorithms to predict therapeutic outcomes in
depression: A meta-analysis and systematic review. J. Affect. Disord. 241, 519–532.
https://doi.org/10.1016/j.jad.2018.08.073
Lewinsohn, P.M., Rohde, P., Seeley, J.R., 1995. Adolescent Psychopathology: III. The
Clinical Consequences of Comorbidity. J. Am. Acad. Child Adolesc. Psychiatry 34,

510–519. https://doi.org/Doi 10.1097/00004583-199504000-00018
Lieb, R, Isensee, B., von Sydow, K., Wittchen, H.U., 2000a. The Early Developmental Stages
of Psychopathology Study (EDSP): A methodological update. Eur. Addict. Res. 6,
170–182. https://doi.org/Doi 10.1159/000052043
Lieb, Roselind, Wittchen, H.-U., Höfler, M., Fuetsch, M., Stein, M.B., Merikangas, K.R.,
2000b. Parental Psychopathology, Parenting Styles, and the Risk of Social Phobia in
Offspring: A Prospective-Longitudinal Community Study. Arch. Gen. Psychiatry 57,
859–866. https://doi.org/10.1001/archpsyc.57.9.859
Lobo, J.M., Jiménez-Valverde, A., Real, R., 2008. AUC: a misleading measure of the
performance of predictive distribution models. Glob. Ecol. Biogeogr. 17, 145–151.
https://doi.org/10.1111/j.1466-8238.2007.00358.x
Maier-Diewald, W., Wittchen, H.-U., Hecht, H., Werner-Eilert, K., 1983. Die Münchner
Ereignisliste (MEL)–Anwendungsmanual. Max Planck Institute for Psychiatry,
Munich.
Mann, J.J., Ellis, S.P., Waternaux, C.M., Liu, X., Oquendo, M.A., Malone, K.M., Brodsky,
B.S., Haas, G.L., Currier, D., 2008. Classification Trees Distinguish Suicide
Attempters in Major Psychiatric Disorders: A Model of Clinical Decision Making. J.
Clin. Psychiatry 69, 23–31.
Mazza, J.J., Catalano, R.F., Abbott, R.D., Haggerty, K.P., 2011. An Examination of the
Validity of Retrospective Measures of Suicide Attempts in Youth. J. Adolesc. Health
49, 532–537. https://doi.org/10.1016/j.jadohealth.2011.04.009
Miché, M., Hofer, P.D., Voss, C., Meyer, A.H., Gloster, A.T., Beesdo-Baum, K., Lieb, R.,
2018. Mental disorders and the risk for the subsequent first suicide attempt: results of
a community study on adolescents and young adults. Eur. Child Adolesc. Psychiatry
27, 839–848. https://doi.org/10.1007/s00787-017-1060-5
Millner, A.J., Lee, M.D., Nock, M.K., 2015. Single-Item Measurement of Suicidal Behaviors:
Validity and Consequences of Misclassification. PLOS ONE 17.
Mushkudiani, N.A., Hukkelhoven, C.W.P.M., Hernández, A.V., Murray, G.D., Choi, S.C.,
Maas, A.I.R., Steyerberg, E.W., 2008. A systematic review finds methodological
improvements necessary for prognostic models in determining traumatic brain injury
outcomes. J. Clin. Epidemiol. 61, 331–343.
https://doi.org/10.1016/j.jclinepi.2007.06.011
Nock, M.K., Borges, G., Bromet, E.J., Cha, C.B., Kessler, R.C., Lee, S., 2008. Suicide and
suicidal behavior. Epidemiol. Rev. 30, 133–154.
https://doi.org/10.1093/epirev/mxn002
Nock, M.K., Millner, A.J., Joiner, T.E., Gutierrez, P.M., Han, G., Hwang, I., King, A.,
Naifeh, J.A., Sampson, N.A., Zaslavsky, A.M., Stein, M.B., Ursano, R.J., Kessler,
R.C., 2018. Risk factors for the transition from suicide ideation to suicide attempt:
Results from the Army Study to Assess Risk and Resilience in Servicemembers
(Army STARRS). J. Abnorm. Psychol. 127, 139–149.
https://doi.org/10.1037/abn0000317
Nordström, P., Samuelsson, M., Åsberg, M., 1995. Survival analysis of suicide risk after
attempted suicide. Acta Psychiatr. Scand. 91, 336–340. https://doi.org/10.1111/j.1600-
0447.1995.tb09791.x
Passos, I.C., Mwangi, B., Cao, B., Hamilton, J.E., Wu, M.-J., Zhang, X.Y., Zunta-Soares,
G.B., Quevedo, J., Kauer-Sant’Anna, M., Kapczinski, F., Soares, J.C., 2016.
Identifying a clinical signature of suicidality among patients with mood disorders: A
pilot study using a machine learning approach. J. Affect. Disord. 193, 109–116.
https://doi.org/10.1016/j.jad.2015.12.066
Pavlou, M., Ambler, G., Seaman, S., De Iorio, M., Omar, R.Z., 2016. Review and evaluation
of penalised regression methods for risk prediction in low-dimensional data with few
events. Stat. Med. 35, 1159–1177. https://doi.org/10.1002/sim.6782
Perkonigg, A., Wittchen, H.-U., 1995a. The Daily-Hassles Scale. Research version. Max
Planck Institute for Psychiatry, Munich.
Perkonigg, A., Wittchen, H.-U., 1995b. Scale for investigation of problem-solving
competences. Research version. Max Planck Institute for Psychiatry, Munich.
Probst, P., Boulesteix, A.-L., 2018. To Tune or Not to Tune the Number of Trees in Random
Forest. J. Mach. Learn. Res. 18, 1–18.
R Core Team, 2017. R: a language and environment for statistical computing. R Foundation
for Statistical Computing, Vienna.
Raudys, S.J., Jain, A.K., 1991. Small sample size effects in statistical pattern recognition:
Recommendations for practitioners. IEEE Trans. Pattern Anal. Mach. Intell. 3, 252–
264.
Reed, V., Gander, F., Pfister, H., Steiger, A., Sonntag, H., Trenkwalder, C., Sonntag, A.,
Hundt, W., Wittchen, H.-U., 1998. To what degree does the Composite International
Diagnostic Interview (CIDI) correctly identify DSM-IV disorders? Testing validity
issues in a clinical sample. Int. J. Methods Psychiatr. Res. 7, 142–155.

https://doi.org/10.1002/mpr.44
Reznick, J.S., Hegeman, I.M., Kaufman, E.R., Woods, S.W., Jacobs, M., 1992. Retrospective
and concurrent self-report of behavioral inhibition and their relation to adult mental
health. Dev. Psychopathol. 4, 301–321. https://doi.org/10.1017/S095457940000016X
Ribeiro, J.D., Franklin, J.C., Fox, K.R., Bentley, K.H., Kleiman, E.M., Chang, B.P., Nock,
M.K., 2016. Self-injurious thoughts and behaviors as risk factors for future suicide
ideation, attempts, and death: a meta-analysis of longitudinal studies. Psychol. Med.
46, 225–236. https://doi.org/10.1017/S0033291715001804
Rice, M.E., Harris, G.T., 2005. Comparing effect sizes in follow-up studies: ROC Area,
Cohen’s d, and r. Law Hum. Behav. 29, 615–620. https://doi.org/10.1007/s10979-005-
6832-7
Saito, T., Rehmsmeier, M., 2014. The Precision-Recall plot is more informative than the ROC
plot when evaluating binary classifiers on imbalanced datasets. PLOS ONE 10,
e0118432. https://doi.org/10.6084/m9.figshare.1245061.v1
Simon, G.E., Johnson, E., Lawrence, J.M., Rossom, R.C., Ahmedani, B., Lynch, F.L., Beck,
A., Waitzfelder, B., Ziebell, R., Penfold, R.B., Shortreed, S.M., 2018. Predicting
Suicide Attempts and Suicide Deaths Following Outpatient Visits Using Electronic
Health Records. Am. J. Psychiatry 175, 951–960.
https://doi.org/10.1176/appi.ajp.2018.17101167
Šimundić, A.-M., 2009. Measures of Diagnostic Accuracy: Basic Definitions. J. Int. Fed.
Clin. Chem. Lab. Med. 19, 203–211.
Spauwen, J., Krabbendam, L., Lieb, R., Wittchen, H.-U., van Os, J., 2004. Does urbanicity
shift the population expression of psychosis? J. Psychiatr. Res. 38, 613–618.
https://doi.org/10.1016/j.jpsychires.2004.04.003
Steyerberg, E.W., 2009. Clinical Prediction Models. A Practical Approach to Development,
Validation, and Updating, In Statistics for Biology and Health Health (series eds. M
Gail, JM Samet, B Singer, A Tsiatis). Springer, New York.
https://doi.org/10.1007/978-0-387-77244-8
Steyerberg, E.W., Vickers, A.J., Cook, N.R., Gerds, T., Gonen, M., Obuchowski, N., Pencina,
M.J., Kattan, M.W., 2010. Assessing the Performance of Prediction Models: A
Framework for Traditional and Novel Measures. Epidemiology 21, 128–138.
Studerus, E., Ramyead, A., Riecher-Rössler, A., 2017. Prediction of transition to psychosis in
patients with a clinical high risk for psychosis: a systematic review of methodology
and reporting. Psychol. Med. 47, 1163–1178.
https://doi.org/10.1017/S0033291716003494
van Loo, H.M., Cai, T., Gruber, M.J., Li, J., de Jonge, P., Petukhova, M., Rose, S., Sampson,
N.A., Schoevers, R.A., Wardenaar, K.J., Wilcox, M.A., Al-Hamzawi, A.O., Andrade,
L.H., Bromet, E.J., Bunting, B., Fayyad, J., Florescu, S.E., Gureje, O., Hu, C., Huang,
Y., Levinson, D., Medina-Mora, M.E., Nakane, Y., Posada-Villa, J., Scott, K.M.,
Xavier, M., Zarkov, Z., Kessler, R.C., 2014. MAJOR DEPRESSIVE DISORDER
SUBTYPES TO PREDICT LONG-TERM COURSE: Research Article: MDD
Subtypes to Predict Long-Term Course. Depress. Anxiety 31, 765–777.
https://doi.org/10.1002/da.22233
Wald, N., Bestwick, J., 2014. Is the area under an ROC curve a valid measure of the
performance of a screening or diagnostic test? J. Med. Screen. 21, 51–56.

https://doi.org/10.1177/0969141313517497
Walsh, C.G., Ribeiro, J.D., Franklin, J.C., 2018. Predicting suicide attempts in adolescents
with longitudinal clinical data and machine learning. J. Child Psychol. Psychiatry 59,
1261–1270. https://doi.org/10.1111/jcpp.12916
Walsh, C.G., Ribeiro, J.D., Franklin, J.C., 2017. Predicting Risk of Suicide Attempts Over
Time Through Machine Learning. Clin. Psychol. Sci. 5, 457–469.
https://doi.org/10.1177/2167702617691560
Wardenaar, K.J., van Loo, H.M., Cai, T., Fava, M., Gruber, M.J., Li, J., de Jonge, P.,
Nierenberg, A.A., Petukhova, M.V., Rose, S., Sampson, N.A., Schoevers, R.A.,
Wilcox, M.A., Alonso, J., Bromet, E.J., Bunting, B., Florescu, S.E., Fukao, A., Gureje,
O., Hu, C., Huang, Y.Q., Karam, A.N., Levinson, D., Medina Mora, M.E., Posada-
Villa, J., Scott, K.M., Taib, N.I., Viana, M.C., Xavier, M., Zarkov, Z., Kessler, R.C.,
2014. The effects of co-morbidity in defining major depression subtypes associated
with long-term course and severity. Psychol. Med. 44, 3289–3302.
https://doi.org/10.1017/S0033291714000993
Wittchen, H.U., Lachner, G., Wunderlich, U., Pfister, H., 1998a. Test-retest reliability of the
computerized DSM-IV version of the Munich Composite International Diagnostic
Interview (M-CIDI). Soc. Psychiatry Psychiatr. Epidemiol. 33, 568–578.
https://doi.org/DOI 10.1007/s001270050095
Wittchen, H.U., Perkonigg, A., Lachner, G., Nelson, C.B., 1998b. Early Developmental
Stages of Psychopathology Study (EDSP): Objectives and design. Eur. Addict. Res. 4,
18–27. https://doi.org/Doi 10.1159/000018921
Wittchen, H.-U., Pfister, H., 1997. DIA-X-Interviews: Manual für Screening-Verfahren und
Interview; Interviewheft Längsschnittuntersuchung (DIA-X-Lifetime);
Ergänzungsheft (DIA-XLifetime); Interviewheft Querschnittsuntersuchung (DIA-X-
12 Monate); Ergänzungsheft (DIA-X-12 Monate); PC-Programm zur Durchführung
des Interviews (Längs- und Querschnittuntersuchung); Auswertungsprogramm. Swets
& Zeitlinger, Frankfurt.
World Health Organization, 2014. Preventing suicide: A global imperative [WWW
Document]. URL
https://apps.who.int/iris/bitstream/handle/10665/131056/9789241564779_eng.pdf;jses
sionid=ABD7A55A03CF6B869C727FE5E25CEE77?sequence=1 (accessed 3.1.19).
Wright, M.N., Ziegler, A., 2017. ranger: A Fast Implementation of Random Forests for High
Dimensional Data in C++ and R. J. Stat. Softw. 77, 1–17.
https://doi.org/10.18637/jss.v077.i01
Yarkoni, T., Westfall, J., 2017. Choosing Prediction Over Explanation in Psychology:
Lessons From Machine Learning. Perspect. Psychol. Sci. 12, 1100–1122.
https://doi.org/10.1177/1745691617693393
Table 1
Overview of the performance estimates for each prediction model.
Model AUC M AUC Md BS M BS Md Sens PPV
Logistic 0.828 0.825 0.179 0.190 0.223 0.704
Lasso 0.826 0.822 0.245 0.246 0.212 0.716
Ridge 0.829 0.830 0.466 0.461 0.251 0.658
Random forest 0.824 0.826 0.136 0.167 0.028 0.870
Note. AUC, area under the receiver operating curve; BS, Brier-scaled; Sens, sensitivity; PPV,
positive predictive value.
Figure 1. Boxplot of 100 resampling results for each prediction model (see median results in
Table 1). Logistic, Logistic regression model; Rf, Random forest model. Left: Area under the
curve (AUC), including the AUC of 0.58 as reported in the meta-analysis by Franklin et al.
(2017). Right: Scaled Brier score, with values below zero indicating a model
performance/calibration inferior to that of a chance prediction model applied to the validation
dataset.
Table 2
Overview of the decreasing importance of the 16 baseline predictors for each prediction model.
Logistic Lasso Ridge Random forest
Predictor β OR Rank % β OR Rank* % β OR Rank* % Importance Rank*
Prior SA (j) 0.454 1.57 1 57.5 0.439 1.55 1 55.2 0.130 1.14 1 13.9 4.434 1
Education (c) -0.405 0.67 2 33.3 -0.361 0.70 2 30.3 -0.031 0.97 5 3.1 0.235 10
Prior help-seeking (h) 0.296 1.34 3 34.5 0.276 1.32 3 31.7 0.048 1.05 2 4.9 0.664 3
Any parental mental dx (i) 0.278 1.32 4 32.1 0.226 1.25 4 25.4 0.018 1.02 10 1.8 0.117 14
Parental loss or separation (g) 0.245 1.28 5 27.8 0.216 1.24 5 24.1 0.030 1.03 6 3.1 0.199 11
BI (k) 0.221 1.25 6 24.8 0.193 1.21 6 21.2 0.029 1.03 7 3.0 0.428 5
Number of mental dx (d) 0.187 1.21 7 20.6 0.164 1.18 7 17.8 0.043 1.04 3 4.3 1.020 2
DH (n) -0.171 0.84 8 15.7 -0.107 0.90 9 10.2 -0.007 0.99 16 0.7 0.323 7
Number of traumatic events (e) 0.157 1.17 9 17.0 0.109 1.12 8 11.6 0.016 1.02 11 1.6 -0.086 15
Age (b) -0.142 0.87 10 13.2 -0.097 0.91 11 9.3 -0.011 0.99 15 1.1 0.173 13
Rural (o) -0.142 0.87 11 13.2 -0.097 0.91 10 9.3 -0.014 0.99 12 1.4 0.182 12
Sex (a) 0.106 1.11 12 11.2 0.060 1.06 13 6.2 0.011 1.01 14 1.1 0.045 16
PCE (p) -0.099 0.91 13 9.4 -0.081 0.92 12 7.8 -0.019 0.98 9 1.9 0.300 8
NLE (m) -0.020 0.98 14 2.0 0.000 1.00 15 0.0 0.013 1.01 13 1.3 0.480 4
Rape/Childhood sexual abuse (f) 0.011 1.01 15 1.1 0.013 1.01 14 1.3 0.034 1.03 4 3.4 0.240 9
PE (l) 0.001 1.00 16 0.1 0.000 1.00 15 0.0 0.023 1.02 8 2.4 0.343 6
Note. Letter in brackets after each predictor corresponds to ordering of predictors in section ―Selection and assessment of predictors‖; Rank*, Order
according to the predictor ranking of the logistic regression model; β, beta-coefficient of the (penalized) logistic regression model; OR, odds ratio;
%, OR translated to percentage. The original importance values of the random forest model have been multiplied by 1000, to avoid having to
display too many digits. Prior SA, Lifetime suicide attempt, reported at baseline; Education 1 = low, 2 = middle, 3 = high; dx disorder; BI,
behavioral inhibition; Number of mental dx, Number of DSM-IV diagnoses; DH daily hassles; Rural, 0 = living in an urban area; PCE perceived
coping efficacy (higher PCE values denote lower PCE); NLE, negative life events; PE, psychotic experiences.

Marcel 2019

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Marcel 2019

Uploaded by

Copyright:

Available Formats

Journal Pre-proof

Prospective prediction of suicide attempts in community adolescents

Miché Marcel PhD , Studerus Erich PhD ,

To appear in: Journal of Affective Disorders

Received date: 3 May 2019

© 2019 Published by Elsevier B.V.

Prospective prediction of suicide attempts in community adolescents and young

adults, using regression methods and machine learning

Miché, Marcel, PhD1

Studerus, Erich, PhD2

Meyer, Andrea Hans, PhD1

Gloster, Andrew Thomas, PhD3

Beesdo-Baum, Katja, PhD4, 5

Wittchen, Hans-Ulrich, PhD5, 6

Lieb, Roselind, PhD1

Psychology, Basel, Switzerland

Science, Basel, Switzerland

Roselind Lieb, PhD

Division of Clinical Psychology and Epidemiology

and analyses were also additionally supported by Deutsche Forschungsgemeinschaft (DFG)

University, Boston), and Dr. Jim van Os (Maastricht University).

prospective-longitudinal Early Developmental Stages of Psychopathology (EDSP) data set,

we compared four algorithms—logistic regression, lasso, ridge, and random forest—in

predicting a future SA in a community sample of adolescents and young adults.

random forest, were 0.828, 0.826, 0.829, and 0.824, respectively.

Conclusions. Based on our comparison, each algorithm performed equally well in

Further research and replication studies are required in this regard.

community sample, prospective design

prediction (Franklin et al., 2017). In an endeavor to increase accuracy of SA prediction,

2014; Studerus et al., 2017; Yarkoni and Westfall, 2017).

being outperformed by ML models across all reported depression-related outcomes, SA being

studies applied statistical models in combination with techniques (e.g., penalization,

2018] and 0.93 [Nock et al., 2018]).

We are unaware of any study that investigated SA prediction accuracy by applying

possible. To explore the utility of ML approaches we examined the prediction accuracy of

Developmental Stages of Psychopathology (EDSP) Study, which prospectively assessed

community adolescents and young adults over the course of 10 years.

Selection and assessment of predictors

history of psychopathology, general psychopathology, psychosis, prior self-injurious thoughts

2017) in order to avoid methodological shortcomings, such as unreliable predictor selection

they are in other methods (Pavlou et al., 2016).

assessed with the computer-assisted Munich-Composite International Diagnostic Interview

imprisonment/kept hostage/abduction, witness to someone else experiencing a traumatic

psychopathology (assessed via family history information provided by the offspring at

asked the SA question (T2: lifetime, T3: since last interview).

Additional predictors assessed at baseline were (k) behavioral inhibition (assessed

scale values denote lower perceived coping efficacy).

and random forest, a widely used ML algorithm (Fernández-Delgado et al., 2014).

All data-related procedures were done in the statistical software environment R,

Learning in R; Bischl et al., 2016), which is a framework for ML experiments in R (R Core

Prediction models and performance measures

We selected a conventional logistic regression model (based on the full set of

(Wright and Ziegler, 2017).

variables equally, we normalized them.

Second, in accordance with the Transparent Reporting of a Multivariable Prediction

latter measuring a model’s overall performance (combination of both discrimination and

Third, as a means to avoid performance estimates that are optimistically biased

(overfitting), we applied repeated nested cross-validation, which is the recommended method

model, we used stratified out-of-bag bootstrapping with 50 iterations (stratification being

iteration. Bootstrapping is a recommended choice when selecting a prediction model, because

cross-validation, with 10 repetitions per fold. Repeated n-fold cross-validation is

prediction performances for each model.

Predictive performance measures