You are on page 1of 7

Drug and Alcohol Dependence 236 (2022) 109476

Contents lists available at ScienceDirect

Drug and Alcohol Dependence


journal homepage: www.elsevier.com/locate/drugalcdep

A Bayesian learning model to predict the risk for cannabis use disorder
Rajapaksha Mudalige Dhanushka S. Rajapaksha a, Francesca Filbey b, Swati Biswas a, *,
Pankaj Choudhary a, *
a
Department of Mathematical Sciences, University of Texas at Dallas, Richardson, TX, USA
b
School of Behavioral and Brain Sciences, University of Texas at Dallas, Richardson, TX, USA

A R T I C L E I N F O A B S T R A C T

Keywords: Background: The prevalence of cannabis use disorder (CUD) has been increasing recently and is expected to
Cannabis use disorder increase further due to the rising trend of cannabis legalization. To help stem this public health concern, a model
Prediction model is needed that predicts for an adolescent or young adult cannabis user their personalized risk of developing CUD
Bayesian methods
in adulthood. However, there exists no such model that is built using nationally representative longitudinal data.
Machine learning
Methods: We use a novel Bayesian learning approach and data from Add Health (n = 8712), a nationally
Model validation
representative longitudinal study, to build logistic regression models using four different regularization priors:
lasso, ridge, horseshoe, and t. The models are compared by their prediction performance on unseen data via 5-
fold-cross-validation (CV). We assess model discrimination using the area under the curve (AUC) and calibration
by comparing the expected (E) and observed (O) number of CUD cases. We also externally validate the final
model on independent test data from Add Health (n = 570).
Results: Our final model is based on lasso prior and has seven predictors: biological sex; scores on personality
traits of neuroticism, openness, and conscientiousness; and measures of adverse childhood experiences, de­
linquency, and peer cannabis use. It has good discrimination and calibration performance as reflected by its
respective AUC and E/O of 0.69 and 0.95 based on 5-fold CV and 0.71 and 1.10 on validation data.
Conclusion: This externally validated model may help in identifying adolescent or young adult cannabis users at
high risk of developing CUD in adulthood.

1. Introduction Statistical Manual of Mental Disorders (DSM-IV) (Marel et al., 2019).


Another study using DSM-5 criteria found that the lifetime probability of
Substance use disorders (SUDs) constitute a serious public health transition from cannabis use to CUD is 27% (Feingold et al., 2020).
issue in the US (SAMHSA, 2016). In 2019, among persons aged 12 or Furthermore, people who use cannabis are more likely to be poly­
older, 20.4 million (7.4%) met the clinical diagnostic criteria for a past substance users (NCDAS, 2018). Given that SUD is a chronic brain dis­
year SUD of alcohol or any illicit drugs (SAMHSA, 2020). Cannabis is the order associated with long-term, negative consequences (Heilig et al.,
most commonly used illicit substance in the world (NIDA, 2019). In the 2021), the need to identify adolescent and young adult substance users
US, several states have legalized medical and/or recreational use of who are at risk of developing CUD/SUD in future has become more
cannabis (Bridgeman and Abazia, 2017; Hasin et al., 2017). These new important than ever.
regulations may be partly responsible for a recent increase in cannabis Several risk factors are known to be associated with substance use
use and its adverse consequences (NCDAS, 2018). and SUDs in general and CUD in particular. These include male sex,
Cannabis use among adolescents and young adults has also increased substance initiation at an early age, early exposure to traumatic events,
in recent years (Schulenberg et al., 2021). In 2019, 43% of college stu­ family and peer substance use, mood related risk factors such as
dents consumed cannabis, which was the highest rate of cannabis con­ depression and anxiety, personality traits such as impulsivity, conduct
sumption in that group since 1983 (NCDAS, 2018). It is estimated that disorder symptoms such as delinquent behavior, and attention deficit
around 34% of cannabis users develop cannabis use disorder (CUD) hyperactivity disorder (ADHD) (Beaton et al., 2014; Douglas et al., 2010;
during their lifetime based on the 4th edition of the Diagnostic and Gray and Squeglia, 2018; Ketcherside et al., 2016; Koh et al., 2017; Lowe

* Correspondence to: 800 W Campbell Rd, FO 35, Richardson, TX 75025, USA.


E-mail addresses: swati.biswas@utdallas.edu (S. Biswas), pankaj@utdallas.edu (P. Choudhary).

https://doi.org/10.1016/j.drugalcdep.2022.109476
Received 28 December 2021; Received in revised form 19 April 2022; Accepted 23 April 2022
Available online 29 April 2022
0376-8716/© 2022 Elsevier B.V. All rights reserved.
R.M.D.S. Rajapaksha et al. Drug and Alcohol Dependence 236 (2022) 109476

et al., 2020; Meier et al., 2016; Rajapaksha et al., 2020; Tomko et al., nationally representative longitudinal data from National Longitudinal
2019; Verdejo-García et al., 2008; Zhang-James et al., 2020). Study on Adolescent Health (Add Health) (Harris et al., 2009). A
With a vast literature available on risk factors for SUD, a natural and comprehensive set of potential predictors, including demographic,
practically important next step is to build risk prediction models for behavioral, personality, and cognitive characteristics of individuals, was
SUD. Indeed, such models have been developed for several diseases and considered. The final model was independently validated on an external
disorders, including depression, ADHD, and mental illness (Bernardini test data, also obtained from Add Health. This study has been approved
et al., 2017; Cattelani et al., 2019; Caye et al., 2019; Chowdhury et al., by the Institutional Review Board of the University of Texas at Dallas.
2018; D’Agostino et al., 2008; Gail et al., 1989). Although some studies
have developed simple cumulative risk indices and/or risk scores for 2. Methods
SUD outcomes (Hayatbakhsh et al., 2009; Meier et al., 2016), efforts to
build risk prediction models to predict an SUD outcome are a recent 2.1. Participants
development. Rajapaksha et al. (2020) proposed a preliminary model
for predicting CUD by applying statistical and machine learning (ML) Add Health used a multistage stratified cluster sampling design to
techniques. Another recent study applied ML techniques to a longitu­ ensure that the sample reflected the adolescent population of the United
dinal dataset and predicted SUD (Jing et al., 2020). A few other studies States in terms of urbanicity, region, school size, school type (public/
also used ML techniques to predict substance use outcomes (Hu et al., private), and ethnicity (Harris et al., 2009). Adolescents were first
2020; Nasir et al., 2021; Zhang-James et al., 2020; Zoboroski et al., enrolled when they were in grades 7–12 during the 1994–95 school year
2021). (wave I). They were followed up in 1996 (wave II), 2001–2002 (wave
However, all these studies have a common and major limitation of III), and 2008 (wave IV). The most recent wave (wave V) was in
being based on data from a limited geographic location or high-risk 2016–18, however, information regarding SUDs was not collected in this
population (Hayatbakhsh et al., 2009; Hu et al., 2020; Jing et al., wave. So, we used data up to wave IV during which the participants were
2020; Nasir et al., 2021; Rajapaksha et al., 2020; Zhang-James et al., adults aged 24–32.
2020). Therefore, these models are not generalizable to a larger or
nationwide population of substance users and thus may not be suitable 2.2. Data preparation
for risk assessment in clinical practice. Moreover, some studies have
focused only on certain specific types of predictors rather than consid­ Our response variable is a binary measure of lifetime diagnosis of
ering a comprehensive set of risk factors and hence are not suitable for CUD, which was measured only in wave IV. For model building, we
risk prediction as such. Another crucial aspect of any risk prediction included only those participants who started using cannabis during any
model is the use of longitudinal data so that one can ensure that the risk of the first three waves, participated in wave IV, and have survey
factors have been measured before the development of the outcome weights available. Diagnosis of CUD was derived from Add Health items
rather their measurements being effects of the outcome. Although some that were originally based on DSM-IV for lifetime diagnosis of cannabis
studies used longitudinal data, they did not consider the effect of the use dependence. Specifically, each item had dichotomous (Yes/No) re­
longitudinal trajectory of risk factors and used their cross-sectional sponses indicating whether or not one engaged in a certain substance
summaries (Hayatbakhsh et al., 2009; Hu et al., 2020; Jing et al., dependence behavior corresponding to each of the diagnostic criteria
2020; Meier et al., 2016). Finally, and perhaps most importantly, none outlined in DSM-IV for cannabis dependence, such as tolerance, with­
of these tools has been independently validated on external data. drawal, etc. An answer of “yes” to three or more questions within a 12-
Given the large number of potential risk factors for SUD, for practical month period was indicative of CUD. This is a widely used criterion to
utility it is important to ensure that a risk prediction model is parsi­ measure CUD (Feingold et al., 2020). Potential participants were
monious. This may be achieved using regularization methods as they excluded if they developed CUD but did not provide an age of CUD onset
shrink (penalize) regression coefficients towards zero, thereby avoiding or had inconsistency between their reported status of ever use cannabis
overfitting and improving prediction accuracy on future unseen data and age of first cannabis use across different waves. There were 9491
(Park and Casella, 2008; Tibshirani, 1996, 2011). Regularization is participants after applying these inclusion and exclusion criteria.
accomplished by adding a penalty term to the negative log-likelihood Add Health administered several age-specific questionnaires con­
function that we need to minimize to fit a model. For example, the taining more than 1000 items. We constructed around 50 potential
penalty term under lasso (ridge) regularization is the sum of absolute predictors from these questionnaires based on literature review. Some
values (squares) of the slope coefficients multiplied by a non-negative predictors were cross-sectional, measured in a single wave, while the
penalty parameter. This parameter is chosen optimally through others were longitudinal, measured across multiple waves. We consid­
cross-validation. ered two ways of summarizing longitudinal predictors for inclusion in
Such regularization can be achieved through Bayesian learning the CUD model: (1) using random effects obtained by fitting a separate
methods in a more flexible way. In classical framework, the penalty linear mixed effects model (LMM) for each predictor relating the pre­
parameter has to be estimated separately from the regression co­ dictor to age of participants at different waves (see Supplementary
efficients. While in Bayesian framework, the penalty parameter is part of Materials), and (2) taking the average/maximum exposure over
the whole model, which contains not only the regression coefficients but different waves (Chen et al., 2015; Dandis et al., 2020).
other parameters as well (e.g., variance parameters). All parameters
have priors (e.g., in Bayesian lasso, regression coefficients have Laplace 2.3. Initial variable processing
priors) and thus regularization takes place within the model building
process in an integrated manner (Van Erp et al., 2019). Furthermore, While most of the predictors came from waves I to III, the following
Bayesian regularization methods naturally quantify the uncertainty in two types came from wave IV. One retrospectively measured adverse
estimates through posterior distributions, which is not so straightfor­ childhood experiences (ACEs), which occurred before age 18 and were
ward for their classical counterparts (Carvalho et al., 2010; O’Hara and used to create the ACE scale. The other measured personality traits that
Sillanpaa, 2009; Park and Casella, 2008; Van Erp et al., 2019). Despite are believed to remain relatively stable over time (Caspi et al., 2005;
these advantages, there is no study in the SUD literature that has used Damian et al., 2019). Further, although CUD was measured in wave IV, a
Bayesian learning methods for risk prediction. person could have developed CUD for the first time before wave IV.
In this study, we built Bayesian risk prediction models to predict the Hence, for each CUD case, we only considered the portion of their data
risk of developing CUD in adulthood based on risk factors (which may that were recorded before their age of CUD onset. Thus, it was effec­
vary over time) measured in adolescence and young adulthood. We used tively ensured that all predictors were measured before the outcome.

2
R.M.D.S. Rajapaksha et al. Drug and Alcohol Dependence 236 (2022) 109476

Some predictors had substantial missing data. Therefore, an initial predictors are selected. The second method excludes a variable if the
variable filtering process was implemented to identify potentially posterior probability that its regression coefficient is within ± 1 poste­
important predictors while maximizing the number of participants with rior standard deviation (SD) of 0 exceeds a certain threshold. The
complete data. As complex survey design must be accounted for even in threshold is varied such as 0.3, 0.4, 0.5, etc. A higher threshold leads to
this initial variable filtering, we fitted survey multiple logistic regression fewer variables meeting the exclusion requirement and hence to fewer
models using varying sample sizes (corresponding to different pro­ excluded variables. By applying these variable selection methods with
portions of missing values) and considered a variable to be potentially different levels/thresholds and different regularization priors, we ob­
important if its p-value was less than 0.3 in at least one of the models. tained a total of 40 competing models. Each model provides a predicted
This process identified 15 predictors (see Table 1) and 8712 participants probability of CUD for an individual (see Supplementary Materials for its
with complete data on them. All quantitative predictors were scaled/ calculation).
transformed to lie within [0,1] by diving by their maximum possible
value (all except TV hours) or their maximum value observed in the data 2.4.2. Using prediction accuracy to obtain the best model
(TV hours). This helps in ensuring portability of the model to data To identify the best among the competing models, we compared their
wherein predictors are measured in different scales. This final data set is prediction accuracy on unseen data via 5-fold cross-validation (CV)
used as the training data. (James et al., 2013) and Bayesian leave one out (LOO) cross-validation
(Vehtari et al., 2017). These approaches provide a good assessment of
model performance on unseen data by protecting against overfitting.
2.4. Statistical methods The model discrimination is assessed by area under the receiver oper­
ating characteristic curve (AUC) based on 5-fold CV and the Bayesian
2.4.1. Variable selection and Bayesian learning models LOO estimate. Higher AUC and LOO indicate better prediction accuracy.
We used a Bayesian learning approach within the framework of lo­ To evaluate model calibration, we compared the expected number of
gistic regression models with random effects. This involves specifying cases (E) based on 5-fold CV with the observed (O) number of cases. The
priors for the unknown model parameters. The priors on regression closer the E/O value is to 1, the better is the model performance.
coefficients serve to regularize them (Gelman et al., 2014). We used four
different regularization priors: lasso, ridge, horseshoe, and t (Van Erp 2.4.3. External validation
et al., 2019). All models incorporated the complex survey sampling To externally validate the final proposed model, we utilized an in­
design by weighting each participant’s data with their survey weight in dependent test dataset. It consisted of Add Health participants whose
the likelihood and including the stratification variable region as a fixed survey weights were missing and hence were not included in the training
covariate and the clustering variable school as a random effect (see data. We calculated their risk of developing CUD using the final model
Supplementary Materials for details). and compared with their actual CUD status. Then we calculated AUC
As priors for regression coefficients are continuous, they do not and E/O.
automatically provide variable selection. Therefore, we used two More information about statistical methods is available in Supple­
methods for variable selection after fitting the full models with 15 pre­ mentary Materials. The models were fitted using the statistical software
dictors: credible interval and probability thresholding (Li and Lin, 2010; system R (R Core Team, 2019) with the following packages: RStan (Stan
Van Erp et al., 2019). Briefly, in the first method, a predictor is selected if Development Team, 2020) for Bayesian inference, lme4 (Bates et al.,
a credible interval for its regression coefficient excludes 0. The level of 2015) to summarize longitudinal predictors using random effects, sur­
the interval is varied, e.g., 70%, 80%, 95%, etc. As the level increases, vey (Lumley, 2004) for initial variable filtering, loo for cross-validation
the interval becomes wider, it includes 0 more often, and hence fewer (Vehtari et al., 2020), and pROC (Robin et al., 2011) for AUC.

Table 1 3. Results
Summary of predictors that passed initial variable filtering.
Predictor Description 3.1. Sample characteristics
Biological sex 0 =Female, 1 =Male
Race 1 =White, 2 = Black/African American, 3 = American
In our final dataset (n = 8712), the unweighted and weighted
Indian/Native American, 4 =Asian, 5 =other prevalence of lifetime CUD are 7.51% and 7.84%, respectively. As seen
ACE scale Number of adverse childhood experiences that occurred in Table 2, compared to controls (i.e., the non-CUD participants), the
before age 18. cases are more likely to be males (61% vs 50%); and on average expe­
Neuroticism scale Measure of general tendency to experience negative
rienced more ACEs (0.29 vs 0.25) and reported higher neuroticism (0.56
feelings. A higher value implies a greater tendency.
Conscientiousness scale Measure of forward planning, organization, and ability to vs 0.52) and openness (0.76 vs 0.73) but lower conscientiousness (0.70
carry out tasks. A higher value implies greater vs 0.73). Table 3 presents sample characteristics for longitudinal pre­
conscientiousness. dictors across the waves in which they were measured. Generally,
Agreeableness scale Measure of compassion, eagerness to cooperate, and
depressive symptoms, peer alcohol use, peer cannabis use, and peer
tendency to avoid conflict. A higher value implies more
agreeableness.
smoking increased over time while delinquent activities and violence
Openness scale Measure of openness to new experiences and victimization decreased. For all the predictors, the average score of a
imaginativeness. A higher value implies more openness. predictor was at least as high for cases as for controls.
Anxiety scale Number of times experienced anxiety symptoms.
Depression scale Number of times experienced depression symptoms.
Delinquency scale Number of times involved in delinquent activities. A
3.2. Results from Bayesian learning models
higher value implies a greater involvement.
Violence victimization Number of times experienced violent incidents. As described in the Methods section, we applied four shrinkage
scale priors and performed variable selection among 15 predictors that passed
Peer alcohol use Number of best friends (out of 3 best friends) who drink
the initial filtering using credible interval and thresholding methods.
alcohol at least once a month.
Peer cannabis use Number of best friends (out of 3 best friends) who use Interestingly, the model with highest prediction accuracy for each prior
cannabis at least once a month. contained the same seven predictors: biological sex, ACE, conscien­
Peer smoking Number of best friends (out of 3 best friends) who smoke tiousness, neuroticism, openness, delinquency, and peer cannabis use.
at least 1 cigarette a day. Next, we fitted models with the same priors but with data on only these
TV hours Number of hours per week watched television.
seven predictors, resulting in a slightly larger sample size of 8753 due to

3
R.M.D.S. Rajapaksha et al. Drug and Alcohol Dependence 236 (2022) 109476

Table 2 neuroticism and openness, lower conscientiousness, and larger propor­


Summary of cross-sectional predictors: percentage for categorical predictors and tion of friends using cannabis. The AUC of this model was 0.69 and its E/
mean (standard deviation) for continuous predictors. The p-values are based on O was 0.953. These indicate good discrimination ability and calibration
chi-square test of association in case of categorical predictors and on univariate performance of the model.
logistic regression models in case of continuous predictors.
Predictor Total (n = Cases (n Controls (n = p-
8712) = 654) 8058) value 3.3. External validation results
Biological sex 0.001
Male % 50.85 61.16 50.01 Next, we applied the final model to the independent validation
Female % 49.15 38.84 49.99 dataset (n = 570). The prevalence of lifetime CUD in this sample was
7.19%, comparable to 7.51% in the training data. The AUC was 0.71 and
Race 0.507 the E/O was 1.10. Thus, although the model slightly overpredicted the
White % 70.14 71.87 70
Black/ African American % 20.14 20.34 20.13
number of cases, its discrimination ability was good, and it was well
American Indian/ Native 2.03 1.99 2.04 calibrated. Table 6 shows the E/O for the five risk quintile groups, two
American % levels of biological sex, and the groups obtained using median as cutoff
Asian % 5.75 4.13 5.88 for each risk factor. We see that the E/O value is quite close to 1 for the
Other % 1.93 1.68 1.95
two highest risk quintile groups, male sex, below median neuroticism,
ACE scale 0.25 0.29 0.25 (0.19) 0.001
(0.19) (0.20) above median openness, and all delinquency, conscientiousness, and
Neuroticism scale 0.53 0.56 0.52 (0.14) < ACE groups, indicating that the model performed very well for these
(0.14) (0.14) 0.001 groups. For the remaining groups also, the E/O generally does not
Conscientiousness scale 0.73 0.70 0.73 (0.14) < deviate too far from 1 (except the risk quintile group 3 consisting of
(0.14) (0.14) 0.001
Agreeableness scale 0.76 0.76 0.76 (0.12) 0.876
moderate risk individuals).
(0.12) (0.12)
Openness scale 0.73 0.76 0.73 (0.12) 0.001 4. Discussion
(0.12) (0.13)
TV hours 0.16 0.17 0.16 (0.15) 0.229
(0.15) (0.15)
Substance use disorders constitute a growing public health problem
in the US, incurring enormous societal costs (NIDA, 2017, 2019; WHO,
2020). Increasing legalization of cannabis usage is likely to exacerbate
fewer missing observations. this trend. Thus, there is an urgent need to develop prediction models to
As delinquency and peer cannabis use are longitudinal predictors, we identify adolescents and young adults who are at high risk of developing
explored the possibility of including them in the model simply as CUD in future so that early intervention and preventive measures may
average or maximum over the waves in which they were measured be provided. In this study, we built such a model using nationally
rather than as random effects. We found that the average provided better representative longitudinal data. To our knowledge, this is the first such
prediction performance for delinquency (Table 4). As average is also risk prediction model for any SUD. This model was obtained by applying
simpler for practical implementation, we use it in the final model. On the state-of-the-art regularization methods under the framework of Bayesian
other hand, for peer cannabis use, the random effect provided better learning. It uses only seven predictors and validated well on external test
performance and hence was retained. Next, as the model performance data. For comparison, we also tried unregularized Bayesian models, but
was practically identical for all four priors (see Table 4), we chose the they had larger number of predictors and lower prediction accuracy than
lasso model because it is the most popular among the models considered. the proposed model.
Lastly, the stratification variable region was dropped from the final The risk factors identified in our model are consistent with the
model as it did not increase prediction accuracy and the simpler model is
easier to use in practice. Table 4
The results of our final model are presented in Table 5. It shows Comparison of prediction performance of Bayesian learning models using the
posterior means of regression coefficients (β) and odds ratios (OR) for random effects (RE) and average for delinquency scale.
predictors in the model and their 95% credible intervals. For continuous Model AUC E/O LOO
cross-sectional predictors in the model, which are scaled to lie between
Lasso model RE for delinquency 0.66 0.952 -2233.5
0 and 1, it also shows posterior means and 95% credible intervals for
average for delinquency 0.69 0.953 -2192.4
odds ratios on original scale (OR*), which are given by exp(β/M), where Ridge model RE for delinquency 0.66 0.950 -2233.9
M is the maximum possible value for the predictor (used for scaling). For average for delinquency 0.69 0.952 -2192.5
a longitudinal predictor, obtaining odds ratio on the original scale is not t model RE for delinquency 0.66 0.951 -2233.5
straightforward as its maximum possible value varies across waves. A average for delinquency 0.69 0.952 -2192.4
Horseshoe model RE for delinquency 0.66 0.952 -2233.7
higher likelihood of CUD was associated with male sex, greater adverse average for delinquency 0.69 0.955 -2192.8
childhood experiences and involvement in delinquent activities, higher

Table 3
Summary of longitudinal predictors: Mean (standard deviation) across different waves. The p-values are based on a logistic regression model with SUD status as
response and random effects associated with the longitudinal predictor as covariates.
Wave 1 Wave 2 Wave 3

Variable Overall Cases Controls Overall Cases Controls Overall Cases Controls p-value

Anxiety scale 0.15 (0.10) 0.16 (0.11) 0.15 (0.10) 0.15 (0.10) 0.16 (0.10) 0.15 (0.100 0.006
Depression scale 0.27 (0.09) 0.29 (0.10) 0.27 (0.09) 0.28 (0.09) 0.28 (0.09) 0.28 (0.09) 0.30 (0.11) 0.33 (0.11) 0.30 (0.11) 0.008
Delinquency scale 0.12 (0.13) 0.16 (0.14) 0.12 (0.13) 0.09 (0.11) 0.12 (0.13) 0.08 (0.11) 0.03 (0.06) 0.05 (0.07) 0.03 (0.06) < 0.001
Violence victimization scale 0.05 (0.07) 0.06 (0.08) 0.05 (0.07) 0.02 (0.06) 0.03 (0.07) 0.02 (0.06) 0.02 (0.06) 0.02 (0.06) 0.02 (0.05) 0.002
Peer alcohol use 0.45 (0.40) 0.48 (0.41) 0.45 (0.40) 0.47 (0.40) 0.47 (0.40) 0.46 (0.40) 0.66 (0.39) 0.72 (0.36) 0.65 (0.39) 0.01
Peer cannabis use 0.28 (0.36) 0.34 (0.39) 0.28 (0.36) 0.33 (0.37) 0.4 (0.40) 0.32 (0.37) <0.001
Peer smoking 0.34 (0.38) 0.36 (0.39) 0.33 (0.37) 0.37 (0.39) 0.42 (0.40) 0.37 (0.39) <0.001

4
R.M.D.S. Rajapaksha et al. Drug and Alcohol Dependence 236 (2022) 109476

Table 5
Final Bayesian learning model using lasso prior: Posterior means of regression coefficients, posterior means of odds ratios (OR), 95% credible intervals for OR, posterior
means of odds ratio (OR*) on original scale (only for continuous cross-sectional variables), and 95% credible intervals for OR* .
Variable Posterior mean of coefficient Posterior mean of OR Credible Interval for OR Posterior mean for OR* Credible interval for OR*

Intercept -4.68
Biological sex 0.33 1.40 (1.17,1.66)
ACE scale 0.51 1.71 (1.08,2.56) 1.06 (1.01,1.11)
Conscientiousness scale -1.14 0.34 (0.17,0.58) 0.94 (0.92,0.97)
Neuroticism scale 1.72 5.86 (3.01,10.49) 1.09 (1.06,1.12)
Openness scale 1.68 5.71 (2.66,10.89) 1.09 (1.05,1.13)
Delinquency scale 3.84 50.07 (22.43,98.38)
Peer cannabis use 0.67 2.01 (1.24,3.06)

socioeconomic position during childhood (Walsh et al., 2019). Our final


Table 6
model contained the ACE scale, which measures several dimensions of
Expected (E) and observed (O) number of CUD cases for external validation data.
social inequality, including childhood maltreatment, parental separa­
n E O E/O tion and divorce, and household criminality and incarnation.
Risk quintile groups There are few limitations in this study. Add Health data are relatively
Group 1 114 3.66 3 1.22 old and do not capture recent changes in the cannabis access landscape
Group 2 114 5.25 6 0.88
due to the introduction of new cannabis-based products and methods of
Group 3 114 6.80 3 2.27
Group 4 114 9.44 9 1.05 use (Knapp et al., 2019; Spindle et al., 2019). Moreover, Add Health did
Group 5 114 20.14 20 1.01 not measure neurocognitive and neuro-imaging data that may be
Biological sex important in predicting SUD. Unfortunately, to our knowledge, there is
Male 293 27.89 29 0.96 no other publicly available dataset wherein the participants are na­
Female 277 17.40 12 1.45
ACE scale
tionally representative, have been longitudinally followed from ado­
Below median 318 21.12 19 1.11 lescence/youth to adulthood, and have clinical measures of SUD and its
Above median 252 24.17 22 1.10 potential risk factors measured. Nonetheless, the known personal factors
Conscientiousness scale that make one susceptible to SUD are likely to remain important even
Below median 335 29.87 28 1.07
with the increasing prevalence of cannabis use (Afuseh et al., 2020; CDC,
Above median 235 15.43 13 1.19
Neuroticism scale 2020). As newer and more comprehensive data become available,
Below median 289 18.65 18 1.04 potentially including neurocognitive and neuro-imaging data as well, it
Above median 281 26.64 23 1.16 will be important to build new models using them and compare with the
Openness scale proposed model.
Below median 366 25.06 19 1.32
Also, some participants had missing data in one or more waves
Above median 204 20.23 22 0.92
Delinquency scale (Supplementary Table 1 shows percentage of missing values for each
Below median 286 14.32 13 1.10 predictor). Although this contributed to the overall amount of missing
Above median 284 30.97 28 1.11 data for cross-sectional predictors, a longitudinal predictor could be
Peer cannabis use
included for a participant provided there was at least one wave in which
Below median 285 15.67 19 0.82
Above median 285 29.62 22 1.35 the predictor was measured. We did not attempt to impute missing data
because variables related to substance use are not amenable to impu­
tation, e.g., the variable past 12 months alcohol days is undefined (and
literature. In particular, it is known that males are more likely to develop missing) for a non-user of alcohol and should not be imputed). More­
CUD than females (Hayatbakhsh et al., 2009; Jing et al., 2020; Meier over, there were no predictors that had a strong relationship with CUD
et al., 2016). In line with our finding that peer cannabis use increases the (based on univariate analyses) and also had a large proportion of missing
likelihood of CUD, a recent study reported that peer substance use in­ data except for past 12 months alcohol use. This variable had 7.85%
creases the likelihood of becoming a user of cannabis and other sub­ missing values and about half of them could not be imputed due to the
stances (Lowe et al., 2020). Three of our seven predictors are related to above- mentioned reason. Nonetheless, we tried adding this predictor to
personality traits, which have been previously identified as important in the final model and found that it did not increase the predictive accuracy
predicting substance use and dependence (Rajapaksha et al., 2020; of the model.
Zoboroski et al., 2021). A recent study identified several We included personality and ACE related variables from wave IV.
delinquency-related activities to be important in predicting SUD (Jing Even though the ACEs were measured in wave IV, they are retrospective
et al., 2020). Moreover, individuals with more delinquent behavior are measures that occurred before age 18 (much before wave IV). Moreover,
more likely to start using substances at a young age and to develop drug for someone with age of CUD onset less than 18, we counted only those
use disorders (Koh et al., 2017; Richmond-Rakerd et al., 2016). Previous ACEs that occurred before the CUD onset age. With personality traits,
studies have also identified childhood adverse events as an important such chronological ordering could not be ensured. However, there is
predictor for predicting substance use and dependence outcomes evidence from the literature that the personality traits included in this
(Hayatbakhsh et al., 2009; Moss et al., 2020). Some seemingly important study are relatively stable over time (Caspi et al., 2005; Damian et al.,
predictors such as personal substance use and socio-demographics got 2019).
excluded during our initial variable filtering or variable selection steps. These limitations are, however, outweighed by the various strengths
It is possible that their effects are partially captured by the variables in of the study, with the final product being a novel and methodologically
the final model. rigorous Bayesian learning model for CUD risk prediction in adulthood
The relationship between social inequality and substance use is based on risk factors from adolescence and young adulthood. Moreover,
currently of great scientific interest. Add Health has risk factors such as the model validated well on external data. Its discrimination and cali­
childhood welfare status, neighborhood status, etc., which are measures bration performance on external data compared favorably with those of
of social inequality, but these did not pass our initial variable filtering. the widely used risk prediction models for various diseases/disorders
Nonetheless, there is evidence that ACEs are highly associated with low (Chowdhury et al., 2018; Costantino et al., 1999; Min et al., 2014;

5
R.M.D.S. Rajapaksha et al. Drug and Alcohol Dependence 236 (2022) 109476

Spiegelman et al., 1994). Although future validation studies are Cattelani, L., Murri, M.B., Chesani, F., Chiari, L., Bandinelli, S., Palumbo, P., 2019. Risk
prediction model for late life depression: development and validation on three large
important for building more confidence, the model is ready to be
European datasets. IEEE J. Biomed. Health Inf. 23 (5), 2196–2204.
considered for potential adoption in practice. To aid in this task, an R Caye, A., Agnew-Blais, J., Arseneault, L., Gonçalves, H., Kieling, C., Langley, K.,
package is under construction (relevant steps are provided in Supple­ Menezes, A.M.B., Moffitt, T.E., Passos, I.C., Rocha, T.B., Sibley, M.H., Swanson, J.M.,
mentary Materials). The model can help in identifying adolescent or Thapar, A., Wehrmeister, F., Rohde, L.A., 2019. A risk calculator to predict adult
attention-deficit/hyperactivity disorder: Generation and external validation in three
young adult cannabis users who are at high risk of developing CUD in birth cohorts and one clinical sample. Epidemiol. Psychiatr. Sci. 29, e37.
adulthood. Such users may then be provided with appropriate inter­ Tomko, R., Williamson, N.A., McRae-Clark, A., Gray, K.M., 2019. Cannabis use disorder
vention and prevention measures to help them divert from the path as a developmental disorder. In: Montoya, I., Weiss, D., R. B., S. (Eds.), Cannabis Use
Disorders. Springer, New York, pp. 189–199.
towards CUD. The model may be also helpful in medical settings where WHO, 2020. Management of Substance Abuse: Cannabis. World Health Organization
patients are considering using medical cannabis in consultation with (Accessed 23 Sep, 2020). 〈https://www.who.int/substance_abuse/facts/cannabis/
clinicians. en/〉.
CDC, 2020. High-Risk Substance Use Among Youth. 〈https://www.cdc.
gov/healthyyouth/substance-use/index.htm#4〉. (Accessed 05 Aug 2021).
Role of funding sources Chen, Y.H., Ferguson, K.K., Meeker, J.D., McElrath, T.F., Mukherjee, B., 2015. Statistical
methods for modeling repeated measures of maternal environmental exposure
biomarkers during pregnancy in association with preterm birth. Environ. Health 14,
This work was funded by the University of Texas at Dallas SPIRe seed 9.
grant. The sponsor had no role in the study design, collection, analysis or Chowdhury, M., Euhus, D., Arun, B., Umbricht, C., Biswas, S., Choudhary, P., 2018.
interpretation of data, writing the manuscript and the decision to submit Validation of a personalized risk prediction model for contralateral breast cancer.
Breast Cancer Res Treat. 170 (2), 415–423.
this manuscript for publication.
Costantino, J.P., Gail, M.H., Pee, D., Anderson, S., Redmond, C.K., Benichou, J.,
Wieand, H.S., 1999. Validation studies for models projecting the risk of invasive and
CRediT authorship contribution statement total breast cancer incidence. J. Natl. Cancer Inst. 91 (18), 1541–1548.
D’Agostino, R.B., Vasan, R.S., Pencina, M.J., Wolf, P.A., Cobain, M., Massaro, J.M.,
Kannel, W.B., 2008. General cardiovascular risk profile for use in primary care: The
SB, PKC, and RMDR conceived the study. RMDR carried out all data Framingham Heart Study. Circulation 117 (6), 743–753.
pre-processing and analyses. SB and PKC supervised RMDR throughout Damian, R.I., Spengler, M., Sutu, A., Roberts, B.W., 2019. Sixteen going on sixty-six: a
the entire project. FF provided subject matter expertise in designing the longitudinal study of personality stability and change across 50 years. J. Pers. Soc.
Psychol. 117 (3), 674–695.
study and interpreting the results. All authors participated in inter­ Dandis, R., Teerenstra, S., Massuger, L., Sweep, F., Eysbouts, Y., IntHout, J., 2020.
preting the results and writing manuscript. All authors have read and A tutorial on dynamic risk prediction of a binary outcome based on a longitudinal
approved the final version of the manuscript. biomarker. Biom. J. 62 (2), 398–413.
Douglas, K.R., Chan, G., Gelernter, J., Arias, A.J., Anton, R.F., Weiss, R.D., Brady, K.,
Poling, J., Farrer, L., Kranzler, H.R., 2010. Adverse childhood events as risk factors
Acknowledgement for substance dependence: partial mediation by mood and anxiety disorders. Addict.
Behav. 35 (1), 7–13.
Feingold, D., Livne, O., Rehm, J., Lev-Ran, S., 2020. Probability and correlates of
The data used in this work are from Add Health, a program project transition from cannabis use to DSM-5 cannabis use disorder: results from a large-
directed by Kathleen Mullan Harris and designed by J. Richard Udry, scale nationally representative study. Drug Alcohol Rev. 39 (2), 142–151.
Peter S. Bearman, and Kathleen Mullan Harris at the University of North Gail, M.H., Brinton, L.A., Byar, D.P., Corle, D.K., Green, S.B., Schairer, C., Mulvihill, J.J.,
1989. Projecting individualized probabilities of developing breast cancer for white
Carolina at Chapel Hill and funded by grant P01-HD31921 from the
females who are being examined annually. J. Natl. Cancer Inst. 81 (24), 1879–1886.
Eunice Kennedy Shriver National Institute of Child Health and Human Gelman, A., Carlin, J.B., Stern, H.S., Dunson, D.B., Vehtari, A., Rubin, D.B., 2014.
Development, with cooperative funding from 23 other federal agencies Bayesian Data Analysis, third ed. CRC Press, Boca Raton.
Gray, K.M., Squeglia, L.M., 2018. Research review: What have we learned about
and foundations. Information on how to obtain the Add Health data files
adolescent substance use? J. Child Psychol. Psychiatry 59 (6), 618–627.
is available on the Add Health website (http://www.cpc.unc.edu/addh Harris, K.M., Halpern, C.T., Whitsel, E., Hussey, J., Tabor, J., Entzel, P., Udry, J.R., 2009.
ealth). No direct support was received from grant P01-HD31921 for The National Longitudinal Study of Adolescent to Adult Health: Research Design.
this analysis. The authors thank Thanthirige Lakshika Ruberu for help­ 〈https://addhealth.cpc.unc.edu/documentation/study-design/〉. (Accessed 05 Aug,
2021).
ing with initial exploration of the data. Hasin, D.S., Sarvet, A.L., Cerdá, M., Keyes, K.M., Stohl, M., Galea, S., Wall, M.M., 2017.
US adult illicit cannabis use, cannabis use disorder, and medical marijuana laws:
Declaration of competing interest 1991-1992 to 2012-2013. JAMA Psychiatry 74 (6), 579–588.
Hayatbakhsh, M.R., Najman, J.M., Bor, W., O’Callaghan, M.J., Williams, G.M., 2009.
Multiple risk factor model predicting cannabis use and use disorders: a longitudinal
None. study. Am. J. Drug Alcohol Abus. 35 (6), 399–407.
Heilig, M., MacKillop, J., Martinez, D., Rehm, J., Leggio, L., Vanderschuren, L.J.M.J.,
2021. Addiction as a brain disease revised: why it still matters, and the need for
Appendix A. Supporting information consilience. Neuropsychopharmacology 46 (10), 1715–1723.
Hu, Z., Jing, Y., Xue, Y., Fan, P., Wang, L., Vanyukov, M., Kirisci, L., Wang, J., Tarter, R.
Supplementary data associated with this article can be found in the E., Xie, X.Q., 2020. Analysis of substance use and its outcomes by machine learning:
II. Derivation and prediction of the trajectory of substance use severity. Drug Alcohol
online version at doi:10.1016/j.drugalcdep.2022.109476. Depend. 206, 107604.
James, G., Witten, D., Hastie, T., Tibshirani, R., 2013. An Introduction to Statistical
References Learning: with Applications in R. Springer, New York.
Jing, Y., Hu, Z., Fan, P., Xue, Y., Wang, L., Tarter, R.E., Kirisci, L., Wang, J.,
Vanyukov, M., Xie, X.Q., 2020. Analysis of substance use and its outcomes by
Afuseh, E., Pike, C.A., Oruche, U.M., 2020. Individualized approach to primary
machine learning I. Childhood evaluation of liability to substance use disorder. Drug
prevention of substance use disorder: age-related risks. Subst. Abus. Treat. Prev.
Alcohol Depend. 206, 107605.
Policy 15 (1), 58.
Ketcherside, A., Jeon-Slaughter, H., Baine, J.L., Filbey, F.M., 2016. Discriminability of
Bates, D., Maechler, M., Bolker, B., Walker, S., 2015. Fitting linear mixed-effects models
personality profiles in isolated and co-morbid marijuana and nicotine users.
using lme4. J. Stat. Softw. 67 (1), 1–48.
Psychiatry Res 238, 356–362.
Beaton, D., Abdi, H., Filbey, F.M., 2014. Unique aspects of impulsive traits in substance
Knapp, A.A., Lee, D.C., Borodovsky, J.T., Auty, S.G., Gabrielli, J., Budney, A.J., 2019.
use and overeating: specific contributions of common assessments of impulsivity.
Emerging trends in cannabis administration among adolescent cannabis users.
Am. J. Drug Alcohol Abus. 40 (6), 463–475.
J. Adolesc. Health 64 (4), 487–493.
Bernardini, F., Attademo, L., Cleary, S.D., Luther, C., Shim, R.S., Quartesan, R.,
Koh, P.K., Peh, C.X., Cheok, C., Guo, S., 2017. Violence, delinquent behaviors, and drug
Compton, M.T., 2017. Risk prediction models in psychiatry: toward a new frontier
use disorders among adolescents from an addiction-treatment sample. J. Child
for the prevention of mental illnesses. J. Clin. Psychiatry 78 (5), 572–583.
Adolesc. Subst. Abus. 26 (6), 463–471.
Bridgeman, M.B., Abazia, D.T., 2017. Medicinal cannabis: History, pharmacology, and
Li, Q., Lin, N., 2010. The Bayesian elastic net. Bayesian Anal. 5 (1), 151–170.
implications for the acute care setting. Pharm. Ther. 42 (3), 180–188.
Lowe, C.C., Miller, B.L., Stogner, J., 2020. Comfortably numb? Revisiting and re-
Carvalho, C.M., Polson, N.G., Scott, J.G., 2010. The horseshoe estimator for sparse
specifying the relationship between health strain and substance use. Crime. Delinq.
signals. Biometrika 97 (2), 465–480.
66 (13–14), 1937–1959.
Caspi, A., Roberts, B.W., Shiner, R.L., 2005. Personality development: stability and
Lumley, T., 2004. Analysis of complex survey samples. J. Stat. Softw. 9 (8), 1–19.
change. Annu Rev. Psychol. 56, 453–484.

6
R.M.D.S. Rajapaksha et al. Drug and Alcohol Dependence 236 (2022) 109476

Marel, C., Sunderland, M., Mills, K.L., Slade, T., Teesson, M., Chapman, C., 2019. SAMHSA, 2016. Facing Addiction in America: The Surgeon General’s Report on Alcohol,
Conditional probabilities of substance use disorders and associated risk factors: Drugs, and Health. 〈https://www.hhs.gov/surgeongeneral/reports-and-publication
progression from first use to use disorder on alcohol, cannabis, stimulants, sedatives s/index.html〉. (Accessed Aug 05, 2021).
and opioids. Drug Alcohol Depend. 194, 136–142. SAMHSA, 2020. Key substance use and mental health indicators in the united states:
Meier, M.H., Hall, W., Caspi, A., Belsky, D.W., Cerdá, M., Harrington, H.L., Houts, R., Results from the 2019 national survey on drug use and health. 〈https://www.samh
Poulton, R., Moffitt, T.E., 2016. Which adolescents develop persistent substance sa.gov/data/〉. (Accessed 05 Aug 2021).
dependence in adulthood? Using population-representative longitudinal data to Schulenberg, J.E., Johnston, L.D., O’Malley, P.M., Bachman, J.G., Miech, R.A., Patrick,
inform universal risk assessment. Psychol. Med 46 (4), 877–889. M.E., 2021. Monitoring the future national survey results on drug use, 1975–2019.
Min, J.W., Chang, M.C., Lee, H.K., Hur, M.H., Noh, D.Y., Yoon, J.H., Jung, Y., Yang, J.H., Volume II, college students & adults ages 19–60. 〈http://www.monitoringthefuture.
Society, K.B.C., 2014. Validation of risk assessment models for predicting the org/pubs.html#monographs〉. (Accessed Nov, 21. 2021).
incidence of breast cancer in korean women. J. Breast Cancer 17 (3), 226–235. Spiegelman, D., Colditz, G.A., Hunter, D., Hertzmark, E., 1994. Validation of the Gail
Moss, H.B., Ge, S., Trager, E., Saavedra, M., Yau, M., Ijeaku, I., Deas, D., 2020. Risk for et al. model for predicting individual breast cancer risk. J. Natl. Cancer Inst. 86 (8),
substance use disorders in young adulthood: Associations with developmental 600–607.
experiences of homelessness, foster care, and adverse childhood experiences. Compr. Spindle, T.R., Bonn-Miller, M.O., Vandrey, R., 2019. Changing landscape of cannabis:
Psychiatry 100, 152175. novel products, formulations, and methods of administration. Curr. Opin. Psychol.
Nasir, M., Summerfield, N.S., Oztekin, A., Knight, M., Ackerson, L.K., Carreiro, S., 2021. 30, 98–102.
Machine learning-based outcome prediction and novel hypotheses generation for Stan Development Team, 2020. rstan: R Interface to Stan, 2.21.2 ed.
substance use disorder treatment. J. Am. Med Inf. Assoc. 28 (6), 1216–1224. Tibshirani, R., 1996. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser.
NCDAS, 2018. National Center for Drug Abuse Statistics. 〈https://drugabusestatistics. B Methodol. 58 (1), 267–288.
org/〉. (Accessed 05 Aug 2021). Tibshirani, R., 2011. Regression shrinkage and selection via the lasso: a retrospective.
NIDA, 2017. Trends and Statistics. 〈https://archives.drugabuse.gov/trends-statistics/c J. R. Stat. Soc., Ser. B, Stat. Methodol. 73, 273–282.
osts-substance-abuse〉. (Accessed 05 Aug 2021). Van Erp, S., Oberski, D.L., Mulder, J., 2019. Shrinkage priors for Bayesian penalized
NIDA, 2019. Media Guide: Most Commonly Used Additive Drugs. National Institute on regression. J. Math. Psychol. 89, 31–50.
Drug Abuse. 〈https://www.drugabuse.gov/publications/media-guide/most-common Vehtari, A., Gelman, A., Gabry, J., 2017. Practical Bayesian model evaluation using
ly-used-addictive-drugs〉 (Accessed 23 Sep, 2020). leave-one-out cross-validation and WAIC. Stat. Comput. 27 (5), 1413–1432.
O’Hara, R.B., Sillanpaa, M.J., 2009. A review of Bayesian variable selection methods: Vehtari, A., Gabry, J., Magnusson, M., Yao, Y., Bürkner, P., Gelman, A., 2020. loo:
What, how and which. Bayesian Anal. 4 (1), 85–117. Efficient leave-one-out cross-validation and WAIC for Bayesian models. Statistics and
Park, T., Casella, G., 2008. The Bayesian Lasso. J. Am. Stat. Assoc. 103 (482), 681–686. Computing.
R Core Team, 2019. R: A language and environment for statistical computing. R Verdejo-García, A., Lawrence, A.J., Clark, L., 2008. Impulsivity as a vulnerability marker
Foundation for Statistical Computing, Vienna, Austria. for substance-use disorders: review of findings from high-risk research, problem
Rajapaksha, R.M.D.S., Hammonds, R., Filbey, F., Choudhary, P.K., Biswas, S., 2020. gamblers and genetic association studies. Neurosci. Biobehav Rev. 32 (4), 777–810.
A preliminary risk prediction model for cannabis use disorder. Prev. Med Rep. 20, Walsh, D., McCartney, G., Smith, M., Armour, G., 2019. Relationship between childhood
101228. socioeconomic position and adverse childhood experiences (ACEs): a systematic
Richmond-Rakerd, L.S., Fleming, K.A., Slutske, W.S., 2016. Investigating progression in review. J. Epidemiol. Community Health 73 (12), 1087–1093.
substance use initiation using a discrete-time multiple event process survival mixture Zhang-James, Y., Chen, Q., Kuja-Halkola, R., Lichtenstein, P., Larsson, H., Faraone, S.V.,
(MEPSUM) approach. Clin. Psychol. Sci. 4 (2), 167–182. 2020. Machine-learning prediction of comorbid substance use disorders in ADHD
Robin, X., Turck, N., Hainard, A., Tiberti, N., Lisacek, F., Sanchez, J.C., Müller, M., 2011. youth using Swedish registry data. J. Child Psychol. Psychiatry 61 (12), 1370–1379.
pROC: An open-source package for R and S+ to analyze and compare ROC curves. Zoboroski, L., Wagner, T., Langhals, B., 2021. Classical and neural network machine
BMC Bioinforma. 12, 77. learning to determine the risk of marijuana use. Int J. Environ. Res Public Health 18
(14), 7466.

You might also like