Professional Documents
Culture Documents
Hepp, Niedtfeld, Schulze (Accepted Draft) - Experimental Paradigms in Personality Disorder Research - PDTRT
Hepp, Niedtfeld, Schulze (Accepted Draft) - Experimental Paradigms in Personality Disorder Research - PDTRT
a
Department of Psychosomatic Medicine and Psychotherapy, Central Institute of Mental
Corresponding author: Prof. (apl.) Dr. Inga Niedtfeld, Central Institute of Mental Health,
Department of Psychosomatic Medicine and Psychotherapy, PO Box 12 21 20, 68072
Mannheim, Germany. Tel: +49-621-1703-4403, Fax: +49-621-1703-4405, E-mail:
Inga.Niedtfeld@zi-mannheim.de
Funding: This research was supported by a grant of the German Research Foundation to Inga Niedtfeld
(NI 1591/1–2).
1
Abstract
paradigms and that were published between 2017 and 2021 in thirteen peer-reviewed journals.
We structure the study content according to the NIMH Research Domain Criteria (RDoC), and
report details on demographic variables, experimental design, sample size, and statistical
recruited clinical groups, and a lack of sample diversity. Finally, we review issues regarding
statistical power and the data analytic designs that were used. Based on the literature review, we
draw implications for future experimental PD research, encouraging researchers to increase the
breadth of represented RDoC constructs, the representativeness and diversity of the recruited
samples, the statistical power to detect between-person effects, the reliability of estimators, the
2
Introduction
Laboratory studies that include behavioral paradigms with experimental manipulations are
central to personality disorder (PD) research in that they have helped shape our understanding
of the psychopathological processes that play out in PDs (Domes et al., 2009; Jeung et al., 2016).
These studies (we will refer to them as studies using “experimental paradigms” henceforth)
randomly assigned to complete a task with either positive or negative stimuli), or manipulation
of an experimental within-factor (e.g., participants complete several trials with positive and
negative stimuli in random order). In contrast to more naturalistic settings, the laboratory
environment affords control over potentially confounding variables (Myers & Hansen, 2011).
The use of (computerized) experimental tasks allows for the repeated measurement of the
processes of interest and can thus afford a level of reliability that is difficult to obtain outside of
of experimental factors allows for causal attributions of differences in outcome measures to the
experimental condition (Myers & Hansen, 2011). While findings from experimental paradigms
have been paramount for understanding psychopathological processes in PDs at a basic level,
their utility to the field is limited when considered without further context. The importance and
reliability of the effect of a specific experimental factor can only be recognized if it is embedded
in a methodological and content-related context. In other words, studies that use experimental
paradigms typically manipulate only one specific factor in a psychopathological process and
therefore generate evidence that is highly specific to that factor and process and can be difficult
to integrate into the larger research landscape. Therefore, we herein provide a review of recent
thirteen peer-reviewed journals within the last five years. We selected target journals from
3
clinical psychology, psychiatry, and psychosomatics, as well as two open access journals that
repeatedly published PD work in the past.1 To be included, studies had to 1) sample individuals
inclusion of experimental conditions that vary within participants). For details on the literature
screening and extraction process, see the supplement 1. We include 99 manuscripts in this
review, which report 102 unique studies and 123 experimental paradigms. For a Reference list
for all articles included in the review, see supplement 2. We review all studies with regard to
the covered content, study design, and data analytic aspects. All data that we extracted from the
articles are accessible at https://osf.io/xmwe9/ (Hepp et al., 2022), as is the full citation list for all
reviewed articles.
We structure the content covered by the reviewed studies according to the NIMH
Research Domain Criteria (RDoC, Cuthbert, 2014; Koudys et al., 2019). RDoC were proposed
to stimulate research on mental disorders that overcome central limitations of research relying
on disorder categories, especially a focus on individuals with low levels of overall functioning
who meet the strict diagnostic thresholds, while ignoring the broader continuum of functioning.
RDoC is not intended as a new diagnostic system, but as a research framework that helps
structure and cluster evidence on six major domains of human functioning. The six RDoC
domains are negative valence systems, positive valence systems, cognitive systems, social
processes, arousal and regulatory systems, and sensorimotor systems. Domains form the highest
level of the RDoC matrix, followed by constructs situated within each domain. For definitions
1
The included journals are: Personality Disorders: Theory, Research and Treatment; Journal of Personality
Disorders; Borderline Personality Disorder and Emotion Dysregulation; Journal of Abnormal Psychology; Clinical
Psychological Science; Behaviour Research and Therapy; Psychological Medicine; Psychiatry Research; Journal
of Affective Disorders; Biological Psychiatry; Neuroimage Clinical; PLoS One; Scientific Reports.
4
conceptualized as falling on a dimension of functioning from normal to abnormal and RDoC
context. The lowest level in the RDoC matrix describes units of analysis used to measure the
constructs. Experimental paradigms are one possible unit of analysis within the RDoC matrix.
As described above, we review 99 articles that report 123 experimental paradigms. Of these, 35
paradigms (29.41%) tapped into more than one RDoC construct. For a visualization of the
Figure 1: The circular bar chart shows the different constructs studied with experimental paradigms in personality
psychopathology. Colors of the bars reflect overarching RDoc domains. Red bars represent the ‘negative valence
systems’ (NVS), blue bars ‘social processes’ (SP), green bars ‘cognitive systems’ (CS), and purple bars ‘positive
valence systems’ (PVS). The available studies did not investigate the RDoc domains ‘arousal and regulatory’ or
‘sensorimotor systems’. Note that we included the two additional constructs emotion regulation and prosocial
behavior. These are currently not part of the RDoc system but were covered by several paradigms.
5
Negative Valence Systems
This domain subsumes responses to aversive situations or contexts, including fear, anxiety, and
loss (Cuthbert, 2014; National Institute of Mental Health, 2022). This domain was covered 38
times (24.36%). The construct acute threat (fear) was most prominent, as it was investigated 21
times (13.46%). Studies largely focused on stress reactivity in PDs and employed stress-
inducing stimuli (various stress paradigms, aversive pictures or film clips). In nine additional
paradigms, participants were asked to regulate their emotions, mostly by cognitive reappraisal.
Although emotion regulation is not a construct of RDoC (Fernandez et al., 2016), it is frequently
studied in the context of PDs. Further underlining the importance of threat processing, 14
paradigms that tap into more than one RDoC domain combined a stress induction (via aversive
pictures, negative facial expressions, negative words, or a stress test), located on the RDoc
construct acute threat (fear), with other RDoC constructs such as reward learning, cognitive
control, or emotion recognition. In addition to these studies on acute threat, two studies focused
on the construct frustrative nonreward, using different aggression paradigms. The construct loss
was investigated six times, for instance by inducing sadness via a film clip or via social
exclusion. No studies assessed the constructs potential threat (anxiety) or sustained threat.
This RDoC domain describes responses to positive motivational situations or contexts, such as
reward seeking, consummatory behavior, and reward/habit learning (Cuthbert, 2014), and was
investigated 17 times (10.90%). Reward responsiveness (i.e. anticipation to reward, and the
response to reward cues or receipt of reward) and reward valuation were investigated six times,
respectively. Reward valuation taps into computational processes about the probability and
benefits of a prospective outcome. Reward learning was studied five times using social
valuation or reinforcement learning paradigms. Notably, several of the reviewed studies used
paradigms that are explicitly not recommended in the RDoC matrix (e.g., the Iowa gambling
6
task), because they cannot disentangle the three constructs of the positive valence systems
domain. In addition to the unmet need to use experimental paradigms that assess RDoC
constructs in this domain, positive valence systems were clearly under-researched, as compared
Cognitive Systems
The domain cognitive systems encompasses various cognitive processes (Cuthbert, 2014;
Morris & Cuthbert, 2012) that were investigated 31 times (19.87%). Fifteen paradigms targeted
the construct cognitive control by using the go/no-go task, stop signal task, stroop task, or task-
switching task. In addition, five paradigms focused on attention, mostly in combination with
other RDoC domains, such as negative valence systems or social processes. The construct
declarative memory was investigated four times, with two paradigms combining declarative
memory and social cognition. Five paradigms assessed working memory (primarily with the n-
back task), and two paradigms captured visual perception (lateral masking task, binocular
rivalry paradigm).
Social Processes
The domain social processes focuses on responses to interpersonal stimuli, including the
perception and interpretation of others’ actions and the interpretation of the self (Cuthbert, 2014;
Hanegraaf et al., 2021). It was by far the most researched RdoC domain, investigated 70 times
(44.87%). The construct social communication was most often investigated (29 times). For the
particularly the reading the mind in the eyes test2 (RMET, seven times). While most paradigms
included stimuli of negative, neutral and positive valence, two paradigms used a restricted
2
Please note that the reading the mind in the eyes test was listed as a paradigm to investigate the perception and
understanding of others by the RDoC taskforce (https://www.nimh.nih.gov/about/advisory-boards-and-
groups/namhc/reports/behavioral-assessment-methods-for-rdoc-constructs). However, since other emotion
recognition tasks are explicitly listed under the construct social communication, we decided to count the reading
the mind in the eyes test likewise.
7
stimulus set with threatening and neutral faces only, possibly also tapping into the threat
construct of the negative valence systems domain. Facial emotion recognition was further
investigated in combination with the cognitive systems domain, with two studies presenting
emotional faces and concurrently studying attention or cognitive control (i.e. emotional faces as
A group of nine studies assessed the construct affiliation and attachment, primarily by
measuring responses to social rejection in the cyberball paradigm. Interestingly, four studies
used cyberball or a group rejection paradigm not to study affiliation and attachment primarily,
but to induce negative affect (i.e. domain negative valence systems) and investigate subsequent
behavior.
The construct perception and understanding of self was assessed with eight paradigms,
which varied considerably in their theoretical background (e.g., self-referential processing tasks,
implicit association test). The construct perception and understanding of others was studied
more frequently (19 paradigms), and subsumes research on theory of mind as well as empathy.
Again, paradigms varied substantially and included (among others) metaphor comprehension,
prosocial behavior in PDs, which is currently not covered by the RDoC framework. These
studies employed economic games to measure prosocial behavior (see Hepp & Niedtfeld, 2022
for a conceptual paper; Jeung et al., 2016 for a literature review on prosociality in PDs). In light
Remaining domains
Finally, within the last five years, there were no experimental studies on the RDoC domains
Study design
The reviewed 99 articles reported a total of 102 studies3. The studies show a clear focus on
samples comprising individuals with borderline personality disorder (BPD), which were
included in 89.22% of studies. Findings from these studies have clearly helped shape and update
our concept of BPD and personality pathology in a broader sense. At the same time, the narrow
focus on BPD constitutes one of the most striking limitations to current experimental PD
research. Other PDs were strongly under-represented, as the reviewed literature only covered
individuals with a PD (as discussed, largely BPD) to healthy control participants, clinical
controls, or a combination of both. A healthy control group was included in 82.35% of studies,
30.00, SD = 24.36, see Figure 2). Notably, 17 of the reviewed studies (16.67%) assessed PD
symptom levels or PD features. However, four of these still opted to split the sample into discrete
groups of individuals with low versus high levels of PD pathology, rather than using the
dimensional indicator to predict behavior in the paradigms. The average sample size in studies
that sampled dimensionally (and did not divide the sample into groups) was substantially higher
3
We were unable to definitively determine how many of these samples were unique and therefore report the
demographic and design data averaged across all reported samples.
9
Figure 2: Density plots visualizing probability distributions of socio-demographic and experimental variables.
Note, average number of participants refers to experimental studies with between-group designs.
The strong focus on categorical PDs also reflects the fact that, until recently, PDs were
defined categorically in DSM-5 (American Psychiatric Association, 2013) and ICD-10 (World
Health Organization, 2016). Yet, diagnostic systems are currently shifting to a dimensional
concept of PDs. This is reflected in the DSM-5 alternative model for personality disorders
(AMPD; Oldham, 2015), which outlines a dimensional approach to diagnosing PDs and was
accompanied by a call for further investigation (for an review on studies using the DSM-5
alternative model, see (Zimmermann et al., 2019). In addition, ICD-11 (World Health
Organization, 2019), which has come into effect in 2022, includes a fully dimensional PD
However, this may in part be due to our selection of clinical target journals, which possibly were
preferred outlets for work on categorical and clinically diagnosed PD samples. Further studies
10
focusing on samples of individuals with varying levels of maladaptive traits in community or
clinical samples may have been published predominantly in other outlets, for instance in the
field of personality psychology (e.g., da Costa et al., 2018; Fossati et al., 2018; Papousek et al.,
2018). Nonetheless, even considering this, there is a relatively clear picture that the majority of
reviewed studies focused on categorical PDs (and thus - by definition - pathology above a certain
threshold). As discussed, these studies may have limited utility for understanding the broader
continuum of PD pathology.
individuals, further aggravates this problem. It introduces an artificial divide between extreme
groups at the high end of the severity continuum (i.e. those meeting diagnostic thresholds for
categorical PD diagnoses) and individuals specifically selected to be entirely free of any present
or past PD symptoms (i.e. participants in the healthy control group). This way, the field has
neglected to generate evidence about milder levels of PD pathology. Additionally, one must
expect that individuals in PD and healthy control groups differ on many characteristics beyond
pathology impossible. The reliance on healthy control groups also severely limits any
conclusions regarding diagnostic specificity. In fact, for the majority of findings, it remains
entirely unclear whether they are at all specific to a certain PD. By relying on case-control
designs, the field further missed opportunities to investigate whether the associations between
PD pathology and the processes studied in experimental paradigms follow a continuous, dose-
response-like relationship with problems increasing at the higher end of the PD severity
spectrum (or whether they are unique to those with PD and completely absent in “healthy”
individuals). This severely limits our understanding of the processes themselves. Studies that
include clinical control groups in addition to healthy individuals attenuate this problem
somewhat, but were rare among the reviewed studies, and considered a wide range of different
In addition to the predominance of case-control studies and BPD samples, the reviewed
studies are also highly biased with regard to age, gender, and race4. Regarding age, except for
two studies, all reviewed studies included adult participants over the age of 18. The mean age
across studies was 27.66 years (Md = 28.26, SD = 5.18), indicating a focus on young adults.
The average percentage of female participants across all reviewed studies was M = 81.01 (Md
= 95, SD = 26.68) with 46.08% of reviewed studies including only women. Men, on average,
made up 17.87% of participants across samples (Md = 0, SD = 28.69)5. Other gender identities
4
These three demographic variables were selected for the review and the authors acknowledge that they are not
exhaustive and further bias is likely with regard to other variables such as sexual orientation, socioeconomic status,
country of residence, religion, and many more.
5
Several studies only reported percentages for one gender, likely implying that the remaining participants were of
the other binary gender (e.g., reporting 70% of participants were female and implying the remaining 30% were
male). This reporting is problematic insofar as it promotes the idea that gender binary is the norm and excludes
individuals of other gender identities (e.g., bigender, genderfluid). So as not to perpetuate this, we only coded the
percentages for genders that were explicitly described and did not make further inferences based on an assumption
of gender binary. As a result, the percentages for women and men do not sum up to 100% across studies.
12
were assessed in only one of the reviewed studies. The third demographic variable we aimed to
extract from all studies was race. However, only 29.41% of the reviewed studies even reported
any data on race, which precluded an adequate summary of this variable. This, in and of itself,
is a grave oversight that appears to be highly prevalent in our field and constitutes a form of
implicit racism that we must urgently address (see Haeny et al., 2021 for a nomenclature for
antiracist clinical research, and the article on diversity in this special issue for a discussion
focused on the field of PD research). The overall picture suggests that the reviewed evidence is
highly restricted in its generalizability and does a poor job of representing the broad range of
Experimental manipulations
were randomly assigned to one of two experimental conditions. Almost all experimental
multiple conditions within the same task. Most commonly, studies comprised one within-factor
(70.97% of studies) with two or three discrete factor levels (see Figure 2). The remaining studies
comprised two (21.77%) or three (3.23%) within factors. The average number of within levels
(across all within-factors) was M = 4.67 (Md = 3.00) and varied substantially between studies
(SD = 6.10), with some including more than thirty levels. If this large number of levels is not
accompanied by a proportional increase in trials per level, too few trials fill each cell of the
experimental design and estimators (e.g., means) become unreliable.6 Beyond any
considerations of statistical power, this substantially limits any conclusions that can be drawn
from the data. Lastly, it is important to note that almost no study included a continuously
6
Note that we tried to extract the number of trials for each within-factor combination from all reviewed articles.
However, only few articles clearly reported this information and trial numbers often varied between levels.
Therefore, we decided not to report this data as, for a substantial proportion of articles, we were left unsure about
the exact design that was used.
13
manipulated within-variable. Even variables that could be manipulated continuously, such as
Finally, we coded different variables pertaining to the reporting as well as to the analysis of the
dependent variables (see Figure 3). Particularly with regard to the statistical analysis, this review
focuses on variables that we were able to classify for all manuscripts. A more detailed analysis
of statistical models was prevented by a lack of available information (see also ‘Open data and
materials’).
Power analyses
Power analyses are of utmost importance for performing informative experimental studies.
When power is high, studies can provide clear and more replicable answers to research
questions, whereas low power directly contributes to low replicability and heterogeneous
findings (Stanley et al., 2018). Given the importance of statistical power for research in general,
this aspect is discussed in greater detail by Vize and Lynam in this special issue.
Most commonly, the reviewed articles did not report power analyses (78.79%, n = 78).
Only a minority of articles included an a-priori power analysis (13.13%, n = 13) and few
(8.08%, n = 8) used alternative approaches, such as sensitivity analyses. Sensitivity analyses are
of particular interest for clinical research, which is commonly confronted with feasibility
concerns that can result in relatively fixed maximum sample sizes. This approach allows
determining the minimum effect size that can be reliably detected given the recruited sample
sizes (Bloom, 1995; Lakens, 2014). Reporting of sensitivity analyses thus allows the reader to
put the reported results into context with the accumulated knowledge of the field (Perugini et
al., 2018). Furthermore, almost half the articles reporting power analyses powered their studies
for within-between interactions (42.86%, n = 9). Accordingly, these studies were adequately
14
powered to detect condition-by-group interactions, but may have been underpowered to detect
power for all reviewed studies that included case-control comparisons7. The results of these
calculations with varying numbers of participants and effect sizes are presented in Figure 4.
Most case-control studies (median n per group = 30) were only adequately powered (at 1 - β =
86.14%) to detect large effects, whereas power was low for the detection of medium effects (1
- β = 47.79%) or small effects (1 - β = 20.79%). We repeated this procedure for all reviewed
103). Power estimates showed these studies were sufficiently powered to detect large and
medium-sized, but not small effects (1 - β = 17.18% for a small effect of r = .1; 1 - β = 87.46%
Figure 4: Contour plot shows power estimates for a between-group comparison with different sample sizes and
effect sizes of interests. For comparison, we included power estimates for detecting small, medium, or large effects
considering the median sample size of current experimental studies (dotted line). Shaded area reflects .25 and .75
7
We assumed equal sample sizes, an alpha level of .05, and a two-sided t-test.
15
Dependent variables
On average, studies reported 2.77 (SD = 2.06) different dependent variables. Most studies used
(19.23%, n = 50), and response latencies (13.46%, n = 35). As can be expected in experimental
paradigms, the majority of constructs were measured through the rating of a single item
(72.80%, n = 91) as compared to the rating of a scale (27.20%, n = 34). This reflects a common
decision in experimental designs to lower participant burden related to the repeated assessments.
Accordingly, this pattern changed when focusing on studies with few assessments, such as in
experiments assessing emotional states before and after confrontation with a stressor. However,
the majority of constructs were still assessed by single items (55.17%, n = 32) as compared to
scales (44.83%, n = 26). A notable exception from this practice represents the Cyberball
paradigm. Studies using this task more commonly relied on multi-item scales (single item:
Aggregation of trials
Most dependent variables with repeated assessments were aggregated before statistical analysis
neglects important sources of systematic variation at the trial level, such as drifts over time due
to fatigue or specific associations with experimental stimuli. It is important to keep in mind that
we do not only sample participants, but also aspects of the experiment such as stimuli. Without
unable to generalize beyond the stimuli applied in the respective experiment. For extensive
discussions of this ‘stimuli-as-fixed-effect fallacy’ see Clark (1973) and Judd et al. (2012).
8
Note that we focused on behavioral dependent variables from the manuscripts. Fixations in eye-tracking studies,
psychophysiological responding or neural activation were not extracted.
16
Response latencies were mostly aggregated using the mean (36.36%, n = 12) or median
(21.21%, n = 7), with a substantial number of studies not describing the chosen method of
aggregation (36.36%, n = 12). In addition, most studies did not report transformation of response
latencies (87.88%, n = 29). These procedures neglect the underlying distribution of response
Reliability
The reliability of experimental tasks was rarely reported (12.12%, n = 12). The relations between
reliability and statistical power, but also the implications of low measurement reliability and its
detrimental impact on the interpretability and comparisons of results have been discussed in
Inclusion of covariates
Some studies included covariates in their main analysis9 (23.30%, n = 24), mainly adjusting for
adjusted for psychopathological variables (e.g., depression, or anxiety; 37.50%, n = 9). The
inclusion of covariates may be well justified but requires an a-priori definition of theoretically
relevant covariates and needs to be considered in power estimations (Kraemer, 2015). Ideally,
researchers who report analyses with covariates also report the results of those same models
This point has been reiterated throughout the manuscript - a lack of available information and
transparent reporting prevented the classification of additional variables of interest (e.g., number
9
Note, we did not count studies repeating their main analysis while controlling for different aspects.
17
of repetitions per condition, statistical models). The benefits and values of open data and
materials for research transparency, reduction of data loss, and fostering of progress have been
discussed in great detail in the last years (Gewin, 2016; Munafò et al., 2017; Nosek et al., 2015).
Still, only a minority of experimental studies (9.09%, n = 9) provided open access to their de-
identified data and/or study materials using available repositories. Even in journals with a
mandated data policy we commonly found the statement that “data are available upon
reasonable request” (for a discussion, see Tedersoo et al., 2021; Wicherts et al., 2006).
Based on the above literature review, we have identified six main implications for future
recommendations for good scientific practice and increasing replicability in psychology, but we
try to focus on their specificity for the study of PDs as much as possible (American
As outlined above, several RDoC domains have been studied extensively using experimental
paradigms, while others require further investigation in future work. While laying out this
implication in more detail, we repeatedly refer to maladaptive personality traits as they are
As our review showed, there is a large body of research within the negative valence
systems domain. This is likely due to its close relation to maladaptive personality traits that are
observed frequently in PDs, such as affective instability (e.g., Trull et al., 2008). In line with
this, current dimensional models of PD suggest that PDs are characterized by pronounced
emotional reactivity to stress (Huprich, 2020; Oldham, 2015). The AMPD and ICD-11 PD
18
diagnosis subsume this under the maladaptive trait negative affectivity (Bach et al., 2018;
Hopwood et al., 2012). Two constructs within the negative valence systems domain that should
be targeted further in future work are potential threat (anxiety) and loss, as those with PD tend
PD, anxiousness is subsumed under the maladaptive personality trait negative affectivity in
DSM-5 AMPD and ICD-11 (Bach et al., 2018; Hopwood et al., 2012). Studying responses to
loss within an experimental paradigm could provide insight into the marked level of loneliness
reported by those with PD (Liebke et al., 2017) and the corresponding maladaptive personality
trait detachment in DSM-5 AMPD/ ICD-11 (Bach et al., 2018; Hopwood et al., 2012).
There is also a need for additional experimental studies on the positive valence systems
domain in PDs, and future work should strive to develop paradigms that can distinguish between
the constructs reward valuation, reward responsiveness, and reward learning. The RDoC
domain cognitive systems was studied extensively in (borderline) PD, using established
experimental paradigms that are also referenced in the RDoC matrix as suitable for investigating
cognitive control and memory. In the future, these findings should be replicated and re-evaluated
Within the RDoC domain social processes, the constructs social communication,
perception and understanding of self, and perception and understanding of others deserve
continued attention in future research, because they closely reflect the new diagnostic criteria
for PDs in ICD-11 and M-5 AMPD. The ICD-11 PD diagnosis details “problems in functioning
of aspects of the self (e.g., accuracy of self-view), and/or interpersonal dysfunction (e.g., the
ability to understand others' perspectives)”, and thus almost verbatim references these RDoC
(which was investigated in several of the reviewed studies but is not currently an RDoC
construct) would complement this research (Hepp & Niedtfeld, 2022). An additional issue that
became evident when reviewing paradigms that were used to study the construct perception and
19
understanding of others is that of specificity. Future studies are needed to clarify whether the
paradigms that were used measure the same construct, or whether clustering them into sub-
constructs (affective theory of mind, cognitive theory of mind, empathy) would aid a more fine-
This implication has several elements. First, there is an urgent need for studies that include
individuals with personality pathology beyond BPD. As outlined above, the current body of
experimental work almost exclusively studied BPD, with little evidence generated for other PDs.
However, we do not argue that what the field needs now is an equally large number of
experimental studies for all other categorical PDs. Following the shift to a dimensional PD
diagnosis in ICD-11 and DSM-5 AMPD, we would rather argue for a dimensional recruitment
strategy that samples individuals with different levels of maladaptive PD traits and various
levels of PD severity. This way, dimensional associations between the processes investigated in
experimental paradigms and maladaptive traits or overall PD severity could be established and
possible interventions could be tested for their effectiveness at different points of this
continuum.
Second, the reviewed studies showed marked bias for samples of younger cis-gender
women, and we deem it very likely that this bias extends to other variables that we did not
extract from all articles (e.g., education level, sexual orientation, religion). Strikingly, only a
minority of studies even assessed race or ethnicity so that we were unable to provide a
conclusive picture of the distribution of these variables across the reviewed studies. In addition
to the failure to report data on race and ethnicity, there was also only one study that assessed
gender identities other than female or male, showing that studies almost exclusively succumbed
to a concept of gender binary. Beyond the evident problem of a lack of representation, this
practice is also at odds with the distribution of PDs in the general population, where we see that
20
PDs are more prevalent among transgender and gender nonconforming individuals (Reisner et
al., 2016). In addition to the failure to assess diverse gender identities, the reviewed studies also
findings that (except for antisocial personality disorder) most PDs show similar prevalence rates
among men and women (Grant et al., 2008; Lenzenweger et al., 2007). We would argue that the
field (explicitly including our own research) requires a strong change toward more diverse
samples, both because it is an ethical mandate, and because samples that better represent the
whole population that is affected by PDs will produce findings that are more generalizable and
applicable.
As discussed above, many of the reviewed studies lack the statistical power to detect between-
person effects of the group factor (in previous studies often BPD vs. HC). The easiest way to
remedy this is, of course, to increase sample size. We realize that this is easier said than done
and that the recruitment of PD samples is always effortful and time-consuming. Often, the
samples recruited for the reviewed studies were highly specific and imposed additional criteria
Collaborations between different labs and distributing recruitment efforts across several study
In line with our recommendation for sampling a wider range of PD pathology, moving
away from recruiting participants with a categorical PD may also be helpful for achieving larger
sample sizes (e.g., da Costa et al., 2018). In all likelihood, this would render the recruitment
process much easier as it automatically increases the pool of potential participants and affords
inclusion criteria that are easier to meet. For instance, participants who meet only three or four
BPD criteria have to be excluded from a study that uses the categorical BPD diagnosis as an
inclusion criterion, but could easily be included in a study measuring PD severity level and
21
maladaptive trait combinations dimensionally. Likewise, healthy control participants do not
have to be selected to be “super healthy” and entirely free of any psychopathology, but could
still contribute low levels of maladaptive traits in a dimensional sampling approach. Other
groups that are typically not represented in past studies, such as individuals with partially
remitted or mixed PD pathology, could also be included in studies that quantify PD severity
dimensionally. Lastly, a dimensional sampling approach can be beneficial for statistical power,
as continuous predictors generally afford higher statistical power than categorical ones (for
In either case, studies using experimental paradigms in PD samples would benefit from
using conservatively estimated power analyses or simulations to inform future sample sizes and
conducting a-priori power analyses (which very few of the reviewed studies did) should go
without saying. We are sympathetic that recruitment of big sample sizes to increase statistical
power represents a serious challenge for clinical psychological research. However, in the long
As of now, the replicability of the majority of the reviewed findings is questionable. A solution
for this problem might lie in a much stronger adoption of open data repositories to aggregate
primary data from multiple sites (see implication 6 for a more detailed discussion).
Alternatively, PD research needs more concerted efforts to assess data across multiple labs,
which has become more common in recent years (Klein et al., 2018; Moshontz et al., 2018).
Such efforts would not only result in higher statistical power, but also address issues of
generalizability.
Our review revealed that some studies tended to implement a large number of within-factor
levels, which, if not accompanied by a proportional increase in trials per level, can result in
unreliable estimators. While we were unable to determine precisely how big this problem was
22
in the reviewed studies (because authors tended not to report reliability data and even trial
numbers were not readily available for all articles), we would argue that studies could generally
benefit from placing greater emphasis on the number of repetitions per within-factor
combination. In the simplest sense, this could mean to increase the overall number of repetitions
in the experimental paradigm. At the same time, researchers will want to consider participant
burden and avoid overly long experimental sessions. In the tradeoff between session duration
and trial repetitions, researchers must therefore carefully consider how many factor levels they
can implement, and consider whether the design they are thinking of is too complex for the trial
general, and this is despite its considerable importance for construct validity. Single items are
unlikely to represent the breadth of the theoretical concept of interest. As stated above, in most
cases researchers have to make a trade-off between participant burden and validity of measures,
thus there cannot be a clear-cut recommendation for future research. However, we would urge
researchers to consider whether multi-item scales are feasible. Additionally, it was striking how
many of the reviewed studies developed ever-new paradigms to investigate the same constructs.
While the development of new paradigms is not inherently problematic, most studies failed to
pilot these prior to their first application. Additionally, several studies used paradigms developed
Often, these paradigms were designed and optimized for investigating within-person effects.
Thus, they tend to produce variance at the within-person and not necessarily the between-person
(or even group) level. Ideally, whenever introducing or adopting a new experimental paradigm
to PD research, researchers should first run a pilot study in a convenience sample to establish
general reliability indices and ensure that the paradigm produces substantial between-person
variance.
23
5. Increase transparency of research
A single study reported a pre-registration for their hypotheses and analyses, and only a handful
of studies provided open data or materials. Nonetheless, we are hopeful that most researchers in
our field want to improve on this, and that some already pre-registered studies will be published
in the next few years. The last decade has seen a scientific reform movement starting from a
serious replicability crisis in psychology. Given the issues pointed out throughout the article, we
fear the same problem affects PD research. If published findings are not replicable, then progress
for the diagnosis and treatment of personality psychopathology is ultimately set back. While the
2010s have been described as a decade of ‘active confrontation’ for psychology in this regard
(Nosek et al., 2021), we have to note that our field still seems to be in a phase of ‘active denial’.
In this review, we provided several recommendations that, while advancing scientific rigor, also
require a great deal of additional effort from researchers. Other recommendations, such as the
Increased transparency refers particularly to providing open data and materials as part
of the publication. This fact has been acknowledged by most scientific organizations, commonly
resulting in a best practice recommendation that research data should be openly available
(Gewin, 2016; Wilkinson et al., 2016). Similar sentiments are reflected by TOP guidelines,
which are increasingly adopted by scientific journals (Nosek et al., 2015). It goes without saying
that these calls for ‘open data’ should be followed while considering possible ethics or privacy
constraints. However, too often, privacy constraints of patients are a fig leaf against taking action
and providing open access to data. Wider adoption of open science procedures would make our
research more efficient by facilitating the reuse of data and would allow for calculating
individual patient data (IPD) meta-analyses. This, in most cases, increases statistical power and
allows for adjustment and investigation of confounding factors at the participant level (Riley et
al., 2010), which most studies are not adequately powered for.
24
In parallel to this call for ‘open data’, we ask researchers to consider publication of their
materials, especially the experimental setup and the code for statistical analyses. It is widely
recognized that the traditional article is insufficient for describing all aspects of the experimental
design, data preprocessing, and analysis (Munafò et al., 2017). This makes an informed
understanding and validation of most articles next to impossible and impedes the ability to
accurately judge these aspects or build on previous work. The availability of experimental and
analysis code makes it possible to fully understand these aspects and (combined with open data)
to reproduce key findings of the literature. A wider dissemination of ‘open materials’ might thus
address the obstacles we encountered during this literature review, such as the ones regarding a
Despite the difficulties described above, there are some main takeaways from our literature
review regarding the use of statistical models. As of now, most studies aggregate their measures.
We already pointed out the issues associated with this common practice (i.e. stimuli-as-fixed-
effect fallacy). A wider adoption of linear mixed models would not only make the most of
repeated experimental assessments (Brown, 2021), but would also allow moving away from
discrete factor levels (e.g. stimuli of mild, moderate, and maximum intensity) and aggregating
the data for each level, researchers could opt to manipulate variables continuously (e.g., using
stimuli sampled along the full range of intensity). Randomly sampling stimuli along an intensity
continuum and modeling them as random effects would enable new insights.
Furthermore, a substantial number of studies adjusted their analyses for different socio-
demographic or psychopathological variables. It has been shown before, and we cannot stress
this point enough, that post-hoc inclusion of covariates to adjust for group differences increases
the likelihood of false-positive results (Simmons et al., 2011). When still doing so (or being
25
asked to do so during the review process), a transparent reporting is required to evaluate the
reliance of results on the presence of covariates, thus both the unadjusted and adjusted models
Conclusion
We reviewed 99 articles published between 2017 and 2021 that report findings from
we would like to underline that our own work is not free of these limitations and explicitly
included in all criticism we presented. In addition, our selection of thirteen clinical target
journals might have influenced the results and conclusions of this review. Nonetheless, we
conclude that future research could benefit from: (1) An expansion in content to currently under-
samples, (4) carefully examining and increasing the between- and within-person reliability of
the employed paradigms, (5) adopting statistical tests that adequately models trial-level
variations related to experimental stimuli, and (6) embracing open science practices
(preregistration, open data, open materials, open code) to increase transparency, reproducibility,
and replicability.
26
References
Asendorpf, J. B., Conner, M., De Fruyt, F., De Houwer, J., Denissen, J. J., Fiedler, K., Fiedler,
S., Funder, D. C., Kliegl, R., & Nosek, B. A. (2016). Recommendations for increasing
https://doi.org/10.1002/per.1919
Bach, B., Sellbom, M., Skjernov, M., & Simonsen, E. (2018). ICD-11 and DSM-5 personality
https://doi.org/10.1177/0004867417727867
Bloom, H. S. (1995). Minimum detectable effects: A simple way to report the statistical power
https://doi.org/10.1177/0193841X9501900504
https://doi.org/10.1177/2515245920960351
psychological research. Journal of Verbal Learning and Verbal Behavior, 12(4), 335-
359. https://doi.org/10.1016/S0022-5371(73)80014-3
personality traits are associated with the ability to understand the emotional states of
https://doi.org/10.1016/j.jrp.2018.05.001
Domes, G., Schulze, L., & Herpertz, S. C. (2009). Emotion recognition in borderline personality
https://doi.org/10.1521/pedi.2009.23.1.6
Fernandez, K. C., Jazaieri, H., & Gross, J. J. (2016). Emotion regulation: A transdiagnostic
perspective on a new RDoC domain. Cognitive Therapy and Research, 40(3), 426-440.
https://doi.org/10.1007/s10608-016-9772-2
Fossati, A., Somma, A., Borroni, S., Markon, K. E., & Krueger, R. F. (2018). Executive
https://doi.org/10.1007/s10862-018-9645-y
Gewin, V. (2016). Data sharing: An open mind on open data. Nature, 529(7584), 117-119.
https://doi.org/10.1038/nj7584-117a
Grant, B. F., Chou, S. P., Goldstein, R. B., Huang, B., Stinson, F. S., Saha, T. D., Smith, S. M.,
Dawson, D. A., Pulay, A. J., & Pickering, R. P. (2008). Prevalence, correlates, disability,
and comorbidity of DSM-IV borderline personality disorder: results from the Wave 2
Haeny, A. M., Holmes, S. C., & Williams, M. T. (2021). The need for shared nomenclature on
28
Hanegraaf, L., van Baal, S., Hohwy, J., & Verdejo-Garcia, A. (2021). A Systematic Review and
https://doi.org/10.1016/j.neubiorev.2021.04.013
Hepp, J., & Niedtfeld, I. (2022). Prosociality in personality disorders: Status quo and research
https://doi.org/10.1016/j.copsyc.2021.09.013
Hepp, J., Niedtfeld, I., & Schulze, L. (2022, May 13). Experimental paradigms in personality
directions. https://doi.org/10.17605/OSF.IO/XMWE9
Hopwood, C. J., Thomas, K. M., Markon, K. E., Wright, A. G., & Krueger, R. F. (2012). DSM-
Huprich, S. K. (2020). Personality disorders in the ICD-11: opportunities and challenges for
advancing the diagnosis of personality pathology. Current Psychiatry Reports, 22, 1-7.
https://doi.org/10.1007/s11920-020-01161-4
Jeung, H., Schwieren, C., & Herpertz, S. C. (2016). Rationality and self-interest as economic-
https://doi.org/10.1016/j.neubiorev.2016.10.030
Judd, C. M., Westfall, J., & Kenny, D. A. (2012). Treating stimuli as a random factor in social
https://doi.org/10.1037/a0028347
Klein, R. A., Vianello, M., Hasselman, F., Adams, B. G., Adams Jr, R. B., Alper, S., Aveyard,
M., Axt, J. R., Babalola, M. T., & Bahník, Š. (2018). Many Labs 2: Investigating
29
variation in replicability across samples and settings. Advances in Methods and Practices
Koudys, J. W., Traynor, J. M., Rodrigo, A. H., Carcone, D., & Ruocco, A. C. (2019). The NIMH
research domain criteria (RDoC) initiative and its implications for research on
https://doi.org/10.1007/s11920-019-1023-2
Kraemer, H. C. (2015). A source of false findings in published research studies: Adjusting for
https://doi.org/10.1001/jamapsychiatry.2015.1178
https://doi.org/10.1002/ejsp.2023
Lenzenweger, M. F., Lane, M. C., Loranger, A. W., & Kessler, R. C. (2007). DSM-IV
Liebke, L., Bungert, M., Thome, J., Hauschild, S., Gescher, D. M., Schmahl, C., Bohus, M., &
349-356. https://doi.org/10.1037/per0000208
Morris, S. E., & Cuthbert, B. N. (2012). Research Domain Criteria: cognitive systems, neural
https://doi.org/10.31887/DCNS.2012.14.1/smorris
Moshontz, H., Campbell, L., Ebersole, C. R., IJzerman, H., Urry, H. L., Forscher, P. S., Grahe,
J. E., McCarthy, R. J., Musser, E. D., & Antfolk, J. (2018). The Psychological Science
30
Advances in Methods and Practices in Psychological Science, 1(4), 501-515.
https://doi.org/10.1177/2515245918797607
Munafò, M. R., Nosek, B. A., Bishop, D. V., Button, K. S., Chambers, C. D., Du Sert, N. P.,
Simonsohn, U., Wagenmakers, E.-J., Ware, J. J., & Ioannidis, J. P. (2017). A manifesto
https://doi.org/10.1038/s41562-016-0021
Myers, A., & Hansen, C. H. (2011). Experimental psychology (7th, Ed.). Wadsworth Cengage
Learning.
National Institute of Mental Health. (2022). Domain: Negative valence systems. U.S.
nimh/rdoc/constructs/negative-valence-systems
Nosek, B. A., Alter, G., Banks, G. C., Borsboom, D., Bowman, S. D., Breckler, S. J., Buck, S.,
Chambers, C. D., Chin, G., Christensen, G., Contestabile, M., Dafoe, A., Eich, E.,
Freese, J., Glennerster, R., Goroff, D., Green, D. P., Hesse, B., Humphreys, M., ..., &
https://doi.org/10.1126/science.aab2374
Nosek, B. A., Hardwicke, T. E., Moshontz, H., Allard, A., Corker, K. S., Almenberg, A. D.,
Fidler, F., Hilgard, J., Kline, M., & Nuijten, M. B. (2021). Replicability, robustness, and
020821-114157
Oldham, J. M. (2015). The alternative DSM-5 model for personality disorders. World
Papousek, I., Aydin, N., Rominger, C., Feyaerts, K., Schmid-Zalaudek, K., Lackner, H. K., Fink,
A., Schulter, G., & Weiss, E. M. (2018). DSM-5 personality trait domains and
https://doi.org/10.1016/j.biopsycho.2017.11.010
Parsons, S., Kruijt, A.-W., & Fox, E. (2019). Psychological science needs a standard practice of
https://doi.org/10.1177/2515245919879695
Perugini, M., Gallucci, M., & Costantini, G. (2018). A practical primer to power analysis for
https://doi.org/10.1177/0193841X9501900504
Reisner, S. L., Poteat, T., Keatley, J., Cabral, M., Mothopeng, T., Dunham, E., Holland, C. E.,
Max, R., & Baral, S. D. (2016). Global health burden and needs of transgender
6736(16)00684-X
Riley, R. D., Lambert, P. C., & Abo-Zaid, G. (2010). Meta-analysis of individual participant
Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed
Stanley, T. D., Carter, E. C., & Doucouliagos, H. (2018). What meta-analyses reveal about the
https://doi.org/10.1037/bul0000169
Tedersoo, L., Küngas, R., Oras, E., Köster, K., Eenmaa, H., Leijen, Ä., Pedaste, M., Raju, M.,
Astapova, A., & Lukner, H. (2021). Data sharing practices and data availability upon
https://doi.org/10.1038/s41597-021-00981-0
32
Trull, T. J., Solhan, M. B., Tragesser, S. L., Jahng, S., Wood, P. K., Piasecki, T. M., & Watson,
Wicherts, J. M., Borsboom, D., Kats, J., & Molenaar, D. (2006). The poor availability of
https://doi.org/10.1037/0003-066X.61.7.726
Wilkinson, M. D., Dumontier, M., Aalbersberg, I. J., Appleton, G., Axton, M., Baak, A.,
Blomberg, N., Boiten, J.-W., da Silva Santos, L. B., & Bourne, P. E. (2016). The FAIR
Guiding Principles for scientific data management and stewardship. Scientific Data,
Zimmermann, J., Kerber, A., Rek, K., Hopwood, C. J., & Krueger, R. F. (2019). A brief but
1079-z
33