Landers Et Al - 2022 - Theory-Driven Game-Based Assessment of General Cognitive Ability

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/355479328
Theory-Driven Game-Based Assessment of General Cognitive Ability: Design

Theory, Measurement, Prediction of Performance, and Test Fairness
Article in Journal of Applied Psychology · October 2021

DOI: 10.1037/apl0000954
CITATIONS READS
39 5,077
5 authors, including:
Richard N Landers Michael B. Armstrong

University of Minnesota Twin Cities Old Dominion University
130 PUBLICATIONS 5,886 CITATIONS 22 PUBLICATIONS 1,775 CITATIONS
SEE PROFILE SEE PROFILE
Andrew Burnett Collmus Jason Blaik

Old Dominion University Cappfinity
17 PUBLICATIONS 905 CITATIONS 4 PUBLICATIONS 52 CITATIONS
SEE PROFILE SEE PROFILE
All content following this page was uploaded by Richard N Landers on 22 October 2021.
The user has requested enhancement of the downloaded file.

COGNITIVE ABILITY GAME-BASED ASSESSMENT 1
Theory-driven Game-based Assessment of General Cognitive Ability:
Design Theory, Measurement, Prediction of Performance, and Test Fairness
Richard N. Landers
University of Minnesota
Michael B. Armstrong and Andrew B. Collmus
Old Dominion University
Salih Mujcic and Jason Blaik
Revelian Pty Ltd
Author Note
Early versions of this paper were presented at the annual conference of the Society for Industrial
and Organizational Psychology. Participant payments and graduate research hours in this study
were funded by Revelian Pty Ltd, and RNL became a compensated member of Revelian’s
Scientific Advisory Board mid-project. MBA is now at Google, ABC is now at Facebook, SM is
now at Traitstack, and JB is now at Cappfinity.
Citation
Landers, R. N., Armstrong, M. B., Collmus, A. B., Mujcic, S., & Blaik, J. (in press). Theory-
driven game-based assessment of general cognitive ability: Design theory, measurement,
prediction of performance, and test fairness. Journal of Applied Psychology.

Abstract
Games, which can be defined as an externally structured, goal-directed type of play, are
increasingly being used in high-stakes testing contexts to measure targeted constructs for use in
the selection and promotion of employees. Despite this increasing popularity, little is known
about how theory-driven game-based assessments (GBA), those designed to reflect a targeted
construct, should be designed, or their potential for achieving their simultaneous goals of
positive reactions and high-quality psychometric measurement. In the present research, we
develop a theory of GBA design by integrating game design and development theory from
human-computer interaction with psychometric theory. Next, we test measurement
characteristics, prediction of performance, fairness, and reactions of a GBA designed according
to this theory to measure latent general intelligence (g). Using an academic sample with GPA
data (N=633), we demonstrate convergence between latent GBA performance and g ( = .97).
Adding an organizational sample with supervisory ratings of job performance (N=49), we show
GBA prediction of both GPA (r=.16) and supervisory ratings (r=.29). We also show incremental
prediction of GPA using unit-weighted composites of the g test battery beyond that of the g-
GBA battery but not the reverse. We also show the presence of similar adverse impact for both
the traditional test battery and GBA but the absence of differential prediction of criteria.
Reactions were more positive across all measures for the g-GBA compared to the traditional test
battery. Overall, results support GBA design theory as a promising foundation from which to
build high quality theory-driven GBAs.
Keywords: game-based assessment; game design; validation; measurement; fairness

Theory-driven Game-based Assessment of General Cognitive Ability:
Design Theory, Measurement, Prediction of Performance, and Test Fairness
In recent years, there has been a marked increase in interest among assessment
practitioners in the application of game-thinking, the use of game design theory to improve the
assessment experience (Armstrong, Landers & Collmus, 2016). In the practice of employee
selection, the term game-thinking encompasses two more specific concepts: assessment
gamification and standalone game-based assessment (GBA; Armstrong, Ferrell, et al., 2016).
Whereas assessment gamification refers to design techniques employed to modify existing
assessments by adding game elements (e.g., Attali & Arieli-Attali, 2015; Collmus & Landers,
2019), GBA is a distinct method of measurement (cf. Arthur & Villado, 2008), and GBAs might
reflect either the result of gamification or of a dedicated game design and development. Much
like surveys, simulations, and structured interviews, GBAs can be created with the intent of
assessing any construct of interest. In the current assessment marketplace, there are GBAs
marketed as assessing general mental ability (g), personality, skills, and various competencies
(Handler, 2018); however, scientific evidence evaluating the quality of GBA-based assessment is
scant in the assessment literature and missing in the high-stakes assessment literature (Chamorro-
Premuzic et al., 2016).
In the popular press and in assessment company marketing materials, GBAs are
commonly described as providing two distinct advantages over traditional assessments. The
more prominent marketing claim is improved applicant reactions to assessment (Armstrong,
Ferrell, et al., 2016). Whereas job applicants generally consider traditional survey-based
assessment to be ordinary and expected (Anderson et al., 2010; Steiner & Gilliland, 1996), GBAs
offer a promise of fun and excitement. Although such claims are yet generally untested, it is
clear from applicant reactions theory as to the general concept: that fun and excitement during
the application process should lead to better organizational hiring outcomes (cf., Hausknecht et
al., 2004). However, even if true, given the high cost of GBA development, it is unknown if any
gains realized by developing and implementing GBA would ultimately result in positive utility.
Additionally, the type of fun experienced in games may only be tenuously related to the kind of
fun, if any, that a job applicant wants during high-stakes assessment (cf. Mollick & Rothbard,
2014). The second purported gain attributed to GBA is improved measurement. This claim
takes many specific forms, such as reducing the impact of human biases (e.g., Ip, 2018), yet
comparisons of algorithmic decision-making versus reliance upon human intuition is well-
explored and hardly new concept in the assessment literature (e.g., Kuncel et al., 2013). Another
aspect of GBAs that could enable superior construct measurement is the analysis of the trace
data, such as click and mouse movement data, an area of research called computational
psychometrics (von Davier, 2017). This field is in its infancy and primarily exists in the
assessment of learning (e.g., Olsen et al., 2017), so its potential in the selection context is
completely unknown.
The more immediate concerns for assessment researchers are the development process
and measurement characteristics of current, deployed GBAs designed so that they do not require
computational psychometrics, which we refer to as theory-driven GBA. Theory-driven GBA is
like traditional psychometric assessment in that the GBA is designed to assess a targeted
construct as described by prior psychological research, producing scores representing that
construct based upon assessee behaviors within the assessment. Where theory-driven GBAs
differ from other theory-driven assessment development approaches is that they involve the
creation and collection of scores from a game, a complex concept with a rich history in
interdisciplinary research outside of the assessment literature which spans the humanities, social
science, computer science, and human-computer interaction (Jones, 2008). It is the
interdisciplinary design methodologies developed and refined in games research over the last
century that theory-driven GBA draws from in pursuit of an improved assessment experience
(Armstrong et al., 2016).
Given this landscape and the increasing popularity of GBA in the assessment
marketplace, and given calls to improve applied psychology’s integration of design into its
theories (Landers & Marin, 2021), the purpose of the present article is to introduce games, and
more specifically theory-driven GBA and the game design theories used to create them, to the
high-stakes assessment literature. We create and present a theory of GBA design through an
interdisciplinary integration of literatures across assessment and software design through the lens
of human-computer interaction. Next, we describe two empirical studies, a larger one in an
academic sample and a much smaller one in an organizational sample, comparing and
contrasting reactions, validity, and adverse impact of a traditional g test battery versus a theory-
driven GBA designed to assess g using the process described by our GBA design theory.
Finally, we provide practical recommendations for the development of theory-driven GBA and
describe future research directions.
A Theory of Games and GBA
Games, which have been a part of culture across all recorded human history and likely
much further (Huizinga, 2014), have historically been difficult to define; “what is the definition
of game?” has been the subject of hundreds of articles in the games research literature. Stenros
(2017) attempted to synthesize this literature, identifying 60 distinct definitions presented since
the 1930s differing on 10 dimensions of description. To simplify this in the GBA context, it is
tempting to at least narrow our treatment at least to digital games, which is to say games played
on a computing device and in the present day most typically delivered over the internet.
However, the term game describes a range of analog experiences as well, such as classics like
Monopoly, Scrabble, Chess, and Duck-Duck-Goose. For our purposes in the present article,
game will be defined as explored and defended by Landers et al. (2019): “an externally
structured, goal-directed type of play.” As they explain, fundamental to this definition like
almost all in the literature is that players have a high degree of flexibility in terms of how they go
about achieving goals either enabled or imposed by the game’s design. Further, a game engages
a player in its core gameplay loop, an iterative experience created by the interaction of
potentially thousands of game elements, all designed to encourage players to experience the loop
in a specific, targeted way (Landers et al., 2019).
Building upon this background, GBAs thus utilize games as a platform from which to
conduct meaningful measurement of psychological constructs, much as Likert-type measures
rely upon multiple-choice questions. In theory-driven GBA, this occurs explicitly, by designing
in-game activities to create meaningful scores estimating targeted constructs. Many modern
digital games already incorporate explicit assessments like this quantifying player behaviors
(e.g., goal achievement, in-game activities), although at a less rigorous standard of measurement
than is common in the assessment literature. For example, simply counting the number of goals
achieved among a provided list and providing feedback on this list to players may be sufficient to
create an emotionally compelling gameplay experience. The fluid nature of games’ “multiple
interacting aspects of knowledge and skill; construct-irrelevant variation from game features;
dependencies among actions across time points; [and] different situations arising for different
players as they interact with a game” (Mislevy et al., 2014, p.10) pose significant challenges and
deviations from standard assessment development practice for the more rigorous requirements of
high-quality, modern psychometric measurement.
To overcome these challenges in the development of theory-driven GBAs, gameplay
activities must be designed that provide players with significant freedom in pursuing game goals
while simultaneously designing the experience such that variance in measurable player behaviors
reflect targeted constructs. If there is insufficient freedom to play, the game feels proscribed and
mandatory, and thus the experience loses its gamefulness (see Mollick & Rothbard, 2014;
Landers et al., 2019), reducing the potential added value of GBA over much less costly
assessment methods.
Designing a GBA given this challenge is potentially easier in the case of the
measurement of g than for the measurement of skills or non-cognitive traits. Because g can be
operationally defined as the shared variance in success among varied cognitively-loaded
activities (Schneider & McGrew, 2012), the goal of g-GBA development can be as relatively
simple as creating multiple game activities that are cognitively loaded, such as solving
interactive puzzles or pursuing complex in-game goals, and to score those activities according to
their cognitive aspects with a traditional g scoring model. Thus, development of a g-GBA can
resemble development of a traditional g assessment but with the significant financial overhead of
game development. Skill GBAs are similarly straightforward from a measurement perspective
but add concerns regarding accurate simulation of the targeted skill construct in terms of physical
and functional fidelity (Hamstra et al., 2014); skill GBAs are designed to require the player to
engage in the skill to be measured, or a proxy, and those behaviors are then scored according to
the degree of skill exhibited, which might be done by human raters or algorithmically. For
example, in a cybersecurity skill GBA, players might be presented with a simulated
cybersecurity environment and asked to engage in a series of increasingly difficult challenges
(e.g., see CyberNEXS in Nagarajan et al., 2012). The most complex design case is for non-
cognitive traits where there is no “correct” answer indicating higher construct standing, such as
personality. In this context, activities must be crafted such that players have the freedom to
engage in a range of actions reflecting a range of a targeted trait but not have so much freedom
that those actions could reflect a non-targeted trait. If a player might behave in a scored way in a
GBA because of high agreeableness or because of high conscientiousness, that behavior will
likely not be a particularly good measure of either trait when using classical psychometric
approaches. For example, a player might be given the freedom to choose whether to help a
virtual character in need, yet such helping behavior could reflect a desire to be friendly (i.e., an
agreeableness signal), a desire to complete all tasks provided (i.e., a conscientiousness signal), or
some other rationale altogether.
GBA Design Theory
Modern game design, whether for the purpose of assessment or otherwise, is an
enormously complex process approached with myriad methodologies; all are a combination of
art and science, and the specific methodology adopted shapes not only the aesthetic experience
of a resulting GBA but also its psychometric properties. Most modern digital games are the
result of designing hundreds or thousands of interconnected subsystems that work in harmony to
create a targeted gameplay experience, each requiring hundreds or thousands of individual
design decisions. Thus, much as with the development of other complex media like film, extant
theory provides some “rules” for effective game design (Salen et al., 2004), such as the criticality
of iterative engineering practices in which prototypes are gradually improved through repeated
data collection and action planning (Kultima, 2015), but much is still based upon the intuition
and development experience of the game’s director in terms of how to develop a compelling
experience for players, as there is rarely empirical research to consult regarding individual design
decisions.
In parsing the complexity of games, the games research community has numerous major
theoretical frameworks, the choice of which generally reflects the disciplinary lens of the
researcher attempting design or analysis (Egenfeldt-Nielsen et al., 2013) and which provide a
theoretical framework for understanding the implications of different game designs. A popular
approach within human-computer interaction, which tends to favor empirical approaches with a
foundation in psychological theory, is the Mechanics, Dynamics, Aesthetics (MDA) framework
(Hunicke et al., 2004), which blends perspectives from the fields of “game design and
development, game criticism, and technical game research” (p. 1722). In MDA, mechanics are
defined as the algorithms and systems that establish how a game functions on a technical level,
such as scoring systems or avatar control. Dynamics are defined as the real-time interactions
between mechanics, which include both inter-mechanic dynamics and mechanic-player
dynamics. Aesthetics are defined as a player’s affective response caused by experiencing a
game’s mechanics and dynamics. Importantly, the only aspects of a game that actually “exist” in
a technical sense are the mechanics; the experience recognized as a game are the dynamics – the
result of mechanics interacting with each other and with players in real-time. It is impossible for
game programmers to program dynamics directly. Instead, they must program mechanics with
the goal of those mechanics interacting with the player and with each other in such a way that
desirable dynamics, and subsequently aesthetics, emerge as a result of gameplay. This idea
reflects the theory of motivational affordances, which describes how the technical characteristics
of systems have only the potential to motivate people to act (Deterding, 2011; Zhang, 2008);
mechanics have affordances that are designed to but do not necessarily lead to desired dynamics.
Thus, causally speaking, mechanics cause dynamics which in turn cause aesthetics, but game
developers only have direct control over mechanics despite targeting desirable aesthetics, such as
a “fun” GBA.
The design challenge in GBA incorporates all these concerns yet is still more complex
because targeting desirable aesthetics is a secondary goal. Whereas the primary goal of
commercial entertainment games is to craft an experience in which people have positive
affective experiences, motivating them to play more and discuss their experiences positively with
friends and family so that the games are more commercially successful, GBAs must first produce
scores with trustworthy psychometric properties and secondarily also create a positive affective
experience. If a GBA does not achieve a degree of measurement quality that is recognizable by
common standards (e.g., Nunnally, 1978; Society for Industrial and Organizational Psychology,
2003), it is not a GBA; it is only a game that outputs numbers. Similarly, if a GBA does not
create a sense of play, it is not a game; it is only an assessment marketed as if it is a game, which
is instead a type of gamification called “game-framing” (Collmus & Landers, 2019). Perhaps
most importantly, if the GBA cannot achieve both goals simultaneously, there are far more
inexpensive assessment methods that can achieve one or the other goal alone. Thus, GBA
designers face the added complexity of combining traditional game design, such as through the
lens of MDA, with the development of a psychometrically valid measurement instrument.
Currently, GBA developers tend to integrate literatures based upon their own locally-defined
priorities, because there is little theory to guide GBA design. The present study is intended in
part to address this gap.

Modern digital application development processes are generally based upon design
thinking, a design and development process theory focusing upon iterative redevelopment and
reconceptualization of both the problem being addressed and the product being designed to
address it (Plattner et al., 2011; Rowe, 1986) that was popularized by the Stanford Design School
(Plattner, 2011; Bjogvinsson et al., 2012). Design and development process theories are rarely
seen in applied psychology, but they describe how to best align design and development
processes with intended outcomes, which generally determines the quality of the final product
being developed in relation to organizational goals related to that product (Landers & Marin,
2021). In design thinking, there are five key stages to the development of a designed product;
however, designers may jump backwards to any prior stage if challenges in later stages
necessitate it. A visualization of these stages as applied to GBA development appear in Figure 1.
The first stage, empathizing, requires designers to collect data on the problem their product is
designed to solve; in the theory-based GBA design context, this involves construct specification
and development of a shared mental model among assessment experts, game designers, and
software development teams. The second stage, definition, refers to a meta-cognitive process in
which designers attempt to anticipate difficulties ahead based upon the problem and design
product as they understand it at that time. For GBAs, this principally involves consideration of
the ultimate context in which the GBA will be deployed, including issues like supported device
types, proctoring, and technical limitations. The third stage, ideating, involves brainstorming
and prioritization of developed ideas needed to develop a useful prototype, which generally
involves building a shared mental model between the assessment team and technical teams
working on the GBA. The fourth stage, prototyping, involves the creation of a planned product,
which includes anything from paper representations of GBAs (i.e., low fidelity prototypes) to
partially functional but unfinalized digital versions (i.e., high fidelity prototypes; see Figure 2).
The fifth stage, testing, involves assessment of the success of that product in addressing the
original problem statement, which for GBAs typically include both aesthetic and psychometric
goals. These stages are also often repeated in various combinations after a tested product has
already been deployed for use; for example, in an assessment consultancy, after the first version
of a GBA is deployed for one client, lessons learned from that GBA may be used to inform
future empathizing, defining, ideation, and/or prototyping while that first version of the GBA
continues to be used. When creating a GBA under a design thinking model, assessment
specialists generally work much more closely with software developers than is typical for
assessment specialists in more traditional measure development projects given the thousands of
iterations involved, each of which may have implications for high-quality measurement; for
example, the distance between the two images in Figure 2 alone was several hundred prototypes,
each tested in different ways depending upon development goals relevant at that point in time.
As a result of the interdisciplinary expertise required, as well as the significantly
increased overall process complexity, agile development of a GBA occurs in a much more
organic and emergent manner than in traditional assessment design (Mitchell, 2016). Although
the empathizing stage is similar, in that it involves identification of target constructs through
needs analysis (Taylor et al., 1998), job analysis (Morgeson et al., 2019), or some other
traditional process used to define research problems, the remaining stages are different. In GBA
design, the defining stage is minimized as a way of managing risk; it is generally assumed that
developing a prototype and learning about the GBA’s performance from data will be more
efficient than spending more time in the initial defining stage. More dramatically, the ideating,
prototyping, and testing stages are reordered and repeated as suitable for the combination of
Figure 1. Illustration of prototypical GBA development process adopting design thinking.
Figure 2. Early low-fidelity paper prototype (left) and screenshot from a functional high-fidelity
digital prototype (right) of one of the current study’s GBA mini-games, Gridlock
problems, designers, and developers in the project, to maximize quality while minimizing effort
(Moran, 2014; Beck, 1999). The success of these efforts will be driven not only by technical
competence but also in large part by traditional antecedents of team effectiveness, such as team
structure, leadership, planning and coordination, and so on (Mathieu et al., 2017).
Measuring g with a GBA
g is a stable individual difference construct commonly used in employee selection due to
its strong association with job performance and other outcomes of interest across a wide variety
of employment contexts (Schmidt, 2002; Schmidt & Hunter, 2004). In I-O psychology, g is
typically defined within a Spearmanian framework, such as the Cattell-Horn-Carroll (CHC)
theory of cognitive ability and operationalized as shared variance across cognitively-loaded tests
selected to sample across the cognitive ability domain (Lang, Kersting, Hulsheger & Lang,
2010). This operationalization is possible due to CHC theory’s hierarchical modeling of g, such
that g is the topmost and broadest construct, reflecting shared variance among specific abilities,
including fluid reasoning, visual processing, and processing speed, among others. CHC theory
was chosen as the foundation for measurement in the present study’s GBA for two primary
reasons. First, the CHC model is today the clearly dominant model of g in the intelligence
research literature and has a rich history (Schneider & McGrew, 2012) combining two prominent
models of human cognitive abilities, Horn-Cattell Gf-Gc theory (Horn & Noll, 1997) and
Carroll’s three-stratum theory (Carroll, 1993). Within the context of assessment and personnel
selection, g has also consistently demonstrated utility as a predictor of occupational success
across a wide array of job types (Bertua et al., 2005; Salgado et al., 2003; Schmidt & Hunter,
1998). Second, CHC theory’s taxonomy of broad human cognitive abilities included rich
theoretical foundations for its many components, allowing for targeted game design.
Specifically, rather than focusing on developing multiple general assessments of g, CHC’s broad
level factors provide specific details from which specific game mechanics can be identified and
targeted with individual short-form games.
The hierarchical nature of cognitive ability is such that g, which exists at the highest and
most general level, explains the majority of variance across more specific ability tests regardless
of the specific domains of those tests, a concept Spearman called indifference of the indicator.
Empirical support for this is most commonly observed in industrial-organizational psychology as
very strong factor loadings and relatively little residual variance in factor analyses of
cognitively-loaded test batteries used for employee selection (Carretta & Ree, 1996, Ree &
Carretta, 1994, Stauffer et al., 1996). One well-validated framework in the broader literature on
g, the cognitive design system approach (Embretson, 1994), provides valuable guidance on the
development of theoretically-consistent second stratum ability measures to develop g batteries.
This framework addressed what was at the time a generally atheoretical approach to constructing
ability tests with cognitive theory, recommending the targeting of two goals in item construction:
a) construct representation, which refers to alignment between the latent constructs involved in
solving an ability item with the cognitive theory of that solving, and b) nomothetic span, which
refers to expected relationships within the nomological net surrounding the measure given that
cognitive theory. Primi (2014) demonstrated both the value and complexity of this approach in
the creation of a fluid intelligence measure, linking specific item design features, such as
perceptual complexity, with targeted components of fluid intelligence. This type of linking
procedure, between cognitive theory and the properties of each item, thus serves as a strong
conceptual basis for the development of a g-GBA by using these techniques in the same general
fashion but within the context of game design.

Additionally, two research studies are of particular relevance to the measurement of g
using scores across video games. Quiroga et al. (2015) administered 11 puzzle mini-games taken
from a commercially-available “brain training” video game for the Nintendo Wii, 1 computer-
based maze navigation game, and 11 cognitive ability tests to 188 undergraduates, finding a
correlation of .96 between latent cognitive ability and latent game performance. Although
groundbreaking in its demonstration of the potential for video games to provide valid
measurement of g, the study was limited its generalizability and practicality in an employee
selection scenario due to its focus on commercially available video games designed for the
purpose of entertainment (Bhatia & Ryan, 2018), an unclear theoretical basis and design process
for the selection of relevant mini-games, and a complete focus on latent variables without
exploration of observed scores. Quiroga et al. (2019) replicated and extended this study,
addressing some of these concerns, by proposing a concept called video games general
performance (gVG). Although gVG was never explicitly defined, it was treated as the shared
variance among performance scores obtained produced by video games. The selection process in
this study was also a bit more explicit about its game screening process, administering a battery
of games plus a cognitive ability test battery to an ultimate sample of 134 participants in a lab
environment held over three 90-minute sessions. Quiroga et al. modeled scores obtained from
these games in a similar fashion as Quiroga et al. (2015), this time finding a latent correlation
with latent cognitive ability of .79, substantially lower than observed in the earlier study;
comparing R2 across studies, .922 - .624 / .922 suggests a 32.32% reduction in convergence for
unclear reasons.
An additional theoretical concern is the proposed gVG construct. Although never
defined explicitly, the implicit presentation of gVG as reflecting broad ability across all video
games is problematic, especially when followed by targeting subsets of games. Specifically,
there exist numerous game genres with distinct skills required for success within each (Apperley,
2006), and Quiroga et al. (2019) not only sampled a subset of these genres, but also selected
within that subset for appropriate features using an ambiguous selection process. A player with
extensive experience with game mechanics within a particular genre focused upon within a g-
GBA assessment is likely to have developed some generalized skill within that genre’s typical
mechanics that does not generalize to other genres; for example, if a person plays many slide
puzzle games, they are likely to have greater skill at future slide puzzle games, because solving
slide puzzles is a skill than can be learned (cf. Colzato, van Leeuwen, van den Wildenberg, &
Hommel, 2010). Thus, if a g-GBA was designed as a slide puzzle game, latent game
performance from a slide-puzzle-based g-GBA may contain variance related to slide-puzzle-
solving skill, contaminating measurement of g. Thus, in the case of the present study, we sought
to assess the degree to which a particular g-GBA’s design was successful in avoiding the
contamination seen in gVG, or put differently, to what degree the present g-GBA’s design
process, focused entirely upon the creation of a measure of g within an employee selection
context, successfully achieved parity with traditionally-measured g. If supported, the resulting
hypothesis will suggest unique value for theory-driven GBAs, and the design theory used to
create them, beyond scores produced by off-the-shelf video games.
Hypothesis 1. Latent game performance will converge with latent g as measured by a
battery of traditional g tests; specifically, g-GBA latent performance and traditionally-measured
latent g will correlate approximately 1.0.

Prediction of Performance Outcomes
g consistently predicts a variety of performance outcomes, and the greatest quantity of
research exploring this is across the academic and employment domains. In the context of
employment, a common outcome is supervisory ratings of job performance, which brings
numerous advantages and disadvantages as a criterion measure (Viswesvaran & Ones, 2000).
The primary advantages to the use of supervisory ratings for criterion validation are convenience
and availability, as most organizations collect at least annual performance reviews, and
ecological validity, as organizational decision-making is often based upon these ratings. The
primary disadvantages are related to operationalization and measurement quality, as supervisory
ratings are often not high-quality reflections of the job performance construct, showing low
reliability and often collapsing orthogonal dimensions of job performance into individual scores
(Borman & Motowidlo, 1997). Despite these flaws, g tests have been consistently shown to
predict such scores, and the relationship between g and supervisory ratings is a well-explored
relationship (e.g., Hunter & Hunter, 1984, found operational  ranged between .31 and .73,
which varied by job complexity).
In the education domain, the most commonly studied performance outcome is college
grade-point average (Kuncel et al., 2004). Several properties of grade point averages make them
useful for validation research such as the present study. First, grade point averages are
extensively studied in the education literature, which has provided meta-analyses containing
useful benchmarks; the correlation between g and grade-point average is well-explored (ρ =.21,
N =7820, k =35; the relationship is stronger for highly-cognitively loaded tests such as ρ=.33,
N=22289, k=29 for the relationship between the SAT and grade-point average; Richardson,
Abraham, & Bond, 2012). Second, they are conceptually similar to supervisory ratings of job
performance (Meade & Fetzer, 2009), in that both classes of variable represent the outcome of
interactions between individual difference and situational variables. Much as supervisory ratings
are a convenient although imperfect proxy for actual job performance, which is itself the
behavioral outcome of knowledge, skills, motivation, and situational factors (Campbell et al.,
1993), grade-point average is a similarly convenient although imperfect proxy for academic
performance, itself the outcome of learned knowledge, acquired skills, motivation to learn, and
situational factors (Kuncel et al., 2001). Third, grade-point averages are conveniently available,
as college student grade-point averages are readily attainable, with permission, from university
records.
Although the weighting of task and contextual performance likely differs between the
two, in both cases, we would expect the correlation between latent game performance and
criteria to be similar to the relationships between traditionally-measured g and criteria if g-GBAs
are in fact measuring g. In the interest of validating g-GBA scores from multiple perspectives,
we tested both relationships in different samples.
Hypothesis 2a. Latent g-GBA performance will positively predict college grade-point
average.
Hypothesis 2b. Latent g-GBA performance will positively predict supervisory ratings of
job performance.
In the practice of assessment, isolation of and prediction from mathematically isolated
latent construct scores is not generally possible when making selection decisions; instead, test
scores are either combined into operational composites or used as predictors in regression
models. Each of these approaches brings its own strengths and drawbacks depending upon
context (Potosky et al., 2005). In the GBA context, although g is theorized to represent most of
the variance across the GBA’s minigames, performance-irrelevant and performance-relevant
variance unique to each minigame is likely to be contained within minigames scores. In the
more specific case of the present GBA, the seven minigames were developed to principally
target four broad traits in different combinations: Quantitative Knowledge (Gq), Reading and
Writing (Grw), Fluid Reasoning (Gf) and Processing Speed (Gs).
There may also be additional sources of GBA-specific method variance contained within
individual game scores. For example, existing research suggests that well-designed games can
decrease or suspend fears and anxieties by engrossing players in gameplay (Isbister et al., 2012;
Barnett & Storm, 1981; Landers et al., 2019). In the context of GBAs, this may improve
attention and concentration given research linking testing anxiety with test performance (Moran,
2016). Furthermore, because GBAs are much more behaviorally complex than traditional g
tests, GBAs may capture criterion-relevant behaviors beyond those captured by traditional
assessments. Although this would reflect poorer construct validity, it might result in improved
prediction, reflecting a common trade-off also seen in the arguments for the use of machine
learning in selection contexts (Putka et al., 2018).
Given the relatively unexplored nature of issues related to incremental prediction of
GBAs, and because we knew we would be unable to isolate any of these theoretical mechanisms
given the nature of GBA development, we approached the issue of incremental prediction as a
general question of interest with the intent of exploring game performance as both a latent
variable and by modeling performance on the g-GBA minigames individually to better
understand the relationship between this GBA and performance in relation to traditionally-
measured g.
Research Question 1. Does g-GBA performance predict GPA incrementally beyond
traditional g test performance (and vice-versa)?
Adverse Impact and Test Bias
When considering the use of g measures in the employee selection context, the presence
and magnitude of adverse impact is of significant concern (Hough et al., 2001). Adverse impact
occurs when a test is biased by subgroup membership within a protected class defined by
national or local laws, resulting in different success rates by subgroup despite a consistently
applied testing standard. Protected classes vary by jurisdiction but may include race, sex, color,
religion, national origin, disability, age, or any other legally defined classifier. Additionally,
specific cutoffs are sometimes defined at which point adverse impact legally occurs (Equal
Employment Opportunity Commission et al., 1978).
In the context of g, adverse impact research generally focuses on the “Black-White test
score gap” due to the relatively large population of Black people in the United States relative to
other racial and ethnic minorities, as well as the commonly observed mean score difference of
roughly one standard deviation between White and Black people on common g measures (Roth
et al., 2001). If the present g-GBA is indeed a measure of Spearmanian g, a difference of similar
magnitude is expected (Kehoe, 2002). Although such observed differences appear to be
primarily driven by persistent construct-level differences between populations (Reeve & Hakel,
2002), these differences can be widened by certain characteristics of both test design and the
context in which testing takes place. For example, the use of racially-biased items within a
questionnaire measure of a construct that is not itself racially biased can still lead to the
appearance of a racially biased test (Drasgow, 1987). Further, because the internal consistency
reliability of g measures is frequently quite high, relatively lower GBA reliability could also
attenuate observed racial differences. Thus, different characteristics of GBA could increase or
decrease observed differences beyond what is suggested by construct effects. In the context of
racial differences, there is no evidence to suggest that gameplay experience or game skill related
to a g-GBA would differ beyond what is already suggested by construct-level differences; thus,
there is no reason to expect that the Black-White test score gap would be enlarged by use of
GBA, although it could be attenuated due to relatively lower reliability. Unfortunately, due to a
low base rate of Black test-takers in our applied sample, this hypothesis test was limited to our
academic sample alone, which applies to all further race-related hypotheses and research
questions as well. Ross et al. (2001) found a Black-White test score difference of .69 standard
deviations among college students, so we used this estimated population effect as the basis for
our hypothesis.
Hypothesis 3. Black test-takers and White test-takers will differ in g-GBA performance
such that Black test-takers will score approximately .69 SD lower than White test-takers.
Where theory does suggest g-GBAs might introduce additional bias is related to a
different set of protected classes, sex and gender, due to associated differences in game-playing
habits. As described by Brown (2017), the base rate of women who play games in the United
States is slightly lower than that of men (48% and 58%, respectively), suggesting less experience
among women, on average, playing video games. However, these rates vary greatly by genre.
The genres that the present g-GBA minigames most closely resemble vary. Whereas some
games more closely resemble classical cognitive tasks, others may be more firmly placed in a
particular genre. Of the set, the game called Shortcuts most closely resembles the puzzle genre;
within that genre, a 72% base rate for women is observed compared to a 52% base rate for men
(Brown, 2017), suggesting that, on average, women have greater puzzle-playing experience than
men. Thus, an overall difference is expected such that men perform better across the minigames,
yet women may regardless perform better on the types of minigames with which they have the
most prior relevant experience, such as those most clearly resembling prototypical puzzle video
games. Beyond experience alone, other knowledge or skills, such as psychomotor ability, could
also influence performance differently by gender across genres. Given the paucity of research on
the relationship between gameplay experience by genre and gameplay performance by genre, the
magnitude and even direction of likely bias across genres is unclear. Thus, we concluded that we
could not reasonably predict the particular pattern of gender differences across minigames a
priori. Thus, we developed a hypothesis of interest regarding overall differences and an
exploratory research question regarding differences between minigames.
Hypothesis 4. Self-identified males and females will differ in g-GBA performance such
that female test-takers will score lower than male test-takers, on average, across g-GBA tests.
Research Question 2. Are male-female differences moderated by minigame design?
Fairness and Differential Prediction
Beyond the concept of adverse impact is test fairness, a legal classification that allows for
the adoption of a test showing adverse impact under certain conditions (Schmidt & Hunter,
1974). In the United States, test fairness is most commonly evaluated by examining differential
prediction of a meaningful criterion across subgroups (Newman et al., 2007). Within the Cleary
(1968) model of fairness, the dominant framework for evaluating fairness in the United States
legal system (Aguinis & Smith, 2007), differential prediction across subgroups is evaluated in
terms of differences in slopes, intercepts, and error variances (Schmidt, 1988). Although Cleary
stated that any such differences suggested an unfair test, more recent interpretations suggest a
test is fair if a common regression line fits all subgroups equally well (Meade and Tonidandel,
2010). For example, if cognitive ability is directly and causally related to job performance,
adverse impact in the prediction of job performance from a cognitive ability measure without
differential prediction could still be considered fair if it accurately predicts criterion differences,
reflected as a specific type of subgroup intercept differences. In contrast, subgroup differences
in slopes reflect differences in the validity of test scores between subgroups and are the most
legally problematic; intercept differences without slope differences are generally considered fair
(Meade & Fetzer, 2009). As a result, the disentanglement of slope and intercept differences is
generally viewed as the most critical issue in evaluating fairness, and slope differences are used
as the primary indicator of unfairness (Stark et al., 2004).
In the g-GBA context, if differential prediction were observed, the source must be one of
two sources: either the g construct or an aspect of the GBA method as designed to measure g.
Differential prediction from the g construct itself is unlikely; prior research has demonstrated
that although subgroup intercepts for the prediction of job performance from g differ, slopes
generally do not differ by subgroup unless there are test-specific causes, such as biased questions
(Kehoe, 2002). Thus, any observed differential prediction in the present study is more likely
attributable to the GBA method itself or more indirectly, the development process of the g-GBA
under study. The introduction of varying slopes by racial subgroup for a g-GBA is unlikely in
the context of race for the reasons described earlier regarding test bias, namely that there is no
evidence suggesting racial differences in game experience or game attitudes that would
contribute to greater trait diagnosticity for one race versus another. However, if women’s greater
mean experience with puzzle games was associated with decreased reliability for men, this
would attenuate slopes for the male subgroup, biasing those slopes towards zero. Other designs
that could elicit differential effects by group include the specific abilities being measured,
variance contributed by confounding constructs like psychomotor ability, or other psychometric
artifacts.
Research Question 3. Is differential prediction of GPA with a g-GBA similar to that of
traditionally-measured g for race and gender within the academic sample?
Reactions to Assessment Games
Test-taker reactions are an important consideration when developing and implementing
new forms of assessments, especially in high-stakes testing contexts such as employee selection
(Hausknecht et al., 2004). Generally, test-taker reactions refer to the “attitudes, affect, or
cognitions” an individual might have about the testing process (Ryan & Ployhart, 2000, p. 566).
Specifically, test developers and researchers are concerned with test-taker perceptions of test
fairness (i.e., distributive justice, procedural justice), test-taker motivation, test-taker anxiety, and
general attitudes of the test-taker toward the test itself (Hausknecht et al., 2004). Test-taker
perceptions and reactions can impact a variety of outcomes, including actual test performance,
self-efficacy, and, in the context of employee selection, organizational attractiveness and
intentions to accept job offers (Hausknecht et al., 2004), although these downstream effects tend
to be small (Ryan & Huth, 2008). In general, test-taker reactions to g assessments, particularly in
employee selection contexts, are slightly unfavorable, although g assessments vary in this regard
on many design dimensions, including novelty, immersion, features, and time pressure.
We contend that reactions to a g-GBA, if designed well for its audience, should be more
positive than to a traditional g test battery across most reaction dimensions given evidence from
the games literature and research on existing assessments incorporating game elements for five
reasons. First, there is empirical evidence supporting that adding game elements to assessments
can improve reactions. For example, Attali and Arieli-Attali (2015) gamified a computerized
assessment of mathematics knowledge by awarding points for correct answers and speedy
responses, which students found more enjoyable and motivating. Second, some gamified
assessments, which incorporate one or more game elements like animation, sound effects,
instantaneous feedback, varying difficulty, progress bars, and narrative contexts, have been
found to be perceived as face valid by authentic job applicants (Ferrell et al., 2015). Third, video
games more broadly have been found to decrease anxiety; for example, Mavridis and Tsiatsos
(2017) used GBA on a game-based learning platform to decrease test anxiety for graduate
students who reported that they did not feel like they were being tested despite rationally
knowing otherwise. Fourth, because GBAs are more behavioral in nature relative to multiple
choice methods, GBAs should provide an increased sense of an opportunity to perform (i.e.,
procedural justice). Fifth, animation and instant feedback on performance should improve the
sense of interpersonal warmth that traditional g assessments tend to lack (Anderson et al., 2010).
Hypothesis 5. Reactions to a g-GBA will be more positive than to a traditional g test
battery within the academic sample.
Method
The complete experimental protocol and analytic plan for this study were originally
submitted for ethics review to and deemed exempt by the Old Dominion University College of
Sciences Committee for Review of Human Subjects Research (981057-1; "Game-based
Assessment"). A secondary protocol covering access and analysis of data collected under that
exemption was later submitted and deemed exempt by the University of Minnesota
(STUDY00004558; "Game-based Assessment”) upon a change in employment of the first
author.
Participants
Academic Sample
Undergraduate students were recruited from a large public university in the middle-
Atlantic region of the United States for this study. Participants were recruited through traditional
university-wide in-person recruiting, which included live recruiting at the campus quad,
distribution of flyers on campus, and online email announcements, as well as through a
Psychology department research participant pool. From university-wide recruiting, 394 people
volunteered and participated in the study. Each was compensated US$20 for approximately 2
hours of effort. Additionally, 428 students signed up through the psychology department research
participant pool and were compensated with course credit. Of these 822 students who began the
study, 746 (90.8%) reached the end.
To minimize the impact of careless responding on results, two detection techniques
recommended by Meade and Craig (2012) were applied. First, three bogus items were included
throughout the test (i.e., 82.7% correctly answered Disagree or Strongly Disagree to “I have
traveled to the Moon,” 84.5% to “I was on board the Titanic,” and 84.3% to “Select the option
that is at the left end of the scale for this question”). Second, participants responded to a
question directly asking if they had put in an honest effort into all parts of the study and tried to
follow instructions, which was endorsed by 91.8% of participants. Excluding any case that
failed at least one of these four tests would have removed 29% of cases, so we adopted a slightly
less stringent criterion by excluding only cases that failed at least two of these tests, which
eliminated only 14% of cases. We also post-hoc compared our statistical tests between these
approaches, finding few differences. Ultimately, 633 cases (84.5% of valid participants; 76.6%
including study withdrawals) remained for analysis.

In this final sample, the mean age was 21.35 years (s=4.14). 42.0% self-reported as
European American or White, 28.8% as African American or Black, 12.0% self-reported as
biracial, and 17.2% reported another race or combination of races. 41.9% self-reported as male,
57.7% as female, 0.2% as transgender and 0.3% as other. 10.3% self-reported full-time
employment, 43.6% part-time employment, and 46.1% no employment.
Organizational Sample
Current early-in-career employees of a major international healthcare company, across
several small workgroups in Brazil, Canada, Columbia, India, Mexico, and the United States,
participated in a concurrent validation study. Due to privacy requirements within the company
for this type of data sharing, a specific breakdown by country was not made available to the
research team. Participation was voluntary for all employees, and the data collection process
was managed by that organization’s internal industrial-organizational psychologists. In total,
127 employees completed at least one mini-game. Within this group, supervisor ratings of
performance were available for 49 (41.5%). According to the organization sharing data,
supervisory ratings were missing for two reasons: 1) employees had either not worked at the
organization long enough to have participating in the annual review process or 2) less
commonly, their manager was noncompliant and chose not to participate in the mandatory
annual review process for any of their direct reports. This provides some rationale to support that
missingness in this sample was at random and not reflective of the characteristics of individual
employees. Additionally, the organization reported that they did not use highly cognitively-
loaded measures for either employment or promotion, suggesting only weak indirect range
restriction in cognitive ability in the sample. Within this sample, the mean age was 24.69 years
(SD=2.15), ranging from 22 to 32 years, and 28 (59.3%) were female. Employees had worked
for the company on average 1.82 years (SD = 1.01).
Measures
Measures differed by sample. In both samples, a g-GBA was administered, and an
outcome measure was captured. In the organizational sample, gender, race, and age were
requested on a self-report questionnaire in addition to the g-GBA. In the academic sample, a
traditional cognitive ability test battery, a reactions battery, and a demographics questionnaire
were administered in addition to the g-GBA.
Traditional Cognitive Ability Test Battery
Specific cognitive ability tests were chosen from Stanek and Ones’ (2018) compendium
of personality and cognitive ability measures as prototypically representative of that specific
ability, to minimize domain overlap between tests.
Visual processing. The first test, intended to measure visual processing, was the General
Aptitude Battery’s test of Spatial Aptitude (Hunter, 1980). In this test, participants were
presented with a series of figures illustrating a flat piece of paper with dashed lines indicating
fold points. Participants were asked to identify which the flat piece of paper would look like
when folded among four three-dimensional shapes. This test had a 4-minute time limit and 10
items. Scores were the number of correct answers. Due to negative skew, the scores from this
test were reversed and logarithmically transformed. Coefficient alpha of the untransformed
scores was .73.
Fluid reasoning. The second test, also adopted from the Educational Testing Service’s
Kit of Factor-Referenced Cognitive Tests (Ekstrom et al., 1976; Widaman, 1982), was the
locations test. In this test, participants were presented with five rows of dashes and gaps between
the dashes. In each of the first four rows, an X was placed over one of the dashes, creating a
pattern from row to row. Participants were asked to indicate which of five dashes in the fifth row
would need to be marked with an X to match the pattern on the other rows. This test had a 6-
minute time limit and 14 items. Coefficient alpha was .50.
Processing speed. Part 1 of the Chicago Non-Verbal Examination (Brown et al., 1936)
was included in the cognitive battery to assess processing speed. This test is a digit symbol test
(Salthouse, 1996; Conway et al., 2002), in which participants were given 12 different symbols
matched with the numbers 1 through 12 as a legend. Each item was one of the twelve symbols
from a continuously visible legend, and participants had to identify which number was associated
with each symbol presented as quickly as possible. This test lasted 2 minutes and 30 seconds and
included 105 items. Coefficient alpha was .85.
Quantitative ability. The quantitative ability test was the General Aptitude Battery’s test
of Quantitative Reasoning (Hunter, 1980). It included 5 quantity comparison items in which two
quantities were presented. Participants were asked to determine whether quantity A was larger,
quantity B was larger, the two quantities were equal, or if the relationship between the two could
not be determined. This test had a 5-minute time limit. Coefficient alpha was .37.
Verbal ability. The verbal ability test was a section of practice Graduate Record Exam
questions (Conrad et al., 1977) from the verbal section of the test (Burton, et al. (2009); Kuncel
et al., 2001; Kuncel et al., 2010) which involved 6 sentence-completion items with one or two
blanks. Participants selected one or two words from three to five choices available to complete
the sentence logically. This test had a 6-minute time limit. Due to positive skew, the verbal test
was logarithmically transformed. Coefficient alpha of the untransformed items was 0.60.
Game-based Assessment of Cognitive Ability (g-GBA)
The focal g-GBA of this study was an early version of Cognify, which was developed using the
design theory outlined in the introduction of the present paper. Cognify as used in this data
collection effort consisted of seven distinct web-based minigames intended to assess general
cognitive ability by targeting combinations of second-stratum abilities within the CHC model.
Screenshots from each of these minigames appear in Figure 3. All games relied on user input via
pointing and clicking with a mouse. The entirety of the g-GBA took about 20 minutes to
complete, with each game ending automatically after a certain amount of time, creating some
processing speed requirements across all games, although the intensity of this requirement varied
by game. Many test-takers did not finish all possible levels and items within each game, as
expected during a speeded test. In the development of these games, it was recognized early in the
development process that targeting specific second-stratum abilities while ignoring others was
not feasible from a game development perspective without sacrificing desirable game mechanics,
so overlap was permitted. The ultimate specific abilities theoretically targeted varied as indicated
below.
Numbubbles (Gq/Gv). In Numbubbles, players were presented with a target value (e.g.,
7) and then a sequence of bubbles with each bubble containing a numerical formula (e.g., 3 + 4
or 12 x 2). The bubbles had a short lifespan of a few seconds before disappearing. The goal of
the game was to “pop” the bubbles that equal the target value before they disappeared, while
avoiding popping bubbles that did not equal the target value across 10 rounds of varying
difficulty and a 20 second time limit per round. Numbubbles was scored as the number of correct
pops, minus the number of incorrect pops, weighted by average time to a correct pop.
Figure 3. Cognify g-GBA minigames. Clockwise from top-left: Proof It, Tally Up, Resemble,
Numbubbles, Shortcuts, Colour Pop, Grid Lock. Screenshots of Shortcuts and ColourPop are
cropped but still show complete gameplay area.

Resemble (Gf/Gv). In Resemble, players re-created a pattern, shown on one side of the
screen, by dragging puzzle elements onto a grid shown on the opposite side of the screen. Puzzle
elements, once dragged onto the grid, could be rotated, and solving each problem required
dragging the correct pieces to the correct positions and applying correct rotations. The test-taker
had a total of three minutes in which to complete as many puzzles as possible, up to a total of
nine. The timed nature of the puzzle also introduced a speed component. Resemble is the
revision of an earlier developed game which was itself inspired by the Block Design tests of the
Wechsler Adult Intelligence Scales. Resemble was scored as the number of game levels
completed.
Grid Lock (Gf/Gv). In Grid Lock, players assembled a set of puzzle pieces to mimic a
larger novel shape. There were nine rounds of varying difficulty and an overall three-minute time
limit. The size of the grid and the number of pieces that need to be placed increased over time,
and in later levels, grid components needed to be rotated. Grid Lock was scored as the number
of game levels completed.
Proof It (Gc/Grw). In Proof It, players were asked to identify as many textual errors as
possible in a set of sample texts across five rounds, across which 5 minutes were permitted by
tapping the errors. Incorrectly tapping three different places without error resulted in round
progression. No points were deducted for incorrect taps. Proof It was scored as the number of
errors correctly identified across the game.
Tally Up (Gq/Gv). In Tally Up, players were presented with two sets of tokens across 35
rounds of varying difficulty with a time limit of 5 seconds per round. In each round, players were
asked to quantify the number of tokens on two sides of the screen and identify their relationship
to each other, which became more complex as the game progressed. It resembled Quantitative
Comparison questions from the Graduate Record Examination, with the addition of token value
modifiers and per-item time pressure. Tally Up was scored as the number of rounds with correct
responses.
Colour Pop (Gf). In Colour Pop, players completed a gamified version of a Stroop
(1935) task in which they observed a grid of colored words and were asked to identify all tiles
with words that matched the color of the target, e.g. red, irrespective of the color of the tiles.
There were 20 rounds which varied by the number of valid answers and the proportion of word
and color mismatches. The time limit for each round was 4 seconds, with an overall time limit of
approximately 2 minutes. Game elements most noticeably included spinning tiles, progress bars,
star counts and both audio and visual feedback mechanisms. Colour Pop was scored as the total
number of correct pops minus the number of incorrect pops minus the number of misses.
Short Cuts (Gf/Gq/Gv). In Short Cuts, players determined the shortest path to roll a
blue marble to a goal across seven puzzles over a maximum of four minutes. Each puzzle
consisted of a network of paths, with the distance of each path denoted by a numerical value,
where higher values indicated greater distance. Determining the shortest path required the user to
plan a path forward while considering multiple interacting factors, the complexity of which
varied across puzzles. Short Cuts was scored as “distance traveled,” which quantifies the number
of points “spent” to solve all puzzles given all path values chosen. Final scores were reversed for
reporting so that greater scores reflected superior performance.
Reactions
Seven different reaction measures were collected once for the GBA and once for the
traditional g battery across two broad categories. Each set of reactions measures was
administered immediately upon finishing either the traditional cognitive ability battery or the g-
GBA and adapted to inquire either about the “test” or the “game”. Intrinsic motivation and test
anxiety were assessed with 7-point agreement scales; all others were assessed with 5-point
agreement scales. Alpha reliabilities appear in Table 6.
Motivational outcomes. Intrinsic motivation was measured using the
Interest/Enjoyment subscale of the Intrinsic Motivation Inventory (Ryan & Deci, 2000). A
sample items was “These tests (games) were fun to do”. Test motivation was measured with a 5-
item scale (Arvey et al., 1990). An example item was, “I thought the tests (games) were fun.”
Test anxiety was measured using a 6-item scale (Arvey et al., 1990). An example item was “I
probably wouldn’t do as well as most other people who took these tests (played these games).”
Attitudinal outcomes. Distributive justice was measured using a 3-item scale (Smither et
al., 1993). An example item was “The test (game) results would accurately reflect how well I
performed on the examination (in the games)”. Procedural justice was measured using a 4-item
“chance to perform” measure (Bauer et al., 2001). An example item was “I could really show my
skills and abilities through these tests (games).” Job relatedness was measured with a 2-item
measure (Bauer et al., 2001). An example item was “It would be clear to anyone that these tests
(games) are related to the job.” Test propriety was measured using a 3-item measure (Bauer et
al., 2001). An example item was, “The content of the tests seemed appropriate.”
Outcome Variables
Outcome variables differed by sample. In the academic sample, GPA was captured from
historical university records. In the organizational sample, each employee’s most recent annual
supervisory ratings of job performance were collected which asked supervisors to assess the
extent to which employees met their goals set at the beginning of the previous year on a 4-point
scale. Supervisors were not required to match a predefined rating distribution but were
encouraged to assign less than 5% to “Does Not Meet Expectations,” approximately 8% to
“Partially Meets Expectations,” approximately 67% to “Fully Meets Expectations,” and less than
25% to “Exceed Expectations.” These ratings had historically and were actively being used both
administratively and developmentally as of the time of data collection.
Demographics
Basic demographic information was collected from the academic sample including age,
gender, race, ethnicity, employment status, and contact information for participant compensation.
In the organizational sample, only age, gender, race, and ethnicity were collected since all
employees sampled were full-time.
Procedure
Academic Sample
Two university computer labs were identified on campus for use in this study. The two
labs were chosen because they were verified to contain computers with sufficient processing
power to run the g-GBA at the intended speed and were in designated quiet spaces on campus.
As a result, all participants used nearly-identical computers in highly similar physical spaces.
Participants were instructed via email to use one of these computers and to navigate to a specific
webpage. Once on that webpage, participants reviewed informed consent documentation and
entered their student identification number before proceeding with the study via the Qualtrics
survey platform. The overall research design was a two-cell within-subjects design
counterbalanced by test completion order. Specifically, participants were assigned to either take
the g-GBA first or the traditional cognitive battery first but ultimately completed both. After
each test battery, participants completed a reactions measure battery. After both assessments and
both reactions measures were completed, participants completed a demographics survey.

Organizational Sample
Employees were emailed directly by the internal industrial-organizational psychology
team asking them to complete the g-GBA and to release access to their performance data for
research purposes. Their supervisors were also contacted and asked to encourage their
employees to participate.
Results
Results in this section can be reproduced using the anonymized dataset and R (R Core
Team, 2020) analytic files found at https://osf.io/dg6qm/
Measurement
To assess Hypothesis 1, which examined the convergence of latent game performance
with latent g, a series of confirmatory factor analyses (CFA) and larger structural equation
models (SEM) were fitted in sequence to test specific assumptions and measurement hypotheses.
Approximate standards for the evaluation of model fit were taken from Hu and Bentler (2009; p
> .05, CFI > .96, RMSEA < .06, SRMR < .08). Absolute and relative fit indices were jointly
interpreted given limitations in both (March et al., 2009), and both compliance with and violation
of these standards were considered informative to model-building rather than as gold-standard
cut-offs (Markland, 2007). Because the size of the organizational sample was too small to
support CFA, all measurement analyses were conducted on the academic sample. A correlation
matrix calculated from the data used for these analyses appears in Table 1.
First, a CFA was conducted to determine if g was adequately measured by the five
specific ability tests identified. As shown in Figure 4, standards according to all absolute and
relative fit index standards were achieved (2(5) = 5.81, p = .325, CFI = 1.00, RMSEA = .02,
SRMR = .02) and factor loadings were comparable to expected effect sizes given prior research
utilizing CFA on g measurement using specific ability tests (see Tucker-Drob & Salthouse,
2009); thus, we concluded that latent g was effectively represented. We also regressed GPA on
latent g, finding a standardized weight of .28, nearly equal to the meta-analytic means provided
by Richardson et al. (2012), further supporting validity.
Next, latent game performance was modeled with CFA. In this case, however, we a priori
suspected that a traditional CFA, which is based upon the assumptions of classical test theory,
would demonstrate poor model fit. In the case of g measurement, commonly used tests generally
comply well with these assumptions; for example, each item on a well-constructed cognitive
ability test can generally be considered a randomly drawn item from a potential universe of items
assessing that ability. In this case, the development team was aware of residual cross-loadings
during development but could find no way to eliminate them entirely without removing many of
the “gameful” aspects of the experience. Knowing this a priori, we conducted this step of
analysis interactively, using diagnosis of modification indices, freeing notable item residual
covariances, and observing associated changes in fit. The initial model, a standard one-factor
CFA with uncorrelated errors, showed slightly poor fit as expected (2(14) = 55.63, p < .001 CFI
= .92, RMSEA = .07, SRMR = .04). We next used modification indices to free paths between
item residuals that were contributing to the greatest misfit one model re-fit at a time until
adequate fit was achieved for all terms, which occurred by freeing 2 of the 21 available
covariances, as shown in Figure 5 (2(12) = 17.37, p = .14, CFI =.99, RMSEA = .03, SRMR
= .02), which was theoretically consistent with gameplay as described earlier; whereas both
Resemble and Gridlock required the mental manipulation of presented visuals, both Tally Up and
Colourpop required quick quantitative reasoning. Nevertheless, the use of modification indices
Table 1
Correlation matrix
Cognitive Ability Tests g-GBA Minigames Other

Variable Mean SD 1 2 3 4 5 6 7 8 9 10 11 12 13 14
1. Visual
1.01 0.37
Processing
2. Quantitative
2.44 1.30 .29
Knowledge
3. Fluid
3.20 1.54 .24 .17
Reasoning
4. Verbal Ability 0.72 0.57 .23 .27 .19
5. Processing
4.30 1.00 .32 .30 .24 .22
Speed
6. ColourPop 0.68 0.39 .21 .20 .17 .20 .31
7. Numbubbles 12.39 4.29 .20 .30 .13 .29 .30 .25
8. Resemble 5.17 1.92 .40 .34 .25 .22 .37 .25 .29
9. Proof It 32.68 10.43 .21 .23 .16 .35 .28 .28 .35 .18
10. Shortcuts 136.17 42.79 .14 .03 .04 .11 .05 .07 .04 .10 .04
11. Gridlock 5.27 1.31 .25 .17 .17 .12 .25 .22 .21 .40 .19 .06
12. Tally Up 22.06 3.88 .30 .29 .10 .23 .36 .21 .39 .34 .29 .09 .26
13. GPA 3.05 0.67 .11 .12 .12 .19 .18 .11 .15 .04 .22 -.07 .06 .15
14. Race .41 .49 -.27 -.27 -.24 -.27 -.30 -.21 -.10 -.40 -.20 .04 -.28 -.25 -.21
15. Gender .58 .49 -.02 -.20 -.04 -.16 -.06 .00 -.35 -.26 .04 -.09 -.11 -.15 .06 .12
Note. Variables 1-5 are cognitive ability tests; Variables 6-12 are cognitive ability GBAs; Race is coded 1
= Black participant, 0 = White participant, and others missing; Gender is coded 1 = female participant, 0
= male participant and others missing; N=633 for all cells except pairs involving gender (N=630) or race
(N=448)
Figure 4. Confirmatory factor analysis of general cognitive ability, N = 633.

Figure 5. Final confirmatory factor analysis of latent GBA performance. All estimates are
standardized, and two freed residual covariances are shown. N = 633

necessarily led to a model overfitted to some degree. To gauge the effect of these modifications,
we conducted our later test of Hypothesis 1 both with and without these modifications.
Finally, to enable our formal test of Hypothesis 1, we combined both CFAs into a single
SEM predicting latent GBA performance from latent g as measured by the traditional g test
battery. Slight poor fit was observed (2(51) = 129.94, p < .001, CFI =.93, RMSEA = .05,
SRMR = .04), so modification indices were again examined, which revealed several potential
changes to improve model fit, but none were clearly theoretically consistent with the nature of
gameplay as in the previous model fitting. We regardless attempted to fit additional models
based upon the largest suggested freed disturbance covariances to observe the effects, but these
revised models did not meaningfully affect the core effect size of interest between g and game
performance. Thus, we decided to proceed with interpretation of the unmodified model to test
study hypotheses. Developing an internal structure of Cognify in an exploratory fashion was
only done to enable the following confirmatory test of the convergence between latent game
performance and latent g, as a formal test of Hypothesis 1.
In the final model, as shown in Figure 6, the relationship between latent game
performance and latent g was equal to .97, suggesting near-unity between the latent constructs
underlying both the cognitive ability test battery and the GBA’s suite of minigames; only 6.9%
of variance in game performance remained unexplained. As a post-hoc follow-up, to test the
hypothesis of unity formally, we created a new model constraining the variance of the
disturbance of latent gameplay to zero and compared these two nested models with a chi-squared
difference test. We found no significant difference between models (2diff = 131.09 - 129.94 =
1.15; p = .716), suggesting the simpler model constraining the relationship to unity should be
retained. Thus, we concluded that the latent game performance construct underlying
performance across Cognify minigames was in fact g. Hypothesis 1 was therefore supported.
As a post-hoc test of model robustness, we also re-ran the model shown in Figure 6 constraining
all mini-game disturbance covariances to zero, finding worse fit but no meaningful change to the
test of H1 ( = .95; 2(53) = 186.11, p < .001; CFI = .90, RMSEA = .06, SRMR = .04).
Performance Prediction
The finding that latent game performance was indistinguishable from latent g precluded
the need to assess the correlation between latent game performance and outcomes. Because
latent game performance on Cognify and latent g are in effect the same construct,
mathematically speaking, the relationship of latent game performance vs. GPA and latent g vs.
GPA must be extremely similar in magnitude if modeled using SEM. Thus, it was known a
priori that Hypotheses 2a and 2b would be supported using this approach. Given this, we instead
turned our attention to ways that these tests might be operationalized. Specifically, because we
had no a priori hypotheses about specific games, we chose to focus the remainder of our
hypothesis testing on unit-weighted composite scores, which addresses more practical concerns.
Specifically, latent variable modeling is uncommonly used in real-world employee selection,
whereas unit-weighted composites are common and easily applied (Oswald et al., 2014). Thus,
to provide additional support for Hypotheses 2a and 2b, we examined the prediction of criteria
using ordinary least squares regression and mean composite scores.
In the case of predicting GPA in the academic sample, Hypothesis 2a was supported.
The relationship was positive (r = .16 [.08, .24], p < .001) although of somewhat lesser
magnitude of prediction than the cognitive ability test composite’s relationship (r = .22 [.15, .30],
p < .001). Hypothesis 2b was also supported; supervisor ratings of job performance were also
Figure 6. Final model predicting latent GBA performance from latent g, N = 633
predicted (r = .29 [.00, .53], p = .047). In both cases, effect sizes were well within credibility
intervals and at similar magnitude as mean meta-analytic estimates previously observed for these
relationships (cf. Richardson et al., 2012, who found a mean observed correlation with GPA
across studies of .20; Hunter, 1984, who found a mean observed correlation across studies with
supervisor ratings of .27), providing additional construct validity evidence.
To assess Research Question 1, we next conducted hierarchical multiple regression
analyses to determine incremental prediction of GPA of each test battery composite score over
the other. These analyses appear in Table 2. Both traditional test battery composite and the
GBA composite predicted the criterion. However, incremental prediction was only observed in
one direction, of the traditional test composite beyond the GBA. In combination with the results
from the test of Hypotheses 1 and 2, it appears that although latent g is reflected similarly in both
the traditional test battery and the GBA, the GBA contains additional criterion-irrelevant
information not contained within the traditional test battery that is attenuating its relationship
with the GPA criterion. Thus, it seems that although the GBA is clearly a g measure, it is a less
“pure” measure than the traditional cognitive ability test battery. The precise source and nature
of that criterion-irrelevant information is unclear from the present data.
Test Bias and Differential Prediction
To assess Hypothesis 3, we examined differences in the composite game score between
the 184 Black and 275 White test-takers in the academic sample with a Welsh two-sample t-test.
The difference was statistically significant (t(409.34) = -8.08, p<.001, d=-0.77 [.57, .96]) and the
predicted value of .69 from the Ross et al. (2001) meta-analysis was well within its confidence
interval, supporting Hypothesis 3. As a post-hoc test to triangulate upon this result, we next
examined racial differences in the g test composite scores, to see if this same pattern of findings
was observed. In this test, we did observed a larger difference (t(427.77) = -10.23, p<.001, d=-
0.95 [.75, 1.15]), 0.19 standard deviations larger for the composite score calculated from the
traditional g test battery than from the composite score calculated from the GBA.
Next, we assessed Hypothesis 4 by examining differences in composite game scores
between the 265 male and 365 female participants in the academic sample with another Welsh t-
test. The difference was statistically significant (t(499.99) = 5.72, p < .001, d = -0.48 [.32, .64]),
supporting Hypothesis 4. As with our test of Hypothesis 3, we post-hoc compared this to gender
differences on the g test composite, this time finding an effect slightly smaller than for that of the
GBA (t(535.77) = 3.78, p < .001, d = -0.31 [.15, .47]). Given both the statistical and practical
gender effect for this test, it appears that most of the observed gender differences are attributable
to a population effect rather than an assessment medium effect. To investigate Research
Question 2, we examined gender differences on individual minigames, and these results appear
in Table 3. Gender differences disadvantaging female participants appeared for five games, with
effects ranging from small to large, whereas the other two games showed no advantage for either
men or women.
To address Research Question 3, Lautenschlager and Mendoza’s (1986) method for
evaluating differential prediction was used for both racial and gender fairness. The results are
displayed in Tables 4 and 5, respectively. In each Model 1, the criterion is regressed upon the
composite; in each Model 2, upon class membership; and in each Model 3, upon the composite,
class membership, and their interaction term. Using this method, up to three comparisons
between nested models are conducted to determine fairness. First, Model 1 is compared to Model
3. A statistically and practically significant R is treated as a positive omnibus test of
differential prediction. Second, R between Model 2 and Model 3 is examined. If

Table 2
Criterion-related validity
DV: GPA Model 1 Model 2 Model 3

Traditional Test Composite 0.24** 0.21**
GBA Composite 0.19** 0.05
Model F 33.43** 16.86** 17.01**
R2 .05 .03 .05
LL .02 .00 .02
UL .08 .05 .08
Adj R 2 .05 .02 .05
R2 (vs. Model 3) .00 .03**
N = 633. LL and UL are lower and upper limits of 95% confidence intervals surrounding R2,
*p<.05 **p<.01
Table 3
Standardized minigame performance by gender
Male Female Welsh’s test Effect Size
Minigame M SD M SD t df p d LL UL
Colour Pop 0.68 0.38 0.68 0.39 0.09 579.47 .929 0.01 -0.17 0.15
Numbubbles 14.12 4.40 11.11 3.73 -9.01 510.98 <.001 -0.75 0.58 0.91
Resemble 5.74 1.92 4.74 1.80 -6.71 546.10 <.001 -0.55 0.39 0.71
Proof It 32.08 10.35 32.97 10.33 1.07 568.31 .286 0.09 -0.25 0.07
Short Cuts 140.46 41.92 132.86 43.25 -2.22 578.98 .027 -0.18 0.02 0.34
Gridlock 5.44 1.41 4.15 1.22 -2.75 519.01 .006 -0.23 0.07 0.39
Tally Up 22.73 4.02 21.56 3.69 -3.73 538.95 <.001 -0.31 0.15 0.46
N = 630; Female participant N = 365, Male participant N = 265; LL and UL are lower and upper
limits of 95% confidence intervals surrounding d

Table 4
Comparisons of Black-white differential prediction for cognitive ability tests versus GBAs
DV: GPA Traditional g Test Battery g-GBA

Model 1 Model 2 Model 3 Model 1 Model 2 Model 3
Traditional Test Composite 0.20** 0.13* 0.18**
GBA Composite 0.20** 0.14* 0.12
Race (Black=1, White=0) -0.20** -0.22** -0.22** -0.22**
Composite x Race Interaction -0.14 0.03
Model F 16.46** 12.83** 9.06** 13.83** 12.60** 8.40**
R2 .04 .06 .06 .03 .05 .05
LL .00 .01 .02 -.00 .01 .01
UL .07 .10 .10 .06 .09 .09
Adj R 2 .03 .05 .05 .03 .05 .05
R2 (vs. Model 3) .02** .00 .02** .00
N = 448; Black participant N = 182; White participant N = 266; LL and UL are lower and upper
limits of 95% confidence intervals surrounding R2; *p<.05 **p<.01

Table 5
Comparisons of female-male differential prediction for cognitive ability tests versus GBAs
DV: GPA Traditional g Test Battery g-GBA

Model 1 Model 2 Model 3 Model 1 Model 2 Model 3
Traditional Test Composite 0.237** 0.25** 0.22**
GBA Composite 0.19** 0.22** 0.15*
Gender (Female=1, Male=0) 0.13* 0.13* 0.14** 0.14*
Composite x Gender Interaction 0.06 0.14
Model F 32.79** 19.69** 13.28** 16.31** 11.59** 8.45**
R2 .05 .06 .06 .03 .04 .04
LL .02 .02 .02 .00 .01 .01
UL .08 .10 .10 .05 .06 .07
Adj R2 .05 .06 .06 .02 .03 .03
R (vs. Model 3)
2 .01* .00 .01* .00
N = 630; Female participant N = 365, Male participant N = 265; LL and UL are lower and upper
limits of 95% confidence intervals surrounding R2; *p<.05 **p<.01

statistically and practically significant, either slope differences or slope and intercept differences
in combination are present. Third, if the second comparison revealed slope differences, an
additional test can be conducted to determine if intercept differences were observed in addition
to slope differences. As shown in Table 4, which concerns racial differences, both the traditional
test battery and GBA met the fairness standard; both exhibited differential prediction by intercept
but not by slope. In Table 5, which concerns gender differences, the same pattern emerged.
Thus, it appears that the GBA is a fair test, across both classes of interest in this research
question. Additionally, this evidence supports the above-stated conjecture that gender
differences in the GBA may be mostly attributable to study population characteristics rather than
to GBA characteristics; it does not appear as if using a GBA removed existing construct-related
intercept differences.
Reactions
Finally, we assessed Hypothesis 5 by conducting a series of paired-samples t-tests
comparing reactions to the GBA and the traditional g test battery. To address missingness within
the reactions data, single imputation was used (i.e., Amelia II; Honacker et al., 2011). These
results are presented in Table 6. Universally, the GBA was preferred to the g test battery across
all reaction measures. Thus, Hypothesis 5 was supported.
Discussion
This study is the first to integrate current theories of game design, taken from human-
computer interaction, into the organizational assessment literature, to provide a theoretical
framework for the identification and development of GBAs. With a g-GBA built from that
theory for use in hiring decisions, we rigorously explored its measurement characteristics,
predictive accuracy, fairness, and reactions. Most centrally, we have demonstrated that a g-GBA
Table 6
Reactions to g-GBA versus g Test Battery
g-GBA g Test Battery
Variable  Mean SD  Mean SD t p d LL UL
Motivational
Intrinsic Motivation .92 4.52 1.42 .92 3.03 1.56 20.34 <.001 1.00 0.88 1.11
Test Motivation .80 3.78 0.56 .85 3.70 0.69 3.03 .003 0.13 0.05 0.22
Test Anxiety .83 2.82 0.88 .80 3.07 0.89 -8.04 <.001 -0.29 -0.36 -0.22
Attitudinal
Distributive Justice .68 3.21 0.84 .74 2.96 0.94 6.37 <.001 0.28 0.19 0.37
Procedural Justice .88 2.77 0.96 .90 2.49 0.98 8.22 <.001 0.30 0.23 0.37
Job Relatedness .64 3.97 0.75 .67 3.78 0.84 5.60 <.001 0.23 0.15 0.32
Test Propriety .83 2.67 1.03 .83 2.44 1.01 5.69 <.001 0.22 0.14 0.30
Note. t-test and Cohen’s d are calculated for paired comparisons. Positive t and d indicate greater
scores for GBA. LL and UL are lower and upper limits of 95% confidence intervals surrounding
d. N = 632 (one case eliminated due to high missingness)

likely can be designed and developed (i.e., engineered) to meet the same conceptual and
psychometric standards (Sackett et al., 2017) as other more traditional assessments of those same
constructs, and that participants preferred this g-GBA to the traditional battery. This furthermore
provides a theoretically and empirically supported design process for creating new theory-driven
GBAs to assess any construct of interest at a high psychometric quality standard.
It should not be inferred from this study, however, that GBAs as a class of methods are
inherently superior or preferable to traditional assessment methods in any dimension studied.
Current GBA vendors vary in their reliance upon both design theories and psychological
theories, and existing GBAs also vary widely in their implementation of specific game elements.
For example, whereas Cognify does not heavily integrate any sort of story, narrative, or fantasy
elements, these are commonly implemented in other GBAs currently in use in organizations to
unknown effect. Further research is needed on a much broader range of GBAs and GBA design
strategies before any firm conclusions can be drawn about GBAs in general. Much like Arthur
and Villado (2008), we emphasize the critical differences between methods and predictors in the
space of employee assessment; because GBA is a method, the specifics of design and
implementation are critical to understanding its best role in employee selection and should not be
ignored. Much as a survey measure can be well-designed or not in relation to its measurement
goals, so can a GBA. Further research is needed to understand if the GBA design theory
proposed here can serve as a foundation for high quality GBAs across constructs and contexts.
Even if so, more nuanced design theory will likely be needed for different measurement
domains, just as is currently required of questionnaires (Embretson, 1994). Much as
psychometrics was born of a need to better apply statistics to the measurement of latent
psychological constructs (Buckhalt, 2002), new domain-embedded GBA design theories may be
needed to develop the highest quality psychometric assessments appropriate for selection
contexts (Ployhart et al., 2017). Thus, future research on GBA must explicitly consider and
explore design and development processes (Landers & Marin, 2021) in any GBA being
evaluated.
What we can safely conclude given the present results is that the design process studied
here resulted in a GBA of similar psychometric quality to a traditional g test battery. This draws
a theoretical distinction between a g-GBA’s latent performance construct and the only other
latent game performance construct in the research literature, gVG. Whereas Quiroga et al.
(2019) hand-picked a selection of commercially available video games to best reflect g, the
present study demonstrates how a design and development process can be used to create a novel
g-GBA for employee selection. In doing so, we also found a stronger relationship between latent
game performance and g than did Quiroga et al. (2019), with an estimate more similar to Quiroga
et al.’s (2014) result when focusing upon “brain training” Nintendo games. The difference in
results between these studies, especially in contrast to the present study, suggests that design
characteristics like the ones studied here are likely critical to understanding why and how GBAs
can measure traits. Although outside the scope of the present work, an interesting possibility
raised by Quiroga et al.’s work is the existence of a true gVG across all possible video games, a
set of skills or abilities associated with success in video games broadly. We encourage
researchers to continue down this theoretical path, as it might shed additional light on potential
trait confounds when using video games for measurement of any psychological construct.
The observation of adverse impact by race of similar magnitude as traditional g tests was
as predicted but was also disappointing. The idea that GBA somehow “removes” bias is a
common assertion among some GBA proponents in industry (e.g., Hak, 2019) which this study
directly informs. This also provides context for the approach if not the rhetoric of many GBA
vendors; for example, one of the largest GBA vendors, Pymetrics, claims its GBA to be “bias-
free”, explaining “we use a reference set of tens of thousands of people to check for any potential
biases, and we deweight inputs in our model until we produce a bias-free algorithm that is
compliant with the 4/5ths rule” (pymetrics.com, 2019). Thus, rather than their GBA somehow
“removing” adverse impact through some design tactic, it is done post-hoc by reducing the
influence of or dropping individual predictors showing adverse impact in the machine learning
algorithms that they develop. This reaffirms that inclusion of scores from cognitively loaded
tests within a selection battery, at least given the world’s current social and economic state, will
generally lead to adverse impact by race (Kuncel & Hezlett, 2010). GBA appears to neither solve
this problem nor exacerbate it.
Of greater concern was the observation of adverse impact by gender. Although the mean
effect across minigames disadvantaged women, most of this difference was also reflected in
gender differences in the traditional cognitive ability test battery (dTrad = -0.48 vs dGBA = -0.31).
We suspect the remaining gender difference (d = -0.17) is attributable to differences in the
visual-spatial nature of gameplay in some of the games and given prior work suggesting gender
differences in spatial abilities (Voyer et al., 1995). Specifically, the games in which women did
worse were also more heavily visual-spatial in their gameplay than the games demonstrating
parity. Because visual-spatial ability was only represented in one test in the traditional g battery
through a single test, this may have led to the observed difference in gender differences between
the traditional battery and the g-GBA. Although a new composite could be created in the present
dataset utilizing only those tests showing no difference, this would capitalize upon chance to
some degree, and any observed lack of gender effect of such a composite would be of unknown
generalizability. Most importantly, even with the existing set of minigames, there was no
evidence of differential prediction of GPA by gender, by either slope or intercept; thus, despite
the observation of adverse impact, the prediction of GPA from both the traditional cognitive
ability measure and from the g-GBA appeared to be fair by gender.
These results in combination with our analysis of RQ1 raise new theoretical questions
about g-GBA test construction. Specifically, because there was incremental prediction of GPA
by the cognitive ability test composite beyond the g-GBA test composite but not the reverse, this
suggests that although the GBA composite score contains the same information about g that the
cognitive ability test battery composite does, it is also contaminated to a degree by gender-
relevant (and g-irrelevant) variance; in short, there is evidence of construct contamination but not
construct deficiency. We encourage future researchers to examine what specific game
mechanics, dynamics, and aesthetics are most likely to exacerbate gender differences in the
measurement of cognitive ability, and conversely, what might be done to remove it. Assuming
that the pattern of gender differences among minigames was due to legitimate differences in
gameplay, alternate weighting schemes or alternate inclusion/exclusion criteria for minigames
could be used to reduce these gender differences (Sackett & Ellingson, 1997). Future research
should therefore also explore how scores are best used in practice to make actual selection
decisions, and if such strategies have other unintended consequences, such as decreased validity.
Although reactions to the GBA were universally more positive than to the g test battery,
effect sizes varied and were generally small to moderate. Whereas intrinsic motivation was 1.00
standard deviations more positive, other improvements were more modest, ranging from 0.13 to
0.30. A key limiting factor in this study may be the nature of g testing, to which reactions are
already generally poor (Hausknect et al., 2004). Because evaluation of g requires identification
of correct answers, frustration when unable to determine a likely correct response and move
forward to the next question may negatively influence g test reactions (Chan et al., 1997). In a
GBA, there is still feedback as to correct answers, but there is less time for assessees to ruminate
on incorrect answers if the game has been designed and is successful in absorbing assessees in
the flow of game demands. In GBAs designed to assess constructs that lack “correct answers,”
such as personality testing, reactions to GBAs may be more positive; however, the precise effect
of the presence or absence of such gameplay flow is unclear. Further research is required, which
should investigate such interactive effects between constructs targeted and GBA design features,
as well as how specific game design decisions do or do not contribute within any particular
design and construct combination.
These results also raise a key issue unique to GBAs when deployed within the employee
selection context versus educational. Specifically, if a g-GBA is significantly more expensive to
develop than a traditional g test, positive utility for the use of GBAs is of concern. Given the
magnitude of effects we observed here, it is currently inadvisable for a private organization
intending to develop its own internal selection tools to create their own GBAs. Given the
expense, it is unlikely that any reactions benefit from such a move would outweigh the
development costs. However, an independent consultancy licensing such a test could see utility
if deployed to a broad range of organizations. Thus, we contend that the most likely context for
positive utility from GBAs for at least the next several years will be in consultancies serving that
GBA to many different firms, which in turn implies that most GBAs will be intended to assess
broadly useful individual differences where there is significant demand, such as cognitive ability,
personality, emotional intelligence, and broad skill-based competencies, such as leadership
styles, teamwork, or self-directed learning. In this way, GBA more directly compares with
assessment centers than other assessment methods in terms of key strengths, but without the
logistics costs and overhead typically associated with assessment centers. For companies
choosing to adopt GBAs, the potential utility gains are more obvious – if a g-GBA can be
administered at the same cost, with the same psychometric strengths, and with better applicant
reactions in comparison to a traditional g test battery, there are few compelling reasons to
continue using traditional g test batteries in practice. Organizations should consider all such
dimensions of utility, both in terms of immediate predictive gains and larger-scale strategic
business concerns, when making such adoption decisions (Roth & Bobko, 1997).
The findings here also relate to the nascent literature on assessment gamification (e.g.,
Georgiou et al., 2019; Landers et al., 2020). As described in the Measures section, some of
Cognify’s games began as traditional cognitive tasks and were gamified (i.e., Resemble, Colour
Pop), some were directly inspired from existing games (i.e., Numbubbles, Grid Lock, Tally Up),
and some were creative interpretations of other concepts (i.e., Shortcuts, Proof It). In some
ways, this made the gamified assessments less challenging to develop than others, in that the
basic concepts of gameplay were inferred from the existing task structure as a starting point for
game design, but in other ways were more challenging to develop due to the restrictions that
existing task definitions created. For example, because Colour Pop began as a Stroop test, the
design team wanted to maintain the classic Stroop elements regardless of other changes
suggested through user assessments in iterative prototyping; in contrast, because Shortcuts was
based on a novel idea, there was no aspect of the game that was “off limits” for changes during
development. In this way, the science of gamification (Landers, 2018) might inform some
aspects of GBA design just as GBA design might inform gamification; it is thus important for
progress in both that the two literatures do not grow completely independently.
Practical Implications
In stark contrast to a few decades ago, a key concern in modern assessment design is
maximizing applicant reactions, and GBA design makes very explicit the central role of user
experience in the test development process. Specifically, game design and thus GBA design
prioritize consumer expectations and experience in a way not commonly seen in traditional
assessment design. The introduction of internet technologies has flattened job application
pathways such that for many organizations, the application process has become a bidirectional
transaction (Singh & Finn, 2003). User experience and brand reputation now play an important
role in attracting talent. Because cutting edge technologies have been found to impact positively
on applicant perceptions of the organizational image (Bartram & Hambleton, 2006; Sinar et al.,
2003), GBA design can enable rigorous measurement while improving applicant perceptions and
organizational impressions. Further iteration upon the design of the present GBA might result in
further improved perceptions. Such benefits from either initial deployment or redesign are not
guaranteed, however, and both require significant investment in high quality design and
development processes.
In great contrast to traditional assessment development, GBA development requires a
high degree of effort from diverse, multidisciplinary teams and stakeholders outside of the
traditional assessment community. Across disciplinary perspectives, values and methods vary
greatly, creating new challenges in relation to process losses and team coordination, as well as
greater expense, in relation to traditional assessment development. Game designers, software
engineers, artists, and others may all be deeply invested in assessment development, which if not
carefully managed can create significant problems with team cohesion and team commitment.
For example, in the experience of the present authors, game designers typically prioritize “fun,”
engineers typically prioritize system sustainability, artists typically prioritize aesthetics, and
assessment designers typically prioritize psychometric rigor. If properly managed, the resulting
frictions can lead to higher quality assessments in both the psychometric sense and in terms of
applicant perceptions, but the specific “best” path to achieve that remains unclear. In the present
article, we have described the design theory that drove the organization that developed Cognify,
but there is no guarantee that another assessment firm would find the execution of a design
strategy from that theory as effective. Furthermore, there is no guarantee that the same
development process would work equally well even for the same developer if assessing a
different construct. A degree of risk-taking is necessary in the current development of GBAs;
the present study does not provide a single set of best practices but instead provides guidance on
how to reduce risk through a cautious marriage of game design theory and classic test
development practices.
Regardless of the specific development strategy adopted, technical competency in GBA
design related to the assessment delivery platform is significantly more important for assessment
practitioners than in traditional assessment development. Whereas technical teams often are
tasked with “implementation” in traditional development, such that the “assessment team”
creates the assessment and the “technical team” is responsible for placing the assessment online
and collecting data, all members in GBA development teams need to develop expertise in not
only psychometrics but also software architecture and game design theory. This is likely to push
many traditional assessment experts far outside of their core expertise, yet developing new
expertise is critical to ensure psychometric rigor in GBA. Complex technical concerns can arise,
such as the specific equipment required to maintain the number of interactions per second
necessary to ensure the integrity of collected data given a particular GBA design. With
insufficient technical expertise, a traditional assessment specialist might not even realize why
such a restriction could harm the psychometric properties of the GBA. Thus, any assessment
firm seeking to develop GBAs should carefully evaluate if they have not only adequate technical
resources but also the necessary resources to train key personnel across disciplinary lines.
An important set of practical caveats for the application of these results is the 1)
generalizability of this approach to other constructs, 2) the impact of specific development
strategies, and 3) the impact of deployment strategies. First, we have presented here a design
theory which can feasibly be used to develop any theory-driven GBA in which a specific
construct has been a priori targeted by game design. However, there is generally a much greater
and more comprehensive literature on g than on most other traits, which gave the developers a
more solid foundation for design in the CHC model than might be found when targeting other
constructs when integrating these design theories. As such, we caution practitioners against
considering this GBA as prototypical. There are likely to be many challenges in design
unexplored here inherent to any such effort, and another game developed using this approach is
not guaranteed to be successful. For example, it is unclear at this time as to the specific cause of
increases in applicant reactions; this could have been caused by the novelty of the experience,
improved affective reactions to the interface, the quick gameplay, or any of many other game
characteristics acting in concert. Second, the present study only explored one development
particular strategy, which had a variety of practical consequences. In terms of across-construct
influences, the psychometric concerns and iterative strategies necessary for successful
measurement are likely quite different in data-driven GBAs than in theory-driven GBAs. The
lack of a priori focus on constructs in data-driven GBA does not necessarily condemn such
methods, but it may increase the analytic burden on game developers in ways not explored here.
Within the g construct, this GBA focused upon speeded tests for practical reasons related to
reducing cheating (Arthur et al., 2010), but this choice could have led to additional confounds
that affected validity in numerous ways (Lu & Sireci, 2007). Further, in pursuit of more exciting
gameplay, the GBA did not enable clean separation of measurement occasions, which precluded
meaningful estimation of internal consistency reliability. For this reason and from our own
experiences in industry, we believe most GBA developers rely instead upon test re-test reliability
estimates for this reason, yet this requires additional data collection efforts or novel calculation
strategies (Weiner & Sanchez, 2020). Alternative game designs and scoring models could also
avoid this problem. Third, in practice, due to concerns about re-testing effects (Villado et al.,
2016), the company that developed this GBA in its practice does not allow it to be administered
to the same job applicant more than once every twelve months, and the present study did not
examine re-test effects. There is no research exploring if GBA methods amply or attenuate retest
effects in relation to such issues, or if such concerns could be engineered out during game
development. Across these concerns, there is still much remaining about which we simply have
little data. Current caution and further research, especially in applied settings, are needed on all
of these issues.
Limitations and Future Research Directions
One limitation to the current empirical study is the generalizability of the measurement
properties of the Cognify GBA to the measurement properties of other GBAs. As GBAs could
theoretically be built to assess any construct (cf. Arthur & Villado, 2008), with an indefinite
number of potential design processes, no “prototypical” GBA exists or ever will, much as there
can be no “prototypical” questionnaire. One GBA design might approach cognitive ability
measurement through the gameplay of a first-person shooter whereas another might approach it
through puzzles. A design intended to measure a personality trait might be completely different.
This is not a unique limitation to GBA; for example, most researchers would not expect a single
validation study of “questionnaires” to result in definitive conclusions regarding the validity of
“questionnaires.” Instead, much as has been done for questionnaires, a body of evidence
regarding GBAs must be curated. Having said that, researchers must also be careful not to
assume that this justifies a case study approach to GBAs with traditional assessment outcomes.
For example, the mere existence of a GBA that produces scores that correlate with an outcome is
not theoretically interesting unless there is some evidence as to the underlying reason. For the
present GBA, we were able to provide construct validity evidence supporting g as that reason
both due to the GBA’s design process and the data collected. For other GBAs, we hope to see
similar types of evidence, along with detailed descriptions of the design methods that produced
them. In the case of data-driven GBAs, more atypical forms of validation evidence might be
useful, such as evidence from response processes through think-alouds during gameplay.
A second limitation to the present study is the composition of the two samples and the
constraints associated with each. Practical constraints limited our larger data collection effort to
an academic sample, which served as the principal sample for hypothesis testing, plus only a
small organizational sample using an existing one-item supervisory rating to investigate
criterion-related validity. Our interpretation of reactions and psychometric characteristics are
from non-organizational data, and the organizational sample only provides criterion-related
validity evidence with wide confidence intervals. This effect, due to the small organizational
sample size used to measure it, should not be over-interpreted; it is unlikely this correlation
would generalize to other organizations, and the stability of this estimate even as an estimate of
the population effect for the studied organization is poor. Additionally, both samples likely
exhibit some degree of range restriction that may have attenuated observed some relationships,
particularly those with criteria, below their true score values (Sackett & Yang, 2000). Given this,
we recommend the results from the organizational sample be viewed as tentative and have based
most of our conclusions upon results from the academic sample, yet the generalizability of
results from the academic sample to an organizational context is unknown. Future researchers
should prioritize seeking out larger samples with authentic employees to better understand under
what conditions reactions and psychometric properties might differ.
A third limitation of the present study is in the generalizability of the design method
employed to GBA design more broadly. Although we focused here on theory-driven GBA, this
is only one type of GBA currently in the assessment marketplace. The other major type, as
discussed earlier, relies upon computational psychometrics to develop its measurement models.
We call this data-driven GBA, and it typically takes a very different development process.
Specifically, the initial development of data-driven GBAs tends to be based upon content
validation; an assessment designer has a holistic idea for a game, either borrowed from the
research literature (e.g., neuroscience) or driven by marketplace needs (e.g., a client requests a
“leadership game”), and a game is created based upon that idea. Then, all data collected,
including both scores and trace data like mouse clicks, are inputted into either unsupervised
machine learning models to develop categories of players or into supervised machine learning
models to predict some outcome of interest directly from these messy data. This type of
assessment data mining is much more common in educational GBAs (Mislevy et al., 2012)
where precise explication of the constructs being measured is somewhat less important than in
the employment context. It is currently unknown how the present results would have changed if
a data-driven GBA had been used instead, and it is also unknown what additional insights could
have been gained from applying these techniques to the trace data produced by the present GBA.
Both are compelling directions for future research and perhaps critical to the evolution of
psychometrics (Mislevy et al., 2014).
Conclusion
In general, there are great possibilities for the value of such currently-untapped data
sources in GBAs. In contrast to traditional assessments, where an individual is presented with a
distinct task to complete, GBAs integrate challenges, problems, and high-complexity tasks more
seamlessly into a continued experience as part of a narrative or in pursuit of a broader end goal
(Mislevy et al., 2014). In other words, data analysts examining the results of a traditional
assessment can see a response to a question, but the process by which the assessee arrived at that
response is neither captured nor considered. In more complex and interactive games, players
self-direct efforts and pursue individual choices in relation to how they want to progress,
navigate through space, investigate and accomplish goals (Shute, 2011), and GBAs can provide
the facility to record and monitor changes in candidate temporal micro-patterns or strategic
shifts, as well as the context in which these changes occur. The richness of such data provides
significant promises for the future development of GBAs beyond what traditional assessment is
capable of by providing evidence of their thinking, which can itself be designed to meet the
assumptions of psychometric models (Plass et al., 2011). For example, Rupp et al. (2012)
investigated the quality of measured learning in a game-based learning activity involving
configuration of a computer network. Simple traditional outcome measures, such as how many
mistakes or correct responses students made, were not as valuable at distinguishing student
proficiency as various combinations of latent metrics from the data set. Their metacognitive
skills along the way, approach (e.g., time taken, number of commands input, proportions of
commands), efficiency, strategy usage (e.g., switching between computer devices) provided
deeper insight over and above the evidence of their final solutions (Rupp et al., 2012). Future
research is needed to explore the potential of GBAs to provide rich data regarding the automated
measurement of process-oriented traits such as these.

References
Aguinis, H., & Smith, M.A. (2007). Understanding the impact of test validity and bias on
selection errors and adverse impact in human resource selection. Personnel Psychology,
60, 165-199.
Anderson, N., Salgado, J. F., & Hülsheger, U. R. (2010). Applicant reactions in selection:
Comprehensive meta-analysis into reaction generalization versus situational specificity.
International Journal of Selection and Assessment, 18, 291-304.
Apperley, T. H. (2006). Genre and game studies: Toward a critical approach to video games.
Simulation & Gaming, 37, 6-23.
Armstrong, M.B., Ferrell, J., Collmus, A. B., & Landers, R. N. (2016). Correcting
misconceptions about gamification of assessment: More than SJTs and badges. Industrial
and Organizational Psychology, 9, 671-677.
Armstrong, M. B., Landers, R. N., & Collmus, A. B. (2016). Gamifying recruitment, selection,
training, and performance management: Game-thinking in human resource management.
In D. Davis & H. Gangadharbatla (Eds.), Handbook of Research on Trends in
Gamification (pp. 140-165). Hershey, PA: Information Science Reference.
Arthur, W., Glaze, R. M., Villado, A. J., & Taylor, J. E. (2010). The magnitude and extent of
cheating and response distortion effects on unproctored internet-based tests of cognitive
ability and personality. International Journal of Selection and Assessment, 18(1), 1–16.
Arthur, W. & Villado, A. J. (2008). The importance of distinguishing between constructs and
methods when comparing predictors in personnel selection research and practice. Journal
of Applied Psychology, 93, 435-442.

Arvey, R. D., Strickland, W., Drauden, G., & Martin, C. (1990). Motivational components of test
taking. Personnel Psychology, 43.
Attali, Y. & Arieli-Attali, M. (2015). Gamification in assessment: Do points affect test
performance? Computers & Education, 83, 57-63. doi:10.1016/j.compedu.2014.12.012
Bartram, D. & Hambleton, R. K. (2006). Computer-based testing and the internet: Issues and
advances. New York, NY: John Wiley & Sons.
Bauer, T. N., Truxillo, D. M., Sanchez, R. J., Craig, J. M., Ferrara, P., & Campion, M. A. (2001).
Applicant reactions to selection: Development of the Selection Procedural Justice Scale
(SPJS). Personnel Psychology, 54, 387-419.
Beck, Kent (1999). "Embracing Change with Extreme Programming". Computer. 32 (10): 70–
77. doi:10.1109/2.796139.
Bertua, C., Anderson, N., & Salgado, J.F. (2005). The predictive validity of cognitive ability
tests: A UK meta-analysis. Journal of Occupational and Organizational Psychology, 78,
387-409.
Bhatia, S., & Ryan, A. M. (2018). Hiring for the win: Game-based assessment in employee
selection. In The brave new world of eHRM 2.0. (pp. 81–110). IAP Information Age
Publishing.
Bjogvinsson, E., Ehn, P., & Hillgren, P.-A. (2012). Design things and design thinking:
Contemporary participatory design challenges. DesignIssues, 28, 101-116.
Borman, W. C. & Motowidlo, S. J. (1997). Task performance and contextual performance: The
meaning for personnel selection research. Human Performance, 10, 99-109.
Brown, A. (2017). Younger men play video games, but so do a diverse group of Americans.
Pew Research Center. Retrieved from http://www.pewresearch.org/fact-

tank/2017/09/11/younger-men-play-video-games-but-so-do-a-diverse-group-of-other-
americans/
Brown, A. W., Stein, S., & Rohrer, P. L. (1936). Chicago non-verbal examination. Psychological
Corporation.
Buckhalt, J. A. (2002). A short history of g: Psychometrics’ most enduring and controversial
construct. Learning and Individual Differences, 13(2), 101–114.
Burton, N. W., Welsh, C., Kostin, I., & VanEssen, T. (2009). Toward a definition of verbal
reasoning in higher education. ETS Research Report Series, 2009(2), i-41.
doi:10.1002/j.2333-8504.2009.tb02190.x
Campbell, J. P., McCloy, R. A., Oppler, S. H., & Sager, C. E. (1993). A theory of
performance. In N. Schmitt & W. C. Borman (Eds.), Personnel selection in organizations
(pp. 35-70). San Francisco, CA: Jossey-Bass.
Carretta, T. R., & Ree, M. J. (1996). Factor structure of the Air Force Officer Qualifying Test:
Analysis and comparison. Military Psychology, 8(1), 29.
Carroll, J.B. (1993). Human cognitive abilities: A survey of factor-analytic studies. New York:
Cambridge University Press.
Chamorro-Premuzic, T., Winsborough, D., Sherman, R. A. & Hogan, R. (2016). New talent
signals: Shiny new objects or a brave new world? Industrial and Organizational
Psychology, 9, 621-640.
Chan, D., Schmitt, N., DeShon, R. P., Clause, C. S., & Delbridge, K. (1997). Reactions to
cognitive ability tests: The relationships between race, test performance, face validity
perceptions, and test taking motivation. Journal of Applied Psychology, 82, 300-310.
Cleary, T.A. (1968). Test bias: Prediction of grades of Negro and White students in integrated
colleges. Journal of Educational Measurement, 5, 115-124.
Collmus, A. B. & Landers, R. N. (2019). Game-framing cognitive ability tests to improve
applicant perceptions. Journal of Personnel Psychology, 18, 157-162.
Colzato, L. S., van Leeuwen, P. J. A., van den Wildenberg, W. P. M., & Hommel, B. (2010).
DOOM’d to switch: Superior cognitive flexibility in players of first person shooter
games. Frontiers in Psychology. Retrieved from
https://www.frontiersin.org/articles/10.3389/fpsyg.2010.00008/full. doi:
10.3389/fpsyg.2010.00008
Conrad, L., Trismen, D., & Miller, R. (Eds.). (1977). Graduate Record Examinations technical
manual. Princeton, NJ: Educational Testing Service.
Deterding, S. (2011). Situated motivational affordances of game elements: A conceptual model.
Gamification: Using Game Design Elements in Non-Gaming Contexts, a Workshop at
CHI. Presented at CHI 2011, ACM, Vancouver, Canada.
Diehl, V. A. (2014). Using real-world and standardized spatial imagery tasks: Convergence,
imagery realism, and gender differences. Applied Cognitive Psychology, 28, 789-798.
doi:10.1002/acp.3061
Drasgow, F. (19870601). Study of the measurement bias of two standardized psychological tests.
Journal of Applied Psychology, 72(1), 19. https://doi.org/10.1037/0021-9010.72.1.19
Egenfeldt-Nielsen, S., Smith, J. H., & Tosca, S. P. (2013). Understanding video games: The
essential introduction (2nd ed.). New York, NY: Routledge.
Ekstrom, R. B., French, J. W., Harman, H. H., & Dermen, D. (1976). Kit of factor-referenced
cognitive tests. Princeton, NJ: Educational Testing Service.

Embretson, S. (1994). Applications of cognitive design systems to test development. In C. R.
Reynolds (Ed.), Cognitive Assessment: A Multidisciplinary Perspective (pp. 107–135).
Springer US.
Equal Employment Opportunity Commission, Civil Service Commission, Department of Labor,
& Department of Justice. (1978). Uniform guidelines on employee selection procedures.
Federal Register, 43, 38290–39315.
Gee, J.P. (2007). What video games have to teach us about learning and literacy (2nd ed.). New
York: Palgrave.
Georgiou, K., Gouras, A., & Nikolaou, I. (2019). Gamification in employee selection: The
development of a gamified assessment. International Journal of Selection and
Assessment, 27(2), 91–103.
Gustafsson, J. E., & Balke, G. (1993). General and specific abilities as predictors of school
achievement. Multivariate Behavioral Research, 28(4), 407-434.
Hak, A. (2019). How to remove hiring bias through gamification. The Next Web. Retrieved from
https://thenextweb.com/work2030/2019/05/20/how-to-remove-hiring-bias-through-
gamification/
Hamstra, S. J., Brydges, R., Hatala, R., Zendejas, B., & Cook, D. A. (2014). Reconsidering
fidelity in simulation-based training. Academic Medicine, 89, 387-392.
Handler, C. (2018, June 19). The truth about game-based talent assessments. Retrieved from
https://www.ere.net/the-truth-about-game-based-talent-assessments/
Hausknecht, J. P., Day, D. V., & Thomas, S. C. (2004). Applicant reactions to selection
procedures: An updated model and meta-analysis. Personnel Psychology, 57, 639-683.

Honaker, J., King, G., & Blackwell, M.(2011). Amelia II: A program for missing data. Journal
of Statistical Software, 45(7), 1-47.
Horn, J.L., & Noll, J. (1997). Human cognitive capabilities: Gf-Gc theory. In D.P. Flanagan, J.L.
Genshaft, & P. L. Harrison (Eds.), Contemporary intellectual assessment: Theories, tests
and issues (pp. 53-91). New York: Guilford Press.
Hough, L. M., Oswald, F. L., & Ployhart, R. E. (2001). Determinants, detection and amelioration
of adverse impact in personnel selection procedures: Issues, evidence and lessons
learned. International Journal of Selection and Assessment, 9, 152-194.
Hu, L. T., & Bentler, P. M. (1999). Cutoff criteria for fit indexes in covariance structure analysis:
Conventional criteria versus new alternatives. Structural Equation Modeling: A
Multidisciplinary Journal, 6(1), 1-55.
Huizinga, J. (2016). Homo ludens: A study of the play-element in culture. Kettering, OH:
Angelico Press.
Hunicke, R., LeBlanc, M., & Zubek, R. (2004). MDA: A formal approach to game design and
game research. In Proceedings of the AAAI Workshop on Challenges in Game AI (Vol. 4,
No. 1, pp. 1722-1726).
Hunter, J. E. (1980). Test validation for 12,000 jobs: An application of synthetic validity and
validity generalizations to the General Aptitude Test Battery (GATB). Washington, DC:
U.S. Employment Service, Department of Labor.
Hunter, J. E. (1983). A causal analysis of cognitive ability, job knowledge, job performance, and
supervisor ratings. In F. Landy, S. Zedeck, & J. Cleveland (Eds.), Performance
Measurement and Theory (pp. 257–266). Routledge.

Ip, C. (2018, May 4). To find a job, play these games. Engadget. Retrieved from
https://www.engadget.com/2018/05/04/pymetrics-gamified-recruitment-behavioral-tests/
Jones, S. E. (2008). The meaning of video games: Gaming and textual strategies. New York,
NY: Routledge.
Kehoe, J. F. (2002). General mental ability and selection in private sector organizations: A
commentary. Human Performance, 15, 97-106.
Kultima, A. (2015). Developers’ perspectives on iteration in game development. In M. Turunen
(Ed.), Proceedings of the 19th International Academic MindTrek Conference (pp. 26-32).
New York, NY: ACM.
Kuncel, N. R. & Hezlett, S. A. (2010). Fact and fiction in cognitive ability testing for admissions
and hiring decisions. Current Directions in Psychological Science, 19, 339-345.
Kuncel, N. R., Hezlett, S. A., & Ones, D. S. (2001). A comprehensive meta-analysis of the
predictive validity of the Graduate Record Examinations: Implications for graduate
student selection and performance. Psychological Bulletin, 127(1), 162-181.
doi:10.1037//Q033-2909.127.1.162
Kuncel, N. R., Klieger, D. M., Connelly, B. S., & Ones, D. S. (2013). Mechanical versus clinical
data combination in selection and admissions decisions: A meta-analysis. Journal of
Applied Psychology, 98, 1060-1072.
Kuncel, N. R., Wee, S., Serafin, L., & Hezlett, S. A. (2010). The validity of the Graduate Record
Examination for master’s and doctoral programs: A meta-analytic investigation.
Educational and Psychological Measurement, 70(2), 340-352.
doi:10.1177/0013164409344508
Landers, R. N., Auer, E. M., Collmus, A. B., & Armstrong, M. B. (2018). Gamification science,
its history and future: Definitions and a research agenda. Simulation & Gaming, 49(3),
315–337.
Landers, R. N., Auer, E. M., & Abraham, J. D. (2020). Gamifying a situational judgment test
with immersion and control game elements: Effects on applicant reactions and construct
validity. Journal of Managerial Psychology, 35(4), 225–239.
Landers, R. N. & Marin, S. (2021). Theory and technology in organizational psychology: A
review of technology integration paradigms and their effects on the validity of theory.
Annual Review of Organizational Psychology and Organizational Behavior, 8, 235-258.
Landers, R. N., Tondello, G. F., Kappen, D. L., Collmus, A. B., Mekler, E. D., & Nacke, L. E.
(2019). Defining gameful experience as a psychological state caused by gameplay:
Replacing the term ‘gamefulness’ with three distinct constructs. International Journal of
Human-Computer Studies.
Lang, J. W. B., Kersting, M., Hulsheger, U. R., & Lang, J. (2010). General mental ability,
narrower cognitive abilities, and job performance: The perspective of the nested-factors
model of cognitive abilities. Personnel Psychology, 63, 595-640.
Lautenschlager, G. J. & Mendoza, J. L. (1986). A step-down hierarchical multiple regression
analysis for examining hypotheses about test bias in prediction. Applied Psychological
Measurement, 10, 133-139.
Lu, Y., & Sireci, S. G. (2007). Validity issues in test speededness. Educational Measurement:
Issues and Practice, 26(4), 29–37.
MacCann, C. (2010). Further examination of emotional intelligence as a standard intelligence: A
latent variable analysis of fluid intelligence, crystallized intelligence and emotional

intelligence. Personality and Individual Differences, 49, 490-496.
doi:10.1016/j.paid.2010.05.010
Mathieu, J. E., Hollenbeck, J. R., van Knippenberg, D., & Ilgen, D. R. (20170202). A century of
work teams in the Journal of Applied Psychology. Journal of Applied Psychology,
102(3), 452.
Mavridis, A. & Tsiatsos, T. (2017). Game-based assessment: Investigating the impact on test
anxiety and exam performance. Journal of Computer Assisted Learning, 33, 137-150.
doi:10.1111/jcal.12170
Meade, A. W. & Fetzer, M. (2009). Test bias, differential prediction, and a revised approach for
determining the suitability of a predictor in a selection context. Organizational Research
Methods, 12, 738-761.
Mislevy, R. J., Behrens, J. T., DiCerbo, K. E., & Levy, R. (2012). Design and discovery in
educational assessment: Evidence centered design, psychometrics, and data mining.
Journal of Educational Data Mining, 4, 11–48.
Mislevy, R. J., Oranje, A., Bauer, M. I., von Davier, A., Hao, J., … John, M. (2014).
Psychometric considerations in game-based assessment. Retrieved from
https://www.envisionexperience.com/~/media/files/blog/glasslab-
psychometrics.pdf?la=en
Oswald, F. L., Putka, D. J., & Okc, Jisoo. (2014). Weight a minute... What you see in a weighted
composite is probably not what you get. In C. E. Lance & R. J. Vandenberg (Eds.), More
Statistical and Methodological Myths and Urban Legends: Doctrine, Verity and Fable in
Organizational and Social Sciences (pp. 187–205). Routledge.

Plattner, H. (2011). Foreward. In H. Plattner, C. Meinel, & L. Leifer (Eds)., Design Thinking:
Understand, Improve, Apply (pp. v-vi). Berlin, Germany: Springer-Verlag.
Plattner, H., Meinel, C., & Leifer, L. (2011). Design thinking: Understand, improve, apply.
Berlin, Germany: Springer-Verlag.
Ployhart, R. E., Schmitt, N., & Tippins, N. T. (2017). Solving the Supreme Problem: 100 years
of selection and recruitment at the Journal of Applied Psychology. Journal of Applied
Psychology, 102(3), 291–304.
Primi, R. (2014). Developing a fluid intelligence scale through a combination of Rasch modeling
and cognitive psychology. Psychological Assessment, 26(3), 774.
Markland, D. (2007). The golden rule is that there are no golden rules: A commentary on Paul
Barrett’s recommendations for reporting model fit in structural equation modeling.
Personality and Individual Differences, 42, 851-858.
Marsch, H. W., Hau, K.-T., & Wen, Z. (2009). In search of golden rules: Comment on
hypothesis-testing approaches to setting cutoff values for fit indexes and dangers in
overgeneralizing Hu and Bentler’s (1999) findings. Structural Equation Modeling, 11,
320-241.
Meade, A. W., & Craig, S. B. (2012). Identifying careless responses in survey data.
Psychological Methods, 17, 437-455.
Mitchell, I. (2016). Agile development in practice. Tamare House. p. 11. ISBN 978-1-908552-
49-5.
Mollick, E. R. & Rothbard, N. (2014). Mandatory fun: Consent, gamification and the impact of
games at work. The Wharton School Research Paper Series. Retrieved from
https://ssrn.com/abstract=2277103
Moran, A. (2014). Agile risk management. Springer Verlag. ISBN 3319050079.
Morgeson, F. P., Brannick, M. T., & Levine, E. L. (2019). Job and work analysis: Methods,
research, and applications for human resource management. SAGE Publications.
Mount, M. K., Oh, I.-S., & Burns, M. (2008). Incremental validity of perceptual speed and
accuracy over general mental ability. Personnel Psychology, 61, 113-139.
Nagarajan, A., Allbeck, J. M., Sood, A., & Janssen, T. L. (2012). Exploring game design for
cybersecurity training. 2012 IEEE International Conference on Cyber Technology in
Automation, Control, and Intelligent Systems (CYBER), 256–262.
https://doi.org/10.1109/CYBER.2012.6392562
Newman, D. A., Hanges, P. J., & Outtz, J. L. (2007). Racial groups and test fairness, considering
history and construct validity. American Psychologist, 62, 1082-1083.
Nintendo of America. (1989). Tetris. Redmond, WA: Nintendo of America.
Nunnally, J. (1978). Psychometric methods. New York, NY: McGraw Hill.
Olsen, J., Aleven, V. & Rummel, N. (2017). Statistically modeling individual students’ learning
over successive collaborative practice opportunities. Journal of Educational
Measurement, 54, 123-138.
Oswald, F. L., Saad, S. & Sackett, P. R. (2000). The homogenity assumption in differential
prediction analysis: Does it really matter? Journal of Applied Psychology, 85, 536-541.
Plass, J. L., Homer, B. D., Kinzer, C. K. & Perlin, K. (2012). Games for learning institution
(G4LI) white paper: Ideas for impact games. Retrieved from
https://gamesandimpact.org/wp-content/uploads/2012/09/PlassNYU-Ideas-for-Impact-
Games-2.pdf
Potosky, D., Bobko, P., & Roth, P. L. (2005). Forming composites of cognitive ability and
alternative measures to predict job performance and reduce adverse impact: Corrected
estimates and realistic expectations. International Journal of Selection and Assessment,
13,
Putka, D. J., Beatty, A. S., & Reeder, M. C. (2018). Modern prediction methods: New
perspectives on a common problem. Organizational Research Methods, 21(3), 689–732.
https://doi.org/10.1177/1094428117697041
Pymetrics.com. (2019). Pymetrics | The Science. Retrieved from http://pymetrics.com/science/
Quiroga, M. A., Diaz, A., Román, F. J., Privado, J., & Colom, R. (2019). Intelligence and video
games: Beyond “brain-games.” Intelligence, 75, 85-94.
Quiroga, M. Á., Escorial S., Román F. J., Morillo D., Jarabo A., Privado J., et al. (2015). Can we
reliably measure the general factor of intelligence (g) through commercial video games?
Yes, we can! Intelligence 53, 1–7. 10.1016/j.intell.2015.08.004 [Cross Ref]
R Core Team. (2020). R: A language and environment for statistical computing. R Foundation
for Statistical Computing. https://www.R-project.org
Ree, M. J., & Carretta, T. R. (1994). Factor analysis of the ASVAB: Confirming a Vernon-like
structure. Educational and Psychological Measurement, 54(2), 459-463.
Reeve, C. L. & Hakel, M. D. (2002). Asking the right questions about g. Human Performance,
15, 47-74.
Richardson, M. , Abraham, C., & Bond, R. (2012). Psychological correlates of university
students’ academic performance: A systematic review and meta-analysis. Psychological
Bulletin, 138, 353-387.

Roth, P. L., Bevier, C. A., Bobko, P., Switzer, F. S., & Tyler, P. (2001). Ethnic group differences
in cognitive ability in employment and educational settings: A meta-analysis. Personnel
Psychology, 54(2), 297-330.
Roth, P. L., & Bobko, P. (1997). A research agenda for muti-attribute utility analysis in human
resource management. Human Resource Management Review, 7(3), 341–368.
Rowe, P. (1986). Design thinking. Cambridge, MA: The MIT Press.
Rupp, A. A., DiCerbo, K. E., Sweet, S. J., Crawford, A. V., Calico, T., … Behrends, J. T. (2012).
Putting ECD into practice: The interplay of theory and data in evidence models within a
digital learning environment. Journal of Educational Data Mining, 4, 49-110.
Ryan, R. M. & Deci, E. L. (2000). Intrinsic and extrinsic motivations: Classic definitions and
new directions. Contemporary Educational Psychology, 25, 54-67.
Ryan, A. M., & Huth, M. (2008). Not much more than platitudes? A critical look at the utility of
applicant reactions research. Human Resource Management Review, 18(3), 119–132.
https://doi.org/10.1016/j.hrmr.2008.07.004
Ryan, A. M. & Ployhart, R. E. (2000). Applicants’ perceptions of selection procedures and
decisions: A critical review and agenda for the future. Journal of Management, 26(3),
565-606.
Ryan, R. M., Rigby, C. S., & Przybylski, A. (2006). The motivational pull of video games: A
self-determination theory approach. Motivation and Emotion, 30, 347-363.
doi:10.1007/s11031-006-9051-8
Sackett, P. R., & Ellingson, J. E. (1997). The effects of forming multi-predictor composites on
group differences and adverse impact. Personnel Psychology, 50(3), 707–721.

Sackett, P. R., Lievens, F., Van Iddekinge, C. H., & Kuncel, N. R. (2017). Individual differences
and their measurement: A review of 100 years of research. Journal of Applied
Psychology, 102(3), 254–273.
Sackett, P. R., & Yang, H. (2000). Correction for range restriction: An expanded typology.
Journal of Applied Psychology, 85(1), 112–118.
Salgado, J.F., Anderson, N., Moscoso, S., Bertua, C., de Fruyt, F., & Rolland, J.P. (2003). A
meta-analytic study of general mental ability validity for different occupations in the
European community. The Journal of Applied Psychology, 88(6), 1068-1081.
Schmidt, F. L. (1988). The problem of group differences in ability test scores in employment
selection. Journal of Vocational Behavior, 33, 272-292.
Schmidt, F. L. (2002). The role of general cognitive ability and job performance: Why there
cannot be a debate. Human performance, 15(1-2), 187-210.
Schmidt, F. L. & Hunter, J. E. (1974). Racial and ethnic bias in psychological tests: Divergent
implications of two definitions of test bias. American Psychologist, 29, 1-8.
Schmidt, F.L. & Hunter, J. E. (1998). The validity and utility of selection methods in personnel
psychology: Practical and theoretical implications of 85 years of research findings.
Psychological Bulletin, 124, 262-274.
Schmidt, F. L., & Hunter, J. E. (2004). General mental ability in the world of work: Occupational
attainment and job performance. Journal of personality and social psychology, 86(1),
162.
Schneider, W. J., & McGrew, K. (2012). The Cattell-Horn-Carroll model of intelligence. In, D.
Flanagan & P. Harrison (Eds.), Contemporary Intellectual Assessment: Theories, Tests,
and Issues (3rd ed.) (p. 99-144). New York: Guilford.

Salen, K., Tekinbas, K. S., & Zimmerman, E. (2004). Rules of play: Game design fundamentals.
Cambridge, MA: MIT Press.
Schmidt, F. L. & Hunter, J. (2004). General mental ability in the world of work: Occupational
attainment and job performance. Journal of Personality and Social Psychology, 86(1),
162-173. doi:10.1037/0022-3514.86.1.162
Sinar, E. F., Reynolds, D. H., & Paquet, S. L. (2003). Nothing but ‘net? Corporate image and
web-based testing. International Journal of Selection and Assessment, 11, 150-157.
Singh, P. & Finn. D. (2003). The effects of information technology on recruitment. Journal of
Labor Research, 24, 395-408.
Smither, J. W., Reilly, R. R., Millsap, R. E., Pearlman, K., & Stoffey, R. W. (1993). Applicant
reactions to selection procedures. Personnel Psychology, 46, 49-76. doi: 10.1111/j.1744-
6570.1993.tb00867.x
Society for Industrial and Organizational Psychology. (2003). Principles for the validation and
use of personnel selectin procedures (4th ed.). Retrieved from
http://www.siop.org/_principles/principles.pdf
Stanek, K. C. & Ones, D. S. (2018). Taxonomies and compendia of cognitive ability and
personality constructs and measures relevant to industrial, work and organizational
psychology. In D. S. Ones, N. Anderson, C. Viswesvaran, & H. K. Sinangil (Eds.), The
SAGE Handbook of Industrial, Work and Organizational Psychology: Personnel
Psychology and Employee Performance (pp. 366-407). Thousand Oaks, CA, SAGE
Publications Ltd.
Stark, S., Chernyshenko, O. S., & Drasgow, F. (2004). Examining the effects of differential item
(functioning and differential) test functioning on selection decisions: When are

statistically significant effects practically important? Journal of Applied Psychology, 89,
497-508. doi:10.1037/0021-9010.89.3.497
Stauffer, J. M., Ree, M. J., & Carretta, T. R. (1996). Cognitive-Components Tests Are Not Much
More Than" g": An Extension of Kyllonen's Analyses. Journal of General
Psychology, 123(3), 193.
Steiner, D. D. & Gilliland, S. W. (1993). Fairness reactions to personnel selection techniques in
France and the United States. Journal of Applied Psychology, 81, 134-141.
Stenros, J. (2017). The game definition game: A review. Games and Culture, 12, 499-520.
Stroop, J. R. (1935). Studies of interference in serial verbal reactions. Journal of Experimental
Psychology, 18, 643-662.
Taylor, P. J., Driscoll, M. P. O., & Binning, J. F. (1998). A new integrated framework for
training needs analysis. Human Resource Management Journal, 8(2), 29–50.
Tucker-Drob, E. M. & Salthouse, T. A. (2009). Confirmatory factor analysis and
multidimensional scaling for construct validation of cognitive abilities. International
Journal of Behavioral Development, 33, 277-285.
Villado, A. J., Randall, J. G., & Zimmer, C. U. (2016). The effect of method characteristics on
retest score gains and criterion-related validity. Journal of Business and Psychology,
31(2), 233–248.
Viswesvaran, C. & Ones, D. S. (2000). Perspectives on models of job performance. International
Journal of Selection and Assessment, 8, 216-226.
Von Davier, A. A. (2017). Computational psychometrics in support of collaborative educational
assessments. Journal of Educational Measurement, 54, 3-11.

Von Stumm, S. (2013). Investment traits and intelligence in adulthood: Assessment and
associations. Journal of Individual Differences, 34(2), 82-89. doi:10.1027/1614-
0001/a000101
Voyer, D., Voyer, S. & Bryden, M. P. (1995). Magnitude of sex differences in spatial abilities: A
meta-analysis and consideration of critical variables. Psychological Bulletin, 117, 250-
270.
Weiner, E. J., & Sanchez, D. R. (2020). Cognitive ability in virtual reality: Validity evidence for
VR game-based assessments. International Journal of Selection and Assessment, 28(3),
215–235. https://doi.org/10.1111/ijsa.12295
Widaman, K. F. (1982). Stability and construct validity of second-stratum factors of ability
(Dissertation). The Ohio State University.
Workman, J. E., & Lee, S-H. (2004). A cross-cultural comparison of the apparel spatial
visualization test and paper folding test. Clothing and Textiles Research Journal, 22(1/2),
22-30.
Zhang, P. (2008). Motivational affordances: Fundamental reasons for ICT design and use.
Communications of the ACM, 51, 145-157.
View publication stats

Landers Et Al - 2022 - Theory-Driven Game-Based Assessment of General Cognitive Ability

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Landers Et Al - 2022 - Theory-Driven Game-Based Assessment of General Cognitive Ability

Uploaded by

Copyright:

Available Formats

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Theory-Driven Game-Based Assessment of General Cognitive Ability: Design

Article in Journal of Applied Psychology · October 2021

Richard N Landers Michael B. Armstrong

SEE PROFILE SEE PROFILE

Andrew Burnett Collmus Jason Blaik

SEE PROFILE SEE PROFILE

The user has requested enhancement of the downloaded file.

Theory-driven Game-based Assessment of General Cognitive Ability:

Design Theory, Measurement, Prediction of Performance, and Test Fairness

Michael B. Armstrong and Andrew B. Collmus

Old Dominion University

Salih Mujcic and Jason Blaik

Revelian Pty Ltd

now at Traitstack, and JB is now at Cappfinity.

driven game-based assessment of general cognitive ability: Design theory, measurement,

prediction of performance, and test fairness. Journal of Applied Psychology.

positive reactions and high-quality psychometric measurement. In the present research, we

human-computer interaction with psychometric theory. Next, we test measurement

characteristics, prediction of performance, fairness, and reactions of a GBA designed according

build high quality theory-driven GBAs.

Keywords: game-based assessment; game design; validation; measurement; fairness

Theory-driven Game-based Assessment of General Cognitive Ability:

Design Theory, Measurement, Prediction of Performance, and Test Fairness

Whereas assessment gamification refers to design techniques employed to modify existing

Premuzic et al., 2016).

more prominent marketing claim is improved applicant reactions to assessment (Armstrong,

comparisons of algorithmic decision-making versus reliance upon human intuition is well-

computational psychometrics, which we refer to as theory-driven GBA. Theory-driven GBA is

construct as described by prior psychological research, producing scores representing that

science, computer science, and human-computer interaction (Jones, 2008). It is the

(Armstrong et al., 2016).

of human-computer interaction. Next, we describe two empirical studies, a larger one in an

describe future research directions.

A Theory of Games and GBA

in a specific, targeted way (Landers et al., 2019).

conduct meaningful measurement of psychological constructs, much as Likert-type measures

high-quality, modern psychometric measurement.

To overcome these challenges in the development of theory-driven GBAs, gameplay

operationally defined as the shared variance in success among varied cognitively-loaded

example, in a cybersecurity skill GBA, players might be presented with a simulated

cybersecurity environment and asked to engage in a series of increasingly difficult challenges

some other rationale altogether.

GBA Design Theory

Modern game design, whether for the purpose of assessment or otherwise, is an

result of designing hundreds or thousands of interconnected subsystems that work in harmony to

create a targeted gameplay experience, each requiring hundreds or thousands of individual

foundation in psychological theory, is the Mechanics, Dynamics, Aesthetics (MDA) framework

between mechanics, which include both inter-mechanic dynamics and mechanic-player

dynamics. Aesthetics are defined as a player’s affective response caused by experiencing a

commercial entertainment games is to craft an experience in which people have positive

lens of MDA, with the development of a psychometrically valid measurement instrument.

part to address this gap.

As a result of the interdisciplinary expertise required, as well as the significantly

Figure 1. Illustration of prototypical GBA development process adopting design thinking.

structure, leadership, planning and coordination, and so on (Mathieu et al., 2017).

Measuring g with a GBA

g is a stable individual difference construct commonly used in employee selection due to

typically defined within a Spearmanian framework, such as the Cattell-Horn-Carroll (CHC)

selection, g has also consistently demonstrated utility as a predictor of occupational success

targeted with individual short-form games.

Empirical support for this is most commonly observed in industrial-organizational psychology as