Professional Documents
Culture Documents
net/publication/355479328
CITATIONS READS
39 5,077
5 authors, including:
All content following this page was uploaded by Richard N Landers on 22 October 2021.
Richard N. Landers
University of Minnesota
Author Note
Early versions of this paper were presented at the annual conference of the Society for Industrial
and Organizational Psychology. Participant payments and graduate research hours in this study
were funded by Revelian Pty Ltd, and RNL became a compensated member of Revelian’s
Scientific Advisory Board mid-project. MBA is now at Google, ABC is now at Facebook, SM is
Citation
Landers, R. N., Armstrong, M. B., Collmus, A. B., Mujcic, S., & Blaik, J. (in press). Theory-
Abstract
Games, which can be defined as an externally structured, goal-directed type of play, are
increasingly being used in high-stakes testing contexts to measure targeted constructs for use in
the selection and promotion of employees. Despite this increasing popularity, little is known
about how theory-driven game-based assessments (GBA), those designed to reflect a targeted
construct, should be designed, or their potential for achieving their simultaneous goals of
develop a theory of GBA design by integrating game design and development theory from
to this theory to measure latent general intelligence (g). Using an academic sample with GPA
data (N=633), we demonstrate convergence between latent GBA performance and g ( = .97).
Adding an organizational sample with supervisory ratings of job performance (N=49), we show
GBA prediction of both GPA (r=.16) and supervisory ratings (r=.29). We also show incremental
prediction of GPA using unit-weighted composites of the g test battery beyond that of the g-
GBA battery but not the reverse. We also show the presence of similar adverse impact for both
the traditional test battery and GBA but the absence of differential prediction of criteria.
Reactions were more positive across all measures for the g-GBA compared to the traditional test
battery. Overall, results support GBA design theory as a promising foundation from which to
In recent years, there has been a marked increase in interest among assessment
practitioners in the application of game-thinking, the use of game design theory to improve the
assessment experience (Armstrong, Landers & Collmus, 2016). In the practice of employee
selection, the term game-thinking encompasses two more specific concepts: assessment
gamification and standalone game-based assessment (GBA; Armstrong, Ferrell, et al., 2016).
assessments by adding game elements (e.g., Attali & Arieli-Attali, 2015; Collmus & Landers,
2019), GBA is a distinct method of measurement (cf. Arthur & Villado, 2008), and GBAs might
reflect either the result of gamification or of a dedicated game design and development. Much
like surveys, simulations, and structured interviews, GBAs can be created with the intent of
assessing any construct of interest. In the current assessment marketplace, there are GBAs
marketed as assessing general mental ability (g), personality, skills, and various competencies
(Handler, 2018); however, scientific evidence evaluating the quality of GBA-based assessment is
scant in the assessment literature and missing in the high-stakes assessment literature (Chamorro-
In the popular press and in assessment company marketing materials, GBAs are
commonly described as providing two distinct advantages over traditional assessments. The
Ferrell, et al., 2016). Whereas job applicants generally consider traditional survey-based
assessment to be ordinary and expected (Anderson et al., 2010; Steiner & Gilliland, 1996), GBAs
COGNITIVE ABILITY GAME-BASED ASSESSMENT 4
offer a promise of fun and excitement. Although such claims are yet generally untested, it is
clear from applicant reactions theory as to the general concept: that fun and excitement during
the application process should lead to better organizational hiring outcomes (cf., Hausknecht et
al., 2004). However, even if true, given the high cost of GBA development, it is unknown if any
gains realized by developing and implementing GBA would ultimately result in positive utility.
Additionally, the type of fun experienced in games may only be tenuously related to the kind of
fun, if any, that a job applicant wants during high-stakes assessment (cf. Mollick & Rothbard,
2014). The second purported gain attributed to GBA is improved measurement. This claim
takes many specific forms, such as reducing the impact of human biases (e.g., Ip, 2018), yet
explored and hardly new concept in the assessment literature (e.g., Kuncel et al., 2013). Another
aspect of GBAs that could enable superior construct measurement is the analysis of the trace
data, such as click and mouse movement data, an area of research called computational
psychometrics (von Davier, 2017). This field is in its infancy and primarily exists in the
assessment of learning (e.g., Olsen et al., 2017), so its potential in the selection context is
completely unknown.
The more immediate concerns for assessment researchers are the development process
and measurement characteristics of current, deployed GBAs designed so that they do not require
like traditional psychometric assessment in that the GBA is designed to assess a targeted
construct based upon assessee behaviors within the assessment. Where theory-driven GBAs
differ from other theory-driven assessment development approaches is that they involve the
COGNITIVE ABILITY GAME-BASED ASSESSMENT 5
creation and collection of scores from a game, a complex concept with a rich history in
interdisciplinary research outside of the assessment literature which spans the humanities, social
interdisciplinary design methodologies developed and refined in games research over the last
century that theory-driven GBA draws from in pursuit of an improved assessment experience
Given this landscape and the increasing popularity of GBA in the assessment
marketplace, and given calls to improve applied psychology’s integration of design into its
theories (Landers & Marin, 2021), the purpose of the present article is to introduce games, and
more specifically theory-driven GBA and the game design theories used to create them, to the
high-stakes assessment literature. We create and present a theory of GBA design through an
interdisciplinary integration of literatures across assessment and software design through the lens
academic sample and a much smaller one in an organizational sample, comparing and
contrasting reactions, validity, and adverse impact of a traditional g test battery versus a theory-
driven GBA designed to assess g using the process described by our GBA design theory.
Finally, we provide practical recommendations for the development of theory-driven GBA and
Games, which have been a part of culture across all recorded human history and likely
much further (Huizinga, 2014), have historically been difficult to define; “what is the definition
of game?” has been the subject of hundreds of articles in the games research literature. Stenros
(2017) attempted to synthesize this literature, identifying 60 distinct definitions presented since
COGNITIVE ABILITY GAME-BASED ASSESSMENT 6
the 1930s differing on 10 dimensions of description. To simplify this in the GBA context, it is
tempting to at least narrow our treatment at least to digital games, which is to say games played
on a computing device and in the present day most typically delivered over the internet.
However, the term game describes a range of analog experiences as well, such as classics like
Monopoly, Scrabble, Chess, and Duck-Duck-Goose. For our purposes in the present article,
game will be defined as explored and defended by Landers et al. (2019): “an externally
structured, goal-directed type of play.” As they explain, fundamental to this definition like
almost all in the literature is that players have a high degree of flexibility in terms of how they go
about achieving goals either enabled or imposed by the game’s design. Further, a game engages
a player in its core gameplay loop, an iterative experience created by the interaction of
potentially thousands of game elements, all designed to encourage players to experience the loop
Building upon this background, GBAs thus utilize games as a platform from which to
rely upon multiple-choice questions. In theory-driven GBA, this occurs explicitly, by designing
in-game activities to create meaningful scores estimating targeted constructs. Many modern
digital games already incorporate explicit assessments like this quantifying player behaviors
(e.g., goal achievement, in-game activities), although at a less rigorous standard of measurement
than is common in the assessment literature. For example, simply counting the number of goals
achieved among a provided list and providing feedback on this list to players may be sufficient to
create an emotionally compelling gameplay experience. The fluid nature of games’ “multiple
interacting aspects of knowledge and skill; construct-irrelevant variation from game features;
dependencies among actions across time points; [and] different situations arising for different
COGNITIVE ABILITY GAME-BASED ASSESSMENT 7
players as they interact with a game” (Mislevy et al., 2014, p.10) pose significant challenges and
deviations from standard assessment development practice for the more rigorous requirements of
activities must be designed that provide players with significant freedom in pursuing game goals
while simultaneously designing the experience such that variance in measurable player behaviors
reflect targeted constructs. If there is insufficient freedom to play, the game feels proscribed and
mandatory, and thus the experience loses its gamefulness (see Mollick & Rothbard, 2014;
Landers et al., 2019), reducing the potential added value of GBA over much less costly
assessment methods.
Designing a GBA given this challenge is potentially easier in the case of the
measurement of g than for the measurement of skills or non-cognitive traits. Because g can be
activities (Schneider & McGrew, 2012), the goal of g-GBA development can be as relatively
simple as creating multiple game activities that are cognitively loaded, such as solving
interactive puzzles or pursuing complex in-game goals, and to score those activities according to
their cognitive aspects with a traditional g scoring model. Thus, development of a g-GBA can
resemble development of a traditional g assessment but with the significant financial overhead of
game development. Skill GBAs are similarly straightforward from a measurement perspective
but add concerns regarding accurate simulation of the targeted skill construct in terms of physical
and functional fidelity (Hamstra et al., 2014); skill GBAs are designed to require the player to
engage in the skill to be measured, or a proxy, and those behaviors are then scored according to
the degree of skill exhibited, which might be done by human raters or algorithmically. For
COGNITIVE ABILITY GAME-BASED ASSESSMENT 8
(e.g., see CyberNEXS in Nagarajan et al., 2012). The most complex design case is for non-
cognitive traits where there is no “correct” answer indicating higher construct standing, such as
personality. In this context, activities must be crafted such that players have the freedom to
engage in a range of actions reflecting a range of a targeted trait but not have so much freedom
that those actions could reflect a non-targeted trait. If a player might behave in a scored way in a
GBA because of high agreeableness or because of high conscientiousness, that behavior will
likely not be a particularly good measure of either trait when using classical psychometric
approaches. For example, a player might be given the freedom to choose whether to help a
virtual character in need, yet such helping behavior could reflect a desire to be friendly (i.e., an
agreeableness signal), a desire to complete all tasks provided (i.e., a conscientiousness signal), or
enormously complex process approached with myriad methodologies; all are a combination of
art and science, and the specific methodology adopted shapes not only the aesthetic experience
of a resulting GBA but also its psychometric properties. Most modern digital games are the
design decisions. Thus, much as with the development of other complex media like film, extant
theory provides some “rules” for effective game design (Salen et al., 2004), such as the criticality
of iterative engineering practices in which prototypes are gradually improved through repeated
COGNITIVE ABILITY GAME-BASED ASSESSMENT 9
data collection and action planning (Kultima, 2015), but much is still based upon the intuition
and development experience of the game’s director in terms of how to develop a compelling
experience for players, as there is rarely empirical research to consult regarding individual design
decisions.
In parsing the complexity of games, the games research community has numerous major
theoretical frameworks, the choice of which generally reflects the disciplinary lens of the
researcher attempting design or analysis (Egenfeldt-Nielsen et al., 2013) and which provide a
theoretical framework for understanding the implications of different game designs. A popular
approach within human-computer interaction, which tends to favor empirical approaches with a
(Hunicke et al., 2004), which blends perspectives from the fields of “game design and
development, game criticism, and technical game research” (p. 1722). In MDA, mechanics are
defined as the algorithms and systems that establish how a game functions on a technical level,
such as scoring systems or avatar control. Dynamics are defined as the real-time interactions
game’s mechanics and dynamics. Importantly, the only aspects of a game that actually “exist” in
a technical sense are the mechanics; the experience recognized as a game are the dynamics – the
result of mechanics interacting with each other and with players in real-time. It is impossible for
game programmers to program dynamics directly. Instead, they must program mechanics with
the goal of those mechanics interacting with the player and with each other in such a way that
desirable dynamics, and subsequently aesthetics, emerge as a result of gameplay. This idea
reflects the theory of motivational affordances, which describes how the technical characteristics
COGNITIVE ABILITY GAME-BASED ASSESSMENT 10
of systems have only the potential to motivate people to act (Deterding, 2011; Zhang, 2008);
mechanics have affordances that are designed to but do not necessarily lead to desired dynamics.
Thus, causally speaking, mechanics cause dynamics which in turn cause aesthetics, but game
developers only have direct control over mechanics despite targeting desirable aesthetics, such as
a “fun” GBA.
The design challenge in GBA incorporates all these concerns yet is still more complex
because targeting desirable aesthetics is a secondary goal. Whereas the primary goal of
affective experiences, motivating them to play more and discuss their experiences positively with
friends and family so that the games are more commercially successful, GBAs must first produce
scores with trustworthy psychometric properties and secondarily also create a positive affective
experience. If a GBA does not achieve a degree of measurement quality that is recognizable by
common standards (e.g., Nunnally, 1978; Society for Industrial and Organizational Psychology,
2003), it is not a GBA; it is only a game that outputs numbers. Similarly, if a GBA does not
create a sense of play, it is not a game; it is only an assessment marketed as if it is a game, which
is instead a type of gamification called “game-framing” (Collmus & Landers, 2019). Perhaps
most importantly, if the GBA cannot achieve both goals simultaneously, there are far more
inexpensive assessment methods that can achieve one or the other goal alone. Thus, GBA
designers face the added complexity of combining traditional game design, such as through the
Currently, GBA developers tend to integrate literatures based upon their own locally-defined
priorities, because there is little theory to guide GBA design. The present study is intended in
Modern digital application development processes are generally based upon design
thinking, a design and development process theory focusing upon iterative redevelopment and
reconceptualization of both the problem being addressed and the product being designed to
address it (Plattner et al., 2011; Rowe, 1986) that was popularized by the Stanford Design School
(Plattner, 2011; Bjogvinsson et al., 2012). Design and development process theories are rarely
seen in applied psychology, but they describe how to best align design and development
processes with intended outcomes, which generally determines the quality of the final product
being developed in relation to organizational goals related to that product (Landers & Marin,
2021). In design thinking, there are five key stages to the development of a designed product;
however, designers may jump backwards to any prior stage if challenges in later stages
necessitate it. A visualization of these stages as applied to GBA development appear in Figure 1.
The first stage, empathizing, requires designers to collect data on the problem their product is
designed to solve; in the theory-based GBA design context, this involves construct specification
and development of a shared mental model among assessment experts, game designers, and
software development teams. The second stage, definition, refers to a meta-cognitive process in
which designers attempt to anticipate difficulties ahead based upon the problem and design
product as they understand it at that time. For GBAs, this principally involves consideration of
the ultimate context in which the GBA will be deployed, including issues like supported device
types, proctoring, and technical limitations. The third stage, ideating, involves brainstorming
and prioritization of developed ideas needed to develop a useful prototype, which generally
involves building a shared mental model between the assessment team and technical teams
working on the GBA. The fourth stage, prototyping, involves the creation of a planned product,
which includes anything from paper representations of GBAs (i.e., low fidelity prototypes) to
COGNITIVE ABILITY GAME-BASED ASSESSMENT 12
partially functional but unfinalized digital versions (i.e., high fidelity prototypes; see Figure 2).
The fifth stage, testing, involves assessment of the success of that product in addressing the
original problem statement, which for GBAs typically include both aesthetic and psychometric
goals. These stages are also often repeated in various combinations after a tested product has
already been deployed for use; for example, in an assessment consultancy, after the first version
of a GBA is deployed for one client, lessons learned from that GBA may be used to inform
future empathizing, defining, ideation, and/or prototyping while that first version of the GBA
continues to be used. When creating a GBA under a design thinking model, assessment
specialists generally work much more closely with software developers than is typical for
assessment specialists in more traditional measure development projects given the thousands of
iterations involved, each of which may have implications for high-quality measurement; for
example, the distance between the two images in Figure 2 alone was several hundred prototypes,
each tested in different ways depending upon development goals relevant at that point in time.
increased overall process complexity, agile development of a GBA occurs in a much more
organic and emergent manner than in traditional assessment design (Mitchell, 2016). Although
the empathizing stage is similar, in that it involves identification of target constructs through
needs analysis (Taylor et al., 1998), job analysis (Morgeson et al., 2019), or some other
traditional process used to define research problems, the remaining stages are different. In GBA
design, the defining stage is minimized as a way of managing risk; it is generally assumed that
developing a prototype and learning about the GBA’s performance from data will be more
efficient than spending more time in the initial defining stage. More dramatically, the ideating,
prototyping, and testing stages are reordered and repeated as suitable for the combination of
COGNITIVE ABILITY GAME-BASED ASSESSMENT 13
Figure 2. Early low-fidelity paper prototype (left) and screenshot from a functional high-fidelity
digital prototype (right) of one of the current study’s GBA mini-games, Gridlock
COGNITIVE ABILITY GAME-BASED ASSESSMENT 14
problems, designers, and developers in the project, to maximize quality while minimizing effort
(Moran, 2014; Beck, 1999). The success of these efforts will be driven not only by technical
competence but also in large part by traditional antecedents of team effectiveness, such as team
its strong association with job performance and other outcomes of interest across a wide variety
of employment contexts (Schmidt, 2002; Schmidt & Hunter, 2004). In I-O psychology, g is
theory of cognitive ability and operationalized as shared variance across cognitively-loaded tests
selected to sample across the cognitive ability domain (Lang, Kersting, Hulsheger & Lang,
2010). This operationalization is possible due to CHC theory’s hierarchical modeling of g, such
that g is the topmost and broadest construct, reflecting shared variance among specific abilities,
including fluid reasoning, visual processing, and processing speed, among others. CHC theory
was chosen as the foundation for measurement in the present study’s GBA for two primary
reasons. First, the CHC model is today the clearly dominant model of g in the intelligence
research literature and has a rich history (Schneider & McGrew, 2012) combining two prominent
models of human cognitive abilities, Horn-Cattell Gf-Gc theory (Horn & Noll, 1997) and
Carroll’s three-stratum theory (Carroll, 1993). Within the context of assessment and personnel
across a wide array of job types (Bertua et al., 2005; Salgado et al., 2003; Schmidt & Hunter,
1998). Second, CHC theory’s taxonomy of broad human cognitive abilities included rich
theoretical foundations for its many components, allowing for targeted game design.
COGNITIVE ABILITY GAME-BASED ASSESSMENT 15
Specifically, rather than focusing on developing multiple general assessments of g, CHC’s broad
level factors provide specific details from which specific game mechanics can be identified and
The hierarchical nature of cognitive ability is such that g, which exists at the highest and
most general level, explains the majority of variance across more specific ability tests regardless
of the specific domains of those tests, a concept Spearman called indifference of the indicator.
very strong factor loadings and relatively little residual variance in factor analyses of
cognitively-loaded test batteries used for employee selection (Carretta & Ree, 1996, Ree &
Carretta, 1994, Stauffer et al., 1996). One well-validated framework in the broader literature on
g, the cognitive design system approach (Embretson, 1994), provides valuable guidance on the
This framework addressed what was at the time a generally atheoretical approach to constructing
ability tests with cognitive theory, recommending the targeting of two goals in item construction:
a) construct representation, which refers to alignment between the latent constructs involved in
solving an ability item with the cognitive theory of that solving, and b) nomothetic span, which
refers to expected relationships within the nomological net surrounding the measure given that
cognitive theory. Primi (2014) demonstrated both the value and complexity of this approach in
the creation of a fluid intelligence measure, linking specific item design features, such as
perceptual complexity, with targeted components of fluid intelligence. This type of linking
procedure, between cognitive theory and the properties of each item, thus serves as a strong
conceptual basis for the development of a g-GBA by using these techniques in the same general
using scores across video games. Quiroga et al. (2015) administered 11 puzzle mini-games taken
from a commercially-available “brain training” video game for the Nintendo Wii, 1 computer-
based maze navigation game, and 11 cognitive ability tests to 188 undergraduates, finding a
correlation of .96 between latent cognitive ability and latent game performance. Although
groundbreaking in its demonstration of the potential for video games to provide valid
measurement of g, the study was limited its generalizability and practicality in an employee
selection scenario due to its focus on commercially available video games designed for the
purpose of entertainment (Bhatia & Ryan, 2018), an unclear theoretical basis and design process
for the selection of relevant mini-games, and a complete focus on latent variables without
exploration of observed scores. Quiroga et al. (2019) replicated and extended this study,
addressing some of these concerns, by proposing a concept called video games general
performance (gVG). Although gVG was never explicitly defined, it was treated as the shared
variance among performance scores obtained produced by video games. The selection process in
this study was also a bit more explicit about its game screening process, administering a battery
of games plus a cognitive ability test battery to an ultimate sample of 134 participants in a lab
environment held over three 90-minute sessions. Quiroga et al. modeled scores obtained from
these games in a similar fashion as Quiroga et al. (2015), this time finding a latent correlation
with latent cognitive ability of .79, substantially lower than observed in the earlier study;
comparing R2 across studies, .922 - .624 / .922 suggests a 32.32% reduction in convergence for
unclear reasons.
defined explicitly, the implicit presentation of gVG as reflecting broad ability across all video
COGNITIVE ABILITY GAME-BASED ASSESSMENT 17
there exist numerous game genres with distinct skills required for success within each (Apperley,
2006), and Quiroga et al. (2019) not only sampled a subset of these genres, but also selected
within that subset for appropriate features using an ambiguous selection process. A player with
extensive experience with game mechanics within a particular genre focused upon within a g-
GBA assessment is likely to have developed some generalized skill within that genre’s typical
mechanics that does not generalize to other genres; for example, if a person plays many slide
puzzle games, they are likely to have greater skill at future slide puzzle games, because solving
slide puzzles is a skill than can be learned (cf. Colzato, van Leeuwen, van den Wildenberg, &
Hommel, 2010). Thus, if a g-GBA was designed as a slide puzzle game, latent game
solving skill, contaminating measurement of g. Thus, in the case of the present study, we sought
to assess the degree to which a particular g-GBA’s design was successful in avoiding the
contamination seen in gVG, or put differently, to what degree the present g-GBA’s design
process, focused entirely upon the creation of a measure of g within an employee selection
hypothesis will suggest unique value for theory-driven GBAs, and the design theory used to
research exploring this is across the academic and employment domains. In the context of
numerous advantages and disadvantages as a criterion measure (Viswesvaran & Ones, 2000).
The primary advantages to the use of supervisory ratings for criterion validation are convenience
and availability, as most organizations collect at least annual performance reviews, and
ecological validity, as organizational decision-making is often based upon these ratings. The
ratings are often not high-quality reflections of the job performance construct, showing low
reliability and often collapsing orthogonal dimensions of job performance into individual scores
(Borman & Motowidlo, 1997). Despite these flaws, g tests have been consistently shown to
predict such scores, and the relationship between g and supervisory ratings is a well-explored
relationship (e.g., Hunter & Hunter, 1984, found operational ranged between .31 and .73,
In the education domain, the most commonly studied performance outcome is college
grade-point average (Kuncel et al., 2004). Several properties of grade point averages make them
useful for validation research such as the present study. First, grade point averages are
extensively studied in the education literature, which has provided meta-analyses containing
useful benchmarks; the correlation between g and grade-point average is well-explored (ρ =.21,
N =7820, k =35; the relationship is stronger for highly-cognitively loaded tests such as ρ=.33,
N=22289, k=29 for the relationship between the SAT and grade-point average; Richardson,
Abraham, & Bond, 2012). Second, they are conceptually similar to supervisory ratings of job
COGNITIVE ABILITY GAME-BASED ASSESSMENT 19
performance (Meade & Fetzer, 2009), in that both classes of variable represent the outcome of
interactions between individual difference and situational variables. Much as supervisory ratings
are a convenient although imperfect proxy for actual job performance, which is itself the
behavioral outcome of knowledge, skills, motivation, and situational factors (Campbell et al.,
1993), grade-point average is a similarly convenient although imperfect proxy for academic
performance, itself the outcome of learned knowledge, acquired skills, motivation to learn, and
situational factors (Kuncel et al., 2001). Third, grade-point averages are conveniently available,
as college student grade-point averages are readily attainable, with permission, from university
records.
Although the weighting of task and contextual performance likely differs between the
two, in both cases, we would expect the correlation between latent game performance and
are in fact measuring g. In the interest of validating g-GBA scores from multiple perspectives,
Hypothesis 2a. Latent g-GBA performance will positively predict college grade-point
average.
Hypothesis 2b. Latent g-GBA performance will positively predict supervisory ratings of
job performance.
latent construct scores is not generally possible when making selection decisions; instead, test
scores are either combined into operational composites or used as predictors in regression
models. Each of these approaches brings its own strengths and drawbacks depending upon
context (Potosky et al., 2005). In the GBA context, although g is theorized to represent most of
COGNITIVE ABILITY GAME-BASED ASSESSMENT 20
variance unique to each minigame is likely to be contained within minigames scores. In the
more specific case of the present GBA, the seven minigames were developed to principally
target four broad traits in different combinations: Quantitative Knowledge (Gq), Reading and
There may also be additional sources of GBA-specific method variance contained within
individual game scores. For example, existing research suggests that well-designed games can
decrease or suspend fears and anxieties by engrossing players in gameplay (Isbister et al., 2012;
Barnett & Storm, 1981; Landers et al., 2019). In the context of GBAs, this may improve
attention and concentration given research linking testing anxiety with test performance (Moran,
2016). Furthermore, because GBAs are much more behaviorally complex than traditional g
tests, GBAs may capture criterion-relevant behaviors beyond those captured by traditional
assessments. Although this would reflect poorer construct validity, it might result in improved
prediction, reflecting a common trade-off also seen in the arguments for the use of machine
GBAs, and because we knew we would be unable to isolate any of these theoretical mechanisms
given the nature of GBA development, we approached the issue of incremental prediction as a
general question of interest with the intent of exploring game performance as both a latent
understand the relationship between this GBA and performance in relation to traditionally-
measured g.
COGNITIVE ABILITY GAME-BASED ASSESSMENT 21
When considering the use of g measures in the employee selection context, the presence
and magnitude of adverse impact is of significant concern (Hough et al., 2001). Adverse impact
occurs when a test is biased by subgroup membership within a protected class defined by
national or local laws, resulting in different success rates by subgroup despite a consistently
applied testing standard. Protected classes vary by jurisdiction but may include race, sex, color,
religion, national origin, disability, age, or any other legally defined classifier. Additionally,
specific cutoffs are sometimes defined at which point adverse impact legally occurs (Equal
In the context of g, adverse impact research generally focuses on the “Black-White test
score gap” due to the relatively large population of Black people in the United States relative to
other racial and ethnic minorities, as well as the commonly observed mean score difference of
roughly one standard deviation between White and Black people on common g measures (Roth
et al., 2001). If the present g-GBA is indeed a measure of Spearmanian g, a difference of similar
primarily driven by persistent construct-level differences between populations (Reeve & Hakel,
2002), these differences can be widened by certain characteristics of both test design and the
context in which testing takes place. For example, the use of racially-biased items within a
questionnaire measure of a construct that is not itself racially biased can still lead to the
appearance of a racially biased test (Drasgow, 1987). Further, because the internal consistency
reliability of g measures is frequently quite high, relatively lower GBA reliability could also
COGNITIVE ABILITY GAME-BASED ASSESSMENT 22
attenuate observed racial differences. Thus, different characteristics of GBA could increase or
decrease observed differences beyond what is suggested by construct effects. In the context of
racial differences, there is no evidence to suggest that gameplay experience or game skill related
to a g-GBA would differ beyond what is already suggested by construct-level differences; thus,
there is no reason to expect that the Black-White test score gap would be enlarged by use of
GBA, although it could be attenuated due to relatively lower reliability. Unfortunately, due to a
low base rate of Black test-takers in our applied sample, this hypothesis test was limited to our
academic sample alone, which applies to all further race-related hypotheses and research
questions as well. Ross et al. (2001) found a Black-White test score difference of .69 standard
deviations among college students, so we used this estimated population effect as the basis for
our hypothesis.
Hypothesis 3. Black test-takers and White test-takers will differ in g-GBA performance
such that Black test-takers will score approximately .69 SD lower than White test-takers.
Where theory does suggest g-GBAs might introduce additional bias is related to a
different set of protected classes, sex and gender, due to associated differences in game-playing
habits. As described by Brown (2017), the base rate of women who play games in the United
States is slightly lower than that of men (48% and 58%, respectively), suggesting less experience
among women, on average, playing video games. However, these rates vary greatly by genre.
The genres that the present g-GBA minigames most closely resemble vary. Whereas some
games more closely resemble classical cognitive tasks, others may be more firmly placed in a
particular genre. Of the set, the game called Shortcuts most closely resembles the puzzle genre;
within that genre, a 72% base rate for women is observed compared to a 52% base rate for men
(Brown, 2017), suggesting that, on average, women have greater puzzle-playing experience than
COGNITIVE ABILITY GAME-BASED ASSESSMENT 23
men. Thus, an overall difference is expected such that men perform better across the minigames,
yet women may regardless perform better on the types of minigames with which they have the
most prior relevant experience, such as those most clearly resembling prototypical puzzle video
games. Beyond experience alone, other knowledge or skills, such as psychomotor ability, could
also influence performance differently by gender across genres. Given the paucity of research on
the relationship between gameplay experience by genre and gameplay performance by genre, the
magnitude and even direction of likely bias across genres is unclear. Thus, we concluded that we
could not reasonably predict the particular pattern of gender differences across minigames a
Hypothesis 4. Self-identified males and females will differ in g-GBA performance such
that female test-takers will score lower than male test-takers, on average, across g-GBA tests.
Beyond the concept of adverse impact is test fairness, a legal classification that allows for
the adoption of a test showing adverse impact under certain conditions (Schmidt & Hunter,
1974). In the United States, test fairness is most commonly evaluated by examining differential
prediction of a meaningful criterion across subgroups (Newman et al., 2007). Within the Cleary
(1968) model of fairness, the dominant framework for evaluating fairness in the United States
legal system (Aguinis & Smith, 2007), differential prediction across subgroups is evaluated in
terms of differences in slopes, intercepts, and error variances (Schmidt, 1988). Although Cleary
stated that any such differences suggested an unfair test, more recent interpretations suggest a
test is fair if a common regression line fits all subgroups equally well (Meade and Tonidandel,
COGNITIVE ABILITY GAME-BASED ASSESSMENT 24
2010). For example, if cognitive ability is directly and causally related to job performance,
adverse impact in the prediction of job performance from a cognitive ability measure without
differential prediction could still be considered fair if it accurately predicts criterion differences,
in slopes reflect differences in the validity of test scores between subgroups and are the most
legally problematic; intercept differences without slope differences are generally considered fair
(Meade & Fetzer, 2009). As a result, the disentanglement of slope and intercept differences is
generally viewed as the most critical issue in evaluating fairness, and slope differences are used
In the g-GBA context, if differential prediction were observed, the source must be one of
two sources: either the g construct or an aspect of the GBA method as designed to measure g.
Differential prediction from the g construct itself is unlikely; prior research has demonstrated
that although subgroup intercepts for the prediction of job performance from g differ, slopes
generally do not differ by subgroup unless there are test-specific causes, such as biased questions
(Kehoe, 2002). Thus, any observed differential prediction in the present study is more likely
attributable to the GBA method itself or more indirectly, the development process of the g-GBA
under study. The introduction of varying slopes by racial subgroup for a g-GBA is unlikely in
the context of race for the reasons described earlier regarding test bias, namely that there is no
evidence suggesting racial differences in game experience or game attitudes that would
contribute to greater trait diagnosticity for one race versus another. However, if women’s greater
mean experience with puzzle games was associated with decreased reliability for men, this
would attenuate slopes for the male subgroup, biasing those slopes towards zero. Other designs
that could elicit differential effects by group include the specific abilities being measured,
COGNITIVE ABILITY GAME-BASED ASSESSMENT 25
artifacts.
new forms of assessments, especially in high-stakes testing contexts such as employee selection
(Hausknecht et al., 2004). Generally, test-taker reactions refer to the “attitudes, affect, or
cognitions” an individual might have about the testing process (Ryan & Ployhart, 2000, p. 566).
Specifically, test developers and researchers are concerned with test-taker perceptions of test
fairness (i.e., distributive justice, procedural justice), test-taker motivation, test-taker anxiety, and
general attitudes of the test-taker toward the test itself (Hausknecht et al., 2004). Test-taker
perceptions and reactions can impact a variety of outcomes, including actual test performance,
intentions to accept job offers (Hausknecht et al., 2004), although these downstream effects tend
to be small (Ryan & Huth, 2008). In general, test-taker reactions to g assessments, particularly in
employee selection contexts, are slightly unfavorable, although g assessments vary in this regard
on many design dimensions, including novelty, immersion, features, and time pressure.
We contend that reactions to a g-GBA, if designed well for its audience, should be more
positive than to a traditional g test battery across most reaction dimensions given evidence from
the games literature and research on existing assessments incorporating game elements for five
reasons. First, there is empirical evidence supporting that adding game elements to assessments
can improve reactions. For example, Attali and Arieli-Attali (2015) gamified a computerized
COGNITIVE ABILITY GAME-BASED ASSESSMENT 26
assessment of mathematics knowledge by awarding points for correct answers and speedy
responses, which students found more enjoyable and motivating. Second, some gamified
assessments, which incorporate one or more game elements like animation, sound effects,
instantaneous feedback, varying difficulty, progress bars, and narrative contexts, have been
found to be perceived as face valid by authentic job applicants (Ferrell et al., 2015). Third, video
games more broadly have been found to decrease anxiety; for example, Mavridis and Tsiatsos
(2017) used GBA on a game-based learning platform to decrease test anxiety for graduate
students who reported that they did not feel like they were being tested despite rationally
knowing otherwise. Fourth, because GBAs are more behavioral in nature relative to multiple
choice methods, GBAs should provide an increased sense of an opportunity to perform (i.e.,
procedural justice). Fifth, animation and instant feedback on performance should improve the
sense of interpersonal warmth that traditional g assessments tend to lack (Anderson et al., 2010).
Method
The complete experimental protocol and analytic plan for this study were originally
submitted for ethics review to and deemed exempt by the Old Dominion University College of
Assessment"). A secondary protocol covering access and analysis of data collected under that
exemption was later submitted and deemed exempt by the University of Minnesota
author.
COGNITIVE ABILITY GAME-BASED ASSESSMENT 27
Participants
Academic Sample
Undergraduate students were recruited from a large public university in the middle-
Atlantic region of the United States for this study. Participants were recruited through traditional
university-wide in-person recruiting, which included live recruiting at the campus quad,
Psychology department research participant pool. From university-wide recruiting, 394 people
volunteered and participated in the study. Each was compensated US$20 for approximately 2
hours of effort. Additionally, 428 students signed up through the psychology department research
participant pool and were compensated with course credit. Of these 822 students who began the
recommended by Meade and Craig (2012) were applied. First, three bogus items were included
throughout the test (i.e., 82.7% correctly answered Disagree or Strongly Disagree to “I have
traveled to the Moon,” 84.5% to “I was on board the Titanic,” and 84.3% to “Select the option
that is at the left end of the scale for this question”). Second, participants responded to a
question directly asking if they had put in an honest effort into all parts of the study and tried to
follow instructions, which was endorsed by 91.8% of participants. Excluding any case that
failed at least one of these four tests would have removed 29% of cases, so we adopted a slightly
less stringent criterion by excluding only cases that failed at least two of these tests, which
eliminated only 14% of cases. We also post-hoc compared our statistical tests between these
approaches, finding few differences. Ultimately, 633 cases (84.5% of valid participants; 76.6%
In this final sample, the mean age was 21.35 years (s=4.14). 42.0% self-reported as
biracial, and 17.2% reported another race or combination of races. 41.9% self-reported as male,
57.7% as female, 0.2% as transgender and 0.3% as other. 10.3% self-reported full-time
Organizational Sample
several small workgroups in Brazil, Canada, Columbia, India, Mexico, and the United States,
participated in a concurrent validation study. Due to privacy requirements within the company
for this type of data sharing, a specific breakdown by country was not made available to the
research team. Participation was voluntary for all employees, and the data collection process
127 employees completed at least one mini-game. Within this group, supervisor ratings of
performance were available for 49 (41.5%). According to the organization sharing data,
supervisory ratings were missing for two reasons: 1) employees had either not worked at the
organization long enough to have participating in the annual review process or 2) less
commonly, their manager was noncompliant and chose not to participate in the mandatory
annual review process for any of their direct reports. This provides some rationale to support that
missingness in this sample was at random and not reflective of the characteristics of individual
employees. Additionally, the organization reported that they did not use highly cognitively-
loaded measures for either employment or promotion, suggesting only weak indirect range
restriction in cognitive ability in the sample. Within this sample, the mean age was 24.69 years
COGNITIVE ABILITY GAME-BASED ASSESSMENT 29
(SD=2.15), ranging from 22 to 32 years, and 28 (59.3%) were female. Employees had worked
Measures
outcome measure was captured. In the organizational sample, gender, race, and age were
traditional cognitive ability test battery, a reactions battery, and a demographics questionnaire
Specific cognitive ability tests were chosen from Stanek and Ones’ (2018) compendium
Visual processing. The first test, intended to measure visual processing, was the General
Aptitude Battery’s test of Spatial Aptitude (Hunter, 1980). In this test, participants were
presented with a series of figures illustrating a flat piece of paper with dashed lines indicating
fold points. Participants were asked to identify which the flat piece of paper would look like
when folded among four three-dimensional shapes. This test had a 4-minute time limit and 10
items. Scores were the number of correct answers. Due to negative skew, the scores from this
test were reversed and logarithmically transformed. Coefficient alpha of the untransformed
Fluid reasoning. The second test, also adopted from the Educational Testing Service’s
Kit of Factor-Referenced Cognitive Tests (Ekstrom et al., 1976; Widaman, 1982), was the
locations test. In this test, participants were presented with five rows of dashes and gaps between
COGNITIVE ABILITY GAME-BASED ASSESSMENT 30
the dashes. In each of the first four rows, an X was placed over one of the dashes, creating a
pattern from row to row. Participants were asked to indicate which of five dashes in the fifth row
would need to be marked with an X to match the pattern on the other rows. This test had a 6-
Processing speed. Part 1 of the Chicago Non-Verbal Examination (Brown et al., 1936)
was included in the cognitive battery to assess processing speed. This test is a digit symbol test
(Salthouse, 1996; Conway et al., 2002), in which participants were given 12 different symbols
matched with the numbers 1 through 12 as a legend. Each item was one of the twelve symbols
from a continuously visible legend, and participants had to identify which number was associated
with each symbol presented as quickly as possible. This test lasted 2 minutes and 30 seconds and
Quantitative ability. The quantitative ability test was the General Aptitude Battery’s test
of Quantitative Reasoning (Hunter, 1980). It included 5 quantity comparison items in which two
quantities were presented. Participants were asked to determine whether quantity A was larger,
quantity B was larger, the two quantities were equal, or if the relationship between the two could
not be determined. This test had a 5-minute time limit. Coefficient alpha was .37.
Verbal ability. The verbal ability test was a section of practice Graduate Record Exam
questions (Conrad et al., 1977) from the verbal section of the test (Burton, et al. (2009); Kuncel
et al., 2001; Kuncel et al., 2010) which involved 6 sentence-completion items with one or two
blanks. Participants selected one or two words from three to five choices available to complete
the sentence logically. This test had a 6-minute time limit. Due to positive skew, the verbal test
was logarithmically transformed. Coefficient alpha of the untransformed items was 0.60.
COGNITIVE ABILITY GAME-BASED ASSESSMENT 31
The focal g-GBA of this study was an early version of Cognify, which was developed using the
design theory outlined in the introduction of the present paper. Cognify as used in this data
collection effort consisted of seven distinct web-based minigames intended to assess general
cognitive ability by targeting combinations of second-stratum abilities within the CHC model.
Screenshots from each of these minigames appear in Figure 3. All games relied on user input via
pointing and clicking with a mouse. The entirety of the g-GBA took about 20 minutes to
complete, with each game ending automatically after a certain amount of time, creating some
processing speed requirements across all games, although the intensity of this requirement varied
by game. Many test-takers did not finish all possible levels and items within each game, as
expected during a speeded test. In the development of these games, it was recognized early in the
development process that targeting specific second-stratum abilities while ignoring others was
not feasible from a game development perspective without sacrificing desirable game mechanics,
so overlap was permitted. The ultimate specific abilities theoretically targeted varied as indicated
below.
Numbubbles (Gq/Gv). In Numbubbles, players were presented with a target value (e.g.,
7) and then a sequence of bubbles with each bubble containing a numerical formula (e.g., 3 + 4
or 12 x 2). The bubbles had a short lifespan of a few seconds before disappearing. The goal of
the game was to “pop” the bubbles that equal the target value before they disappeared, while
avoiding popping bubbles that did not equal the target value across 10 rounds of varying
difficulty and a 20 second time limit per round. Numbubbles was scored as the number of correct
pops, minus the number of incorrect pops, weighted by average time to a correct pop.
COGNITIVE ABILITY GAME-BASED ASSESSMENT 32
Figure 3. Cognify g-GBA minigames. Clockwise from top-left: Proof It, Tally Up, Resemble,
Numbubbles, Shortcuts, Colour Pop, Grid Lock. Screenshots of Shortcuts and ColourPop are
Resemble (Gf/Gv). In Resemble, players re-created a pattern, shown on one side of the
screen, by dragging puzzle elements onto a grid shown on the opposite side of the screen. Puzzle
elements, once dragged onto the grid, could be rotated, and solving each problem required
dragging the correct pieces to the correct positions and applying correct rotations. The test-taker
had a total of three minutes in which to complete as many puzzles as possible, up to a total of
nine. The timed nature of the puzzle also introduced a speed component. Resemble is the
revision of an earlier developed game which was itself inspired by the Block Design tests of the
Wechsler Adult Intelligence Scales. Resemble was scored as the number of game levels
completed.
Grid Lock (Gf/Gv). In Grid Lock, players assembled a set of puzzle pieces to mimic a
larger novel shape. There were nine rounds of varying difficulty and an overall three-minute time
limit. The size of the grid and the number of pieces that need to be placed increased over time,
and in later levels, grid components needed to be rotated. Grid Lock was scored as the number
Proof It (Gc/Grw). In Proof It, players were asked to identify as many textual errors as
possible in a set of sample texts across five rounds, across which 5 minutes were permitted by
tapping the errors. Incorrectly tapping three different places without error resulted in round
progression. No points were deducted for incorrect taps. Proof It was scored as the number of
Tally Up (Gq/Gv). In Tally Up, players were presented with two sets of tokens across 35
rounds of varying difficulty with a time limit of 5 seconds per round. In each round, players were
asked to quantify the number of tokens on two sides of the screen and identify their relationship
to each other, which became more complex as the game progressed. It resembled Quantitative
COGNITIVE ABILITY GAME-BASED ASSESSMENT 34
Comparison questions from the Graduate Record Examination, with the addition of token value
modifiers and per-item time pressure. Tally Up was scored as the number of rounds with correct
responses.
Colour Pop (Gf). In Colour Pop, players completed a gamified version of a Stroop
(1935) task in which they observed a grid of colored words and were asked to identify all tiles
with words that matched the color of the target, e.g. red, irrespective of the color of the tiles.
There were 20 rounds which varied by the number of valid answers and the proportion of word
and color mismatches. The time limit for each round was 4 seconds, with an overall time limit of
approximately 2 minutes. Game elements most noticeably included spinning tiles, progress bars,
star counts and both audio and visual feedback mechanisms. Colour Pop was scored as the total
number of correct pops minus the number of incorrect pops minus the number of misses.
Short Cuts (Gf/Gq/Gv). In Short Cuts, players determined the shortest path to roll a
blue marble to a goal across seven puzzles over a maximum of four minutes. Each puzzle
consisted of a network of paths, with the distance of each path denoted by a numerical value,
where higher values indicated greater distance. Determining the shortest path required the user to
plan a path forward while considering multiple interacting factors, the complexity of which
varied across puzzles. Short Cuts was scored as “distance traveled,” which quantifies the number
of points “spent” to solve all puzzles given all path values chosen. Final scores were reversed for
Reactions
Seven different reaction measures were collected once for the GBA and once for the
traditional g battery across two broad categories. Each set of reactions measures was
administered immediately upon finishing either the traditional cognitive ability battery or the g-
COGNITIVE ABILITY GAME-BASED ASSESSMENT 35
GBA and adapted to inquire either about the “test” or the “game”. Intrinsic motivation and test
anxiety were assessed with 7-point agreement scales; all others were assessed with 5-point
Interest/Enjoyment subscale of the Intrinsic Motivation Inventory (Ryan & Deci, 2000). A
sample items was “These tests (games) were fun to do”. Test motivation was measured with a 5-
item scale (Arvey et al., 1990). An example item was, “I thought the tests (games) were fun.”
Test anxiety was measured using a 6-item scale (Arvey et al., 1990). An example item was “I
probably wouldn’t do as well as most other people who took these tests (played these games).”
Attitudinal outcomes. Distributive justice was measured using a 3-item scale (Smither et
al., 1993). An example item was “The test (game) results would accurately reflect how well I
performed on the examination (in the games)”. Procedural justice was measured using a 4-item
“chance to perform” measure (Bauer et al., 2001). An example item was “I could really show my
skills and abilities through these tests (games).” Job relatedness was measured with a 2-item
measure (Bauer et al., 2001). An example item was “It would be clear to anyone that these tests
(games) are related to the job.” Test propriety was measured using a 3-item measure (Bauer et
al., 2001). An example item was, “The content of the tests seemed appropriate.”
Outcome Variables
Outcome variables differed by sample. In the academic sample, GPA was captured from
historical university records. In the organizational sample, each employee’s most recent annual
supervisory ratings of job performance were collected which asked supervisors to assess the
extent to which employees met their goals set at the beginning of the previous year on a 4-point
scale. Supervisors were not required to match a predefined rating distribution but were
COGNITIVE ABILITY GAME-BASED ASSESSMENT 36
“Partially Meets Expectations,” approximately 67% to “Fully Meets Expectations,” and less than
25% to “Exceed Expectations.” These ratings had historically and were actively being used both
Demographics
Basic demographic information was collected from the academic sample including age,
gender, race, ethnicity, employment status, and contact information for participant compensation.
In the organizational sample, only age, gender, race, and ethnicity were collected since all
Procedure
Academic Sample
Two university computer labs were identified on campus for use in this study. The two
labs were chosen because they were verified to contain computers with sufficient processing
power to run the g-GBA at the intended speed and were in designated quiet spaces on campus.
As a result, all participants used nearly-identical computers in highly similar physical spaces.
Participants were instructed via email to use one of these computers and to navigate to a specific
webpage. Once on that webpage, participants reviewed informed consent documentation and
entered their student identification number before proceeding with the study via the Qualtrics
survey platform. The overall research design was a two-cell within-subjects design
counterbalanced by test completion order. Specifically, participants were assigned to either take
the g-GBA first or the traditional cognitive battery first but ultimately completed both. After
each test battery, participants completed a reactions measure battery. After both assessments and
Organizational Sample
team asking them to complete the g-GBA and to release access to their performance data for
research purposes. Their supervisors were also contacted and asked to encourage their
employees to participate.
Results
Results in this section can be reproduced using the anonymized dataset and R (R Core
Measurement
with latent g, a series of confirmatory factor analyses (CFA) and larger structural equation
models (SEM) were fitted in sequence to test specific assumptions and measurement hypotheses.
Approximate standards for the evaluation of model fit were taken from Hu and Bentler (2009; p
> .05, CFI > .96, RMSEA < .06, SRMR < .08). Absolute and relative fit indices were jointly
interpreted given limitations in both (March et al., 2009), and both compliance with and violation
cut-offs (Markland, 2007). Because the size of the organizational sample was too small to
support CFA, all measurement analyses were conducted on the academic sample. A correlation
matrix calculated from the data used for these analyses appears in Table 1.
First, a CFA was conducted to determine if g was adequately measured by the five
specific ability tests identified. As shown in Figure 4, standards according to all absolute and
relative fit index standards were achieved (2(5) = 5.81, p = .325, CFI = 1.00, RMSEA = .02,
SRMR = .02) and factor loadings were comparable to expected effect sizes given prior research
COGNITIVE ABILITY GAME-BASED ASSESSMENT 38
utilizing CFA on g measurement using specific ability tests (see Tucker-Drob & Salthouse,
2009); thus, we concluded that latent g was effectively represented. We also regressed GPA on
latent g, finding a standardized weight of .28, nearly equal to the meta-analytic means provided
Next, latent game performance was modeled with CFA. In this case, however, we a priori
suspected that a traditional CFA, which is based upon the assumptions of classical test theory,
would demonstrate poor model fit. In the case of g measurement, commonly used tests generally
comply well with these assumptions; for example, each item on a well-constructed cognitive
ability test can generally be considered a randomly drawn item from a potential universe of items
assessing that ability. In this case, the development team was aware of residual cross-loadings
during development but could find no way to eliminate them entirely without removing many of
the “gameful” aspects of the experience. Knowing this a priori, we conducted this step of
analysis interactively, using diagnosis of modification indices, freeing notable item residual
covariances, and observing associated changes in fit. The initial model, a standard one-factor
CFA with uncorrelated errors, showed slightly poor fit as expected (2(14) = 55.63, p < .001 CFI
= .92, RMSEA = .07, SRMR = .04). We next used modification indices to free paths between
item residuals that were contributing to the greatest misfit one model re-fit at a time until
adequate fit was achieved for all terms, which occurred by freeing 2 of the 21 available
covariances, as shown in Figure 5 (2(12) = 17.37, p = .14, CFI =.99, RMSEA = .03, SRMR
= .02), which was theoretically consistent with gameplay as described earlier; whereas both
Resemble and Gridlock required the mental manipulation of presented visuals, both Tally Up and
Colourpop required quick quantitative reasoning. Nevertheless, the use of modification indices
COGNITIVE ABILITY GAME-BASED ASSESSMENT 39
Table 1
Correlation matrix
Figure 5. Final confirmatory factor analysis of latent GBA performance. All estimates are
necessarily led to a model overfitted to some degree. To gauge the effect of these modifications,
we conducted our later test of Hypothesis 1 both with and without these modifications.
Finally, to enable our formal test of Hypothesis 1, we combined both CFAs into a single
SEM predicting latent GBA performance from latent g as measured by the traditional g test
battery. Slight poor fit was observed (2(51) = 129.94, p < .001, CFI =.93, RMSEA = .05,
SRMR = .04), so modification indices were again examined, which revealed several potential
changes to improve model fit, but none were clearly theoretically consistent with the nature of
gameplay as in the previous model fitting. We regardless attempted to fit additional models
based upon the largest suggested freed disturbance covariances to observe the effects, but these
revised models did not meaningfully affect the core effect size of interest between g and game
performance. Thus, we decided to proceed with interpretation of the unmodified model to test
only done to enable the following confirmatory test of the convergence between latent game
In the final model, as shown in Figure 6, the relationship between latent game
performance and latent g was equal to .97, suggesting near-unity between the latent constructs
underlying both the cognitive ability test battery and the GBA’s suite of minigames; only 6.9%
hypothesis of unity formally, we created a new model constraining the variance of the
disturbance of latent gameplay to zero and compared these two nested models with a chi-squared
difference test. We found no significant difference between models (2diff = 131.09 - 129.94 =
1.15; p = .716), suggesting the simpler model constraining the relationship to unity should be
COGNITIVE ABILITY GAME-BASED ASSESSMENT 43
retained. Thus, we concluded that the latent game performance construct underlying
performance across Cognify minigames was in fact g. Hypothesis 1 was therefore supported.
As a post-hoc test of model robustness, we also re-ran the model shown in Figure 6 constraining
all mini-game disturbance covariances to zero, finding worse fit but no meaningful change to the
test of H1 ( = .95; 2(53) = 186.11, p < .001; CFI = .90, RMSEA = .06, SRMR = .04).
Performance Prediction
The finding that latent game performance was indistinguishable from latent g precluded
the need to assess the correlation between latent game performance and outcomes. Because
latent game performance on Cognify and latent g are in effect the same construct,
mathematically speaking, the relationship of latent game performance vs. GPA and latent g vs.
GPA must be extremely similar in magnitude if modeled using SEM. Thus, it was known a
priori that Hypotheses 2a and 2b would be supported using this approach. Given this, we instead
turned our attention to ways that these tests might be operationalized. Specifically, because we
had no a priori hypotheses about specific games, we chose to focus the remainder of our
hypothesis testing on unit-weighted composite scores, which addresses more practical concerns.
whereas unit-weighted composites are common and easily applied (Oswald et al., 2014). Thus,
to provide additional support for Hypotheses 2a and 2b, we examined the prediction of criteria
In the case of predicting GPA in the academic sample, Hypothesis 2a was supported.
The relationship was positive (r = .16 [.08, .24], p < .001) although of somewhat lesser
magnitude of prediction than the cognitive ability test composite’s relationship (r = .22 [.15, .30],
p < .001). Hypothesis 2b was also supported; supervisor ratings of job performance were also
COGNITIVE ABILITY GAME-BASED ASSESSMENT 44
Figure 6. Final model predicting latent GBA performance from latent g, N = 633
COGNITIVE ABILITY GAME-BASED ASSESSMENT 45
predicted (r = .29 [.00, .53], p = .047). In both cases, effect sizes were well within credibility
intervals and at similar magnitude as mean meta-analytic estimates previously observed for these
relationships (cf. Richardson et al., 2012, who found a mean observed correlation with GPA
across studies of .20; Hunter, 1984, who found a mean observed correlation across studies with
analyses to determine incremental prediction of GPA of each test battery composite score over
the other. These analyses appear in Table 2. Both traditional test battery composite and the
GBA composite predicted the criterion. However, incremental prediction was only observed in
one direction, of the traditional test composite beyond the GBA. In combination with the results
from the test of Hypotheses 1 and 2, it appears that although latent g is reflected similarly in both
the traditional test battery and the GBA, the GBA contains additional criterion-irrelevant
information not contained within the traditional test battery that is attenuating its relationship
with the GPA criterion. Thus, it seems that although the GBA is clearly a g measure, it is a less
“pure” measure than the traditional cognitive ability test battery. The precise source and nature
the 184 Black and 275 White test-takers in the academic sample with a Welsh two-sample t-test.
The difference was statistically significant (t(409.34) = -8.08, p<.001, d=-0.77 [.57, .96]) and the
predicted value of .69 from the Ross et al. (2001) meta-analysis was well within its confidence
interval, supporting Hypothesis 3. As a post-hoc test to triangulate upon this result, we next
examined racial differences in the g test composite scores, to see if this same pattern of findings
COGNITIVE ABILITY GAME-BASED ASSESSMENT 46
was observed. In this test, we did observed a larger difference (t(427.77) = -10.23, p<.001, d=-
0.95 [.75, 1.15]), 0.19 standard deviations larger for the composite score calculated from the
traditional g test battery than from the composite score calculated from the GBA.
between the 265 male and 365 female participants in the academic sample with another Welsh t-
test. The difference was statistically significant (t(499.99) = 5.72, p < .001, d = -0.48 [.32, .64]),
supporting Hypothesis 4. As with our test of Hypothesis 3, we post-hoc compared this to gender
differences on the g test composite, this time finding an effect slightly smaller than for that of the
GBA (t(535.77) = 3.78, p < .001, d = -0.31 [.15, .47]). Given both the statistical and practical
gender effect for this test, it appears that most of the observed gender differences are attributable
Question 2, we examined gender differences on individual minigames, and these results appear
in Table 3. Gender differences disadvantaging female participants appeared for five games, with
effects ranging from small to large, whereas the other two games showed no advantage for either
men or women.
evaluating differential prediction was used for both racial and gender fairness. The results are
displayed in Tables 4 and 5, respectively. In each Model 1, the criterion is regressed upon the
composite; in each Model 2, upon class membership; and in each Model 3, upon the composite,
class membership, and their interaction term. Using this method, up to three comparisons
between nested models are conducted to determine fairness. First, Model 1 is compared to Model
Table 2
Criterion-related validity
*p<.05 **p<.01
COGNITIVE ABILITY GAME-BASED ASSESSMENT 48
Table 3
Minigame M SD M SD t df p d LL UL
Colour Pop 0.68 0.38 0.68 0.39 0.09 579.47 .929 0.01 -0.17 0.15
Numbubbles 14.12 4.40 11.11 3.73 -9.01 510.98 <.001 -0.75 0.58 0.91
Resemble 5.74 1.92 4.74 1.80 -6.71 546.10 <.001 -0.55 0.39 0.71
Proof It 32.08 10.35 32.97 10.33 1.07 568.31 .286 0.09 -0.25 0.07
Short Cuts 140.46 41.92 132.86 43.25 -2.22 578.98 .027 -0.18 0.02 0.34
Gridlock 5.44 1.41 4.15 1.22 -2.75 519.01 .006 -0.23 0.07 0.39
Tally Up 22.73 4.02 21.56 3.69 -3.73 538.95 <.001 -0.31 0.15 0.46
N = 630; Female participant N = 365, Male participant N = 265; LL and UL are lower and upper
Table 4
Comparisons of Black-white differential prediction for cognitive ability tests versus GBAs
Table 5
Comparisons of female-male differential prediction for cognitive ability tests versus GBAs
statistically and practically significant, either slope differences or slope and intercept differences
in combination are present. Third, if the second comparison revealed slope differences, an
additional test can be conducted to determine if intercept differences were observed in addition
to slope differences. As shown in Table 4, which concerns racial differences, both the traditional
test battery and GBA met the fairness standard; both exhibited differential prediction by intercept
but not by slope. In Table 5, which concerns gender differences, the same pattern emerged.
Thus, it appears that the GBA is a fair test, across both classes of interest in this research
question. Additionally, this evidence supports the above-stated conjecture that gender
differences in the GBA may be mostly attributable to study population characteristics rather than
to GBA characteristics; it does not appear as if using a GBA removed existing construct-related
intercept differences.
Reactions
comparing reactions to the GBA and the traditional g test battery. To address missingness within
the reactions data, single imputation was used (i.e., Amelia II; Honacker et al., 2011). These
results are presented in Table 6. Universally, the GBA was preferred to the g test battery across
Discussion
This study is the first to integrate current theories of game design, taken from human-
framework for the identification and development of GBAs. With a g-GBA built from that
theory for use in hiring decisions, we rigorously explored its measurement characteristics,
predictive accuracy, fairness, and reactions. Most centrally, we have demonstrated that a g-GBA
COGNITIVE ABILITY GAME-BASED ASSESSMENT 52
Table 6
Motivational
Intrinsic Motivation .92 4.52 1.42 .92 3.03 1.56 20.34 <.001 1.00 0.88 1.11
Test Motivation .80 3.78 0.56 .85 3.70 0.69 3.03 .003 0.13 0.05 0.22
Test Anxiety .83 2.82 0.88 .80 3.07 0.89 -8.04 <.001 -0.29 -0.36 -0.22
Attitudinal
Distributive Justice .68 3.21 0.84 .74 2.96 0.94 6.37 <.001 0.28 0.19 0.37
Procedural Justice .88 2.77 0.96 .90 2.49 0.98 8.22 <.001 0.30 0.23 0.37
Job Relatedness .64 3.97 0.75 .67 3.78 0.84 5.60 <.001 0.23 0.15 0.32
Test Propriety .83 2.67 1.03 .83 2.44 1.01 5.69 <.001 0.22 0.14 0.30
Note. t-test and Cohen’s d are calculated for paired comparisons. Positive t and d indicate greater
scores for GBA. LL and UL are lower and upper limits of 95% confidence intervals surrounding
likely can be designed and developed (i.e., engineered) to meet the same conceptual and
psychometric standards (Sackett et al., 2017) as other more traditional assessments of those same
constructs, and that participants preferred this g-GBA to the traditional battery. This furthermore
provides a theoretically and empirically supported design process for creating new theory-driven
It should not be inferred from this study, however, that GBAs as a class of methods are
Current GBA vendors vary in their reliance upon both design theories and psychological
theories, and existing GBAs also vary widely in their implementation of specific game elements.
For example, whereas Cognify does not heavily integrate any sort of story, narrative, or fantasy
elements, these are commonly implemented in other GBAs currently in use in organizations to
unknown effect. Further research is needed on a much broader range of GBAs and GBA design
strategies before any firm conclusions can be drawn about GBAs in general. Much like Arthur
and Villado (2008), we emphasize the critical differences between methods and predictors in the
space of employee assessment; because GBA is a method, the specifics of design and
implementation are critical to understanding its best role in employee selection and should not be
ignored. Much as a survey measure can be well-designed or not in relation to its measurement
goals, so can a GBA. Further research is needed to understand if the GBA design theory
proposed here can serve as a foundation for high quality GBAs across constructs and contexts.
Even if so, more nuanced design theory will likely be needed for different measurement
psychometrics was born of a need to better apply statistics to the measurement of latent
psychological constructs (Buckhalt, 2002), new domain-embedded GBA design theories may be
COGNITIVE ABILITY GAME-BASED ASSESSMENT 54
needed to develop the highest quality psychometric assessments appropriate for selection
contexts (Ployhart et al., 2017). Thus, future research on GBA must explicitly consider and
explore design and development processes (Landers & Marin, 2021) in any GBA being
evaluated.
What we can safely conclude given the present results is that the design process studied
here resulted in a GBA of similar psychometric quality to a traditional g test battery. This draws
a theoretical distinction between a g-GBA’s latent performance construct and the only other
latent game performance construct in the research literature, gVG. Whereas Quiroga et al.
(2019) hand-picked a selection of commercially available video games to best reflect g, the
present study demonstrates how a design and development process can be used to create a novel
g-GBA for employee selection. In doing so, we also found a stronger relationship between latent
game performance and g than did Quiroga et al. (2019), with an estimate more similar to Quiroga
et al.’s (2014) result when focusing upon “brain training” Nintendo games. The difference in
results between these studies, especially in contrast to the present study, suggests that design
characteristics like the ones studied here are likely critical to understanding why and how GBAs
can measure traits. Although outside the scope of the present work, an interesting possibility
raised by Quiroga et al.’s work is the existence of a true gVG across all possible video games, a
set of skills or abilities associated with success in video games broadly. We encourage
researchers to continue down this theoretical path, as it might shed additional light on potential
trait confounds when using video games for measurement of any psychological construct.
The observation of adverse impact by race of similar magnitude as traditional g tests was
as predicted but was also disappointing. The idea that GBA somehow “removes” bias is a
common assertion among some GBA proponents in industry (e.g., Hak, 2019) which this study
COGNITIVE ABILITY GAME-BASED ASSESSMENT 55
directly informs. This also provides context for the approach if not the rhetoric of many GBA
vendors; for example, one of the largest GBA vendors, Pymetrics, claims its GBA to be “bias-
free”, explaining “we use a reference set of tens of thousands of people to check for any potential
biases, and we deweight inputs in our model until we produce a bias-free algorithm that is
compliant with the 4/5ths rule” (pymetrics.com, 2019). Thus, rather than their GBA somehow
“removing” adverse impact through some design tactic, it is done post-hoc by reducing the
influence of or dropping individual predictors showing adverse impact in the machine learning
algorithms that they develop. This reaffirms that inclusion of scores from cognitively loaded
tests within a selection battery, at least given the world’s current social and economic state, will
generally lead to adverse impact by race (Kuncel & Hezlett, 2010). GBA appears to neither solve
Of greater concern was the observation of adverse impact by gender. Although the mean
effect across minigames disadvantaged women, most of this difference was also reflected in
gender differences in the traditional cognitive ability test battery (dTrad = -0.48 vs dGBA = -0.31).
We suspect the remaining gender difference (d = -0.17) is attributable to differences in the
visual-spatial nature of gameplay in some of the games and given prior work suggesting gender
differences in spatial abilities (Voyer et al., 1995). Specifically, the games in which women did
worse were also more heavily visual-spatial in their gameplay than the games demonstrating
parity. Because visual-spatial ability was only represented in one test in the traditional g battery
through a single test, this may have led to the observed difference in gender differences between
the traditional battery and the g-GBA. Although a new composite could be created in the present
dataset utilizing only those tests showing no difference, this would capitalize upon chance to
some degree, and any observed lack of gender effect of such a composite would be of unknown
COGNITIVE ABILITY GAME-BASED ASSESSMENT 56
generalizability. Most importantly, even with the existing set of minigames, there was no
evidence of differential prediction of GPA by gender, by either slope or intercept; thus, despite
the observation of adverse impact, the prediction of GPA from both the traditional cognitive
These results in combination with our analysis of RQ1 raise new theoretical questions
about g-GBA test construction. Specifically, because there was incremental prediction of GPA
by the cognitive ability test composite beyond the g-GBA test composite but not the reverse, this
suggests that although the GBA composite score contains the same information about g that the
cognitive ability test battery composite does, it is also contaminated to a degree by gender-
relevant (and g-irrelevant) variance; in short, there is evidence of construct contamination but not
mechanics, dynamics, and aesthetics are most likely to exacerbate gender differences in the
measurement of cognitive ability, and conversely, what might be done to remove it. Assuming
that the pattern of gender differences among minigames was due to legitimate differences in
could be used to reduce these gender differences (Sackett & Ellingson, 1997). Future research
should therefore also explore how scores are best used in practice to make actual selection
decisions, and if such strategies have other unintended consequences, such as decreased validity.
Although reactions to the GBA were universally more positive than to the g test battery,
effect sizes varied and were generally small to moderate. Whereas intrinsic motivation was 1.00
standard deviations more positive, other improvements were more modest, ranging from 0.13 to
0.30. A key limiting factor in this study may be the nature of g testing, to which reactions are
already generally poor (Hausknect et al., 2004). Because evaluation of g requires identification
COGNITIVE ABILITY GAME-BASED ASSESSMENT 57
of correct answers, frustration when unable to determine a likely correct response and move
forward to the next question may negatively influence g test reactions (Chan et al., 1997). In a
GBA, there is still feedback as to correct answers, but there is less time for assessees to ruminate
on incorrect answers if the game has been designed and is successful in absorbing assessees in
the flow of game demands. In GBAs designed to assess constructs that lack “correct answers,”
such as personality testing, reactions to GBAs may be more positive; however, the precise effect
of the presence or absence of such gameplay flow is unclear. Further research is required, which
should investigate such interactive effects between constructs targeted and GBA design features,
as well as how specific game design decisions do or do not contribute within any particular
These results also raise a key issue unique to GBAs when deployed within the employee
develop than a traditional g test, positive utility for the use of GBAs is of concern. Given the
intending to develop its own internal selection tools to create their own GBAs. Given the
expense, it is unlikely that any reactions benefit from such a move would outweigh the
development costs. However, an independent consultancy licensing such a test could see utility
if deployed to a broad range of organizations. Thus, we contend that the most likely context for
positive utility from GBAs for at least the next several years will be in consultancies serving that
GBA to many different firms, which in turn implies that most GBAs will be intended to assess
broadly useful individual differences where there is significant demand, such as cognitive ability,
styles, teamwork, or self-directed learning. In this way, GBA more directly compares with
COGNITIVE ABILITY GAME-BASED ASSESSMENT 58
assessment centers than other assessment methods in terms of key strengths, but without the
logistics costs and overhead typically associated with assessment centers. For companies
choosing to adopt GBAs, the potential utility gains are more obvious – if a g-GBA can be
administered at the same cost, with the same psychometric strengths, and with better applicant
reactions in comparison to a traditional g test battery, there are few compelling reasons to
continue using traditional g test batteries in practice. Organizations should consider all such
dimensions of utility, both in terms of immediate predictive gains and larger-scale strategic
business concerns, when making such adoption decisions (Roth & Bobko, 1997).
The findings here also relate to the nascent literature on assessment gamification (e.g.,
Georgiou et al., 2019; Landers et al., 2020). As described in the Measures section, some of
Cognify’s games began as traditional cognitive tasks and were gamified (i.e., Resemble, Colour
Pop), some were directly inspired from existing games (i.e., Numbubbles, Grid Lock, Tally Up),
and some were creative interpretations of other concepts (i.e., Shortcuts, Proof It). In some
ways, this made the gamified assessments less challenging to develop than others, in that the
basic concepts of gameplay were inferred from the existing task structure as a starting point for
game design, but in other ways were more challenging to develop due to the restrictions that
existing task definitions created. For example, because Colour Pop began as a Stroop test, the
design team wanted to maintain the classic Stroop elements regardless of other changes
suggested through user assessments in iterative prototyping; in contrast, because Shortcuts was
based on a novel idea, there was no aspect of the game that was “off limits” for changes during
development. In this way, the science of gamification (Landers, 2018) might inform some
aspects of GBA design just as GBA design might inform gamification; it is thus important for
progress in both that the two literatures do not grow completely independently.
COGNITIVE ABILITY GAME-BASED ASSESSMENT 59
Practical Implications
In stark contrast to a few decades ago, a key concern in modern assessment design is
maximizing applicant reactions, and GBA design makes very explicit the central role of user
experience in the test development process. Specifically, game design and thus GBA design
prioritize consumer expectations and experience in a way not commonly seen in traditional
assessment design. The introduction of internet technologies has flattened job application
pathways such that for many organizations, the application process has become a bidirectional
transaction (Singh & Finn, 2003). User experience and brand reputation now play an important
role in attracting talent. Because cutting edge technologies have been found to impact positively
on applicant perceptions of the organizational image (Bartram & Hambleton, 2006; Sinar et al.,
2003), GBA design can enable rigorous measurement while improving applicant perceptions and
organizational impressions. Further iteration upon the design of the present GBA might result in
further improved perceptions. Such benefits from either initial deployment or redesign are not
guaranteed, however, and both require significant investment in high quality design and
development processes.
high degree of effort from diverse, multidisciplinary teams and stakeholders outside of the
traditional assessment community. Across disciplinary perspectives, values and methods vary
greatly, creating new challenges in relation to process losses and team coordination, as well as
engineers, artists, and others may all be deeply invested in assessment development, which if not
carefully managed can create significant problems with team cohesion and team commitment.
For example, in the experience of the present authors, game designers typically prioritize “fun,”
COGNITIVE ABILITY GAME-BASED ASSESSMENT 60
engineers typically prioritize system sustainability, artists typically prioritize aesthetics, and
assessment designers typically prioritize psychometric rigor. If properly managed, the resulting
frictions can lead to higher quality assessments in both the psychometric sense and in terms of
applicant perceptions, but the specific “best” path to achieve that remains unclear. In the present
article, we have described the design theory that drove the organization that developed Cognify,
but there is no guarantee that another assessment firm would find the execution of a design
strategy from that theory as effective. Furthermore, there is no guarantee that the same
development process would work equally well even for the same developer if assessing a
the present study does not provide a single set of best practices but instead provides guidance on
how to reduce risk through a cautious marriage of game design theory and classic test
development practices.
design related to the assessment delivery platform is significantly more important for assessment
practitioners than in traditional assessment development. Whereas technical teams often are
tasked with “implementation” in traditional development, such that the “assessment team”
creates the assessment and the “technical team” is responsible for placing the assessment online
and collecting data, all members in GBA development teams need to develop expertise in not
only psychometrics but also software architecture and game design theory. This is likely to push
many traditional assessment experts far outside of their core expertise, yet developing new
expertise is critical to ensure psychometric rigor in GBA. Complex technical concerns can arise,
such as the specific equipment required to maintain the number of interactions per second
necessary to ensure the integrity of collected data given a particular GBA design. With
COGNITIVE ABILITY GAME-BASED ASSESSMENT 61
insufficient technical expertise, a traditional assessment specialist might not even realize why
such a restriction could harm the psychometric properties of the GBA. Thus, any assessment
firm seeking to develop GBAs should carefully evaluate if they have not only adequate technical
resources but also the necessary resources to train key personnel across disciplinary lines.
An important set of practical caveats for the application of these results is the 1)
strategies, and 3) the impact of deployment strategies. First, we have presented here a design
theory which can feasibly be used to develop any theory-driven GBA in which a specific
construct has been a priori targeted by game design. However, there is generally a much greater
and more comprehensive literature on g than on most other traits, which gave the developers a
more solid foundation for design in the CHC model than might be found when targeting other
constructs when integrating these design theories. As such, we caution practitioners against
considering this GBA as prototypical. There are likely to be many challenges in design
unexplored here inherent to any such effort, and another game developed using this approach is
not guaranteed to be successful. For example, it is unclear at this time as to the specific cause of
increases in applicant reactions; this could have been caused by the novelty of the experience,
improved affective reactions to the interface, the quick gameplay, or any of many other game
characteristics acting in concert. Second, the present study only explored one development
influences, the psychometric concerns and iterative strategies necessary for successful
measurement are likely quite different in data-driven GBAs than in theory-driven GBAs. The
lack of a priori focus on constructs in data-driven GBA does not necessarily condemn such
methods, but it may increase the analytic burden on game developers in ways not explored here.
COGNITIVE ABILITY GAME-BASED ASSESSMENT 62
Within the g construct, this GBA focused upon speeded tests for practical reasons related to
reducing cheating (Arthur et al., 2010), but this choice could have led to additional confounds
that affected validity in numerous ways (Lu & Sireci, 2007). Further, in pursuit of more exciting
gameplay, the GBA did not enable clean separation of measurement occasions, which precluded
meaningful estimation of internal consistency reliability. For this reason and from our own
experiences in industry, we believe most GBA developers rely instead upon test re-test reliability
estimates for this reason, yet this requires additional data collection efforts or novel calculation
strategies (Weiner & Sanchez, 2020). Alternative game designs and scoring models could also
avoid this problem. Third, in practice, due to concerns about re-testing effects (Villado et al.,
2016), the company that developed this GBA in its practice does not allow it to be administered
to the same job applicant more than once every twelve months, and the present study did not
examine re-test effects. There is no research exploring if GBA methods amply or attenuate retest
effects in relation to such issues, or if such concerns could be engineered out during game
development. Across these concerns, there is still much remaining about which we simply have
little data. Current caution and further research, especially in applied settings, are needed on all
of these issues.
One limitation to the current empirical study is the generalizability of the measurement
properties of the Cognify GBA to the measurement properties of other GBAs. As GBAs could
theoretically be built to assess any construct (cf. Arthur & Villado, 2008), with an indefinite
number of potential design processes, no “prototypical” GBA exists or ever will, much as there
can be no “prototypical” questionnaire. One GBA design might approach cognitive ability
measurement through the gameplay of a first-person shooter whereas another might approach it
COGNITIVE ABILITY GAME-BASED ASSESSMENT 63
through puzzles. A design intended to measure a personality trait might be completely different.
This is not a unique limitation to GBA; for example, most researchers would not expect a single
“questionnaires.” Instead, much as has been done for questionnaires, a body of evidence
regarding GBAs must be curated. Having said that, researchers must also be careful not to
assume that this justifies a case study approach to GBAs with traditional assessment outcomes.
For example, the mere existence of a GBA that produces scores that correlate with an outcome is
not theoretically interesting unless there is some evidence as to the underlying reason. For the
present GBA, we were able to provide construct validity evidence supporting g as that reason
both due to the GBA’s design process and the data collected. For other GBAs, we hope to see
similar types of evidence, along with detailed descriptions of the design methods that produced
them. In the case of data-driven GBAs, more atypical forms of validation evidence might be
useful, such as evidence from response processes through think-alouds during gameplay.
A second limitation to the present study is the composition of the two samples and the
constraints associated with each. Practical constraints limited our larger data collection effort to
an academic sample, which served as the principal sample for hypothesis testing, plus only a
from non-organizational data, and the organizational sample only provides criterion-related
validity evidence with wide confidence intervals. This effect, due to the small organizational
sample size used to measure it, should not be over-interpreted; it is unlikely this correlation
would generalize to other organizations, and the stability of this estimate even as an estimate of
the population effect for the studied organization is poor. Additionally, both samples likely
COGNITIVE ABILITY GAME-BASED ASSESSMENT 64
exhibit some degree of range restriction that may have attenuated observed some relationships,
particularly those with criteria, below their true score values (Sackett & Yang, 2000). Given this,
we recommend the results from the organizational sample be viewed as tentative and have based
most of our conclusions upon results from the academic sample, yet the generalizability of
results from the academic sample to an organizational context is unknown. Future researchers
should prioritize seeking out larger samples with authentic employees to better understand under
A third limitation of the present study is in the generalizability of the design method
employed to GBA design more broadly. Although we focused here on theory-driven GBA, this
is only one type of GBA currently in the assessment marketplace. The other major type, as
discussed earlier, relies upon computational psychometrics to develop its measurement models.
We call this data-driven GBA, and it typically takes a very different development process.
Specifically, the initial development of data-driven GBAs tends to be based upon content
validation; an assessment designer has a holistic idea for a game, either borrowed from the
research literature (e.g., neuroscience) or driven by marketplace needs (e.g., a client requests a
“leadership game”), and a game is created based upon that idea. Then, all data collected,
including both scores and trace data like mouse clicks, are inputted into either unsupervised
machine learning models to develop categories of players or into supervised machine learning
models to predict some outcome of interest directly from these messy data. This type of
assessment data mining is much more common in educational GBAs (Mislevy et al., 2012)
where precise explication of the constructs being measured is somewhat less important than in
the employment context. It is currently unknown how the present results would have changed if
a data-driven GBA had been used instead, and it is also unknown what additional insights could
COGNITIVE ABILITY GAME-BASED ASSESSMENT 65
have been gained from applying these techniques to the trace data produced by the present GBA.
Both are compelling directions for future research and perhaps critical to the evolution of
Conclusion
In general, there are great possibilities for the value of such currently-untapped data
distinct task to complete, GBAs integrate challenges, problems, and high-complexity tasks more
seamlessly into a continued experience as part of a narrative or in pursuit of a broader end goal
(Mislevy et al., 2014). In other words, data analysts examining the results of a traditional
assessment can see a response to a question, but the process by which the assessee arrived at that
response is neither captured nor considered. In more complex and interactive games, players
self-direct efforts and pursue individual choices in relation to how they want to progress,
navigate through space, investigate and accomplish goals (Shute, 2011), and GBAs can provide
the facility to record and monitor changes in candidate temporal micro-patterns or strategic
shifts, as well as the context in which these changes occur. The richness of such data provides
significant promises for the future development of GBAs beyond what traditional assessment is
capable of by providing evidence of their thinking, which can itself be designed to meet the
assumptions of psychometric models (Plass et al., 2011). For example, Rupp et al. (2012)
configuration of a computer network. Simple traditional outcome measures, such as how many
mistakes or correct responses students made, were not as valuable at distinguishing student
proficiency as various combinations of latent metrics from the data set. Their metacognitive
skills along the way, approach (e.g., time taken, number of commands input, proportions of
COGNITIVE ABILITY GAME-BASED ASSESSMENT 66
commands), efficiency, strategy usage (e.g., switching between computer devices) provided
deeper insight over and above the evidence of their final solutions (Rupp et al., 2012). Future
research is needed to explore the potential of GBAs to provide rich data regarding the automated
References
Aguinis, H., & Smith, M.A. (2007). Understanding the impact of test validity and bias on
selection errors and adverse impact in human resource selection. Personnel Psychology,
60, 165-199.
Anderson, N., Salgado, J. F., & Hülsheger, U. R. (2010). Applicant reactions in selection:
Apperley, T. H. (2006). Genre and game studies: Toward a critical approach to video games.
Armstrong, M.B., Ferrell, J., Collmus, A. B., & Landers, R. N. (2016). Correcting
misconceptions about gamification of assessment: More than SJTs and badges. Industrial
Armstrong, M. B., Landers, R. N., & Collmus, A. B. (2016). Gamifying recruitment, selection,
Arthur, W., Glaze, R. M., Villado, A. J., & Taylor, J. E. (2010). The magnitude and extent of
ability and personality. International Journal of Selection and Assessment, 18(1), 1–16.
Arthur, W. & Villado, A. J. (2008). The importance of distinguishing between constructs and
methods when comparing predictors in personnel selection research and practice. Journal
Arvey, R. D., Strickland, W., Drauden, G., & Martin, C. (1990). Motivational components of test
Bartram, D. & Hambleton, R. K. (2006). Computer-based testing and the internet: Issues and
Bauer, T. N., Truxillo, D. M., Sanchez, R. J., Craig, J. M., Ferrara, P., & Campion, M. A. (2001).
Beck, Kent (1999). "Embracing Change with Extreme Programming". Computer. 32 (10): 70–
77. doi:10.1109/2.796139.
Bertua, C., Anderson, N., & Salgado, J.F. (2005). The predictive validity of cognitive ability
387-409.
Bhatia, S., & Ryan, A. M. (2018). Hiring for the win: Game-based assessment in employee
selection. In The brave new world of eHRM 2.0. (pp. 81–110). IAP Information Age
Publishing.
Bjogvinsson, E., Ehn, P., & Hillgren, P.-A. (2012). Design things and design thinking:
Borman, W. C. & Motowidlo, S. J. (1997). Task performance and contextual performance: The
Brown, A. (2017). Younger men play video games, but so do a diverse group of Americans.
tank/2017/09/11/younger-men-play-video-games-but-so-do-a-diverse-group-of-other-
americans/
Brown, A. W., Stein, S., & Rohrer, P. L. (1936). Chicago non-verbal examination. Psychological
Corporation.
Burton, N. W., Welsh, C., Kostin, I., & VanEssen, T. (2009). Toward a definition of verbal
doi:10.1002/j.2333-8504.2009.tb02190.x
Campbell, J. P., McCloy, R. A., Oppler, S. H., & Sager, C. E. (1993). A theory of
Carretta, T. R., & Ree, M. J. (1996). Factor structure of the Air Force Officer Qualifying Test:
Carroll, J.B. (1993). Human cognitive abilities: A survey of factor-analytic studies. New York:
Chamorro-Premuzic, T., Winsborough, D., Sherman, R. A. & Hogan, R. (2016). New talent
signals: Shiny new objects or a brave new world? Industrial and Organizational
Psychology, 9, 621-640.
Chan, D., Schmitt, N., DeShon, R. P., Clause, C. S., & Delbridge, K. (1997). Reactions to
cognitive ability tests: The relationships between race, test performance, face validity
perceptions, and test taking motivation. Journal of Applied Psychology, 82, 300-310.
COGNITIVE ABILITY GAME-BASED ASSESSMENT 70
Cleary, T.A. (1968). Test bias: Prediction of grades of Negro and White students in integrated
Colzato, L. S., van Leeuwen, P. J. A., van den Wildenberg, W. P. M., & Hommel, B. (2010).
https://www.frontiersin.org/articles/10.3389/fpsyg.2010.00008/full. doi:
10.3389/fpsyg.2010.00008
Conrad, L., Trismen, D., & Miller, R. (Eds.). (1977). Graduate Record Examinations technical
Diehl, V. A. (2014). Using real-world and standardized spatial imagery tasks: Convergence,
imagery realism, and gender differences. Applied Cognitive Psychology, 28, 789-798.
doi:10.1002/acp.3061
Drasgow, F. (19870601). Study of the measurement bias of two standardized psychological tests.
Egenfeldt-Nielsen, S., Smith, J. H., & Tosca, S. P. (2013). Understanding video games: The
Ekstrom, R. B., French, J. W., Harman, H. H., & Dermen, D. (1976). Kit of factor-referenced
Springer US.
Gee, J.P. (2007). What video games have to teach us about learning and literacy (2nd ed.). New
York: Palgrave.
Georgiou, K., Gouras, A., & Nikolaou, I. (2019). Gamification in employee selection: The
Gustafsson, J. E., & Balke, G. (1993). General and specific abilities as predictors of school
Hak, A. (2019). How to remove hiring bias through gamification. The Next Web. Retrieved from
https://thenextweb.com/work2030/2019/05/20/how-to-remove-hiring-bias-through-
gamification/
Hamstra, S. J., Brydges, R., Hatala, R., Zendejas, B., & Cook, D. A. (2014). Reconsidering
Handler, C. (2018, June 19). The truth about game-based talent assessments. Retrieved from
https://www.ere.net/the-truth-about-game-based-talent-assessments/
Hausknecht, J. P., Day, D. V., & Thomas, S. C. (2004). Applicant reactions to selection
Honaker, J., King, G., & Blackwell, M.(2011). Amelia II: A program for missing data. Journal
Horn, J.L., & Noll, J. (1997). Human cognitive capabilities: Gf-Gc theory. In D.P. Flanagan, J.L.
Hough, L. M., Oswald, F. L., & Ployhart, R. E. (2001). Determinants, detection and amelioration
Hu, L. T., & Bentler, P. M. (1999). Cutoff criteria for fit indexes in covariance structure analysis:
Huizinga, J. (2016). Homo ludens: A study of the play-element in culture. Kettering, OH:
Angelico Press.
Hunicke, R., LeBlanc, M., & Zubek, R. (2004). MDA: A formal approach to game design and
Hunter, J. E. (1980). Test validation for 12,000 jobs: An application of synthetic validity and
validity generalizations to the General Aptitude Test Battery (GATB). Washington, DC:
Hunter, J. E. (1983). A causal analysis of cognitive ability, job knowledge, job performance, and
Ip, C. (2018, May 4). To find a job, play these games. Engadget. Retrieved from
https://www.engadget.com/2018/05/04/pymetrics-gamified-recruitment-behavioral-tests/
Jones, S. E. (2008). The meaning of video games: Gaming and textual strategies. New York,
NY: Routledge.
Kehoe, J. F. (2002). General mental ability and selection in private sector organizations: A
(Ed.), Proceedings of the 19th International Academic MindTrek Conference (pp. 26-32).
Kuncel, N. R. & Hezlett, S. A. (2010). Fact and fiction in cognitive ability testing for admissions
Kuncel, N. R., Hezlett, S. A., & Ones, D. S. (2001). A comprehensive meta-analysis of the
doi:10.1037//Q033-2909.127.1.162
Kuncel, N. R., Klieger, D. M., Connelly, B. S., & Ones, D. S. (2013). Mechanical versus clinical
Kuncel, N. R., Wee, S., Serafin, L., & Hezlett, S. A. (2010). The validity of the Graduate Record
doi:10.1177/0013164409344508
COGNITIVE ABILITY GAME-BASED ASSESSMENT 74
Landers, R. N., Auer, E. M., Collmus, A. B., & Armstrong, M. B. (2018). Gamification science,
its history and future: Definitions and a research agenda. Simulation & Gaming, 49(3),
315–337.
Landers, R. N., Auer, E. M., & Abraham, J. D. (2020). Gamifying a situational judgment test
with immersion and control game elements: Effects on applicant reactions and construct
review of technology integration paradigms and their effects on the validity of theory.
Landers, R. N., Tondello, G. F., Kappen, D. L., Collmus, A. B., Mekler, E. D., & Nacke, L. E.
Replacing the term ‘gamefulness’ with three distinct constructs. International Journal of
Human-Computer Studies.
Lang, J. W. B., Kersting, M., Hulsheger, U. R., & Lang, J. (2010). General mental ability,
narrower cognitive abilities, and job performance: The perspective of the nested-factors
analysis for examining hypotheses about test bias in prediction. Applied Psychological
Lu, Y., & Sireci, S. G. (2007). Validity issues in test speededness. Educational Measurement:
doi:10.1016/j.paid.2010.05.010
Mathieu, J. E., Hollenbeck, J. R., van Knippenberg, D., & Ilgen, D. R. (20170202). A century of
102(3), 452.
Mavridis, A. & Tsiatsos, T. (2017). Game-based assessment: Investigating the impact on test
anxiety and exam performance. Journal of Computer Assisted Learning, 33, 137-150.
doi:10.1111/jcal.12170
Meade, A. W. & Fetzer, M. (2009). Test bias, differential prediction, and a revised approach for
Mislevy, R. J., Behrens, J. T., DiCerbo, K. E., & Levy, R. (2012). Design and discovery in
Mislevy, R. J., Oranje, A., Bauer, M. I., von Davier, A., Hao, J., … John, M. (2014).
https://www.envisionexperience.com/~/media/files/blog/glasslab-
psychometrics.pdf?la=en
Oswald, F. L., Putka, D. J., & Okc, Jisoo. (2014). Weight a minute... What you see in a weighted
composite is probably not what you get. In C. E. Lance & R. J. Vandenberg (Eds.), More
Statistical and Methodological Myths and Urban Legends: Doctrine, Verity and Fable in
Plattner, H. (2011). Foreward. In H. Plattner, C. Meinel, & L. Leifer (Eds)., Design Thinking:
Plattner, H., Meinel, C., & Leifer, L. (2011). Design thinking: Understand, improve, apply.
Ployhart, R. E., Schmitt, N., & Tippins, N. T. (2017). Solving the Supreme Problem: 100 years
Primi, R. (2014). Developing a fluid intelligence scale through a combination of Rasch modeling
Markland, D. (2007). The golden rule is that there are no golden rules: A commentary on Paul
Marsch, H. W., Hau, K.-T., & Wen, Z. (2009). In search of golden rules: Comment on
hypothesis-testing approaches to setting cutoff values for fit indexes and dangers in
320-241.
Meade, A. W., & Craig, S. B. (2012). Identifying careless responses in survey data.
Mitchell, I. (2016). Agile development in practice. Tamare House. p. 11. ISBN 978-1-908552-
49-5.
Mollick, E. R. & Rothbard, N. (2014). Mandatory fun: Consent, gamification and the impact of
games at work. The Wharton School Research Paper Series. Retrieved from
https://ssrn.com/abstract=2277103
COGNITIVE ABILITY GAME-BASED ASSESSMENT 77
Morgeson, F. P., Brannick, M. T., & Levine, E. L. (2019). Job and work analysis: Methods,
Mount, M. K., Oh, I.-S., & Burns, M. (2008). Incremental validity of perceptual speed and
Nagarajan, A., Allbeck, J. M., Sood, A., & Janssen, T. L. (2012). Exploring game design for
https://doi.org/10.1109/CYBER.2012.6392562
Newman, D. A., Hanges, P. J., & Outtz, J. L. (2007). Racial groups and test fairness, considering
Olsen, J., Aleven, V. & Rummel, N. (2017). Statistically modeling individual students’ learning
Oswald, F. L., Saad, S. & Sackett, P. R. (2000). The homogenity assumption in differential
prediction analysis: Does it really matter? Journal of Applied Psychology, 85, 536-541.
Plass, J. L., Homer, B. D., Kinzer, C. K. & Perlin, K. (2012). Games for learning institution
https://gamesandimpact.org/wp-content/uploads/2012/09/PlassNYU-Ideas-for-Impact-
Games-2.pdf
COGNITIVE ABILITY GAME-BASED ASSESSMENT 78
Potosky, D., Bobko, P., & Roth, P. L. (2005). Forming composites of cognitive ability and
alternative measures to predict job performance and reduce adverse impact: Corrected
13,
Putka, D. J., Beatty, A. S., & Reeder, M. C. (2018). Modern prediction methods: New
https://doi.org/10.1177/1094428117697041
Quiroga, M. A., Diaz, A., Román, F. J., Privado, J., & Colom, R. (2019). Intelligence and video
Quiroga, M. Á., Escorial S., Román F. J., Morillo D., Jarabo A., Privado J., et al. (2015). Can we
reliably measure the general factor of intelligence (g) through commercial video games?
R Core Team. (2020). R: A language and environment for statistical computing. R Foundation
Ree, M. J., & Carretta, T. R. (1994). Factor analysis of the ASVAB: Confirming a Vernon-like
Reeve, C. L. & Hakel, M. D. (2002). Asking the right questions about g. Human Performance,
15, 47-74.
Roth, P. L., Bevier, C. A., Bobko, P., Switzer, F. S., & Tyler, P. (2001). Ethnic group differences
Roth, P. L., & Bobko, P. (1997). A research agenda for muti-attribute utility analysis in human
Rupp, A. A., DiCerbo, K. E., Sweet, S. J., Crawford, A. V., Calico, T., … Behrends, J. T. (2012).
Putting ECD into practice: The interplay of theory and data in evidence models within a
Ryan, R. M. & Deci, E. L. (2000). Intrinsic and extrinsic motivations: Classic definitions and
Ryan, A. M., & Huth, M. (2008). Not much more than platitudes? A critical look at the utility of
https://doi.org/10.1016/j.hrmr.2008.07.004
decisions: A critical review and agenda for the future. Journal of Management, 26(3),
565-606.
Ryan, R. M., Rigby, C. S., & Przybylski, A. (2006). The motivational pull of video games: A
doi:10.1007/s11031-006-9051-8
Sackett, P. R., & Ellingson, J. E. (1997). The effects of forming multi-predictor composites on
Sackett, P. R., Lievens, F., Van Iddekinge, C. H., & Kuncel, N. R. (2017). Individual differences
Sackett, P. R., & Yang, H. (2000). Correction for range restriction: An expanded typology.
Salgado, J.F., Anderson, N., Moscoso, S., Bertua, C., de Fruyt, F., & Rolland, J.P. (2003). A
meta-analytic study of general mental ability validity for different occupations in the
Schmidt, F. L. (1988). The problem of group differences in ability test scores in employment
Schmidt, F. L. (2002). The role of general cognitive ability and job performance: Why there
Schmidt, F. L. & Hunter, J. E. (1974). Racial and ethnic bias in psychological tests: Divergent
Schmidt, F.L. & Hunter, J. E. (1998). The validity and utility of selection methods in personnel
Schmidt, F. L., & Hunter, J. E. (2004). General mental ability in the world of work: Occupational
attainment and job performance. Journal of personality and social psychology, 86(1),
162.
Schneider, W. J., & McGrew, K. (2012). The Cattell-Horn-Carroll model of intelligence. In, D.
Salen, K., Tekinbas, K. S., & Zimmerman, E. (2004). Rules of play: Game design fundamentals.
Schmidt, F. L. & Hunter, J. (2004). General mental ability in the world of work: Occupational
attainment and job performance. Journal of Personality and Social Psychology, 86(1),
162-173. doi:10.1037/0022-3514.86.1.162
Sinar, E. F., Reynolds, D. H., & Paquet, S. L. (2003). Nothing but ‘net? Corporate image and
Singh, P. & Finn. D. (2003). The effects of information technology on recruitment. Journal of
Smither, J. W., Reilly, R. R., Millsap, R. E., Pearlman, K., & Stoffey, R. W. (1993). Applicant
6570.1993.tb00867.x
Society for Industrial and Organizational Psychology. (2003). Principles for the validation and
http://www.siop.org/_principles/principles.pdf
Stanek, K. C. & Ones, D. S. (2018). Taxonomies and compendia of cognitive ability and
Psychology and Employee Performance (pp. 366-407). Thousand Oaks, CA, SAGE
Publications Ltd.
Stark, S., Chernyshenko, O. S., & Drasgow, F. (2004). Examining the effects of differential item
497-508. doi:10.1037/0021-9010.89.3.497
Stauffer, J. M., Ree, M. J., & Carretta, T. R. (1996). Cognitive-Components Tests Are Not Much
France and the United States. Journal of Applied Psychology, 81, 134-141.
Stenros, J. (2017). The game definition game: A review. Games and Culture, 12, 499-520.
Taylor, P. J., Driscoll, M. P. O., & Binning, J. F. (1998). A new integrated framework for
Villado, A. J., Randall, J. G., & Zimmer, C. U. (2016). The effect of method characteristics on
retest score gains and criterion-related validity. Journal of Business and Psychology,
31(2), 233–248.
Von Stumm, S. (2013). Investment traits and intelligence in adulthood: Assessment and
0001/a000101
Voyer, D., Voyer, S. & Bryden, M. P. (1995). Magnitude of sex differences in spatial abilities: A
270.
Weiner, E. J., & Sanchez, D. R. (2020). Cognitive ability in virtual reality: Validity evidence for
215–235. https://doi.org/10.1111/ijsa.12295
Workman, J. E., & Lee, S-H. (2004). A cross-cultural comparison of the apparel spatial
visualization test and paper folding test. Clothing and Textiles Research Journal, 22(1/2),
22-30.
Zhang, P. (2008). Motivational affordances: Fundamental reasons for ICT design and use.