You are on page 1of 84

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/355479328

Theory-Driven Game-Based Assessment of General Cognitive Ability: Design


Theory, Measurement, Prediction of Performance, and Test Fairness

Article in Journal of Applied Psychology · October 2021


DOI: 10.1037/apl0000954

CITATIONS READS

39 5,077

5 authors, including:

Richard N Landers Michael B. Armstrong


University of Minnesota Twin Cities Old Dominion University
130 PUBLICATIONS 5,886 CITATIONS 22 PUBLICATIONS 1,775 CITATIONS

SEE PROFILE SEE PROFILE

Andrew Burnett Collmus Jason Blaik


Old Dominion University Cappfinity
17 PUBLICATIONS 905 CITATIONS 4 PUBLICATIONS 52 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Richard N Landers on 22 October 2021.

The user has requested enhancement of the downloaded file.


COGNITIVE ABILITY GAME-BASED ASSESSMENT 1

Theory-driven Game-based Assessment of General Cognitive Ability:

Design Theory, Measurement, Prediction of Performance, and Test Fairness

Richard N. Landers

University of Minnesota

Michael B. Armstrong and Andrew B. Collmus

Old Dominion University

Salih Mujcic and Jason Blaik

Revelian Pty Ltd

Author Note

Early versions of this paper were presented at the annual conference of the Society for Industrial

and Organizational Psychology. Participant payments and graduate research hours in this study

were funded by Revelian Pty Ltd, and RNL became a compensated member of Revelian’s

Scientific Advisory Board mid-project. MBA is now at Google, ABC is now at Facebook, SM is

now at Traitstack, and JB is now at Cappfinity.

Citation

Landers, R. N., Armstrong, M. B., Collmus, A. B., Mujcic, S., & Blaik, J. (in press). Theory-

driven game-based assessment of general cognitive ability: Design theory, measurement,

prediction of performance, and test fairness. Journal of Applied Psychology.


COGNITIVE ABILITY GAME-BASED ASSESSMENT 2

Abstract

Games, which can be defined as an externally structured, goal-directed type of play, are

increasingly being used in high-stakes testing contexts to measure targeted constructs for use in

the selection and promotion of employees. Despite this increasing popularity, little is known

about how theory-driven game-based assessments (GBA), those designed to reflect a targeted

construct, should be designed, or their potential for achieving their simultaneous goals of

positive reactions and high-quality psychometric measurement. In the present research, we

develop a theory of GBA design by integrating game design and development theory from

human-computer interaction with psychometric theory. Next, we test measurement

characteristics, prediction of performance, fairness, and reactions of a GBA designed according

to this theory to measure latent general intelligence (g). Using an academic sample with GPA

data (N=633), we demonstrate convergence between latent GBA performance and g ( = .97).

Adding an organizational sample with supervisory ratings of job performance (N=49), we show

GBA prediction of both GPA (r=.16) and supervisory ratings (r=.29). We also show incremental

prediction of GPA using unit-weighted composites of the g test battery beyond that of the g-

GBA battery but not the reverse. We also show the presence of similar adverse impact for both

the traditional test battery and GBA but the absence of differential prediction of criteria.

Reactions were more positive across all measures for the g-GBA compared to the traditional test

battery. Overall, results support GBA design theory as a promising foundation from which to

build high quality theory-driven GBAs.

Keywords: game-based assessment; game design; validation; measurement; fairness


COGNITIVE ABILITY GAME-BASED ASSESSMENT 3

Theory-driven Game-based Assessment of General Cognitive Ability:

Design Theory, Measurement, Prediction of Performance, and Test Fairness

In recent years, there has been a marked increase in interest among assessment

practitioners in the application of game-thinking, the use of game design theory to improve the

assessment experience (Armstrong, Landers & Collmus, 2016). In the practice of employee

selection, the term game-thinking encompasses two more specific concepts: assessment

gamification and standalone game-based assessment (GBA; Armstrong, Ferrell, et al., 2016).

Whereas assessment gamification refers to design techniques employed to modify existing

assessments by adding game elements (e.g., Attali & Arieli-Attali, 2015; Collmus & Landers,

2019), GBA is a distinct method of measurement (cf. Arthur & Villado, 2008), and GBAs might

reflect either the result of gamification or of a dedicated game design and development. Much

like surveys, simulations, and structured interviews, GBAs can be created with the intent of

assessing any construct of interest. In the current assessment marketplace, there are GBAs

marketed as assessing general mental ability (g), personality, skills, and various competencies

(Handler, 2018); however, scientific evidence evaluating the quality of GBA-based assessment is

scant in the assessment literature and missing in the high-stakes assessment literature (Chamorro-

Premuzic et al., 2016).

In the popular press and in assessment company marketing materials, GBAs are

commonly described as providing two distinct advantages over traditional assessments. The

more prominent marketing claim is improved applicant reactions to assessment (Armstrong,

Ferrell, et al., 2016). Whereas job applicants generally consider traditional survey-based

assessment to be ordinary and expected (Anderson et al., 2010; Steiner & Gilliland, 1996), GBAs
COGNITIVE ABILITY GAME-BASED ASSESSMENT 4

offer a promise of fun and excitement. Although such claims are yet generally untested, it is

clear from applicant reactions theory as to the general concept: that fun and excitement during

the application process should lead to better organizational hiring outcomes (cf., Hausknecht et

al., 2004). However, even if true, given the high cost of GBA development, it is unknown if any

gains realized by developing and implementing GBA would ultimately result in positive utility.

Additionally, the type of fun experienced in games may only be tenuously related to the kind of

fun, if any, that a job applicant wants during high-stakes assessment (cf. Mollick & Rothbard,

2014). The second purported gain attributed to GBA is improved measurement. This claim

takes many specific forms, such as reducing the impact of human biases (e.g., Ip, 2018), yet

comparisons of algorithmic decision-making versus reliance upon human intuition is well-

explored and hardly new concept in the assessment literature (e.g., Kuncel et al., 2013). Another

aspect of GBAs that could enable superior construct measurement is the analysis of the trace

data, such as click and mouse movement data, an area of research called computational

psychometrics (von Davier, 2017). This field is in its infancy and primarily exists in the

assessment of learning (e.g., Olsen et al., 2017), so its potential in the selection context is

completely unknown.

The more immediate concerns for assessment researchers are the development process

and measurement characteristics of current, deployed GBAs designed so that they do not require

computational psychometrics, which we refer to as theory-driven GBA. Theory-driven GBA is

like traditional psychometric assessment in that the GBA is designed to assess a targeted

construct as described by prior psychological research, producing scores representing that

construct based upon assessee behaviors within the assessment. Where theory-driven GBAs

differ from other theory-driven assessment development approaches is that they involve the
COGNITIVE ABILITY GAME-BASED ASSESSMENT 5

creation and collection of scores from a game, a complex concept with a rich history in

interdisciplinary research outside of the assessment literature which spans the humanities, social

science, computer science, and human-computer interaction (Jones, 2008). It is the

interdisciplinary design methodologies developed and refined in games research over the last

century that theory-driven GBA draws from in pursuit of an improved assessment experience

(Armstrong et al., 2016).

Given this landscape and the increasing popularity of GBA in the assessment

marketplace, and given calls to improve applied psychology’s integration of design into its

theories (Landers & Marin, 2021), the purpose of the present article is to introduce games, and

more specifically theory-driven GBA and the game design theories used to create them, to the

high-stakes assessment literature. We create and present a theory of GBA design through an

interdisciplinary integration of literatures across assessment and software design through the lens

of human-computer interaction. Next, we describe two empirical studies, a larger one in an

academic sample and a much smaller one in an organizational sample, comparing and

contrasting reactions, validity, and adverse impact of a traditional g test battery versus a theory-

driven GBA designed to assess g using the process described by our GBA design theory.

Finally, we provide practical recommendations for the development of theory-driven GBA and

describe future research directions.

A Theory of Games and GBA

Games, which have been a part of culture across all recorded human history and likely

much further (Huizinga, 2014), have historically been difficult to define; “what is the definition

of game?” has been the subject of hundreds of articles in the games research literature. Stenros

(2017) attempted to synthesize this literature, identifying 60 distinct definitions presented since
COGNITIVE ABILITY GAME-BASED ASSESSMENT 6

the 1930s differing on 10 dimensions of description. To simplify this in the GBA context, it is

tempting to at least narrow our treatment at least to digital games, which is to say games played

on a computing device and in the present day most typically delivered over the internet.

However, the term game describes a range of analog experiences as well, such as classics like

Monopoly, Scrabble, Chess, and Duck-Duck-Goose. For our purposes in the present article,

game will be defined as explored and defended by Landers et al. (2019): “an externally

structured, goal-directed type of play.” As they explain, fundamental to this definition like

almost all in the literature is that players have a high degree of flexibility in terms of how they go

about achieving goals either enabled or imposed by the game’s design. Further, a game engages

a player in its core gameplay loop, an iterative experience created by the interaction of

potentially thousands of game elements, all designed to encourage players to experience the loop

in a specific, targeted way (Landers et al., 2019).

Building upon this background, GBAs thus utilize games as a platform from which to

conduct meaningful measurement of psychological constructs, much as Likert-type measures

rely upon multiple-choice questions. In theory-driven GBA, this occurs explicitly, by designing

in-game activities to create meaningful scores estimating targeted constructs. Many modern

digital games already incorporate explicit assessments like this quantifying player behaviors

(e.g., goal achievement, in-game activities), although at a less rigorous standard of measurement

than is common in the assessment literature. For example, simply counting the number of goals

achieved among a provided list and providing feedback on this list to players may be sufficient to

create an emotionally compelling gameplay experience. The fluid nature of games’ “multiple

interacting aspects of knowledge and skill; construct-irrelevant variation from game features;

dependencies among actions across time points; [and] different situations arising for different
COGNITIVE ABILITY GAME-BASED ASSESSMENT 7

players as they interact with a game” (Mislevy et al., 2014, p.10) pose significant challenges and

deviations from standard assessment development practice for the more rigorous requirements of

high-quality, modern psychometric measurement.

To overcome these challenges in the development of theory-driven GBAs, gameplay

activities must be designed that provide players with significant freedom in pursuing game goals

while simultaneously designing the experience such that variance in measurable player behaviors

reflect targeted constructs. If there is insufficient freedom to play, the game feels proscribed and

mandatory, and thus the experience loses its gamefulness (see Mollick & Rothbard, 2014;

Landers et al., 2019), reducing the potential added value of GBA over much less costly

assessment methods.

Designing a GBA given this challenge is potentially easier in the case of the

measurement of g than for the measurement of skills or non-cognitive traits. Because g can be

operationally defined as the shared variance in success among varied cognitively-loaded

activities (Schneider & McGrew, 2012), the goal of g-GBA development can be as relatively

simple as creating multiple game activities that are cognitively loaded, such as solving

interactive puzzles or pursuing complex in-game goals, and to score those activities according to

their cognitive aspects with a traditional g scoring model. Thus, development of a g-GBA can

resemble development of a traditional g assessment but with the significant financial overhead of

game development. Skill GBAs are similarly straightforward from a measurement perspective

but add concerns regarding accurate simulation of the targeted skill construct in terms of physical

and functional fidelity (Hamstra et al., 2014); skill GBAs are designed to require the player to

engage in the skill to be measured, or a proxy, and those behaviors are then scored according to

the degree of skill exhibited, which might be done by human raters or algorithmically. For
COGNITIVE ABILITY GAME-BASED ASSESSMENT 8

example, in a cybersecurity skill GBA, players might be presented with a simulated

cybersecurity environment and asked to engage in a series of increasingly difficult challenges

(e.g., see CyberNEXS in Nagarajan et al., 2012). The most complex design case is for non-

cognitive traits where there is no “correct” answer indicating higher construct standing, such as

personality. In this context, activities must be crafted such that players have the freedom to

engage in a range of actions reflecting a range of a targeted trait but not have so much freedom

that those actions could reflect a non-targeted trait. If a player might behave in a scored way in a

GBA because of high agreeableness or because of high conscientiousness, that behavior will

likely not be a particularly good measure of either trait when using classical psychometric

approaches. For example, a player might be given the freedom to choose whether to help a

virtual character in need, yet such helping behavior could reflect a desire to be friendly (i.e., an

agreeableness signal), a desire to complete all tasks provided (i.e., a conscientiousness signal), or

some other rationale altogether.

GBA Design Theory

Modern game design, whether for the purpose of assessment or otherwise, is an

enormously complex process approached with myriad methodologies; all are a combination of

art and science, and the specific methodology adopted shapes not only the aesthetic experience

of a resulting GBA but also its psychometric properties. Most modern digital games are the

result of designing hundreds or thousands of interconnected subsystems that work in harmony to

create a targeted gameplay experience, each requiring hundreds or thousands of individual

design decisions. Thus, much as with the development of other complex media like film, extant

theory provides some “rules” for effective game design (Salen et al., 2004), such as the criticality

of iterative engineering practices in which prototypes are gradually improved through repeated
COGNITIVE ABILITY GAME-BASED ASSESSMENT 9

data collection and action planning (Kultima, 2015), but much is still based upon the intuition

and development experience of the game’s director in terms of how to develop a compelling

experience for players, as there is rarely empirical research to consult regarding individual design

decisions.

In parsing the complexity of games, the games research community has numerous major

theoretical frameworks, the choice of which generally reflects the disciplinary lens of the

researcher attempting design or analysis (Egenfeldt-Nielsen et al., 2013) and which provide a

theoretical framework for understanding the implications of different game designs. A popular

approach within human-computer interaction, which tends to favor empirical approaches with a

foundation in psychological theory, is the Mechanics, Dynamics, Aesthetics (MDA) framework

(Hunicke et al., 2004), which blends perspectives from the fields of “game design and

development, game criticism, and technical game research” (p. 1722). In MDA, mechanics are

defined as the algorithms and systems that establish how a game functions on a technical level,

such as scoring systems or avatar control. Dynamics are defined as the real-time interactions

between mechanics, which include both inter-mechanic dynamics and mechanic-player

dynamics. Aesthetics are defined as a player’s affective response caused by experiencing a

game’s mechanics and dynamics. Importantly, the only aspects of a game that actually “exist” in

a technical sense are the mechanics; the experience recognized as a game are the dynamics – the

result of mechanics interacting with each other and with players in real-time. It is impossible for

game programmers to program dynamics directly. Instead, they must program mechanics with

the goal of those mechanics interacting with the player and with each other in such a way that

desirable dynamics, and subsequently aesthetics, emerge as a result of gameplay. This idea

reflects the theory of motivational affordances, which describes how the technical characteristics
COGNITIVE ABILITY GAME-BASED ASSESSMENT 10

of systems have only the potential to motivate people to act (Deterding, 2011; Zhang, 2008);

mechanics have affordances that are designed to but do not necessarily lead to desired dynamics.

Thus, causally speaking, mechanics cause dynamics which in turn cause aesthetics, but game

developers only have direct control over mechanics despite targeting desirable aesthetics, such as

a “fun” GBA.

The design challenge in GBA incorporates all these concerns yet is still more complex

because targeting desirable aesthetics is a secondary goal. Whereas the primary goal of

commercial entertainment games is to craft an experience in which people have positive

affective experiences, motivating them to play more and discuss their experiences positively with

friends and family so that the games are more commercially successful, GBAs must first produce

scores with trustworthy psychometric properties and secondarily also create a positive affective

experience. If a GBA does not achieve a degree of measurement quality that is recognizable by

common standards (e.g., Nunnally, 1978; Society for Industrial and Organizational Psychology,

2003), it is not a GBA; it is only a game that outputs numbers. Similarly, if a GBA does not

create a sense of play, it is not a game; it is only an assessment marketed as if it is a game, which

is instead a type of gamification called “game-framing” (Collmus & Landers, 2019). Perhaps

most importantly, if the GBA cannot achieve both goals simultaneously, there are far more

inexpensive assessment methods that can achieve one or the other goal alone. Thus, GBA

designers face the added complexity of combining traditional game design, such as through the

lens of MDA, with the development of a psychometrically valid measurement instrument.

Currently, GBA developers tend to integrate literatures based upon their own locally-defined

priorities, because there is little theory to guide GBA design. The present study is intended in

part to address this gap.


COGNITIVE ABILITY GAME-BASED ASSESSMENT 11

Modern digital application development processes are generally based upon design

thinking, a design and development process theory focusing upon iterative redevelopment and

reconceptualization of both the problem being addressed and the product being designed to

address it (Plattner et al., 2011; Rowe, 1986) that was popularized by the Stanford Design School

(Plattner, 2011; Bjogvinsson et al., 2012). Design and development process theories are rarely

seen in applied psychology, but they describe how to best align design and development

processes with intended outcomes, which generally determines the quality of the final product

being developed in relation to organizational goals related to that product (Landers & Marin,

2021). In design thinking, there are five key stages to the development of a designed product;

however, designers may jump backwards to any prior stage if challenges in later stages

necessitate it. A visualization of these stages as applied to GBA development appear in Figure 1.

The first stage, empathizing, requires designers to collect data on the problem their product is

designed to solve; in the theory-based GBA design context, this involves construct specification

and development of a shared mental model among assessment experts, game designers, and

software development teams. The second stage, definition, refers to a meta-cognitive process in

which designers attempt to anticipate difficulties ahead based upon the problem and design

product as they understand it at that time. For GBAs, this principally involves consideration of

the ultimate context in which the GBA will be deployed, including issues like supported device

types, proctoring, and technical limitations. The third stage, ideating, involves brainstorming

and prioritization of developed ideas needed to develop a useful prototype, which generally

involves building a shared mental model between the assessment team and technical teams

working on the GBA. The fourth stage, prototyping, involves the creation of a planned product,

which includes anything from paper representations of GBAs (i.e., low fidelity prototypes) to
COGNITIVE ABILITY GAME-BASED ASSESSMENT 12

partially functional but unfinalized digital versions (i.e., high fidelity prototypes; see Figure 2).

The fifth stage, testing, involves assessment of the success of that product in addressing the

original problem statement, which for GBAs typically include both aesthetic and psychometric

goals. These stages are also often repeated in various combinations after a tested product has

already been deployed for use; for example, in an assessment consultancy, after the first version

of a GBA is deployed for one client, lessons learned from that GBA may be used to inform

future empathizing, defining, ideation, and/or prototyping while that first version of the GBA

continues to be used. When creating a GBA under a design thinking model, assessment

specialists generally work much more closely with software developers than is typical for

assessment specialists in more traditional measure development projects given the thousands of

iterations involved, each of which may have implications for high-quality measurement; for

example, the distance between the two images in Figure 2 alone was several hundred prototypes,

each tested in different ways depending upon development goals relevant at that point in time.

As a result of the interdisciplinary expertise required, as well as the significantly

increased overall process complexity, agile development of a GBA occurs in a much more

organic and emergent manner than in traditional assessment design (Mitchell, 2016). Although

the empathizing stage is similar, in that it involves identification of target constructs through

needs analysis (Taylor et al., 1998), job analysis (Morgeson et al., 2019), or some other

traditional process used to define research problems, the remaining stages are different. In GBA

design, the defining stage is minimized as a way of managing risk; it is generally assumed that

developing a prototype and learning about the GBA’s performance from data will be more

efficient than spending more time in the initial defining stage. More dramatically, the ideating,

prototyping, and testing stages are reordered and repeated as suitable for the combination of
COGNITIVE ABILITY GAME-BASED ASSESSMENT 13

Figure 1. Illustration of prototypical GBA development process adopting design thinking.

Figure 2. Early low-fidelity paper prototype (left) and screenshot from a functional high-fidelity

digital prototype (right) of one of the current study’s GBA mini-games, Gridlock
COGNITIVE ABILITY GAME-BASED ASSESSMENT 14

problems, designers, and developers in the project, to maximize quality while minimizing effort

(Moran, 2014; Beck, 1999). The success of these efforts will be driven not only by technical

competence but also in large part by traditional antecedents of team effectiveness, such as team

structure, leadership, planning and coordination, and so on (Mathieu et al., 2017).

Measuring g with a GBA

g is a stable individual difference construct commonly used in employee selection due to

its strong association with job performance and other outcomes of interest across a wide variety

of employment contexts (Schmidt, 2002; Schmidt & Hunter, 2004). In I-O psychology, g is

typically defined within a Spearmanian framework, such as the Cattell-Horn-Carroll (CHC)

theory of cognitive ability and operationalized as shared variance across cognitively-loaded tests

selected to sample across the cognitive ability domain (Lang, Kersting, Hulsheger & Lang,

2010). This operationalization is possible due to CHC theory’s hierarchical modeling of g, such

that g is the topmost and broadest construct, reflecting shared variance among specific abilities,

including fluid reasoning, visual processing, and processing speed, among others. CHC theory

was chosen as the foundation for measurement in the present study’s GBA for two primary

reasons. First, the CHC model is today the clearly dominant model of g in the intelligence

research literature and has a rich history (Schneider & McGrew, 2012) combining two prominent

models of human cognitive abilities, Horn-Cattell Gf-Gc theory (Horn & Noll, 1997) and

Carroll’s three-stratum theory (Carroll, 1993). Within the context of assessment and personnel

selection, g has also consistently demonstrated utility as a predictor of occupational success

across a wide array of job types (Bertua et al., 2005; Salgado et al., 2003; Schmidt & Hunter,

1998). Second, CHC theory’s taxonomy of broad human cognitive abilities included rich

theoretical foundations for its many components, allowing for targeted game design.
COGNITIVE ABILITY GAME-BASED ASSESSMENT 15

Specifically, rather than focusing on developing multiple general assessments of g, CHC’s broad

level factors provide specific details from which specific game mechanics can be identified and

targeted with individual short-form games.

The hierarchical nature of cognitive ability is such that g, which exists at the highest and

most general level, explains the majority of variance across more specific ability tests regardless

of the specific domains of those tests, a concept Spearman called indifference of the indicator.

Empirical support for this is most commonly observed in industrial-organizational psychology as

very strong factor loadings and relatively little residual variance in factor analyses of

cognitively-loaded test batteries used for employee selection (Carretta & Ree, 1996, Ree &

Carretta, 1994, Stauffer et al., 1996). One well-validated framework in the broader literature on

g, the cognitive design system approach (Embretson, 1994), provides valuable guidance on the

development of theoretically-consistent second stratum ability measures to develop g batteries.

This framework addressed what was at the time a generally atheoretical approach to constructing

ability tests with cognitive theory, recommending the targeting of two goals in item construction:

a) construct representation, which refers to alignment between the latent constructs involved in

solving an ability item with the cognitive theory of that solving, and b) nomothetic span, which

refers to expected relationships within the nomological net surrounding the measure given that

cognitive theory. Primi (2014) demonstrated both the value and complexity of this approach in

the creation of a fluid intelligence measure, linking specific item design features, such as

perceptual complexity, with targeted components of fluid intelligence. This type of linking

procedure, between cognitive theory and the properties of each item, thus serves as a strong

conceptual basis for the development of a g-GBA by using these techniques in the same general

fashion but within the context of game design.


COGNITIVE ABILITY GAME-BASED ASSESSMENT 16

Additionally, two research studies are of particular relevance to the measurement of g

using scores across video games. Quiroga et al. (2015) administered 11 puzzle mini-games taken

from a commercially-available “brain training” video game for the Nintendo Wii, 1 computer-

based maze navigation game, and 11 cognitive ability tests to 188 undergraduates, finding a

correlation of .96 between latent cognitive ability and latent game performance. Although

groundbreaking in its demonstration of the potential for video games to provide valid

measurement of g, the study was limited its generalizability and practicality in an employee

selection scenario due to its focus on commercially available video games designed for the

purpose of entertainment (Bhatia & Ryan, 2018), an unclear theoretical basis and design process

for the selection of relevant mini-games, and a complete focus on latent variables without

exploration of observed scores. Quiroga et al. (2019) replicated and extended this study,

addressing some of these concerns, by proposing a concept called video games general

performance (gVG). Although gVG was never explicitly defined, it was treated as the shared

variance among performance scores obtained produced by video games. The selection process in

this study was also a bit more explicit about its game screening process, administering a battery

of games plus a cognitive ability test battery to an ultimate sample of 134 participants in a lab

environment held over three 90-minute sessions. Quiroga et al. modeled scores obtained from

these games in a similar fashion as Quiroga et al. (2015), this time finding a latent correlation

with latent cognitive ability of .79, substantially lower than observed in the earlier study;

comparing R2 across studies, .922 - .624 / .922 suggests a 32.32% reduction in convergence for

unclear reasons.

An additional theoretical concern is the proposed gVG construct. Although never

defined explicitly, the implicit presentation of gVG as reflecting broad ability across all video
COGNITIVE ABILITY GAME-BASED ASSESSMENT 17

games is problematic, especially when followed by targeting subsets of games. Specifically,

there exist numerous game genres with distinct skills required for success within each (Apperley,

2006), and Quiroga et al. (2019) not only sampled a subset of these genres, but also selected

within that subset for appropriate features using an ambiguous selection process. A player with

extensive experience with game mechanics within a particular genre focused upon within a g-

GBA assessment is likely to have developed some generalized skill within that genre’s typical

mechanics that does not generalize to other genres; for example, if a person plays many slide

puzzle games, they are likely to have greater skill at future slide puzzle games, because solving

slide puzzles is a skill than can be learned (cf. Colzato, van Leeuwen, van den Wildenberg, &

Hommel, 2010). Thus, if a g-GBA was designed as a slide puzzle game, latent game

performance from a slide-puzzle-based g-GBA may contain variance related to slide-puzzle-

solving skill, contaminating measurement of g. Thus, in the case of the present study, we sought

to assess the degree to which a particular g-GBA’s design was successful in avoiding the

contamination seen in gVG, or put differently, to what degree the present g-GBA’s design

process, focused entirely upon the creation of a measure of g within an employee selection

context, successfully achieved parity with traditionally-measured g. If supported, the resulting

hypothesis will suggest unique value for theory-driven GBAs, and the design theory used to

create them, beyond scores produced by off-the-shelf video games.

Hypothesis 1. Latent game performance will converge with latent g as measured by a

battery of traditional g tests; specifically, g-GBA latent performance and traditionally-measured

latent g will correlate approximately 1.0.


COGNITIVE ABILITY GAME-BASED ASSESSMENT 18

Prediction of Performance Outcomes

g consistently predicts a variety of performance outcomes, and the greatest quantity of

research exploring this is across the academic and employment domains. In the context of

employment, a common outcome is supervisory ratings of job performance, which brings

numerous advantages and disadvantages as a criterion measure (Viswesvaran & Ones, 2000).

The primary advantages to the use of supervisory ratings for criterion validation are convenience

and availability, as most organizations collect at least annual performance reviews, and

ecological validity, as organizational decision-making is often based upon these ratings. The

primary disadvantages are related to operationalization and measurement quality, as supervisory

ratings are often not high-quality reflections of the job performance construct, showing low

reliability and often collapsing orthogonal dimensions of job performance into individual scores

(Borman & Motowidlo, 1997). Despite these flaws, g tests have been consistently shown to

predict such scores, and the relationship between g and supervisory ratings is a well-explored

relationship (e.g., Hunter & Hunter, 1984, found operational  ranged between .31 and .73,

which varied by job complexity).

In the education domain, the most commonly studied performance outcome is college

grade-point average (Kuncel et al., 2004). Several properties of grade point averages make them

useful for validation research such as the present study. First, grade point averages are

extensively studied in the education literature, which has provided meta-analyses containing

useful benchmarks; the correlation between g and grade-point average is well-explored (ρ =.21,

N =7820, k =35; the relationship is stronger for highly-cognitively loaded tests such as ρ=.33,

N=22289, k=29 for the relationship between the SAT and grade-point average; Richardson,

Abraham, & Bond, 2012). Second, they are conceptually similar to supervisory ratings of job
COGNITIVE ABILITY GAME-BASED ASSESSMENT 19

performance (Meade & Fetzer, 2009), in that both classes of variable represent the outcome of

interactions between individual difference and situational variables. Much as supervisory ratings

are a convenient although imperfect proxy for actual job performance, which is itself the

behavioral outcome of knowledge, skills, motivation, and situational factors (Campbell et al.,

1993), grade-point average is a similarly convenient although imperfect proxy for academic

performance, itself the outcome of learned knowledge, acquired skills, motivation to learn, and

situational factors (Kuncel et al., 2001). Third, grade-point averages are conveniently available,

as college student grade-point averages are readily attainable, with permission, from university

records.

Although the weighting of task and contextual performance likely differs between the

two, in both cases, we would expect the correlation between latent game performance and

criteria to be similar to the relationships between traditionally-measured g and criteria if g-GBAs

are in fact measuring g. In the interest of validating g-GBA scores from multiple perspectives,

we tested both relationships in different samples.

Hypothesis 2a. Latent g-GBA performance will positively predict college grade-point

average.

Hypothesis 2b. Latent g-GBA performance will positively predict supervisory ratings of

job performance.

In the practice of assessment, isolation of and prediction from mathematically isolated

latent construct scores is not generally possible when making selection decisions; instead, test

scores are either combined into operational composites or used as predictors in regression

models. Each of these approaches brings its own strengths and drawbacks depending upon

context (Potosky et al., 2005). In the GBA context, although g is theorized to represent most of
COGNITIVE ABILITY GAME-BASED ASSESSMENT 20

the variance across the GBA’s minigames, performance-irrelevant and performance-relevant

variance unique to each minigame is likely to be contained within minigames scores. In the

more specific case of the present GBA, the seven minigames were developed to principally

target four broad traits in different combinations: Quantitative Knowledge (Gq), Reading and

Writing (Grw), Fluid Reasoning (Gf) and Processing Speed (Gs).

There may also be additional sources of GBA-specific method variance contained within

individual game scores. For example, existing research suggests that well-designed games can

decrease or suspend fears and anxieties by engrossing players in gameplay (Isbister et al., 2012;

Barnett & Storm, 1981; Landers et al., 2019). In the context of GBAs, this may improve

attention and concentration given research linking testing anxiety with test performance (Moran,

2016). Furthermore, because GBAs are much more behaviorally complex than traditional g

tests, GBAs may capture criterion-relevant behaviors beyond those captured by traditional

assessments. Although this would reflect poorer construct validity, it might result in improved

prediction, reflecting a common trade-off also seen in the arguments for the use of machine

learning in selection contexts (Putka et al., 2018).

Given the relatively unexplored nature of issues related to incremental prediction of

GBAs, and because we knew we would be unable to isolate any of these theoretical mechanisms

given the nature of GBA development, we approached the issue of incremental prediction as a

general question of interest with the intent of exploring game performance as both a latent

variable and by modeling performance on the g-GBA minigames individually to better

understand the relationship between this GBA and performance in relation to traditionally-

measured g.
COGNITIVE ABILITY GAME-BASED ASSESSMENT 21

Research Question 1. Does g-GBA performance predict GPA incrementally beyond

traditional g test performance (and vice-versa)?

Adverse Impact and Test Bias

When considering the use of g measures in the employee selection context, the presence

and magnitude of adverse impact is of significant concern (Hough et al., 2001). Adverse impact

occurs when a test is biased by subgroup membership within a protected class defined by

national or local laws, resulting in different success rates by subgroup despite a consistently

applied testing standard. Protected classes vary by jurisdiction but may include race, sex, color,

religion, national origin, disability, age, or any other legally defined classifier. Additionally,

specific cutoffs are sometimes defined at which point adverse impact legally occurs (Equal

Employment Opportunity Commission et al., 1978).

In the context of g, adverse impact research generally focuses on the “Black-White test

score gap” due to the relatively large population of Black people in the United States relative to

other racial and ethnic minorities, as well as the commonly observed mean score difference of

roughly one standard deviation between White and Black people on common g measures (Roth

et al., 2001). If the present g-GBA is indeed a measure of Spearmanian g, a difference of similar

magnitude is expected (Kehoe, 2002). Although such observed differences appear to be

primarily driven by persistent construct-level differences between populations (Reeve & Hakel,

2002), these differences can be widened by certain characteristics of both test design and the

context in which testing takes place. For example, the use of racially-biased items within a

questionnaire measure of a construct that is not itself racially biased can still lead to the

appearance of a racially biased test (Drasgow, 1987). Further, because the internal consistency

reliability of g measures is frequently quite high, relatively lower GBA reliability could also
COGNITIVE ABILITY GAME-BASED ASSESSMENT 22

attenuate observed racial differences. Thus, different characteristics of GBA could increase or

decrease observed differences beyond what is suggested by construct effects. In the context of

racial differences, there is no evidence to suggest that gameplay experience or game skill related

to a g-GBA would differ beyond what is already suggested by construct-level differences; thus,

there is no reason to expect that the Black-White test score gap would be enlarged by use of

GBA, although it could be attenuated due to relatively lower reliability. Unfortunately, due to a

low base rate of Black test-takers in our applied sample, this hypothesis test was limited to our

academic sample alone, which applies to all further race-related hypotheses and research

questions as well. Ross et al. (2001) found a Black-White test score difference of .69 standard

deviations among college students, so we used this estimated population effect as the basis for

our hypothesis.

Hypothesis 3. Black test-takers and White test-takers will differ in g-GBA performance

such that Black test-takers will score approximately .69 SD lower than White test-takers.

Where theory does suggest g-GBAs might introduce additional bias is related to a

different set of protected classes, sex and gender, due to associated differences in game-playing

habits. As described by Brown (2017), the base rate of women who play games in the United

States is slightly lower than that of men (48% and 58%, respectively), suggesting less experience

among women, on average, playing video games. However, these rates vary greatly by genre.

The genres that the present g-GBA minigames most closely resemble vary. Whereas some

games more closely resemble classical cognitive tasks, others may be more firmly placed in a

particular genre. Of the set, the game called Shortcuts most closely resembles the puzzle genre;

within that genre, a 72% base rate for women is observed compared to a 52% base rate for men

(Brown, 2017), suggesting that, on average, women have greater puzzle-playing experience than
COGNITIVE ABILITY GAME-BASED ASSESSMENT 23

men. Thus, an overall difference is expected such that men perform better across the minigames,

yet women may regardless perform better on the types of minigames with which they have the

most prior relevant experience, such as those most clearly resembling prototypical puzzle video

games. Beyond experience alone, other knowledge or skills, such as psychomotor ability, could

also influence performance differently by gender across genres. Given the paucity of research on

the relationship between gameplay experience by genre and gameplay performance by genre, the

magnitude and even direction of likely bias across genres is unclear. Thus, we concluded that we

could not reasonably predict the particular pattern of gender differences across minigames a

priori. Thus, we developed a hypothesis of interest regarding overall differences and an

exploratory research question regarding differences between minigames.

Hypothesis 4. Self-identified males and females will differ in g-GBA performance such

that female test-takers will score lower than male test-takers, on average, across g-GBA tests.

Research Question 2. Are male-female differences moderated by minigame design?

Fairness and Differential Prediction

Beyond the concept of adverse impact is test fairness, a legal classification that allows for

the adoption of a test showing adverse impact under certain conditions (Schmidt & Hunter,

1974). In the United States, test fairness is most commonly evaluated by examining differential

prediction of a meaningful criterion across subgroups (Newman et al., 2007). Within the Cleary

(1968) model of fairness, the dominant framework for evaluating fairness in the United States

legal system (Aguinis & Smith, 2007), differential prediction across subgroups is evaluated in

terms of differences in slopes, intercepts, and error variances (Schmidt, 1988). Although Cleary

stated that any such differences suggested an unfair test, more recent interpretations suggest a

test is fair if a common regression line fits all subgroups equally well (Meade and Tonidandel,
COGNITIVE ABILITY GAME-BASED ASSESSMENT 24

2010). For example, if cognitive ability is directly and causally related to job performance,

adverse impact in the prediction of job performance from a cognitive ability measure without

differential prediction could still be considered fair if it accurately predicts criterion differences,

reflected as a specific type of subgroup intercept differences. In contrast, subgroup differences

in slopes reflect differences in the validity of test scores between subgroups and are the most

legally problematic; intercept differences without slope differences are generally considered fair

(Meade & Fetzer, 2009). As a result, the disentanglement of slope and intercept differences is

generally viewed as the most critical issue in evaluating fairness, and slope differences are used

as the primary indicator of unfairness (Stark et al., 2004).

In the g-GBA context, if differential prediction were observed, the source must be one of

two sources: either the g construct or an aspect of the GBA method as designed to measure g.

Differential prediction from the g construct itself is unlikely; prior research has demonstrated

that although subgroup intercepts for the prediction of job performance from g differ, slopes

generally do not differ by subgroup unless there are test-specific causes, such as biased questions

(Kehoe, 2002). Thus, any observed differential prediction in the present study is more likely

attributable to the GBA method itself or more indirectly, the development process of the g-GBA

under study. The introduction of varying slopes by racial subgroup for a g-GBA is unlikely in

the context of race for the reasons described earlier regarding test bias, namely that there is no

evidence suggesting racial differences in game experience or game attitudes that would

contribute to greater trait diagnosticity for one race versus another. However, if women’s greater

mean experience with puzzle games was associated with decreased reliability for men, this

would attenuate slopes for the male subgroup, biasing those slopes towards zero. Other designs

that could elicit differential effects by group include the specific abilities being measured,
COGNITIVE ABILITY GAME-BASED ASSESSMENT 25

variance contributed by confounding constructs like psychomotor ability, or other psychometric

artifacts.

Research Question 3. Is differential prediction of GPA with a g-GBA similar to that of

traditionally-measured g for race and gender within the academic sample?

Reactions to Assessment Games

Test-taker reactions are an important consideration when developing and implementing

new forms of assessments, especially in high-stakes testing contexts such as employee selection

(Hausknecht et al., 2004). Generally, test-taker reactions refer to the “attitudes, affect, or

cognitions” an individual might have about the testing process (Ryan & Ployhart, 2000, p. 566).

Specifically, test developers and researchers are concerned with test-taker perceptions of test

fairness (i.e., distributive justice, procedural justice), test-taker motivation, test-taker anxiety, and

general attitudes of the test-taker toward the test itself (Hausknecht et al., 2004). Test-taker

perceptions and reactions can impact a variety of outcomes, including actual test performance,

self-efficacy, and, in the context of employee selection, organizational attractiveness and

intentions to accept job offers (Hausknecht et al., 2004), although these downstream effects tend

to be small (Ryan & Huth, 2008). In general, test-taker reactions to g assessments, particularly in

employee selection contexts, are slightly unfavorable, although g assessments vary in this regard

on many design dimensions, including novelty, immersion, features, and time pressure.

We contend that reactions to a g-GBA, if designed well for its audience, should be more

positive than to a traditional g test battery across most reaction dimensions given evidence from

the games literature and research on existing assessments incorporating game elements for five

reasons. First, there is empirical evidence supporting that adding game elements to assessments

can improve reactions. For example, Attali and Arieli-Attali (2015) gamified a computerized
COGNITIVE ABILITY GAME-BASED ASSESSMENT 26

assessment of mathematics knowledge by awarding points for correct answers and speedy

responses, which students found more enjoyable and motivating. Second, some gamified

assessments, which incorporate one or more game elements like animation, sound effects,

instantaneous feedback, varying difficulty, progress bars, and narrative contexts, have been

found to be perceived as face valid by authentic job applicants (Ferrell et al., 2015). Third, video

games more broadly have been found to decrease anxiety; for example, Mavridis and Tsiatsos

(2017) used GBA on a game-based learning platform to decrease test anxiety for graduate

students who reported that they did not feel like they were being tested despite rationally

knowing otherwise. Fourth, because GBAs are more behavioral in nature relative to multiple

choice methods, GBAs should provide an increased sense of an opportunity to perform (i.e.,

procedural justice). Fifth, animation and instant feedback on performance should improve the

sense of interpersonal warmth that traditional g assessments tend to lack (Anderson et al., 2010).

Hypothesis 5. Reactions to a g-GBA will be more positive than to a traditional g test

battery within the academic sample.

Method

The complete experimental protocol and analytic plan for this study were originally

submitted for ethics review to and deemed exempt by the Old Dominion University College of

Sciences Committee for Review of Human Subjects Research (981057-1; "Game-based

Assessment"). A secondary protocol covering access and analysis of data collected under that

exemption was later submitted and deemed exempt by the University of Minnesota

(STUDY00004558; "Game-based Assessment”) upon a change in employment of the first

author.
COGNITIVE ABILITY GAME-BASED ASSESSMENT 27

Participants

Academic Sample

Undergraduate students were recruited from a large public university in the middle-

Atlantic region of the United States for this study. Participants were recruited through traditional

university-wide in-person recruiting, which included live recruiting at the campus quad,

distribution of flyers on campus, and online email announcements, as well as through a

Psychology department research participant pool. From university-wide recruiting, 394 people

volunteered and participated in the study. Each was compensated US$20 for approximately 2

hours of effort. Additionally, 428 students signed up through the psychology department research

participant pool and were compensated with course credit. Of these 822 students who began the

study, 746 (90.8%) reached the end.

To minimize the impact of careless responding on results, two detection techniques

recommended by Meade and Craig (2012) were applied. First, three bogus items were included

throughout the test (i.e., 82.7% correctly answered Disagree or Strongly Disagree to “I have

traveled to the Moon,” 84.5% to “I was on board the Titanic,” and 84.3% to “Select the option

that is at the left end of the scale for this question”). Second, participants responded to a

question directly asking if they had put in an honest effort into all parts of the study and tried to

follow instructions, which was endorsed by 91.8% of participants. Excluding any case that

failed at least one of these four tests would have removed 29% of cases, so we adopted a slightly

less stringent criterion by excluding only cases that failed at least two of these tests, which

eliminated only 14% of cases. We also post-hoc compared our statistical tests between these

approaches, finding few differences. Ultimately, 633 cases (84.5% of valid participants; 76.6%

including study withdrawals) remained for analysis.


COGNITIVE ABILITY GAME-BASED ASSESSMENT 28

In this final sample, the mean age was 21.35 years (s=4.14). 42.0% self-reported as

European American or White, 28.8% as African American or Black, 12.0% self-reported as

biracial, and 17.2% reported another race or combination of races. 41.9% self-reported as male,

57.7% as female, 0.2% as transgender and 0.3% as other. 10.3% self-reported full-time

employment, 43.6% part-time employment, and 46.1% no employment.

Organizational Sample

Current early-in-career employees of a major international healthcare company, across

several small workgroups in Brazil, Canada, Columbia, India, Mexico, and the United States,

participated in a concurrent validation study. Due to privacy requirements within the company

for this type of data sharing, a specific breakdown by country was not made available to the

research team. Participation was voluntary for all employees, and the data collection process

was managed by that organization’s internal industrial-organizational psychologists. In total,

127 employees completed at least one mini-game. Within this group, supervisor ratings of

performance were available for 49 (41.5%). According to the organization sharing data,

supervisory ratings were missing for two reasons: 1) employees had either not worked at the

organization long enough to have participating in the annual review process or 2) less

commonly, their manager was noncompliant and chose not to participate in the mandatory

annual review process for any of their direct reports. This provides some rationale to support that

missingness in this sample was at random and not reflective of the characteristics of individual

employees. Additionally, the organization reported that they did not use highly cognitively-

loaded measures for either employment or promotion, suggesting only weak indirect range

restriction in cognitive ability in the sample. Within this sample, the mean age was 24.69 years
COGNITIVE ABILITY GAME-BASED ASSESSMENT 29

(SD=2.15), ranging from 22 to 32 years, and 28 (59.3%) were female. Employees had worked

for the company on average 1.82 years (SD = 1.01).

Measures

Measures differed by sample. In both samples, a g-GBA was administered, and an

outcome measure was captured. In the organizational sample, gender, race, and age were

requested on a self-report questionnaire in addition to the g-GBA. In the academic sample, a

traditional cognitive ability test battery, a reactions battery, and a demographics questionnaire

were administered in addition to the g-GBA.

Traditional Cognitive Ability Test Battery

Specific cognitive ability tests were chosen from Stanek and Ones’ (2018) compendium

of personality and cognitive ability measures as prototypically representative of that specific

ability, to minimize domain overlap between tests.

Visual processing. The first test, intended to measure visual processing, was the General

Aptitude Battery’s test of Spatial Aptitude (Hunter, 1980). In this test, participants were

presented with a series of figures illustrating a flat piece of paper with dashed lines indicating

fold points. Participants were asked to identify which the flat piece of paper would look like

when folded among four three-dimensional shapes. This test had a 4-minute time limit and 10

items. Scores were the number of correct answers. Due to negative skew, the scores from this

test were reversed and logarithmically transformed. Coefficient alpha of the untransformed

scores was .73.

Fluid reasoning. The second test, also adopted from the Educational Testing Service’s

Kit of Factor-Referenced Cognitive Tests (Ekstrom et al., 1976; Widaman, 1982), was the

locations test. In this test, participants were presented with five rows of dashes and gaps between
COGNITIVE ABILITY GAME-BASED ASSESSMENT 30

the dashes. In each of the first four rows, an X was placed over one of the dashes, creating a

pattern from row to row. Participants were asked to indicate which of five dashes in the fifth row

would need to be marked with an X to match the pattern on the other rows. This test had a 6-

minute time limit and 14 items. Coefficient alpha was .50.

Processing speed. Part 1 of the Chicago Non-Verbal Examination (Brown et al., 1936)

was included in the cognitive battery to assess processing speed. This test is a digit symbol test

(Salthouse, 1996; Conway et al., 2002), in which participants were given 12 different symbols

matched with the numbers 1 through 12 as a legend. Each item was one of the twelve symbols

from a continuously visible legend, and participants had to identify which number was associated

with each symbol presented as quickly as possible. This test lasted 2 minutes and 30 seconds and

included 105 items. Coefficient alpha was .85.

Quantitative ability. The quantitative ability test was the General Aptitude Battery’s test

of Quantitative Reasoning (Hunter, 1980). It included 5 quantity comparison items in which two

quantities were presented. Participants were asked to determine whether quantity A was larger,

quantity B was larger, the two quantities were equal, or if the relationship between the two could

not be determined. This test had a 5-minute time limit. Coefficient alpha was .37.

Verbal ability. The verbal ability test was a section of practice Graduate Record Exam

questions (Conrad et al., 1977) from the verbal section of the test (Burton, et al. (2009); Kuncel

et al., 2001; Kuncel et al., 2010) which involved 6 sentence-completion items with one or two

blanks. Participants selected one or two words from three to five choices available to complete

the sentence logically. This test had a 6-minute time limit. Due to positive skew, the verbal test

was logarithmically transformed. Coefficient alpha of the untransformed items was 0.60.
COGNITIVE ABILITY GAME-BASED ASSESSMENT 31

Game-based Assessment of Cognitive Ability (g-GBA)

The focal g-GBA of this study was an early version of Cognify, which was developed using the

design theory outlined in the introduction of the present paper. Cognify as used in this data

collection effort consisted of seven distinct web-based minigames intended to assess general

cognitive ability by targeting combinations of second-stratum abilities within the CHC model.

Screenshots from each of these minigames appear in Figure 3. All games relied on user input via

pointing and clicking with a mouse. The entirety of the g-GBA took about 20 minutes to

complete, with each game ending automatically after a certain amount of time, creating some

processing speed requirements across all games, although the intensity of this requirement varied

by game. Many test-takers did not finish all possible levels and items within each game, as

expected during a speeded test. In the development of these games, it was recognized early in the

development process that targeting specific second-stratum abilities while ignoring others was

not feasible from a game development perspective without sacrificing desirable game mechanics,

so overlap was permitted. The ultimate specific abilities theoretically targeted varied as indicated

below.

Numbubbles (Gq/Gv). In Numbubbles, players were presented with a target value (e.g.,

7) and then a sequence of bubbles with each bubble containing a numerical formula (e.g., 3 + 4

or 12 x 2). The bubbles had a short lifespan of a few seconds before disappearing. The goal of

the game was to “pop” the bubbles that equal the target value before they disappeared, while

avoiding popping bubbles that did not equal the target value across 10 rounds of varying

difficulty and a 20 second time limit per round. Numbubbles was scored as the number of correct

pops, minus the number of incorrect pops, weighted by average time to a correct pop.
COGNITIVE ABILITY GAME-BASED ASSESSMENT 32

Figure 3. Cognify g-GBA minigames. Clockwise from top-left: Proof It, Tally Up, Resemble,

Numbubbles, Shortcuts, Colour Pop, Grid Lock. Screenshots of Shortcuts and ColourPop are

cropped but still show complete gameplay area.


COGNITIVE ABILITY GAME-BASED ASSESSMENT 33

Resemble (Gf/Gv). In Resemble, players re-created a pattern, shown on one side of the

screen, by dragging puzzle elements onto a grid shown on the opposite side of the screen. Puzzle

elements, once dragged onto the grid, could be rotated, and solving each problem required

dragging the correct pieces to the correct positions and applying correct rotations. The test-taker

had a total of three minutes in which to complete as many puzzles as possible, up to a total of

nine. The timed nature of the puzzle also introduced a speed component. Resemble is the

revision of an earlier developed game which was itself inspired by the Block Design tests of the

Wechsler Adult Intelligence Scales. Resemble was scored as the number of game levels

completed.

Grid Lock (Gf/Gv). In Grid Lock, players assembled a set of puzzle pieces to mimic a

larger novel shape. There were nine rounds of varying difficulty and an overall three-minute time

limit. The size of the grid and the number of pieces that need to be placed increased over time,

and in later levels, grid components needed to be rotated. Grid Lock was scored as the number

of game levels completed.

Proof It (Gc/Grw). In Proof It, players were asked to identify as many textual errors as

possible in a set of sample texts across five rounds, across which 5 minutes were permitted by

tapping the errors. Incorrectly tapping three different places without error resulted in round

progression. No points were deducted for incorrect taps. Proof It was scored as the number of

errors correctly identified across the game.

Tally Up (Gq/Gv). In Tally Up, players were presented with two sets of tokens across 35

rounds of varying difficulty with a time limit of 5 seconds per round. In each round, players were

asked to quantify the number of tokens on two sides of the screen and identify their relationship

to each other, which became more complex as the game progressed. It resembled Quantitative
COGNITIVE ABILITY GAME-BASED ASSESSMENT 34

Comparison questions from the Graduate Record Examination, with the addition of token value

modifiers and per-item time pressure. Tally Up was scored as the number of rounds with correct

responses.

Colour Pop (Gf). In Colour Pop, players completed a gamified version of a Stroop

(1935) task in which they observed a grid of colored words and were asked to identify all tiles

with words that matched the color of the target, e.g. red, irrespective of the color of the tiles.

There were 20 rounds which varied by the number of valid answers and the proportion of word

and color mismatches. The time limit for each round was 4 seconds, with an overall time limit of

approximately 2 minutes. Game elements most noticeably included spinning tiles, progress bars,

star counts and both audio and visual feedback mechanisms. Colour Pop was scored as the total

number of correct pops minus the number of incorrect pops minus the number of misses.

Short Cuts (Gf/Gq/Gv). In Short Cuts, players determined the shortest path to roll a

blue marble to a goal across seven puzzles over a maximum of four minutes. Each puzzle

consisted of a network of paths, with the distance of each path denoted by a numerical value,

where higher values indicated greater distance. Determining the shortest path required the user to

plan a path forward while considering multiple interacting factors, the complexity of which

varied across puzzles. Short Cuts was scored as “distance traveled,” which quantifies the number

of points “spent” to solve all puzzles given all path values chosen. Final scores were reversed for

reporting so that greater scores reflected superior performance.

Reactions

Seven different reaction measures were collected once for the GBA and once for the

traditional g battery across two broad categories. Each set of reactions measures was

administered immediately upon finishing either the traditional cognitive ability battery or the g-
COGNITIVE ABILITY GAME-BASED ASSESSMENT 35

GBA and adapted to inquire either about the “test” or the “game”. Intrinsic motivation and test

anxiety were assessed with 7-point agreement scales; all others were assessed with 5-point

agreement scales. Alpha reliabilities appear in Table 6.

Motivational outcomes. Intrinsic motivation was measured using the

Interest/Enjoyment subscale of the Intrinsic Motivation Inventory (Ryan & Deci, 2000). A

sample items was “These tests (games) were fun to do”. Test motivation was measured with a 5-

item scale (Arvey et al., 1990). An example item was, “I thought the tests (games) were fun.”

Test anxiety was measured using a 6-item scale (Arvey et al., 1990). An example item was “I

probably wouldn’t do as well as most other people who took these tests (played these games).”

Attitudinal outcomes. Distributive justice was measured using a 3-item scale (Smither et

al., 1993). An example item was “The test (game) results would accurately reflect how well I

performed on the examination (in the games)”. Procedural justice was measured using a 4-item

“chance to perform” measure (Bauer et al., 2001). An example item was “I could really show my

skills and abilities through these tests (games).” Job relatedness was measured with a 2-item

measure (Bauer et al., 2001). An example item was “It would be clear to anyone that these tests

(games) are related to the job.” Test propriety was measured using a 3-item measure (Bauer et

al., 2001). An example item was, “The content of the tests seemed appropriate.”

Outcome Variables

Outcome variables differed by sample. In the academic sample, GPA was captured from

historical university records. In the organizational sample, each employee’s most recent annual

supervisory ratings of job performance were collected which asked supervisors to assess the

extent to which employees met their goals set at the beginning of the previous year on a 4-point

scale. Supervisors were not required to match a predefined rating distribution but were
COGNITIVE ABILITY GAME-BASED ASSESSMENT 36

encouraged to assign less than 5% to “Does Not Meet Expectations,” approximately 8% to

“Partially Meets Expectations,” approximately 67% to “Fully Meets Expectations,” and less than

25% to “Exceed Expectations.” These ratings had historically and were actively being used both

administratively and developmentally as of the time of data collection.

Demographics

Basic demographic information was collected from the academic sample including age,

gender, race, ethnicity, employment status, and contact information for participant compensation.

In the organizational sample, only age, gender, race, and ethnicity were collected since all

employees sampled were full-time.

Procedure

Academic Sample

Two university computer labs were identified on campus for use in this study. The two

labs were chosen because they were verified to contain computers with sufficient processing

power to run the g-GBA at the intended speed and were in designated quiet spaces on campus.

As a result, all participants used nearly-identical computers in highly similar physical spaces.

Participants were instructed via email to use one of these computers and to navigate to a specific

webpage. Once on that webpage, participants reviewed informed consent documentation and

entered their student identification number before proceeding with the study via the Qualtrics

survey platform. The overall research design was a two-cell within-subjects design

counterbalanced by test completion order. Specifically, participants were assigned to either take

the g-GBA first or the traditional cognitive battery first but ultimately completed both. After

each test battery, participants completed a reactions measure battery. After both assessments and

both reactions measures were completed, participants completed a demographics survey.


COGNITIVE ABILITY GAME-BASED ASSESSMENT 37

Organizational Sample

Employees were emailed directly by the internal industrial-organizational psychology

team asking them to complete the g-GBA and to release access to their performance data for

research purposes. Their supervisors were also contacted and asked to encourage their

employees to participate.

Results

Results in this section can be reproduced using the anonymized dataset and R (R Core

Team, 2020) analytic files found at https://osf.io/dg6qm/

Measurement

To assess Hypothesis 1, which examined the convergence of latent game performance

with latent g, a series of confirmatory factor analyses (CFA) and larger structural equation

models (SEM) were fitted in sequence to test specific assumptions and measurement hypotheses.

Approximate standards for the evaluation of model fit were taken from Hu and Bentler (2009; p

> .05, CFI > .96, RMSEA < .06, SRMR < .08). Absolute and relative fit indices were jointly

interpreted given limitations in both (March et al., 2009), and both compliance with and violation

of these standards were considered informative to model-building rather than as gold-standard

cut-offs (Markland, 2007). Because the size of the organizational sample was too small to

support CFA, all measurement analyses were conducted on the academic sample. A correlation

matrix calculated from the data used for these analyses appears in Table 1.

First, a CFA was conducted to determine if g was adequately measured by the five

specific ability tests identified. As shown in Figure 4, standards according to all absolute and

relative fit index standards were achieved (2(5) = 5.81, p = .325, CFI = 1.00, RMSEA = .02,

SRMR = .02) and factor loadings were comparable to expected effect sizes given prior research
COGNITIVE ABILITY GAME-BASED ASSESSMENT 38

utilizing CFA on g measurement using specific ability tests (see Tucker-Drob & Salthouse,

2009); thus, we concluded that latent g was effectively represented. We also regressed GPA on

latent g, finding a standardized weight of .28, nearly equal to the meta-analytic means provided

by Richardson et al. (2012), further supporting validity.

Next, latent game performance was modeled with CFA. In this case, however, we a priori

suspected that a traditional CFA, which is based upon the assumptions of classical test theory,

would demonstrate poor model fit. In the case of g measurement, commonly used tests generally

comply well with these assumptions; for example, each item on a well-constructed cognitive

ability test can generally be considered a randomly drawn item from a potential universe of items

assessing that ability. In this case, the development team was aware of residual cross-loadings

during development but could find no way to eliminate them entirely without removing many of

the “gameful” aspects of the experience. Knowing this a priori, we conducted this step of

analysis interactively, using diagnosis of modification indices, freeing notable item residual

covariances, and observing associated changes in fit. The initial model, a standard one-factor

CFA with uncorrelated errors, showed slightly poor fit as expected (2(14) = 55.63, p < .001 CFI

= .92, RMSEA = .07, SRMR = .04). We next used modification indices to free paths between

item residuals that were contributing to the greatest misfit one model re-fit at a time until

adequate fit was achieved for all terms, which occurred by freeing 2 of the 21 available

covariances, as shown in Figure 5 (2(12) = 17.37, p = .14, CFI =.99, RMSEA = .03, SRMR

= .02), which was theoretically consistent with gameplay as described earlier; whereas both

Resemble and Gridlock required the mental manipulation of presented visuals, both Tally Up and

Colourpop required quick quantitative reasoning. Nevertheless, the use of modification indices
COGNITIVE ABILITY GAME-BASED ASSESSMENT 39

Table 1

Correlation matrix

Cognitive Ability Tests g-GBA Minigames Other


Variable Mean SD 1 2 3 4 5 6 7 8 9 10 11 12 13 14
1. Visual
1.01 0.37
Processing
2. Quantitative
2.44 1.30 .29
Knowledge
3. Fluid
3.20 1.54 .24 .17
Reasoning
4. Verbal Ability 0.72 0.57 .23 .27 .19
5. Processing
4.30 1.00 .32 .30 .24 .22
Speed
6. ColourPop 0.68 0.39 .21 .20 .17 .20 .31
7. Numbubbles 12.39 4.29 .20 .30 .13 .29 .30 .25
8. Resemble 5.17 1.92 .40 .34 .25 .22 .37 .25 .29
9. Proof It 32.68 10.43 .21 .23 .16 .35 .28 .28 .35 .18
10. Shortcuts 136.17 42.79 .14 .03 .04 .11 .05 .07 .04 .10 .04
11. Gridlock 5.27 1.31 .25 .17 .17 .12 .25 .22 .21 .40 .19 .06
12. Tally Up 22.06 3.88 .30 .29 .10 .23 .36 .21 .39 .34 .29 .09 .26
13. GPA 3.05 0.67 .11 .12 .12 .19 .18 .11 .15 .04 .22 -.07 .06 .15
14. Race .41 .49 -.27 -.27 -.24 -.27 -.30 -.21 -.10 -.40 -.20 .04 -.28 -.25 -.21
15. Gender .58 .49 -.02 -.20 -.04 -.16 -.06 .00 -.35 -.26 .04 -.09 -.11 -.15 .06 .12
Note. Variables 1-5 are cognitive ability tests; Variables 6-12 are cognitive ability GBAs; Race is coded 1
= Black participant, 0 = White participant, and others missing; Gender is coded 1 = female participant, 0
= male participant and others missing; N=633 for all cells except pairs involving gender (N=630) or race
(N=448)
COGNITIVE ABILITY GAME-BASED ASSESSMENT 40

Figure 4. Confirmatory factor analysis of general cognitive ability, N = 633.


COGNITIVE ABILITY GAME-BASED ASSESSMENT 41

Figure 5. Final confirmatory factor analysis of latent GBA performance. All estimates are

standardized, and two freed residual covariances are shown. N = 633


COGNITIVE ABILITY GAME-BASED ASSESSMENT 42

necessarily led to a model overfitted to some degree. To gauge the effect of these modifications,

we conducted our later test of Hypothesis 1 both with and without these modifications.

Finally, to enable our formal test of Hypothesis 1, we combined both CFAs into a single

SEM predicting latent GBA performance from latent g as measured by the traditional g test

battery. Slight poor fit was observed (2(51) = 129.94, p < .001, CFI =.93, RMSEA = .05,

SRMR = .04), so modification indices were again examined, which revealed several potential

changes to improve model fit, but none were clearly theoretically consistent with the nature of

gameplay as in the previous model fitting. We regardless attempted to fit additional models

based upon the largest suggested freed disturbance covariances to observe the effects, but these

revised models did not meaningfully affect the core effect size of interest between g and game

performance. Thus, we decided to proceed with interpretation of the unmodified model to test

study hypotheses. Developing an internal structure of Cognify in an exploratory fashion was

only done to enable the following confirmatory test of the convergence between latent game

performance and latent g, as a formal test of Hypothesis 1.

In the final model, as shown in Figure 6, the relationship between latent game

performance and latent g was equal to .97, suggesting near-unity between the latent constructs

underlying both the cognitive ability test battery and the GBA’s suite of minigames; only 6.9%

of variance in game performance remained unexplained. As a post-hoc follow-up, to test the

hypothesis of unity formally, we created a new model constraining the variance of the

disturbance of latent gameplay to zero and compared these two nested models with a chi-squared

difference test. We found no significant difference between models (2diff = 131.09 - 129.94 =

1.15; p = .716), suggesting the simpler model constraining the relationship to unity should be
COGNITIVE ABILITY GAME-BASED ASSESSMENT 43

retained. Thus, we concluded that the latent game performance construct underlying

performance across Cognify minigames was in fact g. Hypothesis 1 was therefore supported.

As a post-hoc test of model robustness, we also re-ran the model shown in Figure 6 constraining

all mini-game disturbance covariances to zero, finding worse fit but no meaningful change to the

test of H1 ( = .95; 2(53) = 186.11, p < .001; CFI = .90, RMSEA = .06, SRMR = .04).

Performance Prediction

The finding that latent game performance was indistinguishable from latent g precluded

the need to assess the correlation between latent game performance and outcomes. Because

latent game performance on Cognify and latent g are in effect the same construct,

mathematically speaking, the relationship of latent game performance vs. GPA and latent g vs.

GPA must be extremely similar in magnitude if modeled using SEM. Thus, it was known a

priori that Hypotheses 2a and 2b would be supported using this approach. Given this, we instead

turned our attention to ways that these tests might be operationalized. Specifically, because we

had no a priori hypotheses about specific games, we chose to focus the remainder of our

hypothesis testing on unit-weighted composite scores, which addresses more practical concerns.

Specifically, latent variable modeling is uncommonly used in real-world employee selection,

whereas unit-weighted composites are common and easily applied (Oswald et al., 2014). Thus,

to provide additional support for Hypotheses 2a and 2b, we examined the prediction of criteria

using ordinary least squares regression and mean composite scores.

In the case of predicting GPA in the academic sample, Hypothesis 2a was supported.

The relationship was positive (r = .16 [.08, .24], p < .001) although of somewhat lesser

magnitude of prediction than the cognitive ability test composite’s relationship (r = .22 [.15, .30],

p < .001). Hypothesis 2b was also supported; supervisor ratings of job performance were also
COGNITIVE ABILITY GAME-BASED ASSESSMENT 44

Figure 6. Final model predicting latent GBA performance from latent g, N = 633
COGNITIVE ABILITY GAME-BASED ASSESSMENT 45

predicted (r = .29 [.00, .53], p = .047). In both cases, effect sizes were well within credibility

intervals and at similar magnitude as mean meta-analytic estimates previously observed for these

relationships (cf. Richardson et al., 2012, who found a mean observed correlation with GPA

across studies of .20; Hunter, 1984, who found a mean observed correlation across studies with

supervisor ratings of .27), providing additional construct validity evidence.

To assess Research Question 1, we next conducted hierarchical multiple regression

analyses to determine incremental prediction of GPA of each test battery composite score over

the other. These analyses appear in Table 2. Both traditional test battery composite and the

GBA composite predicted the criterion. However, incremental prediction was only observed in

one direction, of the traditional test composite beyond the GBA. In combination with the results

from the test of Hypotheses 1 and 2, it appears that although latent g is reflected similarly in both

the traditional test battery and the GBA, the GBA contains additional criterion-irrelevant

information not contained within the traditional test battery that is attenuating its relationship

with the GPA criterion. Thus, it seems that although the GBA is clearly a g measure, it is a less

“pure” measure than the traditional cognitive ability test battery. The precise source and nature

of that criterion-irrelevant information is unclear from the present data.

Test Bias and Differential Prediction

To assess Hypothesis 3, we examined differences in the composite game score between

the 184 Black and 275 White test-takers in the academic sample with a Welsh two-sample t-test.

The difference was statistically significant (t(409.34) = -8.08, p<.001, d=-0.77 [.57, .96]) and the

predicted value of .69 from the Ross et al. (2001) meta-analysis was well within its confidence

interval, supporting Hypothesis 3. As a post-hoc test to triangulate upon this result, we next

examined racial differences in the g test composite scores, to see if this same pattern of findings
COGNITIVE ABILITY GAME-BASED ASSESSMENT 46

was observed. In this test, we did observed a larger difference (t(427.77) = -10.23, p<.001, d=-

0.95 [.75, 1.15]), 0.19 standard deviations larger for the composite score calculated from the

traditional g test battery than from the composite score calculated from the GBA.

Next, we assessed Hypothesis 4 by examining differences in composite game scores

between the 265 male and 365 female participants in the academic sample with another Welsh t-

test. The difference was statistically significant (t(499.99) = 5.72, p < .001, d = -0.48 [.32, .64]),

supporting Hypothesis 4. As with our test of Hypothesis 3, we post-hoc compared this to gender

differences on the g test composite, this time finding an effect slightly smaller than for that of the

GBA (t(535.77) = 3.78, p < .001, d = -0.31 [.15, .47]). Given both the statistical and practical

gender effect for this test, it appears that most of the observed gender differences are attributable

to a population effect rather than an assessment medium effect. To investigate Research

Question 2, we examined gender differences on individual minigames, and these results appear

in Table 3. Gender differences disadvantaging female participants appeared for five games, with

effects ranging from small to large, whereas the other two games showed no advantage for either

men or women.

To address Research Question 3, Lautenschlager and Mendoza’s (1986) method for

evaluating differential prediction was used for both racial and gender fairness. The results are

displayed in Tables 4 and 5, respectively. In each Model 1, the criterion is regressed upon the

composite; in each Model 2, upon class membership; and in each Model 3, upon the composite,

class membership, and their interaction term. Using this method, up to three comparisons

between nested models are conducted to determine fairness. First, Model 1 is compared to Model

3. A statistically and practically significant R is treated as a positive omnibus test of

differential prediction. Second, R between Model 2 and Model 3 is examined. If


COGNITIVE ABILITY GAME-BASED ASSESSMENT 47

Table 2

Criterion-related validity

DV: GPA Model 1 Model 2 Model 3


Traditional Test Composite 0.24** 0.21**
GBA Composite 0.19** 0.05
Model F 33.43** 16.86** 17.01**
R2 .05 .03 .05
LL .02 .00 .02
UL .08 .05 .08
Adj R 2 .05 .02 .05
R2 (vs. Model 3) .00 .03**
N = 633. LL and UL are lower and upper limits of 95% confidence intervals surrounding R2,

*p<.05 **p<.01
COGNITIVE ABILITY GAME-BASED ASSESSMENT 48

Table 3

Standardized minigame performance by gender

Male Female Welsh’s test Effect Size

Minigame M SD M SD t df p d LL UL

Colour Pop 0.68 0.38 0.68 0.39 0.09 579.47 .929 0.01 -0.17 0.15

Numbubbles 14.12 4.40 11.11 3.73 -9.01 510.98 <.001 -0.75 0.58 0.91

Resemble 5.74 1.92 4.74 1.80 -6.71 546.10 <.001 -0.55 0.39 0.71

Proof It 32.08 10.35 32.97 10.33 1.07 568.31 .286 0.09 -0.25 0.07

Short Cuts 140.46 41.92 132.86 43.25 -2.22 578.98 .027 -0.18 0.02 0.34

Gridlock 5.44 1.41 4.15 1.22 -2.75 519.01 .006 -0.23 0.07 0.39

Tally Up 22.73 4.02 21.56 3.69 -3.73 538.95 <.001 -0.31 0.15 0.46

N = 630; Female participant N = 365, Male participant N = 265; LL and UL are lower and upper

limits of 95% confidence intervals surrounding d


COGNITIVE ABILITY GAME-BASED ASSESSMENT 49

Table 4

Comparisons of Black-white differential prediction for cognitive ability tests versus GBAs

DV: GPA Traditional g Test Battery g-GBA


Model 1 Model 2 Model 3 Model 1 Model 2 Model 3
Traditional Test Composite 0.20** 0.13* 0.18**
GBA Composite 0.20** 0.14* 0.12
Race (Black=1, White=0) -0.20** -0.22** -0.22** -0.22**
Composite x Race Interaction -0.14 0.03
Model F 16.46** 12.83** 9.06** 13.83** 12.60** 8.40**
R2 .04 .06 .06 .03 .05 .05
LL .00 .01 .02 -.00 .01 .01
UL .07 .10 .10 .06 .09 .09
Adj R 2 .03 .05 .05 .03 .05 .05
R2 (vs. Model 3) .02** .00 .02** .00
N = 448; Black participant N = 182; White participant N = 266; LL and UL are lower and upper

limits of 95% confidence intervals surrounding R2; *p<.05 **p<.01


COGNITIVE ABILITY GAME-BASED ASSESSMENT 50

Table 5

Comparisons of female-male differential prediction for cognitive ability tests versus GBAs

DV: GPA Traditional g Test Battery g-GBA


Model 1 Model 2 Model 3 Model 1 Model 2 Model 3
Traditional Test Composite 0.237** 0.25** 0.22**
GBA Composite 0.19** 0.22** 0.15*
Gender (Female=1, Male=0) 0.13* 0.13* 0.14** 0.14*
Composite x Gender Interaction 0.06 0.14
Model F 32.79** 19.69** 13.28** 16.31** 11.59** 8.45**
R2 .05 .06 .06 .03 .04 .04
LL .02 .02 .02 .00 .01 .01
UL .08 .10 .10 .05 .06 .07
Adj R2 .05 .06 .06 .02 .03 .03
R (vs. Model 3)
2 .01* .00 .01* .00
N = 630; Female participant N = 365, Male participant N = 265; LL and UL are lower and upper

limits of 95% confidence intervals surrounding R2; *p<.05 **p<.01


COGNITIVE ABILITY GAME-BASED ASSESSMENT 51

statistically and practically significant, either slope differences or slope and intercept differences

in combination are present. Third, if the second comparison revealed slope differences, an

additional test can be conducted to determine if intercept differences were observed in addition

to slope differences. As shown in Table 4, which concerns racial differences, both the traditional

test battery and GBA met the fairness standard; both exhibited differential prediction by intercept

but not by slope. In Table 5, which concerns gender differences, the same pattern emerged.

Thus, it appears that the GBA is a fair test, across both classes of interest in this research

question. Additionally, this evidence supports the above-stated conjecture that gender

differences in the GBA may be mostly attributable to study population characteristics rather than

to GBA characteristics; it does not appear as if using a GBA removed existing construct-related

intercept differences.

Reactions

Finally, we assessed Hypothesis 5 by conducting a series of paired-samples t-tests

comparing reactions to the GBA and the traditional g test battery. To address missingness within

the reactions data, single imputation was used (i.e., Amelia II; Honacker et al., 2011). These

results are presented in Table 6. Universally, the GBA was preferred to the g test battery across

all reaction measures. Thus, Hypothesis 5 was supported.

Discussion

This study is the first to integrate current theories of game design, taken from human-

computer interaction, into the organizational assessment literature, to provide a theoretical

framework for the identification and development of GBAs. With a g-GBA built from that

theory for use in hiring decisions, we rigorously explored its measurement characteristics,

predictive accuracy, fairness, and reactions. Most centrally, we have demonstrated that a g-GBA
COGNITIVE ABILITY GAME-BASED ASSESSMENT 52

Table 6

Reactions to g-GBA versus g Test Battery

g-GBA g Test Battery

Variable  Mean SD  Mean SD t p d LL UL

Motivational

Intrinsic Motivation .92 4.52 1.42 .92 3.03 1.56 20.34 <.001 1.00 0.88 1.11

Test Motivation .80 3.78 0.56 .85 3.70 0.69 3.03 .003 0.13 0.05 0.22

Test Anxiety .83 2.82 0.88 .80 3.07 0.89 -8.04 <.001 -0.29 -0.36 -0.22

Attitudinal

Distributive Justice .68 3.21 0.84 .74 2.96 0.94 6.37 <.001 0.28 0.19 0.37

Procedural Justice .88 2.77 0.96 .90 2.49 0.98 8.22 <.001 0.30 0.23 0.37

Job Relatedness .64 3.97 0.75 .67 3.78 0.84 5.60 <.001 0.23 0.15 0.32

Test Propriety .83 2.67 1.03 .83 2.44 1.01 5.69 <.001 0.22 0.14 0.30

Note. t-test and Cohen’s d are calculated for paired comparisons. Positive t and d indicate greater

scores for GBA. LL and UL are lower and upper limits of 95% confidence intervals surrounding

d. N = 632 (one case eliminated due to high missingness)


COGNITIVE ABILITY GAME-BASED ASSESSMENT 53

likely can be designed and developed (i.e., engineered) to meet the same conceptual and

psychometric standards (Sackett et al., 2017) as other more traditional assessments of those same

constructs, and that participants preferred this g-GBA to the traditional battery. This furthermore

provides a theoretically and empirically supported design process for creating new theory-driven

GBAs to assess any construct of interest at a high psychometric quality standard.

It should not be inferred from this study, however, that GBAs as a class of methods are

inherently superior or preferable to traditional assessment methods in any dimension studied.

Current GBA vendors vary in their reliance upon both design theories and psychological

theories, and existing GBAs also vary widely in their implementation of specific game elements.

For example, whereas Cognify does not heavily integrate any sort of story, narrative, or fantasy

elements, these are commonly implemented in other GBAs currently in use in organizations to

unknown effect. Further research is needed on a much broader range of GBAs and GBA design

strategies before any firm conclusions can be drawn about GBAs in general. Much like Arthur

and Villado (2008), we emphasize the critical differences between methods and predictors in the

space of employee assessment; because GBA is a method, the specifics of design and

implementation are critical to understanding its best role in employee selection and should not be

ignored. Much as a survey measure can be well-designed or not in relation to its measurement

goals, so can a GBA. Further research is needed to understand if the GBA design theory

proposed here can serve as a foundation for high quality GBAs across constructs and contexts.

Even if so, more nuanced design theory will likely be needed for different measurement

domains, just as is currently required of questionnaires (Embretson, 1994). Much as

psychometrics was born of a need to better apply statistics to the measurement of latent

psychological constructs (Buckhalt, 2002), new domain-embedded GBA design theories may be
COGNITIVE ABILITY GAME-BASED ASSESSMENT 54

needed to develop the highest quality psychometric assessments appropriate for selection

contexts (Ployhart et al., 2017). Thus, future research on GBA must explicitly consider and

explore design and development processes (Landers & Marin, 2021) in any GBA being

evaluated.

What we can safely conclude given the present results is that the design process studied

here resulted in a GBA of similar psychometric quality to a traditional g test battery. This draws

a theoretical distinction between a g-GBA’s latent performance construct and the only other

latent game performance construct in the research literature, gVG. Whereas Quiroga et al.

(2019) hand-picked a selection of commercially available video games to best reflect g, the

present study demonstrates how a design and development process can be used to create a novel

g-GBA for employee selection. In doing so, we also found a stronger relationship between latent

game performance and g than did Quiroga et al. (2019), with an estimate more similar to Quiroga

et al.’s (2014) result when focusing upon “brain training” Nintendo games. The difference in

results between these studies, especially in contrast to the present study, suggests that design

characteristics like the ones studied here are likely critical to understanding why and how GBAs

can measure traits. Although outside the scope of the present work, an interesting possibility

raised by Quiroga et al.’s work is the existence of a true gVG across all possible video games, a

set of skills or abilities associated with success in video games broadly. We encourage

researchers to continue down this theoretical path, as it might shed additional light on potential

trait confounds when using video games for measurement of any psychological construct.

The observation of adverse impact by race of similar magnitude as traditional g tests was

as predicted but was also disappointing. The idea that GBA somehow “removes” bias is a

common assertion among some GBA proponents in industry (e.g., Hak, 2019) which this study
COGNITIVE ABILITY GAME-BASED ASSESSMENT 55

directly informs. This also provides context for the approach if not the rhetoric of many GBA

vendors; for example, one of the largest GBA vendors, Pymetrics, claims its GBA to be “bias-

free”, explaining “we use a reference set of tens of thousands of people to check for any potential

biases, and we deweight inputs in our model until we produce a bias-free algorithm that is

compliant with the 4/5ths rule” (pymetrics.com, 2019). Thus, rather than their GBA somehow

“removing” adverse impact through some design tactic, it is done post-hoc by reducing the

influence of or dropping individual predictors showing adverse impact in the machine learning

algorithms that they develop. This reaffirms that inclusion of scores from cognitively loaded

tests within a selection battery, at least given the world’s current social and economic state, will

generally lead to adverse impact by race (Kuncel & Hezlett, 2010). GBA appears to neither solve

this problem nor exacerbate it.

Of greater concern was the observation of adverse impact by gender. Although the mean

effect across minigames disadvantaged women, most of this difference was also reflected in

gender differences in the traditional cognitive ability test battery (dTrad = -0.48 vs dGBA = -0.31).

We suspect the remaining gender difference (d = -0.17) is attributable to differences in the

visual-spatial nature of gameplay in some of the games and given prior work suggesting gender

differences in spatial abilities (Voyer et al., 1995). Specifically, the games in which women did

worse were also more heavily visual-spatial in their gameplay than the games demonstrating

parity. Because visual-spatial ability was only represented in one test in the traditional g battery

through a single test, this may have led to the observed difference in gender differences between

the traditional battery and the g-GBA. Although a new composite could be created in the present

dataset utilizing only those tests showing no difference, this would capitalize upon chance to

some degree, and any observed lack of gender effect of such a composite would be of unknown
COGNITIVE ABILITY GAME-BASED ASSESSMENT 56

generalizability. Most importantly, even with the existing set of minigames, there was no

evidence of differential prediction of GPA by gender, by either slope or intercept; thus, despite

the observation of adverse impact, the prediction of GPA from both the traditional cognitive

ability measure and from the g-GBA appeared to be fair by gender.

These results in combination with our analysis of RQ1 raise new theoretical questions

about g-GBA test construction. Specifically, because there was incremental prediction of GPA

by the cognitive ability test composite beyond the g-GBA test composite but not the reverse, this

suggests that although the GBA composite score contains the same information about g that the

cognitive ability test battery composite does, it is also contaminated to a degree by gender-

relevant (and g-irrelevant) variance; in short, there is evidence of construct contamination but not

construct deficiency. We encourage future researchers to examine what specific game

mechanics, dynamics, and aesthetics are most likely to exacerbate gender differences in the

measurement of cognitive ability, and conversely, what might be done to remove it. Assuming

that the pattern of gender differences among minigames was due to legitimate differences in

gameplay, alternate weighting schemes or alternate inclusion/exclusion criteria for minigames

could be used to reduce these gender differences (Sackett & Ellingson, 1997). Future research

should therefore also explore how scores are best used in practice to make actual selection

decisions, and if such strategies have other unintended consequences, such as decreased validity.

Although reactions to the GBA were universally more positive than to the g test battery,

effect sizes varied and were generally small to moderate. Whereas intrinsic motivation was 1.00

standard deviations more positive, other improvements were more modest, ranging from 0.13 to

0.30. A key limiting factor in this study may be the nature of g testing, to which reactions are

already generally poor (Hausknect et al., 2004). Because evaluation of g requires identification
COGNITIVE ABILITY GAME-BASED ASSESSMENT 57

of correct answers, frustration when unable to determine a likely correct response and move

forward to the next question may negatively influence g test reactions (Chan et al., 1997). In a

GBA, there is still feedback as to correct answers, but there is less time for assessees to ruminate

on incorrect answers if the game has been designed and is successful in absorbing assessees in

the flow of game demands. In GBAs designed to assess constructs that lack “correct answers,”

such as personality testing, reactions to GBAs may be more positive; however, the precise effect

of the presence or absence of such gameplay flow is unclear. Further research is required, which

should investigate such interactive effects between constructs targeted and GBA design features,

as well as how specific game design decisions do or do not contribute within any particular

design and construct combination.

These results also raise a key issue unique to GBAs when deployed within the employee

selection context versus educational. Specifically, if a g-GBA is significantly more expensive to

develop than a traditional g test, positive utility for the use of GBAs is of concern. Given the

magnitude of effects we observed here, it is currently inadvisable for a private organization

intending to develop its own internal selection tools to create their own GBAs. Given the

expense, it is unlikely that any reactions benefit from such a move would outweigh the

development costs. However, an independent consultancy licensing such a test could see utility

if deployed to a broad range of organizations. Thus, we contend that the most likely context for

positive utility from GBAs for at least the next several years will be in consultancies serving that

GBA to many different firms, which in turn implies that most GBAs will be intended to assess

broadly useful individual differences where there is significant demand, such as cognitive ability,

personality, emotional intelligence, and broad skill-based competencies, such as leadership

styles, teamwork, or self-directed learning. In this way, GBA more directly compares with
COGNITIVE ABILITY GAME-BASED ASSESSMENT 58

assessment centers than other assessment methods in terms of key strengths, but without the

logistics costs and overhead typically associated with assessment centers. For companies

choosing to adopt GBAs, the potential utility gains are more obvious – if a g-GBA can be

administered at the same cost, with the same psychometric strengths, and with better applicant

reactions in comparison to a traditional g test battery, there are few compelling reasons to

continue using traditional g test batteries in practice. Organizations should consider all such

dimensions of utility, both in terms of immediate predictive gains and larger-scale strategic

business concerns, when making such adoption decisions (Roth & Bobko, 1997).

The findings here also relate to the nascent literature on assessment gamification (e.g.,

Georgiou et al., 2019; Landers et al., 2020). As described in the Measures section, some of

Cognify’s games began as traditional cognitive tasks and were gamified (i.e., Resemble, Colour

Pop), some were directly inspired from existing games (i.e., Numbubbles, Grid Lock, Tally Up),

and some were creative interpretations of other concepts (i.e., Shortcuts, Proof It). In some

ways, this made the gamified assessments less challenging to develop than others, in that the

basic concepts of gameplay were inferred from the existing task structure as a starting point for

game design, but in other ways were more challenging to develop due to the restrictions that

existing task definitions created. For example, because Colour Pop began as a Stroop test, the

design team wanted to maintain the classic Stroop elements regardless of other changes

suggested through user assessments in iterative prototyping; in contrast, because Shortcuts was

based on a novel idea, there was no aspect of the game that was “off limits” for changes during

development. In this way, the science of gamification (Landers, 2018) might inform some

aspects of GBA design just as GBA design might inform gamification; it is thus important for

progress in both that the two literatures do not grow completely independently.
COGNITIVE ABILITY GAME-BASED ASSESSMENT 59

Practical Implications

In stark contrast to a few decades ago, a key concern in modern assessment design is

maximizing applicant reactions, and GBA design makes very explicit the central role of user

experience in the test development process. Specifically, game design and thus GBA design

prioritize consumer expectations and experience in a way not commonly seen in traditional

assessment design. The introduction of internet technologies has flattened job application

pathways such that for many organizations, the application process has become a bidirectional

transaction (Singh & Finn, 2003). User experience and brand reputation now play an important

role in attracting talent. Because cutting edge technologies have been found to impact positively

on applicant perceptions of the organizational image (Bartram & Hambleton, 2006; Sinar et al.,

2003), GBA design can enable rigorous measurement while improving applicant perceptions and

organizational impressions. Further iteration upon the design of the present GBA might result in

further improved perceptions. Such benefits from either initial deployment or redesign are not

guaranteed, however, and both require significant investment in high quality design and

development processes.

In great contrast to traditional assessment development, GBA development requires a

high degree of effort from diverse, multidisciplinary teams and stakeholders outside of the

traditional assessment community. Across disciplinary perspectives, values and methods vary

greatly, creating new challenges in relation to process losses and team coordination, as well as

greater expense, in relation to traditional assessment development. Game designers, software

engineers, artists, and others may all be deeply invested in assessment development, which if not

carefully managed can create significant problems with team cohesion and team commitment.

For example, in the experience of the present authors, game designers typically prioritize “fun,”
COGNITIVE ABILITY GAME-BASED ASSESSMENT 60

engineers typically prioritize system sustainability, artists typically prioritize aesthetics, and

assessment designers typically prioritize psychometric rigor. If properly managed, the resulting

frictions can lead to higher quality assessments in both the psychometric sense and in terms of

applicant perceptions, but the specific “best” path to achieve that remains unclear. In the present

article, we have described the design theory that drove the organization that developed Cognify,

but there is no guarantee that another assessment firm would find the execution of a design

strategy from that theory as effective. Furthermore, there is no guarantee that the same

development process would work equally well even for the same developer if assessing a

different construct. A degree of risk-taking is necessary in the current development of GBAs;

the present study does not provide a single set of best practices but instead provides guidance on

how to reduce risk through a cautious marriage of game design theory and classic test

development practices.

Regardless of the specific development strategy adopted, technical competency in GBA

design related to the assessment delivery platform is significantly more important for assessment

practitioners than in traditional assessment development. Whereas technical teams often are

tasked with “implementation” in traditional development, such that the “assessment team”

creates the assessment and the “technical team” is responsible for placing the assessment online

and collecting data, all members in GBA development teams need to develop expertise in not

only psychometrics but also software architecture and game design theory. This is likely to push

many traditional assessment experts far outside of their core expertise, yet developing new

expertise is critical to ensure psychometric rigor in GBA. Complex technical concerns can arise,

such as the specific equipment required to maintain the number of interactions per second

necessary to ensure the integrity of collected data given a particular GBA design. With
COGNITIVE ABILITY GAME-BASED ASSESSMENT 61

insufficient technical expertise, a traditional assessment specialist might not even realize why

such a restriction could harm the psychometric properties of the GBA. Thus, any assessment

firm seeking to develop GBAs should carefully evaluate if they have not only adequate technical

resources but also the necessary resources to train key personnel across disciplinary lines.

An important set of practical caveats for the application of these results is the 1)

generalizability of this approach to other constructs, 2) the impact of specific development

strategies, and 3) the impact of deployment strategies. First, we have presented here a design

theory which can feasibly be used to develop any theory-driven GBA in which a specific

construct has been a priori targeted by game design. However, there is generally a much greater

and more comprehensive literature on g than on most other traits, which gave the developers a

more solid foundation for design in the CHC model than might be found when targeting other

constructs when integrating these design theories. As such, we caution practitioners against

considering this GBA as prototypical. There are likely to be many challenges in design

unexplored here inherent to any such effort, and another game developed using this approach is

not guaranteed to be successful. For example, it is unclear at this time as to the specific cause of

increases in applicant reactions; this could have been caused by the novelty of the experience,

improved affective reactions to the interface, the quick gameplay, or any of many other game

characteristics acting in concert. Second, the present study only explored one development

particular strategy, which had a variety of practical consequences. In terms of across-construct

influences, the psychometric concerns and iterative strategies necessary for successful

measurement are likely quite different in data-driven GBAs than in theory-driven GBAs. The

lack of a priori focus on constructs in data-driven GBA does not necessarily condemn such

methods, but it may increase the analytic burden on game developers in ways not explored here.
COGNITIVE ABILITY GAME-BASED ASSESSMENT 62

Within the g construct, this GBA focused upon speeded tests for practical reasons related to

reducing cheating (Arthur et al., 2010), but this choice could have led to additional confounds

that affected validity in numerous ways (Lu & Sireci, 2007). Further, in pursuit of more exciting

gameplay, the GBA did not enable clean separation of measurement occasions, which precluded

meaningful estimation of internal consistency reliability. For this reason and from our own

experiences in industry, we believe most GBA developers rely instead upon test re-test reliability

estimates for this reason, yet this requires additional data collection efforts or novel calculation

strategies (Weiner & Sanchez, 2020). Alternative game designs and scoring models could also

avoid this problem. Third, in practice, due to concerns about re-testing effects (Villado et al.,

2016), the company that developed this GBA in its practice does not allow it to be administered

to the same job applicant more than once every twelve months, and the present study did not

examine re-test effects. There is no research exploring if GBA methods amply or attenuate retest

effects in relation to such issues, or if such concerns could be engineered out during game

development. Across these concerns, there is still much remaining about which we simply have

little data. Current caution and further research, especially in applied settings, are needed on all

of these issues.

Limitations and Future Research Directions

One limitation to the current empirical study is the generalizability of the measurement

properties of the Cognify GBA to the measurement properties of other GBAs. As GBAs could

theoretically be built to assess any construct (cf. Arthur & Villado, 2008), with an indefinite

number of potential design processes, no “prototypical” GBA exists or ever will, much as there

can be no “prototypical” questionnaire. One GBA design might approach cognitive ability

measurement through the gameplay of a first-person shooter whereas another might approach it
COGNITIVE ABILITY GAME-BASED ASSESSMENT 63

through puzzles. A design intended to measure a personality trait might be completely different.

This is not a unique limitation to GBA; for example, most researchers would not expect a single

validation study of “questionnaires” to result in definitive conclusions regarding the validity of

“questionnaires.” Instead, much as has been done for questionnaires, a body of evidence

regarding GBAs must be curated. Having said that, researchers must also be careful not to

assume that this justifies a case study approach to GBAs with traditional assessment outcomes.

For example, the mere existence of a GBA that produces scores that correlate with an outcome is

not theoretically interesting unless there is some evidence as to the underlying reason. For the

present GBA, we were able to provide construct validity evidence supporting g as that reason

both due to the GBA’s design process and the data collected. For other GBAs, we hope to see

similar types of evidence, along with detailed descriptions of the design methods that produced

them. In the case of data-driven GBAs, more atypical forms of validation evidence might be

useful, such as evidence from response processes through think-alouds during gameplay.

A second limitation to the present study is the composition of the two samples and the

constraints associated with each. Practical constraints limited our larger data collection effort to

an academic sample, which served as the principal sample for hypothesis testing, plus only a

small organizational sample using an existing one-item supervisory rating to investigate

criterion-related validity. Our interpretation of reactions and psychometric characteristics are

from non-organizational data, and the organizational sample only provides criterion-related

validity evidence with wide confidence intervals. This effect, due to the small organizational

sample size used to measure it, should not be over-interpreted; it is unlikely this correlation

would generalize to other organizations, and the stability of this estimate even as an estimate of

the population effect for the studied organization is poor. Additionally, both samples likely
COGNITIVE ABILITY GAME-BASED ASSESSMENT 64

exhibit some degree of range restriction that may have attenuated observed some relationships,

particularly those with criteria, below their true score values (Sackett & Yang, 2000). Given this,

we recommend the results from the organizational sample be viewed as tentative and have based

most of our conclusions upon results from the academic sample, yet the generalizability of

results from the academic sample to an organizational context is unknown. Future researchers

should prioritize seeking out larger samples with authentic employees to better understand under

what conditions reactions and psychometric properties might differ.

A third limitation of the present study is in the generalizability of the design method

employed to GBA design more broadly. Although we focused here on theory-driven GBA, this

is only one type of GBA currently in the assessment marketplace. The other major type, as

discussed earlier, relies upon computational psychometrics to develop its measurement models.

We call this data-driven GBA, and it typically takes a very different development process.

Specifically, the initial development of data-driven GBAs tends to be based upon content

validation; an assessment designer has a holistic idea for a game, either borrowed from the

research literature (e.g., neuroscience) or driven by marketplace needs (e.g., a client requests a

“leadership game”), and a game is created based upon that idea. Then, all data collected,

including both scores and trace data like mouse clicks, are inputted into either unsupervised

machine learning models to develop categories of players or into supervised machine learning

models to predict some outcome of interest directly from these messy data. This type of

assessment data mining is much more common in educational GBAs (Mislevy et al., 2012)

where precise explication of the constructs being measured is somewhat less important than in

the employment context. It is currently unknown how the present results would have changed if

a data-driven GBA had been used instead, and it is also unknown what additional insights could
COGNITIVE ABILITY GAME-BASED ASSESSMENT 65

have been gained from applying these techniques to the trace data produced by the present GBA.

Both are compelling directions for future research and perhaps critical to the evolution of

psychometrics (Mislevy et al., 2014).

Conclusion

In general, there are great possibilities for the value of such currently-untapped data

sources in GBAs. In contrast to traditional assessments, where an individual is presented with a

distinct task to complete, GBAs integrate challenges, problems, and high-complexity tasks more

seamlessly into a continued experience as part of a narrative or in pursuit of a broader end goal

(Mislevy et al., 2014). In other words, data analysts examining the results of a traditional

assessment can see a response to a question, but the process by which the assessee arrived at that

response is neither captured nor considered. In more complex and interactive games, players

self-direct efforts and pursue individual choices in relation to how they want to progress,

navigate through space, investigate and accomplish goals (Shute, 2011), and GBAs can provide

the facility to record and monitor changes in candidate temporal micro-patterns or strategic

shifts, as well as the context in which these changes occur. The richness of such data provides

significant promises for the future development of GBAs beyond what traditional assessment is

capable of by providing evidence of their thinking, which can itself be designed to meet the

assumptions of psychometric models (Plass et al., 2011). For example, Rupp et al. (2012)

investigated the quality of measured learning in a game-based learning activity involving

configuration of a computer network. Simple traditional outcome measures, such as how many

mistakes or correct responses students made, were not as valuable at distinguishing student

proficiency as various combinations of latent metrics from the data set. Their metacognitive

skills along the way, approach (e.g., time taken, number of commands input, proportions of
COGNITIVE ABILITY GAME-BASED ASSESSMENT 66

commands), efficiency, strategy usage (e.g., switching between computer devices) provided

deeper insight over and above the evidence of their final solutions (Rupp et al., 2012). Future

research is needed to explore the potential of GBAs to provide rich data regarding the automated

measurement of process-oriented traits such as these.


COGNITIVE ABILITY GAME-BASED ASSESSMENT 67

References

Aguinis, H., & Smith, M.A. (2007). Understanding the impact of test validity and bias on

selection errors and adverse impact in human resource selection. Personnel Psychology,

60, 165-199.

Anderson, N., Salgado, J. F., & Hülsheger, U. R. (2010). Applicant reactions in selection:

Comprehensive meta-analysis into reaction generalization versus situational specificity.

International Journal of Selection and Assessment, 18, 291-304.

Apperley, T. H. (2006). Genre and game studies: Toward a critical approach to video games.

Simulation & Gaming, 37, 6-23.

Armstrong, M.B., Ferrell, J., Collmus, A. B., & Landers, R. N. (2016). Correcting

misconceptions about gamification of assessment: More than SJTs and badges. Industrial

and Organizational Psychology, 9, 671-677.

Armstrong, M. B., Landers, R. N., & Collmus, A. B. (2016). Gamifying recruitment, selection,

training, and performance management: Game-thinking in human resource management.

In D. Davis & H. Gangadharbatla (Eds.), Handbook of Research on Trends in

Gamification (pp. 140-165). Hershey, PA: Information Science Reference.

Arthur, W., Glaze, R. M., Villado, A. J., & Taylor, J. E. (2010). The magnitude and extent of

cheating and response distortion effects on unproctored internet-based tests of cognitive

ability and personality. International Journal of Selection and Assessment, 18(1), 1–16.

Arthur, W. & Villado, A. J. (2008). The importance of distinguishing between constructs and

methods when comparing predictors in personnel selection research and practice. Journal

of Applied Psychology, 93, 435-442.


COGNITIVE ABILITY GAME-BASED ASSESSMENT 68

Arvey, R. D., Strickland, W., Drauden, G., & Martin, C. (1990). Motivational components of test

taking. Personnel Psychology, 43.

Attali, Y. & Arieli-Attali, M. (2015). Gamification in assessment: Do points affect test

performance? Computers & Education, 83, 57-63. doi:10.1016/j.compedu.2014.12.012

Bartram, D. & Hambleton, R. K. (2006). Computer-based testing and the internet: Issues and

advances. New York, NY: John Wiley & Sons.

Bauer, T. N., Truxillo, D. M., Sanchez, R. J., Craig, J. M., Ferrara, P., & Campion, M. A. (2001).

Applicant reactions to selection: Development of the Selection Procedural Justice Scale

(SPJS). Personnel Psychology, 54, 387-419.

Beck, Kent (1999). "Embracing Change with Extreme Programming". Computer. 32 (10): 70–

77. doi:10.1109/2.796139.

Bertua, C., Anderson, N., & Salgado, J.F. (2005). The predictive validity of cognitive ability

tests: A UK meta-analysis. Journal of Occupational and Organizational Psychology, 78,

387-409.

Bhatia, S., & Ryan, A. M. (2018). Hiring for the win: Game-based assessment in employee

selection. In The brave new world of eHRM 2.0. (pp. 81–110). IAP Information Age

Publishing.

Bjogvinsson, E., Ehn, P., & Hillgren, P.-A. (2012). Design things and design thinking:

Contemporary participatory design challenges. DesignIssues, 28, 101-116.

Borman, W. C. & Motowidlo, S. J. (1997). Task performance and contextual performance: The

meaning for personnel selection research. Human Performance, 10, 99-109.

Brown, A. (2017). Younger men play video games, but so do a diverse group of Americans.

Pew Research Center. Retrieved from http://www.pewresearch.org/fact-


COGNITIVE ABILITY GAME-BASED ASSESSMENT 69

tank/2017/09/11/younger-men-play-video-games-but-so-do-a-diverse-group-of-other-

americans/

Brown, A. W., Stein, S., & Rohrer, P. L. (1936). Chicago non-verbal examination. Psychological

Corporation.

Buckhalt, J. A. (2002). A short history of g: Psychometrics’ most enduring and controversial

construct. Learning and Individual Differences, 13(2), 101–114.

Burton, N. W., Welsh, C., Kostin, I., & VanEssen, T. (2009). Toward a definition of verbal

reasoning in higher education. ETS Research Report Series, 2009(2), i-41.

doi:10.1002/j.2333-8504.2009.tb02190.x

Campbell, J. P., McCloy, R. A., Oppler, S. H., & Sager, C. E. (1993). A theory of

performance. In N. Schmitt & W. C. Borman (Eds.), Personnel selection in organizations

(pp. 35-70). San Francisco, CA: Jossey-Bass.

Carretta, T. R., & Ree, M. J. (1996). Factor structure of the Air Force Officer Qualifying Test:

Analysis and comparison. Military Psychology, 8(1), 29.

Carroll, J.B. (1993). Human cognitive abilities: A survey of factor-analytic studies. New York:

Cambridge University Press.

Chamorro-Premuzic, T., Winsborough, D., Sherman, R. A. & Hogan, R. (2016). New talent

signals: Shiny new objects or a brave new world? Industrial and Organizational

Psychology, 9, 621-640.

Chan, D., Schmitt, N., DeShon, R. P., Clause, C. S., & Delbridge, K. (1997). Reactions to

cognitive ability tests: The relationships between race, test performance, face validity

perceptions, and test taking motivation. Journal of Applied Psychology, 82, 300-310.
COGNITIVE ABILITY GAME-BASED ASSESSMENT 70

Cleary, T.A. (1968). Test bias: Prediction of grades of Negro and White students in integrated

colleges. Journal of Educational Measurement, 5, 115-124.

Collmus, A. B. & Landers, R. N. (2019). Game-framing cognitive ability tests to improve

applicant perceptions. Journal of Personnel Psychology, 18, 157-162.

Colzato, L. S., van Leeuwen, P. J. A., van den Wildenberg, W. P. M., & Hommel, B. (2010).

DOOM’d to switch: Superior cognitive flexibility in players of first person shooter

games. Frontiers in Psychology. Retrieved from

https://www.frontiersin.org/articles/10.3389/fpsyg.2010.00008/full. doi:

10.3389/fpsyg.2010.00008

Conrad, L., Trismen, D., & Miller, R. (Eds.). (1977). Graduate Record Examinations technical

manual. Princeton, NJ: Educational Testing Service.

Deterding, S. (2011). Situated motivational affordances of game elements: A conceptual model.

Gamification: Using Game Design Elements in Non-Gaming Contexts, a Workshop at

CHI. Presented at CHI 2011, ACM, Vancouver, Canada.

Diehl, V. A. (2014). Using real-world and standardized spatial imagery tasks: Convergence,

imagery realism, and gender differences. Applied Cognitive Psychology, 28, 789-798.

doi:10.1002/acp.3061

Drasgow, F. (19870601). Study of the measurement bias of two standardized psychological tests.

Journal of Applied Psychology, 72(1), 19. https://doi.org/10.1037/0021-9010.72.1.19

Egenfeldt-Nielsen, S., Smith, J. H., & Tosca, S. P. (2013). Understanding video games: The

essential introduction (2nd ed.). New York, NY: Routledge.

Ekstrom, R. B., French, J. W., Harman, H. H., & Dermen, D. (1976). Kit of factor-referenced

cognitive tests. Princeton, NJ: Educational Testing Service.


COGNITIVE ABILITY GAME-BASED ASSESSMENT 71

Embretson, S. (1994). Applications of cognitive design systems to test development. In C. R.

Reynolds (Ed.), Cognitive Assessment: A Multidisciplinary Perspective (pp. 107–135).

Springer US.

Equal Employment Opportunity Commission, Civil Service Commission, Department of Labor,

& Department of Justice. (1978). Uniform guidelines on employee selection procedures.

Federal Register, 43, 38290–39315.

Gee, J.P. (2007). What video games have to teach us about learning and literacy (2nd ed.). New

York: Palgrave.

Georgiou, K., Gouras, A., & Nikolaou, I. (2019). Gamification in employee selection: The

development of a gamified assessment. International Journal of Selection and

Assessment, 27(2), 91–103.

Gustafsson, J. E., & Balke, G. (1993). General and specific abilities as predictors of school

achievement. Multivariate Behavioral Research, 28(4), 407-434.

Hak, A. (2019). How to remove hiring bias through gamification. The Next Web. Retrieved from

https://thenextweb.com/work2030/2019/05/20/how-to-remove-hiring-bias-through-

gamification/

Hamstra, S. J., Brydges, R., Hatala, R., Zendejas, B., & Cook, D. A. (2014). Reconsidering

fidelity in simulation-based training. Academic Medicine, 89, 387-392.

Handler, C. (2018, June 19). The truth about game-based talent assessments. Retrieved from

https://www.ere.net/the-truth-about-game-based-talent-assessments/

Hausknecht, J. P., Day, D. V., & Thomas, S. C. (2004). Applicant reactions to selection

procedures: An updated model and meta-analysis. Personnel Psychology, 57, 639-683.


COGNITIVE ABILITY GAME-BASED ASSESSMENT 72

Honaker, J., King, G., & Blackwell, M.(2011). Amelia II: A program for missing data. Journal

of Statistical Software, 45(7), 1-47.

Horn, J.L., & Noll, J. (1997). Human cognitive capabilities: Gf-Gc theory. In D.P. Flanagan, J.L.

Genshaft, & P. L. Harrison (Eds.), Contemporary intellectual assessment: Theories, tests

and issues (pp. 53-91). New York: Guilford Press.

Hough, L. M., Oswald, F. L., & Ployhart, R. E. (2001). Determinants, detection and amelioration

of adverse impact in personnel selection procedures: Issues, evidence and lessons

learned. International Journal of Selection and Assessment, 9, 152-194.

Hu, L. T., & Bentler, P. M. (1999). Cutoff criteria for fit indexes in covariance structure analysis:

Conventional criteria versus new alternatives. Structural Equation Modeling: A

Multidisciplinary Journal, 6(1), 1-55.

Huizinga, J. (2016). Homo ludens: A study of the play-element in culture. Kettering, OH:

Angelico Press.

Hunicke, R., LeBlanc, M., & Zubek, R. (2004). MDA: A formal approach to game design and

game research. In Proceedings of the AAAI Workshop on Challenges in Game AI (Vol. 4,

No. 1, pp. 1722-1726).

Hunter, J. E. (1980). Test validation for 12,000 jobs: An application of synthetic validity and

validity generalizations to the General Aptitude Test Battery (GATB). Washington, DC:

U.S. Employment Service, Department of Labor.

Hunter, J. E. (1983). A causal analysis of cognitive ability, job knowledge, job performance, and

supervisor ratings. In F. Landy, S. Zedeck, & J. Cleveland (Eds.), Performance

Measurement and Theory (pp. 257–266). Routledge.


COGNITIVE ABILITY GAME-BASED ASSESSMENT 73

Ip, C. (2018, May 4). To find a job, play these games. Engadget. Retrieved from

https://www.engadget.com/2018/05/04/pymetrics-gamified-recruitment-behavioral-tests/

Jones, S. E. (2008). The meaning of video games: Gaming and textual strategies. New York,

NY: Routledge.

Kehoe, J. F. (2002). General mental ability and selection in private sector organizations: A

commentary. Human Performance, 15, 97-106.

Kultima, A. (2015). Developers’ perspectives on iteration in game development. In M. Turunen

(Ed.), Proceedings of the 19th International Academic MindTrek Conference (pp. 26-32).

New York, NY: ACM.

Kuncel, N. R. & Hezlett, S. A. (2010). Fact and fiction in cognitive ability testing for admissions

and hiring decisions. Current Directions in Psychological Science, 19, 339-345.

Kuncel, N. R., Hezlett, S. A., & Ones, D. S. (2001). A comprehensive meta-analysis of the

predictive validity of the Graduate Record Examinations: Implications for graduate

student selection and performance. Psychological Bulletin, 127(1), 162-181.

doi:10.1037//Q033-2909.127.1.162

Kuncel, N. R., Klieger, D. M., Connelly, B. S., & Ones, D. S. (2013). Mechanical versus clinical

data combination in selection and admissions decisions: A meta-analysis. Journal of

Applied Psychology, 98, 1060-1072.

Kuncel, N. R., Wee, S., Serafin, L., & Hezlett, S. A. (2010). The validity of the Graduate Record

Examination for master’s and doctoral programs: A meta-analytic investigation.

Educational and Psychological Measurement, 70(2), 340-352.

doi:10.1177/0013164409344508
COGNITIVE ABILITY GAME-BASED ASSESSMENT 74

Landers, R. N., Auer, E. M., Collmus, A. B., & Armstrong, M. B. (2018). Gamification science,

its history and future: Definitions and a research agenda. Simulation & Gaming, 49(3),

315–337.

Landers, R. N., Auer, E. M., & Abraham, J. D. (2020). Gamifying a situational judgment test

with immersion and control game elements: Effects on applicant reactions and construct

validity. Journal of Managerial Psychology, 35(4), 225–239.

Landers, R. N. & Marin, S. (2021). Theory and technology in organizational psychology: A

review of technology integration paradigms and their effects on the validity of theory.

Annual Review of Organizational Psychology and Organizational Behavior, 8, 235-258.

Landers, R. N., Tondello, G. F., Kappen, D. L., Collmus, A. B., Mekler, E. D., & Nacke, L. E.

(2019). Defining gameful experience as a psychological state caused by gameplay:

Replacing the term ‘gamefulness’ with three distinct constructs. International Journal of

Human-Computer Studies.

Lang, J. W. B., Kersting, M., Hulsheger, U. R., & Lang, J. (2010). General mental ability,

narrower cognitive abilities, and job performance: The perspective of the nested-factors

model of cognitive abilities. Personnel Psychology, 63, 595-640.

Lautenschlager, G. J. & Mendoza, J. L. (1986). A step-down hierarchical multiple regression

analysis for examining hypotheses about test bias in prediction. Applied Psychological

Measurement, 10, 133-139.

Lu, Y., & Sireci, S. G. (2007). Validity issues in test speededness. Educational Measurement:

Issues and Practice, 26(4), 29–37.

MacCann, C. (2010). Further examination of emotional intelligence as a standard intelligence: A

latent variable analysis of fluid intelligence, crystallized intelligence and emotional


COGNITIVE ABILITY GAME-BASED ASSESSMENT 75

intelligence. Personality and Individual Differences, 49, 490-496.

doi:10.1016/j.paid.2010.05.010

Mathieu, J. E., Hollenbeck, J. R., van Knippenberg, D., & Ilgen, D. R. (20170202). A century of

work teams in the Journal of Applied Psychology. Journal of Applied Psychology,

102(3), 452.

Mavridis, A. & Tsiatsos, T. (2017). Game-based assessment: Investigating the impact on test

anxiety and exam performance. Journal of Computer Assisted Learning, 33, 137-150.

doi:10.1111/jcal.12170

Meade, A. W. & Fetzer, M. (2009). Test bias, differential prediction, and a revised approach for

determining the suitability of a predictor in a selection context. Organizational Research

Methods, 12, 738-761.

Mislevy, R. J., Behrens, J. T., DiCerbo, K. E., & Levy, R. (2012). Design and discovery in

educational assessment: Evidence centered design, psychometrics, and data mining.

Journal of Educational Data Mining, 4, 11–48.

Mislevy, R. J., Oranje, A., Bauer, M. I., von Davier, A., Hao, J., … John, M. (2014).

Psychometric considerations in game-based assessment. Retrieved from

https://www.envisionexperience.com/~/media/files/blog/glasslab-

psychometrics.pdf?la=en

Oswald, F. L., Putka, D. J., & Okc, Jisoo. (2014). Weight a minute... What you see in a weighted

composite is probably not what you get. In C. E. Lance & R. J. Vandenberg (Eds.), More

Statistical and Methodological Myths and Urban Legends: Doctrine, Verity and Fable in

Organizational and Social Sciences (pp. 187–205). Routledge.


COGNITIVE ABILITY GAME-BASED ASSESSMENT 76

Plattner, H. (2011). Foreward. In H. Plattner, C. Meinel, & L. Leifer (Eds)., Design Thinking:

Understand, Improve, Apply (pp. v-vi). Berlin, Germany: Springer-Verlag.

Plattner, H., Meinel, C., & Leifer, L. (2011). Design thinking: Understand, improve, apply.

Berlin, Germany: Springer-Verlag.

Ployhart, R. E., Schmitt, N., & Tippins, N. T. (2017). Solving the Supreme Problem: 100 years

of selection and recruitment at the Journal of Applied Psychology. Journal of Applied

Psychology, 102(3), 291–304.

Primi, R. (2014). Developing a fluid intelligence scale through a combination of Rasch modeling

and cognitive psychology. Psychological Assessment, 26(3), 774.

Markland, D. (2007). The golden rule is that there are no golden rules: A commentary on Paul

Barrett’s recommendations for reporting model fit in structural equation modeling.

Personality and Individual Differences, 42, 851-858.

Marsch, H. W., Hau, K.-T., & Wen, Z. (2009). In search of golden rules: Comment on

hypothesis-testing approaches to setting cutoff values for fit indexes and dangers in

overgeneralizing Hu and Bentler’s (1999) findings. Structural Equation Modeling, 11,

320-241.

Meade, A. W., & Craig, S. B. (2012). Identifying careless responses in survey data.

Psychological Methods, 17, 437-455.

Mitchell, I. (2016). Agile development in practice. Tamare House. p. 11. ISBN 978-1-908552-

49-5.

Mollick, E. R. & Rothbard, N. (2014). Mandatory fun: Consent, gamification and the impact of

games at work. The Wharton School Research Paper Series. Retrieved from

https://ssrn.com/abstract=2277103
COGNITIVE ABILITY GAME-BASED ASSESSMENT 77

Moran, A. (2014). Agile risk management. Springer Verlag. ISBN 3319050079.

Morgeson, F. P., Brannick, M. T., & Levine, E. L. (2019). Job and work analysis: Methods,

research, and applications for human resource management. SAGE Publications.

Mount, M. K., Oh, I.-S., & Burns, M. (2008). Incremental validity of perceptual speed and

accuracy over general mental ability. Personnel Psychology, 61, 113-139.

Nagarajan, A., Allbeck, J. M., Sood, A., & Janssen, T. L. (2012). Exploring game design for

cybersecurity training. 2012 IEEE International Conference on Cyber Technology in

Automation, Control, and Intelligent Systems (CYBER), 256–262.

https://doi.org/10.1109/CYBER.2012.6392562

Newman, D. A., Hanges, P. J., & Outtz, J. L. (2007). Racial groups and test fairness, considering

history and construct validity. American Psychologist, 62, 1082-1083.

Nintendo of America. (1989). Tetris. Redmond, WA: Nintendo of America.

Nunnally, J. (1978). Psychometric methods. New York, NY: McGraw Hill.

Olsen, J., Aleven, V. & Rummel, N. (2017). Statistically modeling individual students’ learning

over successive collaborative practice opportunities. Journal of Educational

Measurement, 54, 123-138.

Oswald, F. L., Saad, S. & Sackett, P. R. (2000). The homogenity assumption in differential

prediction analysis: Does it really matter? Journal of Applied Psychology, 85, 536-541.

Plass, J. L., Homer, B. D., Kinzer, C. K. & Perlin, K. (2012). Games for learning institution

(G4LI) white paper: Ideas for impact games. Retrieved from

https://gamesandimpact.org/wp-content/uploads/2012/09/PlassNYU-Ideas-for-Impact-

Games-2.pdf
COGNITIVE ABILITY GAME-BASED ASSESSMENT 78

Potosky, D., Bobko, P., & Roth, P. L. (2005). Forming composites of cognitive ability and

alternative measures to predict job performance and reduce adverse impact: Corrected

estimates and realistic expectations. International Journal of Selection and Assessment,

13,

Putka, D. J., Beatty, A. S., & Reeder, M. C. (2018). Modern prediction methods: New

perspectives on a common problem. Organizational Research Methods, 21(3), 689–732.

https://doi.org/10.1177/1094428117697041

Pymetrics.com. (2019). Pymetrics | The Science. Retrieved from http://pymetrics.com/science/

Quiroga, M. A., Diaz, A., Román, F. J., Privado, J., & Colom, R. (2019). Intelligence and video

games: Beyond “brain-games.” Intelligence, 75, 85-94.

Quiroga, M. Á., Escorial S., Román F. J., Morillo D., Jarabo A., Privado J., et al. (2015). Can we

reliably measure the general factor of intelligence (g) through commercial video games?

Yes, we can! Intelligence 53, 1–7. 10.1016/j.intell.2015.08.004 [Cross Ref]

R Core Team. (2020). R: A language and environment for statistical computing. R Foundation

for Statistical Computing. https://www.R-project.org

Ree, M. J., & Carretta, T. R. (1994). Factor analysis of the ASVAB: Confirming a Vernon-like

structure. Educational and Psychological Measurement, 54(2), 459-463.

Reeve, C. L. & Hakel, M. D. (2002). Asking the right questions about g. Human Performance,

15, 47-74.

Richardson, M. , Abraham, C., & Bond, R. (2012). Psychological correlates of university

students’ academic performance: A systematic review and meta-analysis. Psychological

Bulletin, 138, 353-387.


COGNITIVE ABILITY GAME-BASED ASSESSMENT 79

Roth, P. L., Bevier, C. A., Bobko, P., Switzer, F. S., & Tyler, P. (2001). Ethnic group differences

in cognitive ability in employment and educational settings: A meta-analysis. Personnel

Psychology, 54(2), 297-330.

Roth, P. L., & Bobko, P. (1997). A research agenda for muti-attribute utility analysis in human

resource management. Human Resource Management Review, 7(3), 341–368.

Rowe, P. (1986). Design thinking. Cambridge, MA: The MIT Press.

Rupp, A. A., DiCerbo, K. E., Sweet, S. J., Crawford, A. V., Calico, T., … Behrends, J. T. (2012).

Putting ECD into practice: The interplay of theory and data in evidence models within a

digital learning environment. Journal of Educational Data Mining, 4, 49-110.

Ryan, R. M. & Deci, E. L. (2000). Intrinsic and extrinsic motivations: Classic definitions and

new directions. Contemporary Educational Psychology, 25, 54-67.

Ryan, A. M., & Huth, M. (2008). Not much more than platitudes? A critical look at the utility of

applicant reactions research. Human Resource Management Review, 18(3), 119–132.

https://doi.org/10.1016/j.hrmr.2008.07.004

Ryan, A. M. & Ployhart, R. E. (2000). Applicants’ perceptions of selection procedures and

decisions: A critical review and agenda for the future. Journal of Management, 26(3),

565-606.

Ryan, R. M., Rigby, C. S., & Przybylski, A. (2006). The motivational pull of video games: A

self-determination theory approach. Motivation and Emotion, 30, 347-363.

doi:10.1007/s11031-006-9051-8

Sackett, P. R., & Ellingson, J. E. (1997). The effects of forming multi-predictor composites on

group differences and adverse impact. Personnel Psychology, 50(3), 707–721.


COGNITIVE ABILITY GAME-BASED ASSESSMENT 80

Sackett, P. R., Lievens, F., Van Iddekinge, C. H., & Kuncel, N. R. (2017). Individual differences

and their measurement: A review of 100 years of research. Journal of Applied

Psychology, 102(3), 254–273.

Sackett, P. R., & Yang, H. (2000). Correction for range restriction: An expanded typology.

Journal of Applied Psychology, 85(1), 112–118.

Salgado, J.F., Anderson, N., Moscoso, S., Bertua, C., de Fruyt, F., & Rolland, J.P. (2003). A

meta-analytic study of general mental ability validity for different occupations in the

European community. The Journal of Applied Psychology, 88(6), 1068-1081.

Schmidt, F. L. (1988). The problem of group differences in ability test scores in employment

selection. Journal of Vocational Behavior, 33, 272-292.

Schmidt, F. L. (2002). The role of general cognitive ability and job performance: Why there

cannot be a debate. Human performance, 15(1-2), 187-210.

Schmidt, F. L. & Hunter, J. E. (1974). Racial and ethnic bias in psychological tests: Divergent

implications of two definitions of test bias. American Psychologist, 29, 1-8.

Schmidt, F.L. & Hunter, J. E. (1998). The validity and utility of selection methods in personnel

psychology: Practical and theoretical implications of 85 years of research findings.

Psychological Bulletin, 124, 262-274.

Schmidt, F. L., & Hunter, J. E. (2004). General mental ability in the world of work: Occupational

attainment and job performance. Journal of personality and social psychology, 86(1),

162.

Schneider, W. J., & McGrew, K. (2012). The Cattell-Horn-Carroll model of intelligence. In, D.

Flanagan & P. Harrison (Eds.), Contemporary Intellectual Assessment: Theories, Tests,

and Issues (3rd ed.) (p. 99-144). New York: Guilford.


COGNITIVE ABILITY GAME-BASED ASSESSMENT 81

Salen, K., Tekinbas, K. S., & Zimmerman, E. (2004). Rules of play: Game design fundamentals.

Cambridge, MA: MIT Press.

Schmidt, F. L. & Hunter, J. (2004). General mental ability in the world of work: Occupational

attainment and job performance. Journal of Personality and Social Psychology, 86(1),

162-173. doi:10.1037/0022-3514.86.1.162

Sinar, E. F., Reynolds, D. H., & Paquet, S. L. (2003). Nothing but ‘net? Corporate image and

web-based testing. International Journal of Selection and Assessment, 11, 150-157.

Singh, P. & Finn. D. (2003). The effects of information technology on recruitment. Journal of

Labor Research, 24, 395-408.

Smither, J. W., Reilly, R. R., Millsap, R. E., Pearlman, K., & Stoffey, R. W. (1993). Applicant

reactions to selection procedures. Personnel Psychology, 46, 49-76. doi: 10.1111/j.1744-

6570.1993.tb00867.x

Society for Industrial and Organizational Psychology. (2003). Principles for the validation and

use of personnel selectin procedures (4th ed.). Retrieved from

http://www.siop.org/_principles/principles.pdf

Stanek, K. C. & Ones, D. S. (2018). Taxonomies and compendia of cognitive ability and

personality constructs and measures relevant to industrial, work and organizational

psychology. In D. S. Ones, N. Anderson, C. Viswesvaran, & H. K. Sinangil (Eds.), The

SAGE Handbook of Industrial, Work and Organizational Psychology: Personnel

Psychology and Employee Performance (pp. 366-407). Thousand Oaks, CA, SAGE

Publications Ltd.

Stark, S., Chernyshenko, O. S., & Drasgow, F. (2004). Examining the effects of differential item

(functioning and differential) test functioning on selection decisions: When are


COGNITIVE ABILITY GAME-BASED ASSESSMENT 82

statistically significant effects practically important? Journal of Applied Psychology, 89,

497-508. doi:10.1037/0021-9010.89.3.497

Stauffer, J. M., Ree, M. J., & Carretta, T. R. (1996). Cognitive-Components Tests Are Not Much

More Than" g": An Extension of Kyllonen's Analyses. Journal of General

Psychology, 123(3), 193.

Steiner, D. D. & Gilliland, S. W. (1993). Fairness reactions to personnel selection techniques in

France and the United States. Journal of Applied Psychology, 81, 134-141.

Stenros, J. (2017). The game definition game: A review. Games and Culture, 12, 499-520.

Stroop, J. R. (1935). Studies of interference in serial verbal reactions. Journal of Experimental

Psychology, 18, 643-662.

Taylor, P. J., Driscoll, M. P. O., & Binning, J. F. (1998). A new integrated framework for

training needs analysis. Human Resource Management Journal, 8(2), 29–50.

Tucker-Drob, E. M. & Salthouse, T. A. (2009). Confirmatory factor analysis and

multidimensional scaling for construct validation of cognitive abilities. International

Journal of Behavioral Development, 33, 277-285.

Villado, A. J., Randall, J. G., & Zimmer, C. U. (2016). The effect of method characteristics on

retest score gains and criterion-related validity. Journal of Business and Psychology,

31(2), 233–248.

Viswesvaran, C. & Ones, D. S. (2000). Perspectives on models of job performance. International

Journal of Selection and Assessment, 8, 216-226.

Von Davier, A. A. (2017). Computational psychometrics in support of collaborative educational

assessments. Journal of Educational Measurement, 54, 3-11.


COGNITIVE ABILITY GAME-BASED ASSESSMENT 83

Von Stumm, S. (2013). Investment traits and intelligence in adulthood: Assessment and

associations. Journal of Individual Differences, 34(2), 82-89. doi:10.1027/1614-

0001/a000101

Voyer, D., Voyer, S. & Bryden, M. P. (1995). Magnitude of sex differences in spatial abilities: A

meta-analysis and consideration of critical variables. Psychological Bulletin, 117, 250-

270.

Weiner, E. J., & Sanchez, D. R. (2020). Cognitive ability in virtual reality: Validity evidence for

VR game-based assessments. International Journal of Selection and Assessment, 28(3),

215–235. https://doi.org/10.1111/ijsa.12295

Widaman, K. F. (1982). Stability and construct validity of second-stratum factors of ability

(Dissertation). The Ohio State University.

Workman, J. E., & Lee, S-H. (2004). A cross-cultural comparison of the apparel spatial

visualization test and paper folding test. Clothing and Textiles Research Journal, 22(1/2),

22-30.

Zhang, P. (2008). Motivational affordances: Fundamental reasons for ICT design and use.

Communications of the ACM, 51, 145-157.

View publication stats

You might also like