Test-Taking Heuristics and Biases As Potential Threats To Test Score Validity

Ian Koh (4230663) MSc Thesis, Methodology and Statistics
Intended journal: Frontiers in Psychology Utrecht University, The Netherlands
Test-Taking Heuristics and Biases as Potential Threats

to Test Score Validity
Ian Koh
Utrecht University–Cito
Gunter Maris Frederik Coomans

Cito–University of Amsterdam University of Amsterdam–KU Leuven
In their attempt to achieve maximal test performance, students behave in ways that they
perceive to be in line with their goal. However, their behaviour is guided by heuristics and
biases, an interaction of which leads to either a beneficial or detrimental outcome. We col-
lectively label heuristics and biases ‘test strategies’, explaining how these influence students’
test behaviour for better or worse. For one of these strategies, we also provide empirical
evidence from the 2009 edition of the Programme for International Student Assessment (?) to
support our argument. We then present a new approach to test scoring that aims to uncover
these strategies as much as possible to give us new insights about students. This also has the
consequence of preserving test score validity.
Keywords: educational testing, test score validity, psychometrics, behavioural economics,

scoring rules
Introduction tutions, but they do make it more challenging for students

to obtain the same educational outcomes as their higher-
Many people would have had the experience of taking a scoring peers.
standardised test at some point in their lives. Whether for Furthermore, in East Asian countries, where societies
university admission, determining which academic track to have deep roots in Confucian thinking and place great em-
take in secondary school or even to qualify for a driving phasis on education, students’ test scores could result in
licence, standardised tests are a staple feature of modern either prestige or denigration. While this might seem trivial
living. to a casual Western observer, test scores can be – and often
Pertinent to the idea of testing is validity, which the Cam- are – sources of great stress and personal dissatisfaction to
bridge Dictionary of Statistics (?, p. 444) defines as ‘the a typical East Asian student.
extent to which a measuring instrument is measuring what This being the case, then, it is not difficult to appreciate
was intended’. Tests can therefore be thought of as instru- the importance of a test score being valid. Such a score
ments, where the object being measured is one’s ability in must accurately reflect a student’s ability, because further
some subject matter. In the area of K-12 education, this decisions concerning a student are made that rely on this
means assessing students on subjects such as mathematics, piece of information. These decisions have cascading effects
science, language and so on, with the aim of getting an idea that extend far beyond one’s schooling years.
of how well students have mastered their academic material. In order to ensure test score validity, one must first
The outcome of such assessments tends to be summarised identify and address threats to said validity. Two such ex-
as one piece of information: the test score. This information amples are heuristics and biases, which feature in our every-
has a potentially big influence on a student’s academic path. day decision making (see, for example, ?). These apply to
This is particularly so in countries with educational sys- testing situations as well, and are very much involved in
tems that use tracking (as in the Netherlands), and countries influencing the actions that students take while answering
where the prevailing culture places a high premium on one’s questions.
level of education (as in the East Asian countries). Heuristics help to simplify our surroundings so that de-
Good test scores open options for students, allowing them cision making becomes easier. Ideally, this helps us make
access to better schools, academic instruction and educa- better choices, such as deciding on how best to answer a
tional outcomes. Less satisfactory scores do not necessarily test question. However, their effects are also influenced by
result in students being placed in ‘inferior’ educational insti- our biases, which reflect a certain way of thinking. The net
effect of this tug-of-war could be beneficial or detrimental.
To us, though, we are only trying to make the best of our
With many thanks to Maria Bolsinova and Maarten Marsman of situation. This means that we will perceive this net effect as
the University of Amsterdam; Timo Bechger, Daniël van der Palm something positive, even if it is not the case in reality.
and Ivailo Partchev of Cito Institute for Educational Measurement, And so it is with test behaviour. Students perceive their
Arnhem; and Matthieu Brinkhuis and Sarah Kuipers. I am grateful actions on tests as beneficial to their scores, regardless of
for your insightful comments, discussions and assistance. whether their heuristics and biases that guide these actions
1
2 KOH, MARIS AND COOMANS
result in a helpful or hindering outcome. This outcome guessing strategy), or in a situation where one knows that
can lead to very different implications for their final scores. a subset of the alternatives contains the correct answer but
Because students do the best they can to maximise their does not know which option (within the subset) to select.
scores, and because their actions are guided by heuristics
and biases, we have decided to collapse these two terms into Characteristics Influencing Test Strategy Use
one for convenience’s sake. From this point onwards, we We now have some idea of what test strategies are, but
shall refer to them collectively as test strategies. what influences students to use them differentially? We
We now present two illustrative examples to sketch a pic- believe there are three characteristics of students that influ-
ture of how strategies can lead to differential test outcomes. ence whether and how they use test strategies, at least on a
multiple-choice test. These are (a) degree of confidence, (b)
Two Illustrations of Test Strategy Use representation bias and (c) risk aversion. Figure 1 is a schem-
atic diagram depicting how these student characteristics
Consider two students with identical test scores. One
relate to test strategies and each other.
has employed a set of adaptive test strategies while going
We devote the rest of the thesis to examining the above
through the questions, especially with regards to those ques-
characteristics, providing empirical results as supporting
tions to which the answer was not known. The other has
evidence for representation bias. As we better understand
not used these strategies (or, at least, has used them to a
how they influence students’ use of strategies, we will be
much lesser extent). This means that the latter’s test score
more able to see how these strategies threaten test score
is (almost) fully reflective of ability in the tested subject. A
validity.
casual observer cannot therefore conclude that both students
are comparable in terms of their ability, because one of them
Confidence Levels and Probability Distributions
has achieved this test score with strategic aids, in addition
to ability. We begin our discussion by introducing the idea of stu-
On the flip side of the coin, consider two students with an dents’ levels of confidence when they attempt MCQs. Every
identical ability level on some academic subject. One uses test taker approaches a question with some degree of con-
adaptive test strategies and the other does not, resulting in fidence attached to each option. This refers to how sure one
them getting different test scores. An observer who looks is that a given option is correct compared to the others, and
at both students’ scores would likely conclude that they are is influenced by prior content knowledge.
not comparable in terms of academic ability, even though Ideally, students would have prepared before taking a test,
this is, in reality, the case. but in reality not everyone does so. Even among those who
What these examples demonstrate is that test strategies do, there may be gaps in knowledge. The act of evaluating
have the potential to distort how comparable students’ test the response options to select the best one exposes these
scores are when some students use them and others do not. gaps. For students who studied and therefore have much
In other words, these strategies are potential threats to test content knowledge, discriminating among the options is
score validity. usually trivial. For students with lesser amounts of con-
tent knowledge, discriminating is much more challenging,
Examples of Test Strategies if at all possible. The degree of confidence attached to each
option will differ between the students with more content
Before continuing, it is helpful to define the boundaries knowledge and those with less.
of our discussion. We shall limit our discussion to multiple- At this point, we should stress that the confidence level we
choice questions (MCQs). This is a test format that will be mention in this thesis is not the same as the one used in null-
familiar to most people: for each question, there are several hypothesis significance testing. Our confidence levels relate
alternatives to choose from as the correct answer(s). We strictly to students’ subjective perceptions of which response
further limit our discussion by only considering the case option most appropriately answers the test question.
where there is one correct answer to an MCQ, since there It is useful to characterise a student’s level of confidence
are variations that involve selecting more than one option. meaningfully. Such a characterisation would reflect the posi-
(For tests that are marked by machine, this is reflected in the tion of the student’s internal state between being completely
shading of multiple lozenges on the optical response sheet.) confident and completely unconfident. Thankfully, such a
Three test strategies come to mind: answer guessing, metric already exists in the form of probability values. In
answer switching and answer elimination. In a multiple- fact, one’s confidence level can be thought of as a subjective
choice context, guessing and switching are self-explanatory. probability – the more confident about a response option
The former involves picking an answer from among the one is, the more probable it is that the option is the correct
alternatives (either at random or through some systematic answer. (We shall use confidence levels and probabilities
selection process, like elimination), while the latter involves interchangeably.)
going back to a previously-answered question and altering Probabilities contain all available information about a
the response. student’s choices and can be decomposed into two further
Answer elimination involves examining the given altern- sources of information, known as ranks and order statistics.
atives and crossing out those alternatives that are definitely In doing this, we have more insightful pieces of information
incorrect. This can be used in a situation where one does not that give us a richer understanding of students’ thought
know what the correct answer is (in conjunction with the processes while answering MCQs.
EXAM STRATEGIES AND TEST SCORE VALIDITY 3
Prior content
knowledge
Influences
How many response

options are eliminated
Ranks and Level of
order statistics confidence
Influences
Probability
distributions
Risk aversion
Fixed Varying
distributions distributions
Probability
values/weights
How might test
scores be affected
by strategy?
Learn about
Varying students’
prospects
Representation
bias Introduce elimin-
ation scoring (ES)
FS RF
How might test How do we learn what
scores be affected the ranks and order
Omit Guess by strategy? statistics for these
distributions are?
Explained
using
Differential
Prospect
switching
theory
Empirical
evidence
Figure 1. Schematic of thesis, started from the node labelled ‘Prior content knowledge’, showing student characteristics (level
of confidence, representation bias and risk aversion) related to the test strategies students use.
Introducing Ranks and Order Statistics can generically write that a student will have X 1 ,X 2 , . . . ,X c
as his or her confidence levels over all the options. The
Ranks and order statistics feature throughout the thesis, constant c refers to the total number of response options
and are themselves important concepts in the area of non- available.
parametric statistics. We explain what they are and how Ranks denote the order of preference that a particular
they work, taking reference from the first chapter of ?. After response option is associated with. Compactly, we write
defining ranks and order statistics, we put them into context
R(m) = k, (1)
with an example.
Let us first agree that, in an MCQ, a student will have vary- where
ing levels of confidence attached to each response option.
1 ≤ m ≤ c and 1 ≤ k ≤ c, and
(We discussed this in the previous subsection.) We do not
know what the values of these confidence levels are, so we m ∈ Z+ and k ∈ Z+ .
Table 1 The reader will notice that option 2, with the largest
An Example of a Student’s Ranks and Order Statistics confidence level of 0.45, has a rank of 4, while option 4, with
the smallest confidence level of 0.10, has a rank of 1. Going
m π r o back to our verbal definition of a rank, this is rather non-
1 0.15 2 0.10 intuitive as what we have here is a reverse order of prefer-
2 0.45 4 0.15 ence. Nonetheless, this convention is used for mathematical
3 0.30 3 0.30 convenience and will be better appreciated in subsequent
4 0.10 1 0.45 works focusing on an axiomatic development of our content.
Properties of Ranks and Order Statistics

We now unpack the above in words. The variables m
An idea that follows is that ranks and order statistics
and k can take any positive integer value between 1 and c
correspond to each other. We write this as
inclusive, meaning that they are whole numbers. But what
do m and k mean? Perhaps a written explanation of R(m) = k {R(m) = k} ↔ {O k,c = Xm }, (4)
will be instructive. We read the expression as ‘the rank of
a response option, with index m, is the value k’. Hence, and in our example we have
every response option has a rank. We defer the process of
computing this rank to the example. {2, 4, 3, 1} ↔ {0.15, 0.45, 0.30, 0.10}.
While ranks tell us about how preferred a response option
is in relation to the other options, order statistics are simply The curly braces in Equation 4 indicate the set of rank values
the confidence levels themselves. However, there is a specific (left-hand side) and the set of order statistics that are also
way of displaying them (that is, in increasing order), so we the confidence levels (right-hand side).
would express a student’s order statistics as One might think of this as looking at the r and o column
vectors in Table 1 and drawing a line from an element in r to
O 1,c ≤ O 2,c ≤ · · · ≤ O c,c . (2) a matching element in o. Every order statistic is associated
with a rank, depending on the former’s numerical value (that
The above is understood as a series of order statistics (or is, X k ). Despite this association, it would be erroneous to
variational series), while the individual elements of the series think that knowing the values in either r or o would allow
are understood to be the order statistics. one to calculate the values in the other column.
At this point, we motivate our definitions with an ex- Focusing only on the r column, it is easy to see why know-
ample. Consider a student attempting a four-option MCQ. ing the ranks tells one nothing about the values that the
Having read through the question and the possible options, order statistics will take. The o column, however, presents a
the student generates the following information as seen in slightly trickier case. One might assume, since the o values
Table 1. In the table, we write the ranks and order statistics as are known, that ranks can be constructed from these values.
lower-case r and o since these are realisations of the random Equation 1 shows why this is not possible. To calculate
variables R and O. The symbol π denotes the confidence the rank, one needs more than just the confidence levels –
levels, also known as (subjective) probabilities. The order the index (in m) of the response option in question and the
statistics in the column o are displayed per Equation 2. confidence level (in π ) associated with that index also need
The values of π were arbitrarily generated for this ex- to be known. As such, knowing the order statistics also tells
ample, but how did we compute the values for the r column? one nothing about what the ranks will be.
The formula for calculating a response option’s rank is To summarise, we have it that ranks and order statistics
(a) correspond to each other and (b) are independent pieces
c
X of information.
R(m) = 1 {m ≥k } (3)
k=1 Inverse Ranks
where, in our example, c = 4. The term 1 {m ≥k } is an indicator In practice, we are more interested in a specific response
function that outputs a value 1 when the probability π associ- option than that option’s rank. Being able to write the re-
ated with m is indeed greater than or equal to that associated sponse option’s index, while still having access to its rank
with k, and 0 otherwise. Simply put, we are summing up information, is therefore useful. Fortunately, we can do this
values of 1 and 0 to obtain the rank of our response option by expressing ranks as inverse ranks (or anti-ranks). Where
of interest, which we index as m. Working out the ranks of a rank is defined as R(m) = k, the inverse rank is defined as
every response option, we have
R −1 (k ) = m. (5)
R(1) = 1 + 0 + 0 + 1 = 2,
R(2) = 1 + 1 + 1 + 1 = 4, Using Equation 1, we can also write Equation 5 as
R(3) = 1 + 0 + 1 + 1 = 3 and R −1 (R(m)) = m. (6)
R(4) = 0 + 0 + 0 + 1 = 1,
Equations 5 and 6 tell us nothing new over and above
which turn out to be the same values as seen in Table 1. Equation 1. We still know that a response option with index
m has a rank of value k – all we have done is reorganise Fixed Probability Distributions
our existing information. We can now refer to any option
Consider three students who are answering a four-
in terms of its inverse rank, without needing to know what
alternative MCQ in some test. For argument’s sake, assume
that option’s specific index is. In later examples, this shall
that the test assesses them on a subject that they know noth-
prove highly convenient.
ing about. As a result, each student is equally confident of
each presented alternative.
The Different Phenomena Associated With Ranks
and Order Statistics πAdam = (0.25, 0.25, 0.25, 0.25)
π Beth = (0.25, 0.25, 0.25, 0.25)
Over the course of the thesis, several testing phenomena πCate = (0.25, 0.25, 0.25, 0.25)
will be mentioned. These refer to test-taking strategies, scor-
ing rules and student characteristics. The reader is invited The above represents a ‘probability distribution’ of confid-
to refer to Figure 1 for a brief overview of where these phe- ence levels over the available alternatives. Their respective
nomena lie in relation to each other and to the grand scheme probability distributions are denoted by π .
of the thesis. Terms that are presently unfamiliar will be Put differently, Adam, Beth and Cate all have the same
explained as we progress through the thesis. probability of getting the correct answer. Yet, after they
We wish to end our brief introduction to ranks and order answer the question, the following results are obtained:
statistics by stating that different phenomena relate to these
pieces of information in varying degrees. Hence, different Adam: Correct response
issues in the testing process are useful for giving us specific Beth: Incorrect response
information about either ranks or order statistics. We illus- Cate: Omitted response (marked incorrect)
trate the associations between test phenomena and ranks
and order statistics in Figure 2. What could have resulted in the differences in their re-
From the figure, there are two ways to obtain informa- sponses? Why did Cate choose to leave her answer blank?
tion about ranks and order statistics. The first is through We contend that these differences in responses can be traced
students’ levels of confidence, denoted by π . However, this to differences in the degree of representation bias among
is purely hypothetical as one cannot realistically expect stu- the students.
dents to precisely and accurately quantify the degree of con-
fidence they attach to every response option. By imposing Representation Bias and Scoring Rules
an elimination scoring rule (which will be explained in a
Before diving into the matter at hand, we first take a
subsequent section), the second method presents a more
step back to explain what we mean by ‘representation’. The
plausible method of learning about ranks and order statist-
term refers to a student’s personal schema of how much
ics. Elimination scoring also lets us learn about students’
credit is awarded to a question response. In answering a
levels of risk aversion, which then tell us something about
multiple-choice question, one of three outcomes is possible:
how willing they are to guess on a question (pertinent to a
the given response is correct, incorrect or omitted (that is,
formula scoring scenario).
skipped). The amount of credit awarded to each of these
Looking at ranks and order statistics separately, the three outcomes will depend on the instructions specified
former are associated with the number-right (NR) and for- by the test-maker. When a student’s personal schema is at
mula (FS) scoring rules. The latter are associated with answer odds with the ‘test-specified schema’, we have an instance
guessing and switching (both test strategies), and overcon- of representation bias.
fidence (a student characteristic). This test-specified schema is known as a scoring rule. It
determines how much credit a student receives for a correct,
Fixed and Varying Probability Distributions incorrect or skipped response, which is particularly pertin-
ent to the latter two responses. Two examples of scoring
Having established the relevance of ranks and order stat- rules should help illustrate the issue.
istics to this thesis, we return to our earlier discussion of
probability distributions. A probability distribution can take Number-Right Scoring
two forms: fixed and varying. In the commonest and simplest instance of a scoring rule,
A fixed distribution is essentially a uniform distribution, students receive maximum credit for a correct response and
so that all options have the same probability value – or con- zero credit for an incorrect or a skipped response. This
fidence level – attached to them. Ranks and order statistics scoring rule is known as number-right scoring (NR scoring).
are of little use to us in this instance, since they would all be When such a rule is used, the probability of receiving credit
identical. They are therefore not featured when we discuss for a skipped response is zero, whereas for a submitted re-
fixed distributions. sponse, the probability is 1/c, where c indicates the number
A varying distribution is one where the probability val- of available choices (or alternatives). Hence, for a question
ues vary across the options, and also over students. We first that one has difficulty with, it makes sense to guess, since
examine the fixed distribution before moving to the varying there is a higher probability of getting the question correct
distribution. compared to leaving a blank.
π
Guessing
NR
R O Switching
FS
Overconfidence
Risk Elimination
aversion scoring
Figure 2. Different phenomena associated with ranks and order statistics. π refers to students’ levels of confidence, which
are expressed as probability values.
Consider, for instance, the case of a multiple-choice test Regardless of reward or penalty, the underlying logic
where there are four alternatives within each question. The remains the same: In FS scoring, incorrect responses always
probabilities of receiving credit for either guessing or omit- receive less credit than omitted ones. This thesis uses Traub
ting an answer are shown below. and colleagues’ version of the FS rule (that is, the ‘carrot’
variant).
Guess : Pr(correct) = 0.25, Pr(incorrect) = 0.75
Skip : Pr(correct) = 0, Pr(incorrect) = 1 Distinguishing Between Number-Right and Formula
In a situation where a student has absolutely no idea what Scoring
the correct answer is, the above shows that the most rational We can summarise the NR and FS scoring rules as follows.
course of action is to guess. In NR scoring, if one guesses, there is a 0.25 probability
Of course, while guessing is good for students, it is a nuis- of getting a correct answer and, hence, 1 point, and a 0.75
ance for test markers. A student’s ability cannot be accur- probability of receiving 0 points (an incorrect answer). If
ately assessed if he or she makes use of guessing, particularly one skips the question, 0 points are received with certainty.
so if the guesses result in correct answers. With guessing, a In FS scoring, if one guesses, there is a 0.25 probability of
student’s test score will reflect both ability and the positive receiving 1 point, and a 0.75 probability of receiving −0.33
effects of having used the test strategy – basically, the test points. As with NR scoring, if one skips the question, 0
score is higher than it should have been. points are received with certainty.
Having a scoring rule that helps to keep guessing be- The main difference between NR and FS scoring rules is in
haviour in check might therefore be useful. This is where how these rules distinguish incorrect and omitted responses.
formula scoring comes in. In the former rule, there is no difference between incorrect
and omitted responses – both yield zero credit. In the latter,
Formula Scoring an incorrect response receives less credit than an omitted
Formula scoring (FS scoring) is a common alternative to response.
NR scoring. In this instance, one receives maximum credit We borrow notation from expected utility theory (?) to
for a correct response, partial (or zero) credit for an omitted express this more compactly as
response and zero (or negative) credit for an incorrect re- NR scoring :
sponse. As an example, a correct response receives 1 point; a
Uguess = u (1) · 0.25 +u (0) · 0.75
skipped response, 0 points; and an incorrect response, −0.33
points. > u (0) · 1 = Uskip
Formula scoring is set up to penalise guessing: It forces FS scoring :
students to think twice about making a hasty response to a Uguess = u (1) · 0.25 +u (−0.33) · 0.75
question to which they do not know the answer. As such, a
Uskip = u (0) · 1.
student answering a formula-scored test question will have
to carefully consider his or her degree of knowledge – or lack Uguess and Uskip are understood as prospects (?). A prospect is
thereof – before deciding what kind of response to make. a course of action that one could take, out of several in con-
That incorrect responses are penalised constitutes a ‘stick’ tention. When answering a multiple-choice question, this
variant of formula scoring. A ‘carrot’ variant also exists would be to either answer or omit the question. Whichever
where, in contrast to the earlier variant, the focus is on course of action one chooses, there is an outcome that oc-
rewarding students for omitted answers (see ?). Sticking curs with some probability. In our context, the outcomes are
with the example, a correct response receives 1 point, but either 1 or 0 (correct or incorrect).
now a skipped response might receive 0.33 points and an The letter u represents the utility – framed in terms of
incorrect response, 0 points. a gain or loss of points on a test – of some outcome. This
Table 2 non-zero). If one does not answer, one receives zero points
Scoring Rules and the (Whole Number) Weights, or Awarded with certainty. If one answers, one may or may not receive
Credit, Associated With Them credit (but zero credit is no longer a certainty).
We make use of this connection when presenting our
Incorrect Omitted Correct empirical example. In the example, we will also present
Number correct 0 0 2 another reason that we use the RF rule.
Formula 0 1 2
Reverse formula 1 0 2 Misrepresenting the Scoring Rule
As mentioned earlier, the NR scoring rule is one of the

is the benefit, or satisfaction, of correctly or incorrectly an- commonest rules to be found in a testing situation, and also
swering a test question. The utility of an outcome is then one that promotes guessing behaviour. Yet, there are almost
multiplied by the probability of said value occurring (here, always occurrences of answer omissions on test scripts, sug-
the probability of a correct answer is 0.25, while that of an gesting that not all students exploit this scoring rule to its
incorrect answer is 0.75). fullest. Cate would be an example of such a student.
For the case of NR scoring, u (0) · 0.75 = u (0) · 1 = 0. These One explanation for this is students who are unable to
are the portions of the prospects that represent incorrect and finish the test in time, thereby leaving a string of omitted
omitted answering, respectively. In FS scoring, on the other answers near the end of the test. But what about those stu-
hand, u (−0.33) · 0.75 , u (0) · 1. Depending on a test-taker’s dents who finish the test, but still omit some answers? They
willingness to take a risk and guess, Uguess could either be might have forgotten to fill in the answers, which, plausible
greater or smaller than Uskip . This reduces instances of guess- though it may be, is highly unlikely to happen. We thus
ing due to the possibility of a penalty to one’s total test score posit that these students may have internally misrepresented
if one makes a wrong guess. the scoring rule prescribed by the test-maker.
That is, these students may have internalised the FS scor-
Reverse Formula Scoring ing rule instead of the NR rule, resulting in them being less
willing to guess on test questions. The result is that their test
Having outlined the differences between the NR and FS scores are qualitatively different from their peers who had
rules, we present a third scoring rule which we label the no compunctions guessing the right answers. In comparing
reverse formula (RF) scoring rule. This rule is not used in the scores of guessing and non-guessing students, it is the
any test format and we use it in our thesis for purposes of latter group who are disadvantaged as their scores are lower
argument. than should be.
In RF scoring, omissions are strongly discouraged. This However, in the case of high ability students taking a
is achieved by providing an incentive to answer a question high-stakes test, the slightest differentiation in scores could
in the form of credit, regardless of whether the answer is lead to different educational outcomes. (For example, if two
correct or incorrect. RF scoring is essentially a mirror image students are applying to a very competitive university study
of FS scoring when it comes to how incorrect and omitted programme and are equivalent in every way except for a
items are treated. As such, a correct response receives 1 slight difference in grades, the lower-scoring student could
point; an incorrect response, 0.33 points; and an omitted very well be disadvantaged.) In such a scenario, differences
response, 0 points. in test scores due to a representation bias are undesirable.
Please refer to Table 2 for an overview of the scoring rules In our initial example, if Adam, Beth and Cate are all stu-
we have just presented. Do note that unlike our examples, we dents of similar ability and Cate omits her answer because
use whole numbers to represent the points that are awarded she misrepresents the scoring rule, she disadvantages herself
to each response type. We do this for ease of reading. The in comparison to her answer-guessing peers. It is therefore
earlier use of 0.33, however, is not arbitrary. important to be able to detect people like Cate so that they
The value is arrived at by using a formula that was origin- can compete on a level playing field with other students.
ally implemented to calculate the penalty to impose on incor- We are now aware that a given scoring rule can be misrep-
rect answers in FS scoring (the ‘stick’ variant). This formula resented as a different one. How does this misrepresentation
is written as 1/(c − 1) (see also ?; authors use k instead of c). influence a test taker tendency towards omitting answers,
Since we have been talking about scoring rules in the context when his or her other peers do not do so?
of four-option MCQs, we have that 1/(4 − 1) = 1/3 = 0.33.
A Prospect Theory Explanation of Guessing and Omit-
The Conceptual Similarity Between the NR and RF ting Behaviour
Scoring Rules
When we first presented the prospects available to a stu-
The NR and RF rules are more similar than they appear to dent in NR and FS scoring situations, we expressed the form
be at first glance. In both rules, it is to students’ advantage of the prospects using notation from expected utility theory
to guess as no penalty is imposed for doing so (unlike in (?). This behavioural economics theory tells us what a ra-
FS scoring). They only differ in that RF scoring provides a tional person does when faced with making a choice. In this,
stronger incentive to guess than NR scoring. Otherwise, the it is a normative model of behaviour. However, people are
direction of the awarded credit remains the same (that is, not always rational, which results in them selecting courses
of action that are contrary to what the theory predicts. Sev- We focused only on dichotomous mathematics items
eral violations of expected utility theory are mentioned in (where 1 = correct and 0 = incorrect) and removed all poly-
?. tomous items from analysis. Maths items were chosen be-
An alternative theory is proposed by Kahneman and Tver- cause these were considered to be least likely to be affected
sky (??) and referred to as prospect theory. It explains why by differences in translation (PISA tests are administered to
people violate expected utility theory’s predictions and is a students in their local languages). We then fitted a latent
descriptive model of choice behaviour. Using prospect the- class model to the students’ responses.
ory, we can rewrite our previous presentation of prospects Such a model is also known as a finite mixture model and
as is used in the analysis of multivariate and categorical data. It
is appropriate when a researcher believes that item response
NR scoring : data belong to unobserved groups – termed ‘classes’ – which
Vguess = v (1) · w (0.25) +v (0) · w (0.75) are distinct from each other (??, p. 166). After accounting
> v (0) · w (1) = Vskip for the dependence introduced by class membership, the
responses within each class are independent of each other.
FS scoring :
Vguess = v (1) · w (0.25) +v (−0.33) · w (0.75) The RF and FS Latent Classes
Vskip = v (0) · w (1). For our purposes, we believed that students’ PISA re-
sponses belonged to one of two scoring rule latent classes:
Aside from Uguess/skip turning into Vguess/skip , and u (.)
RF or FS. Given our earlier points, including the FS rule is an
turning into v (.), notice also that the probabilities are now
expected choice, but the reader might be wondering why we
weighted by the function w (.). The functions v (.) and w (.)
opted for the RF rule despite talking about misrepresenting
constitute the crux of prospect theory. The former refers
the number-right rule.
to gains and losses – in a test, this would be the amount of
We used the RF rule for statistical convenience. The RF
credit awarded or penalised. The latter function is a decision
and FS scoring rules are modelled using partial credit mod-
weight, which ‘reflects the impact of [the probability of an
els – this made the latent class analysis less complicated
outcome occurring] on the over-all value of the prospect’ (?).
in contrast to using the NR and FS scoring rules. (The NR
(Vskip could also have been written as Vomit , but we decided
rule is modelled using a Rasch model, which is statistically
to go with the former for reasons of taste.)
different from a partial credit model.) As we pointed out
Can we actually observe students who differ qualitatively
earlier, since the RF scoring rule is conceptually similar to
on their guessing and omitting behaviour? How might we
the NR rule, comparing the RF and FS rules is conceptually
go about gathering evidence of this in practice?
similar to comparing the NR and FS rules.
We explain – in passing – what the Rasch and partial
Empirical Evidence: PISA credit models are before presenting our analysis results for
We sought to answer these questions by analysing our PISA data.
publicly-available data from the 2009 edition of the Pro-
gramme for International Student Assessment, or PISA (?). What Are the Rasch and Partial Credit Models?
This is an international assessment of 15-year-old students These are statistical models that are used to determine
in member and partner countries of the Organisation for the probability of an item response being correct. In our
Economic Co-operation and Development (OECD), admin- analysis, we were also interested in the probability of an
istered once every three years. item response being omitted or incorrect. The Rasch model
Students are tested on three domains: mathematics, read- (see ??) is based on the logistic function, which is written as
ing and science. Several test forms, or booklets, are available.
For PISA 2009, 21 booklets were published. We only used 20 ex
f (x ) = . (7)
as the last one was meant for special education students and 1 + ex
therefore contained items that were much easier than the
The original argument x is replaced by the term θ − δ , so
others. The items in a test booklet were a mixture of both
that we have the probability function
dichotomous and polytomous questions.
In a dichotomous item, a response is either entirely correct eθ −δ
or entirely incorrect. Up until now, this has been the type Pr(x = 1 | θ ) = , (8)
1 + eθ −δ
of response we have been considering. A polytomous item,
on the other hand, allows for partially correct responses. where x = 1 for a correct response and x = 0 for an incorrect
An easy example is an open-ended mathematics question, response. θ indicates an individual’s ability, while δ indicates
where it is possible to obtain partial credit for one’s working an item’s difficulty. These are measured on a common latent,
even if the final answer is incorrect. This should not be or unobserved, scale, which will be further explained in
confused with FS scoring – a response scored using this rule the section on varying probability distributions and answer
is still dichotomous. Whether partial credit is given, as in switching.
the ‘carrot’ variant of FS, depends on whether a response is The term θ − δ thus tells us about the disparity between
omitted or incorrect and has nothing to do with a response’s a student’s ability and an item’s difficulty. When the dif-
correctness. ference is positive (that is, when θ > δ ), the greater the
where j refers to one of the three response types and takes on

1.0 the values 0, 1 and 2 in accordance with Table 1. Note that the
Probability
above equation is one of several possible parametrisations

of the partial credit model. Thus, the mathematical form of
0.5
the model may appear differently in other literature.

The partial credit models used in our latent class analysis
reflect the polytomous nature of the response types. We
0.0
stress that this is not the same as the polytomous PISA items
we removed from analysis. Here, the (partial) credit associ-
Ability, θ ated with the response types can be thought of as personal
utility. A correct response yields the maximum utility, while
Figure 3. An illustration of a Rasch model, used to represent the type of internalised scoring rule determines the amount
the NR scoring rule. The curves relate to the probabilities of of utility elicited by omitted and incorrect responses.
a correct response (solid line) and incorrect response (broken The probability functions of the partial credit models
line) as ability increases. characterising the RF and FS rules are shown in Figure 4.
Two curves replace the ‘incorrect curve’ earlier seen in the
Rasch model (broken and dotted lines). In RF scoring, the
disparity, the greater the probability that an item will be dotted line represents the ‘omitted curve’ while the broken
correctly answered. The reverse is true when the difference line represents the incorrect curve. The reverse is true for
between θ and δ is negative (θ < δ ) – the greater the dispar- FS scoring – the dotted line represents the incorrect curve
ity, the smaller the probability that an item will be correctly and the broken line represents the omitted curve. In both
answered. cases, the solid line represents the probability of a correct
A graphical representation of the Rasch model is found response.
in Figure 3. Just as we have a curve representing item cor- In layperson’s terms, then, what we specified in our par-
rectness (solid line), we also have one representing item tial credit models was that under RF scoring, students of
incorrectness (broken line). We would write the probability lower ability omit more than students of higher ability. This
function for the ‘incorrectness curve’ as is because as ability, θ , increases, the probability of making
e−(θ −δ ) an omission decreases, while the probability of making incor-
Pr(x = 0 | θ ) = . (9) rect and correct answers (that is, answering the question as
1 + e−(θ −δ )
opposed to skipping it) increases. We specified the reverse in
One issue that arises is that the above curve does not FS scoring, in effect stating that students of higher ability are
distinguish between incorrect and omitted responses. All split into those who omit and those who answer (correctly).
it states is that the probability that an answer is incorrect
decreases as a student’s ability increases. However, it says Results of the Latent Class Analysis
nothing about the probability of omitted responses. Al-
though these are marked as incorrect, their probability of An algorithm based on the extended marginal Rasch
occurrence does not necessarily decrease with ability. This model (?) was written in R (?, version 3.2.3) for this analysis
is particularly true if a student internalises the FS scoring and applied to all PISA booklets (excluding the one used for
rule, in which case the probability of occurrence is equal for special education students). Test booklets fell into one of two
all levels of ability. broad categories: those with good latent class separation and
Past studies demonstrate that students who attempt tests those with poor separation. For the booklets in the second
under (test-specified) NR and FS rules are qualitatively dif- category, poor separation occurred because of insufficient
ferent from each other (?; ?; see also ?). This then implies maths items (the number of maths items differed across
that we should not automatically assume that students with booklets), and so the algorithm was unable to decisively sort
different internally-represented scoring rules are equivalent. students into one class or the other.
PISA tests are scored in an NR fashion, which necessitates Naturally, we shall only consider the booklets with good
using the Rasch model. But because of students who do not separation and, of these, will report results for only booklets
comply with the test-specified scoring rule and differentiate 01 and 21, n = 40, 018. These booklets shared identical maths
between incorrect and omitted responses, there is a need to items which were administered in the same order. (The other
use a model that captures this difference. well-separating booklets were 05, 11, 23 and 25.) We chose
Partial credit models, first introduced by ?, allow us to these two booklets because the territories represented in
do just that. Unlike a Rasch model, a partial credit model booklet 01 did not overlap with those of booklet 21, so we
consists of several probability functions that illustrate how ended up with a larger representation of students around
likely a given response will occur for some category. Our the world.
categories, in this case, refer to the correct, incorrect and As seen in Figure 5, most students were sorted into the RF
omitted response types. For a question with j response types, class. This would suggest that the majority of students do
we write a partial credit model as not misrepresent the test-specified scoring rule as an FS rule.
That said, a small but non-trivial proportion of students
ejθ −δ j
Pr(x = j | θ ) = P2 jθ −δ , (10) (circa 12 per cent) were classified as internalising the FS
j=0 e
j
scoring rule.
1.0
Probability
I/O
0.5
O/I
C
0.0
Ability, θ
Figure 4. An illustration of a partial credit model. The curves relate to incorrect, omitted and correct responses. In the legend,
these are abbreviated as I, O and C respectively.
For these students, omission was potentially problem-

atic because this was likelier to occur at higher levels of
ability (refer to Figure 4). At the country level, the propor-
tion of students sorted into the FS class ranged from zero
(Liechtenstein) to 34 per cent (Kyrgyzstan). We point out
that Liechtenstein’s results should be interpreted in context,
given that there were 23 Liechtenstein students as compared
to 397 Kyrgyz students. All country-level proportions of
students sorted into the FS latent class are visualised on a
world map in Figure 6, made using the rworldmap package in
R (?).
For the majority of students holding the RF rule, omission
was not so much of a problem since it was limited to the
low ability students. Even if these students did not omit any
questions, it would have been unlikely for these non-omitted
responses to yield enough credit to allow them to score fairly
well.
In a sense, it is good to know that most students did not
misrepresent PISA’s test-specified NR rule as an FS rule.
Nonetheless, the fact that not all students subscribed to the
test-specified scoring rule indicates that the non-compliant
students disadvantaged themselves by not using the guessing
strategy.
LCA Results, RF vs. FS (100 Iterations)

1.0
0.8
Proportion of Students
0.6
0.4
0.2
0.0
Class 0 (RF) Class 1 (FS)

Class Membership
Figure 5. Results of the latent class analysis, showing that the majority of students were sorted into class 0 (the RF scoring
rule). The step function shown in the figure is not a single line, but rather, is made up of thousands of data points representing
the students who completed PISA booklets 01 and 21. That the function progresses almost instantaneously from Class 0 to
Class 1 is indicative of good latent class separation.
KOH, MARIS AND COOMANS
0 0.35
Figure 6. Global heat map showing proportions of students per country who are sorted into the FS latent class. The darker the shade of blue, the larger the proportion (range: 0 to
35 per cent of the sample). Territories with no data (that is, non-participants of PISA 2009) are shaded black.
12
Varying Probability Distributions and the in their willingness to guess. (We use the credit values that
Answer Guessing/Omitting Strategy we mentioned earlier: 1 point for a correct response, −0.33
points for an incorrect response and 0 points for an omitted
In the previous section, we discussed students’ individual
response.) Under a formula scoring framework, the skipping
probability distributions in a simplistic setting. This allowed
prospect produces the same utility for both students,
us to explain what makes some students guess and others
omit in a clear and intuitive way. In the real world, students’ VAdam = VBeth = v (0) · w (1).
thought processes are not identical replications of each other
– each process is a function of a student’s unique background Guessing, on the other hand, results in different utilities,
and level of test preparation. Using fixed probability distribu-
tions to explain the differential use of test strategies among VAdam = v (1) · w (0.5) +v (−0.33) · w (0.5) and
students thus becomes insufficient.
VBeth = v (1) · w (0.9) +v (−0.33) · w (0.1).
We now expound on a more realistic case where every
student’s probability distribution is unique to that person. If, for simplicity’s sake, we assume that w (.) is an identity
Adam, Beth and Cate reprise their roles as our hypothetical function, then VBeth > VAdam and Beth will be more will-
students. We are still examining their probability distribu- ing than Adam to guess the answer to the question. (An
tions with regards to one question of a test. However, we identity function is a function whose value is the same as
now have it that the students possess varying degrees of that of its argument. For example, in Adam’s case we have
preparation for the question. As such, their ‘new’ probability w (0.5) = 0.5.)
distributions can be written as follows. Regardless of whether the internalised rule is NR or FS,
πAdam = (0, 0.1, 0.4, 0.5) the presence of one coerces Adam and Beth to provide some
answer. In doing so, we learn about their most-preferred
πBeth = (0.01, 0.02, 0.07, 0.9) response option, r 4−1 . However, knowing this tells us nothing
π Cate = (0.5, 0.4, 0.1, 0) about the value of that rank’s order statistic.
If an answer is guessed, we know that the value of the
For this example, option 4 is the correct answer. It is im-
largest order statistic is large enough that it results in a
mediately apparent that Adam and Beth will likely answer
utility value more favourable towards guessing than omit-
the question correctly. Cate, again, will not.
ting. If an answer is omitted, then the converse occurs – the
largest order statistic is unable to tilt the balance towards
Ranks and Order Statistics Revisited
guessing and omitting yields a larger utility. From this test
As stated earlier, a student’s probability distribution gen- strategy perspective, guessing or omitting tell us something
erates information in the form of ranks and order statistics. about how sufficiently large an order statistic is, but does not
We want to know more about these pieces of information, associate it with a specific rank and, by extension, response
and test strategies and internalised scoring rules enable us option.
to do so. Looking at Adam and Cate, we find that they have differ-
In this section, we will see that internalised scoring rules ent ranks but the same order statistics. As such,
give us information about ranks and not order statistics,
while guessing/omitting behaviour gives us information rAdam = (1, 2, 3, 4) while
about order statistics but not ranks. The next section is also rCate = (4, 3, 2, 1).
in the context of varying probability distributions. There, we
will see how another test strategy, answer switching, gives The utility for skipping is the same for Adam and Cate (see
us information about order statistics but not ranks. These above) and, this time, the utility for guessing is also the same
sections serve as elaborations of Figure 2. for both students. However, the framing of the students’
guessing prospects is now different.
Guessing/Omitting, Internalised Scoring Rules and In Adam’s case, the guessing prospect reads: Select op-
Ranks tion 4 with a probability of 0.5 and earn 1 point; select one
We see from the above that Adam and Beth’s probability of the others and receive a penalty of −0.33 points. Cate’s
distributions have the same ranks but different order statist- guessing prospect reads: Select option 1 with a probability
ics. Recall that order statistics are the probabilities reflecting of 0.5 and earn 1 point; select one of the others and receive a
the magnitude of confidence across a question’s response penalty of −0.33 points. Adam and Cate’s guessing prospects
options, while ranks are the ordinal positions of these order differ in the response option to which the greatest amount
statistics. Hence, Adam and Beth have the same vector of of confidence is assigned (presented in bold for clarity).
ranks,
Added Realism: Answer Elimination as a Strategy
rAdam = rBeth = (1, 2, 3, 4).
In addition to what we have presented, another strategy
Considering Adam and Beth when they internalise the NR that will be familiar to many is that of answer elimination.
scoring rule is not so interesting to us because of a lack of a This is particularly helpful for questions where one is not
penalty were they to incorrectly answer the question. How- sure what the correct response option is, but can definitively
ever, if they internalise the FS rule, then we notice differences rule out at least one of the distractor options.
Through elimination, one ends up with a reduced set of evidence to the contrary: For students who change their
options, wherein the utility associated with making an incor- answers on a test, doing so is generally beneficial to their
rect choice becomes smaller and the utility associated with scores.
making a correct choice larger. This in turn has implications Convention establishes three categories of answer
for the overall utility value, which will help with deciding switches on a multiple-choice test: wrong to right (WR),
whether to guess or omit. wrong to wrong (WW) and right to wrong (RW). Previous
studies display these categories as percentages of the total
More Realistic Prospects, More Complexity number of switches made. Unsurprisingly, the WR category
is of interest, and a ratio of WR to RW switches (WR/RW)
So far, we have added some realism to our example and is calculated to conveniently express the degree to which
shown that varying probability distributions make it more answer switching is beneficial.
complex to determine values of utility, or V . This then makes A review by ? summarised the proportions of WR
it less straightforward to identify students who are more switches from 28 studies conducted between 1929 and 1983.
inclined to guess versus omit. For example, given a group These proportions ranged from 44.5 to 71.3 per cent, with
of students who adopt an FS rule instead of a test-specified most studies having a proportion of WR switches greater
NR rule, we know that these students will have a greater than 50 per cent (only four did not).
proclivity to omit their answers compared to their more ? and ? are more recent examples where the majority of
compliant peers. answer switches are WR. For all these studies, the WR/RW
However, these non-compliant students will also vary in ratio was greater than one, indicating that there were more
their degree of omitting. For some, omitting answers will WR switches than RW ones. These results fly in the face
appreciably hinder their test scores, whereas for others, this of received wisdom, constituting ‘empirical wisdom’ that
will not be much of an issue. It could be that, even though answer switching is not a test behaviour to be discouraged.
the latter subgroup view an incorrect guess unfavourably, We now turn to our hypothetical three students to demon-
they are sufficiently prepared and so are highly confident of strate the relation between answer switching and order stat-
a specific response option being correct. (This is what we istics.
saw with Beth.)
What interests us is the former subgroup – those who Order Statistics and Switching
omit to the point of hindering their scores. These are the
Recall that, in the case of varying probability distributions,
ones who would benefit the most from some kind of inter-
the confidence levels of Adam, Beth and Cate with regards
vention to correct their way of interpreting a test’s scoring
to a four-option MCQ are as follows:
rule. Being able to determine the utility that students get
from the guessing prospect is of great use in helping us πAdam = (0, 0.1, 0.4, 0.5)
identify this subgroup. πBeth = (0.01, 0.02, 0.07, 0.9)
That said, determining the values of V is exactly the prob-
πCate = (0.5, 0.4, 0.1, 0)
lem we face. While our example involving Adam, Beth and
Cate conveniently contained the values of their individual Between Adam and Beth, both have the highest amount
probabilities (which are also the order statistics), this is not of confidence that option 4 is the correct answer (which we
the case in reality. For argument’s sake, we simplified the decided earlier that it was). We also mentioned that Beth is
weighting function, assuming that it was an identity func- much more confident about her choice than Adam is. What
tion. Again, we do not normally expect this to be so for are the implications of this?
actual students (see ?). Adam has assigned similar levels of confidence to options
This therefore makes it important to develop a method 3 and 4 (0.4 and 0.5 respectively). That the bulk of his confid-
through which we can learn the values of the ranks and ence has gone to these options (0.4 + 0.5 = 0.9, as compared
order statistics. This was referred to in the introduction as to 0 + 0.1 = 0.1) indicates that he is uncertain about which is
elimination scoring, and we will discuss this at a later point. correct. For instance, Adam could decide initially to shade
option 3 as the correct answer, but then change his answer
Varying Probability Distributions and the to option 4 some time later. This would be an instance of a
Answer Switching Strategy WR switch. Alternatively, he could select option 4 at first
but switch it later to option 3, which would be a RW switch.
Having talked about guessing and omitting, we move Cate’s order statistics are similar to Adam’s, therefore
on to another test strategy often used by students: answer she would encounter the same uncertainty that he does.
switching. This refers to the act of answering a question, However, her indecision is between options 1 and 2, both of
moving on to the rest of the test and going back to the initial which are incorrect. She may select option 1 initially and
question some time later and changing the answer. switch to option 2 or vice versa, but regardless, her course
Since the late 1920s, several studies have investigated of action is a WW switch.
when and why students switch answers on multiple-choice Through the above example, we contend that answer
tests. Received wisdom has it that one’s initial response to switching is most likely to occur when the order statistics
a question, or item, is likeliest to be correct. One should are similar to each other. But there is a catch – the values of
therefore be very cautious about switching one’s answers. these order statistics are unavailable to us, so how might we
Time and time again, however, empirical research uncovers go about testing our hypothesis?
as their likelihoods, which we express as
δ1 δ2 θA δ3 L 1 = Pr(Data | H 1 ),
L 2 = Pr(Data | H 2 ),
Figure 7. An illustrative latent scale showing Adam’s ability L 3 = Pr(Data | H 3 ) and
(θ A ) relative to three test items’ difficulties (δ 1 ,δ 2 and δ 3 ). L 4 = Pr(Data | H 4 ).
We define Hi ,i = 1, . . . , 4, as students’ hypothesis of which

Ability and Difficulty as Substitutes for Order Statist- option is correct. The data in question represents students’
ics prior content knowledge, background experiences – any-
thing that helps them to make a decision with regards to
choosing their best option.
While we do not know the values of O, there is another We know that, in the absence of any information (or
way of characterising a student’s uncertainty while answer- data, as it were), selecting the best response option occurs at
ing a question. We do so by drawing from psychometrics chance levels. In other words, without any data, students se-
(?), using a framework called item response theory (IRT; see lect the best option with an a priori probability Pr(Hi ) = 1/4.
?). This framework concerns the interplay between student We also say that their data come from some distribution F .
ability and item difficulty, which were earlier introduced in Using the above, we get to what we truly are interested
Equation 8. Student ability and item difficulty are measured in: the a posteriori probabilities of the hypotheses. Using
on a common latent scale, and we refer the reader to Figure 7 the data as a prior distribution and combining it with our
for an illustration thereof. likelihoods, we have, using Bayes’ theorem, that a posteriori
We use IRT to estimate the probability that a student
Li
answers a given item correctly. In Figure 7, we know that Pr(Hi | Data) = .
Adam’s ability level is greater than the difficulty levels of L1 + L2 + L3 + L4
items 1 and 2 (a positive-valued discrepancy). Therefore, the Because the above process is entirely randomly generated,
probability that he answers these items correctly is greater every time a replication of the ‘experiment’ occurs, the most-
than 0.5. His ability level is not greater than that of item 3 preferred hypothesis – and the likelihoods – changes.
(a negative-valued discrepancy), and so the probability that We are now ready to show how the posterior probabilities
he answers that item correctly is less than 0.5. relate to the probabilities that figure in an IRT model. The
Let us consider an alternative scenario where Adam’s posterior probabilities depend on the data, which we con-
ability level is equal to the difficulty level of item 2, so that ceive of as being random. However, these data are missing.
θ − δ = 0. In such a case, the probability that he correctly Hence when we integrate over the distribution of the data
answers that item is we obtain
Z
eθ −δ e0 Pr(Hi | F ) = Pr(Hi | Data) dF (Data),
Pr(x = 1 | θ,δ ) = = = 0.5.
1 + eθ −δ 1 + e0
and these are the probabilities in the IRT model. The distri-
bution F depends on the person ability and option charac-
This suggests that his answer could go either way – we
teristics.
cannot know for certain if he will answer correctly or incor-
We illustrate the relationship with a concrete example.
rectly.
Suppose the the likelihoods are modelled as
From Adam’s perspective, this is seen as a situation where
−λi x i
he is highly uncertain of his answer. One possibility is that L−1
i = x i ∝ λi e ,
he selects one answer but is unsure of its correctness, having
absolutely no idea if any the other three options is a viable such that a student prefers the option for which x i is smallest.
alternative. (Among the three options, then, he would select We obtain that
from them at chance levels.) λi
Pr(Li > L j , j , i) = ,
A second possibility is that Adam shortlists two or more λ1 + λ2 + λ3 + λ4
response options, but has trouble deciding among them. This
is the possibility we are interested in, as his order statistics which becomes a Rasch model if λ4 = eθ (where the correct
are assumed to be similar over the shortlisted options. Since option is marked with the subscript 4) and λ1 = λ2 = λ3 =
we do not know what the values of his order statistics are, eδ /3 .
we shall rely on his ability and item difficulty instead. With this, students then know which likelihood, or re-
sponse option, best answers the question. They are con-
Ability and difficulty are linked to order statistics and
sequently able to decide between the available prospects.
ranks in another, more statistical way. To explain this link,
we use the analogy of students as Bayesian statisticians
Operationalising Student Ability and Item Difficulty
engaged in model selection. But instead of selecting a model,
students aim to select the best response option out of all the In the Rasch model, it is generally known that the total
available ones. We can think of students’ confidence levels test score, or sum score, and proportion of correct answers
for an item are sufficient statistics for student ability and all options that they consider incorrect. Points are awarded
item difficulty respectively (?). For item difficulty, the smaller for making a correct choice (eliminating an incorrect option
the proportion of students correctly answering the item, the or retaining the correct one). Similarly, points are lost when
more difficult that item is. an incorrect choice is made (eliminating the correct option
A sufficient statistic is one whose estimate contains all or retaining an incorrect one).
available information about a parameter (?), thus providing The ways in which students eliminate and retain response
a means of summarising the given data. In our case, the para- options give us an opportunity to learn about their ranks
meter(s) refer to student ability and item difficulty, while the and order statistics. Since these ways are basically courses
data consist of students’ test responses. Since sum scores of action, we refer to them as prospects as well. In a four-
and proportions of correct answers are sufficient statistics, option MCQ, there are four prospects to be considered: elim-
we can then use these as our operational definitions of our inating none of the options (prospect A), eliminating one
afore-mentioned parameters. option (prospect B), eliminating two options (prospect C) and
eliminating three options (prospect D). We now go through
A Possible Way of Testing Our Hypothesis the prospects in sequence and explain the points system
associated with each.
A shared feature of the studies in the above-mentioned
Prospect A is rendered below.
literature is that the student samples used were relatively
small in comparison to the samples we are presently able to
1 2 3 4
work with. For example, when we conducted the latent class
analysis earlier, we used a sample of 40, 018 students who
Since none of the options has been eliminated, a penalty of
participated in PISA 2009. Given the size of the data sets
−2 points is imposed with certainty.
available to us today, we are able to count the proportions
Prospect B involves the least-preferred option being elim-
of WR, WW and RW switches for entire cohorts of students.
inated, r 1−1 . (Recall that ranks – the subscript in r −1 – are
For example, in the Netherlands, children end their
expressed in reverse order of preference.)
primary education (basisonderwijs) around the age of 12.
During this time, they sit for the Centrale Eindtoets (?), ab- r 1−1
breviated the Cito test after its developer, the Central Insti-
tute for Educational Measurement (Centraal Instituut voor 1 S2 3 4
S
Toetsontwikkeling). This test is conducted over three days
If the eliminated option is the correct one, a penalty of −4
and assesses students’ mathematics and language abilities.
points is levied with probability o 1 (this is the smallest value
Their test results, together with personalised reports and re-
of all the order statistics). Otherwise, 0 points are awarded
commendations from teachers, determine which educational
with probability o 2 +o 3 +o 4 .
track they attend for secondary school.
When students complete the Cito test, their scripts are Prospect C involves the elimination of the two least-
marked by a machine that makes digital scans when in- preferred options, r 1−1 and r 2−1 .
stances of answer switching are detected. These scans, com-
r 1−1 r 2−1
bined with knowledge about student ability and item dif-
ficulty, make it possible to see if answer switching indeed 1 S2 S3 4
S S
occurs when one’s ability level is close to that of item diffi-
culty. If either r 1−1 or r 2−1 happens to be the correct option, a penalty
While the second author in this thesis is an employee of of −2 points is imposed with probability o 1 +o 2 . Otherwise,
Cito, privacy concerns preclude any analysis of Cito test data 2 points are awarded with probability o 3 +o 4 .
at this time. We look forward to embarking on a large-scale Prospect D entails full compliance with the scoring rule’s
analysis with this data in the future, when this hurdle is instructions, so that response options r 1−1 , r 2−1 and r 3−1 are
cleared. For now, we end our discussion on varying probab- eliminated.
ility distributions and move on to a new scoring rule that lets r 3−1 r 1−1 r 2−1
us learn more about certain test characteristics of students.
S1 S2 S3 4
S S S
Elimination Scoring
If any of the eliminated options is the correct one, 0 points
We propose elimination scoring as a new scoring rule are awarded with probability o 1 +o 2 +o 3 . Otherwise, the cor-
that allows us to learn what we could not from regular scor- rect response has been identified and 4 points are awarded
ing rules: ranks and order statistics. In addition, this new with probability o 4 .
rule gives us information about which students are over- or We can thus summarise the calculation of one’s utility
underconfident. values over the four prospects as
How Elimination Scoring Works VA = v (−2) · w (1),

Elimination scoring bears some resemblance to the an- VB = v (0) · w (o 2 +o 3 +o 4 ) +v (−4) · w (o 1 ),
(11)
swer elimination strategy that was earlier mentioned. When VC = v (2) · w (o 3 +o 4 ) +v (−2) · w (o 1 +o 2 ) and
this rule is in effect, students answer MCQs by eliminating VD = v (4) · w (o 4 ) +v (0) · w (o 1 +o 2 +o 3 ).
An Amalgamation of the NR, FS and RF Scoring Rules thing about how confident they are. For example, if we have
a student who always goes with prospect D, but frequently
From the above, we see that elimination scoring combines receives 0 points for her efforts, we have reason to believe
the features of the scoring rules we have mentioned. Pro- that she may be overconfident. On the other hand, if another
spect A is similar to the RF scoring rule in that it discourages student always goes with prospect C, but frequently receives
guessing by imposing a penalty for it. Prospects B and C are 2 points because the correct answer is not eliminated, we
akin to formula scoring, where a penalty is imposed if the learn that he may be underconfident.
correct answer is eliminated. In FS terms, this is the same as Interestingly, given that a student has made two initial
selecting an option other than the correct one. Prospect D correct eliminations, the choice between prospects C and D
corresponds to the NR rule being complied with, as points are is reminiscent of a problem posed by the French economist
awarded for making the right decision and none otherwise. Maurice Allais (?). This problem is also detailed in ? as a
What elimination scoring does, then, is discourage test starting point for developing their prospect theory of beha-
behaviours like guessing and non-compliance with scoring viour. Allais presents his problem as choice between two
instructions. It also provides an incentive to eliminate all in- gambles:
correct options, since the payoff for the correct combination
of eliminations is much higher than that for prospects B and Gamble 1A :
C (4 points as opposed to 0 or 2). Pr(Receive $1 million) = 1.00
How Does Elimination Scoring Tell Us About Ranks,

Gamble 1B :
Order Statistics and Student Over-/Underconfidence?
Pr(Receive $1 million) = 0.89
We can consider the information we learn about students Pr(Receive nothing) = 0.01
in the contexts of full and partial compliance with the scoring
Pr(Receive $5 million) = 0.10.
rule. Full compliance refers to the adoption of prospect D,
while partial compliance refers to the adoption of prospects
The majority of his participants opted for Gamble 1A. This
B and C.
gamble corresponds to prospect C in our case.
For ranks, when students adopt prospect D, we learn their
Gamble 1B is similar – but not identical – to prospect
most preferred option, r 4−1 . When they adopt prospect B or
D, as the possibility of receiving the same amount of credit
C, we learn their least-preferred option(s), r 1−1 (and possibly
as the previous gamble is removed (in Allais’s example, $1
r 2−1 ). These options are the eliminated ones. For those op-
million; in elimination scoring, 2 points). We can merge this
tions that the students do not eliminate, we know that these
possibility with that of receiving nothing, so that one now
are perceived to have about the same rank (otherwise one
faces a 0.90 probability of receiving nothing and a much
or more would have been eliminated).
smaller 0.10 probability of receiving $5 million. This is, in
For order statistics, when students adopt prospect D, we principle, what prospect D of elimination scoring entails.
learn the response option to which they assign the highest Unlike Allais’s participants, there is no reason to expect
value, o 4 . As with ranks, when they adopt prospect B or the majority of students to prefer prospect C to prospect D.
C, we learn which prospect(s) has the smallest value(s) in We can attribute this to the consequences of participating in
the case of the eliminated option(s). We learn that the non- an economics experiment as opposed to participating in an
eliminated options are likely to be close to each other in academic test. The former is a low-stakes situation (if there
value. is even anything to lose in the first place), while the latter
Unlike any of the other scoring rules, elimination scoring determines one’s exam grade and consequences such as the
gives us information pertaining to how over- or underconfid- educational track available to one given the obtained grade.
ent students are. Before delving into the issue of confidence, Nonetheless, in a situation where one is unsure of which
we want to point out a particular feature of this scoring rule of the two non-eliminated options is correct, the decision to
that one can observe in prospects C and D. go with prospect D instead of C is a function of overconfid-
Consider a student who eliminates the response options ence. The more overconfident one is, the more likely one is
r 1−1 and r 2−1 . Assume that these eliminations are, indeed, to choose prospect D (and vice versa), in the same way that
correct and that the student is now deciding which of the a more risk-loving and confident person is likelier to choose
remaining options to eliminate. Further assume that the Gamble 1B over 1A.
student is having trouble deciding between the options. Pro- Next, we present some examples that indicate the best
spects C and D can be framed as follows. prospect (A, B, C or D) to select under different combinations
The student will receive 2 points with certainty if the of the value and weighting functions, w (.) and v (.). The best
choice is made to not eliminate a third option. However, prospect for a specific combination is one which yields the
should a third option be struck out, there is an o 4 probability highest utility, V .
of receiving twice as many points for a correct answer. This
is accompanied by a larger 1 −o 4 = o 1 +o 2 +o 3 probability of Combinations of w (.) and v (.)
receiving 0 points for an incorrect elimination.
Whether or not students decide to eliminate the third Table 3 displays a matrix of results. Within each cell
option tells us something about their appetite for risk. The is the prospect with the highest utility (that is, the most
long-run outcomes of this decision, however, tell us some- preferred prospect) for Adam, given a specific combination
of weighting and value functions. Recall that these functions ?.

are crucial for determining the amount of utility one receives Going back to Adam, we will consider these four value
for a given prospect and were first mentioned in ?. The fig- functions together with the three weighting functions, and
ures representing the columns are the weighting functions, evaluate the prospects that he is likeliest to adopt given his
while those representing the rows are value functions. order statistics. Recall that πAdam = (0.0, 0.1, 0.4, 0.5).
We present three hypothetical weighting functions that
show how students with varying amounts of confidence Adam’s Likeliest Prospects
transform their order statistics. In Table 3, the x-axis stands
Here, we go through some examples that lead to the like-
for the actual order statistics o, and the y-axis stands for
liest prospects in the cells of Table 3. Importantly, the pro-
the transformed – or weighted – order statistics w (o). An
spects in the table are the result of comparisons between the
overconfident student will overweight an order statistic if
C and D prospects for every combination of w (.) and v (.).
its value is close to 1, making the response option associated
We will shortly explain why the A and B prospects do not
with that statistic very attractive. Conversely, that student
feature in our comparison.
will underweight a statistic whose value is closer to 0, leading
First, we establish some arbitrarily-chosen values that
to its associated response option being discounted.
we will use in our examples. For each prospect (C and
The opposite happens with an underconfident student. D), we have values for w overconfident (.), w underconfident (.),
In this case, order statistics that are high are underweighted v overconfident (.), v underconfident (.) and v original (.). The functions
and given less importance than should be the case, while w ideal (.) and v ideal (.) are identity functions. Thus, we have
those that are low are overweighted. One could say that this the following:
presents a rather pessimistic perspective of one’s chances of
correctly answering the question. An ideal student is one Prospect C:
who does not overweight or underweight his or her order
w over (0.9) = 0.99,w over (0.1) = 0.05
statistics. Such a student takes the statistics’ values as they
are and does not distort them with internal perceptions and v over (2) = 4,v over (−2) = −1.5
biases.
w under (0.9) = 0.75,w under (0.1) = 0.05
As with their weighting counterparts, we also have value
v under (2) = 1.5,v under (−2) = −2.25
functions representing ideal, overconfident and underconfid-
ent students. Where the weighting functions deal with the v orig (2) = 1.84,v orig (−2) = −4.14
transformation of the order statistics, the value functions
indicate the degree of a person’s appetite for risk. One can Prospect D:
better understand this idea by examining the value function w over (0.5) = 0.55,w under (0.5) = 0.45
plots in two parts: the side to the right of the y-axis and the
v over (4) = 6,v under (4) = 3,v orig (4) = 3.39.
side to the left.
The x-axis of a value function stands for the pay-off, while We will also use the > and < operators to denote which of the
the y axis represents the value attained when one receives two prospects is greater, and consequently which prospect
a particular pay-off. In our context, this pay-off refers to is likelier.
the amount of (awarded or penalised) points associated with The values for v original (.) are not arbitrary and come from
a student’s choice of prospect. Therefore, the right side of the value function equation found in ?, p. 309. For quick
the function represents the value associated with a correct reference, the equation is defined as
elimination decision, while the left side represents the value
associated with an incorrect elimination decision. x α
 if x ≥ 0
Given the same absolute value of a pay-off (for example, v (x ) = 
 −λ(−x ) β (12)
if x < 0,
−2 and 2 points), an overconfident student will attach greater 
value to the positive one than to the negative one. This in- where α = β = 0.88 and λ = 2.25 (values can be found in the
dicates more risk-loving behaviour, and we expect that such same reference).
a student is likelier to make guesses and risk incorrectly an- Case 1. We begin by considering the case where w (.) and
swering a question (prospect D). An underconfident student v (.) are both identity functions (that is, both ideal functions).
behaves the opposite way and is expected to take a ‘safer’ With reference to Equation 11, we have that
approach to a question (prospect A, B or C, depending on
degree of preparedness). An ideal student is neither risk v (2) · w (0.4 + 0.5) +v (−2) · w (0.1) < v (4) · w (0.5)
loving nor risk averse, and will derive the same absolute | {z
Prospect C
} | {z }
Prospect D
value for positive and negative pay-offs of the same absolute
magnitude. 2 · 0.9 − 2 · 0.1 < 4 · 0.5.
There is a fourth value function in addition to the three
hypothetical ones. We have labelled this the original value This corresponds to the w ideal (.), v ideal (.) cell of Table 3.
function because it is the one whose shape was proposed in Case 2. We consider the case where v (.) is an identity
Kahneman and Tversky’s (?) initial presentation of prospect function, but not w (.). For w over (.),
theory. The equation for the function, as well as the para-
meter values used to determine its shape, can be found in 2 · 0.99 − 2 · 0.05 < 4 · 0.55.
Table 3
Example of Adam’s Most Optimal Prospects, Given His Order Statistics and Combinations of Weighting and Value Functions
(Horizontally- and Vertically-Arranged Figures, Respectively)
0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0

Overconfident Ideal Underconfident
D D D
Original
D D D
Ideal
C C C
Overconfident
D D D
Underconfident
For w under (.), minimum points are non-negative values. We summarise

our system of awarding credit in Table 4.
2 · 0.75 − 2 · 0.05 < 4 · 0.45. In ES scoring, a penalty of 0 points is imposed when a
mistake is made. In the previous version of ES, a mistake – or
These correspond respectively to the left and right cells of
incorrect choice, as we termed it earlier – could be one of two
the row represented by v ideal .
types: eliminating the correct answer or failing to eliminate
Case 3. In this case, we have the opposite – now w (.) is
an incorrect answer. Here, we define only the former as an
an identity function, but not v (.). For v orig (.), we have
incorrect choice. If a mistake is not made, a test candidate is
1.84 · 0.9 − 4.14 · 0.1 < 3.39 · 0.5, (13) guaranteed at least 1 point for that question. Similar to FS
scoring, this measure helps discourage impulsive answering,
which corresponds to the v orig (.), w ideal (.) cell. For v over (.), but because no points are deducted, students do not have to
associate a mistake with a ‘punishment’.
4 · 0.9 − 1.5 · 0.1 > 6 · 0.5, If a student is completely ignorant of what the correct
and incorrect response options are, omitting is encouraged
corresponding to the v over (.), w ideal (.) cell. Lastly, for
by the guarantee of 1 point. Conversely, random guessing is
v under (.),
discouraged by the penalty of 0 points for eliminating the
1.9 · 0.9 − 2.25 · 0.1 < 3 · 0.5, correct answer.
In a situation where a student definitively knows that
corresponding to the v under (.), w ideal (.) cell. one of the options is incorrect, eliminating that option is en-
With these three cases we have obtained the Adam’s couraged by prospect B, which earns the student 1 14 points.
likeliest prospects for six out of 12 combinations in Table 3. Related to this is prospect C, which presents a situation
Using the values we outlined earlier, the rest of the cells can where a student definitively knows that two options are in-
be filled out. To save on space, we will not outline the steps correct, but cannot distinguish between the two remaining
to obtain the utility of the likeliest prospect in each of them. ones. This prospect earns the student 1 24 points.
Finally, when a student knows the answer for sure, pro-
Why We Left Out Prospects A and B spect D is the best choice. This results in the maximum 1 34
We considered only prospects C and D in our above ex- points being awarded.
ample because prospects A and B are not possible in practice. Calculating the utility for each prospect is no longer the
To demonstrate why, let us continue using Adam’s order same as in Kahneman and Tversky’s (?) prospect theory.
statistics to calculate the utility associated with prospects A The basic formula for computing utility is
and B. We find that we have
V = (1 +p) · w (.), (14)
VA = v (−2) · w (1)
= −2 and where p is the proportion of correct eliminations. For ex-
VB = v (0) · w (0.1 + 0.4 + 0.5) −v (−4) · w (0.0) ample, in prospect B, since one response option is correctly
= 0. eliminated, p = 1/4.
Using Adam’s order statistics, and assuming that his
Seeing as the above prospects yield utility values of −2 and weighting function is an identity function for simplicity’s
0, as long as C and D yield positive utility values – which sake, we can thus calculate the utility values of his prospects
they do – there is no reason to consider A and B at all. as
For VB , note that in Adam’s case, the utility is 0. This
is only because his smallest order statistic, o 1 , is equal to 0. VA = (1 + 0)
If it were some positive value, then VB would be negative = 1,
just like VA , providing even less of a case for its being the
likeliest prospect. 1
VB = (1 + ) · w (0.1 + 0.4 + 0.5)
Regardless, we believe there are situations where pro- 4
spects A and B have their use, and would like an elimination = 1.3,
scoring rule that is organised in such a way that these pro-
2
spects have a realistic chance of being the most-preferred VC = (1 + ) · w (0.4 + 0.5)
ones. To achieve this aim, redesigning our original rule is 4
needed. = 1.4 and
3
Elimination Scoring, Revised VD = (1 + ) · w (0.5)
4
In this new form of elimination scoring (which we now = 0.9.
abbreviate as ES), we have a gain-framed rule that also func-
tions as an extension of FS scoring. By imposing a gain We see immediately that prospect C – eliminating the op-
frame, we put students in a situation where no points are tions associated with the order statistics 0.0 and 0.1 – is
ever deducted from errors. Instead, their maximum and likeliest.
Table 4
Credit Awarding System in the Revised Version of Elimination Scoring
Prospect Points Action on test question Illustrative order statistics, o
– 0 Eliminate correct option (1, 0, 0, 0)

A 1 Omit (complete ignorance) (0.25, 0.25, 0.25, 0.25)
B 1 41 Eliminate one incorrect option (0.1, 0.3, 0.3, 0.3)
C 1 42 Eliminate two incorrect options (0.1, 0.1, 0.4, 0.4)
D 1 34 Eliminate three incorrect options (certainty) (0, 0, 0, 1)
With Beth’s order statistics, we have
VA = (1 + 0) · w (1)
2.0
= 1,
Utility
1.0
1
VB = (1 + ) · w (0.02 + 0.07 + 0.90)
4
= 1.2,
0.0
2
VC = (1 + ) · w (0.07 + 0.90) 0.00 0.25 0.50 0.75 1.00
4
= 1.5 and
p
3
VD = (1 + ) · w (0.90) Figure 8. The utility curve for when prospect C – eliminat-
4
= 1.6. ing two response options – is most preferred. The broken
vertical lines illustrate the reduction in utility when the
Given her certainty about one of the response options, it suboptimal prospects B and D are chosen. This curve holds
makes sense to choose prospect D, which is what we see for any number of response options, not just four.
from the above.
We shall consider one final example where a test can-
didate is completely ignorant. That is, the student is un-
able to (a) eliminate any of the incorrect response options Figure 8 graphically depicts our numerical examples. It
and (b) has no inkling of what the correct option might shows how a student’s utility peaks only when the pro-
be. Such a student has a probability distribution of π = portion of correctly-eliminated response options is at its
(0.25, 0.25, 0.25, 0.25), and the prospect utility values are optimum point, p = 0.50 (marked with a solid vertical line),
given that student’s set of order statistics.
VA = (1 + 0) · w (1) The points p = 0.25 and p = 0.75, marked with broken
= 1, vertical lines, represent suboptimal proportions of elimin-
ated options. The former occurs when a student attempts
1
VB = (1 + ) · w (0.25 + 0.25 + 0.25) to eliminate fewer options than he or she should, and the
4 latter occurs when more options are eliminated than should
= 0.9, be. Beyond q 2 , utility continues decreasing until it reaches
2 zero at the point where all options are eliminated. This
VC = (1 + ) · w (0.25 + 0.25) makes sense as eliminating all possible options is akin to
4
eliminating the correct option in terms of undesirability, and
= 0.8 and
should be treated the same way.
3
VD = (1 + ) · w (0.25 + 0.25)
4 The Interdisciplinarity of Elimination Scoring
= 0.4.
The ES rule represents an intersection of psychometrics
For such a student, the best course of action is clearly pro- and behavioural economics. While behavioural economics
spect A – skip the question altogether. yields information about what an individual student might
It is worth nothing that ES scoring is geared to penalise do on a test, psychometrics links students’ individual per-
students for choosing a suboptimal prospect in the form of formances together and makes them comparable. This also
reduced utility. This helps to dissuade students from guess- provides information about test items and the students’ abil-
ing on questions to which they do not know the answer, ities relative to the items.
consequently preserving test score validity. We can use ES scoring to gain a better understanding of
test-taking students. Just as we used a latent class model across the participating countries. For the most part, stu-
to identify students who internalised an FS or RF scoring dents did not misrepresent the test-specified NR rule as an
rule, we can do the same thing with ES scoring. For example, FS one. For those who did misrepresent the scoring rule, the
despite the emphasis given to not guessing, there may still be country with the highest proportion of such students was
some students who decided to do so. These non-compliant found to be Kyrgyzstan.
test-takers might internalise a scoring rule such as NR. A Next, we considered the more realistic case of varying
latent class model will then be useful in helping us detect probability distributions, and how these affected two test
them. strategies: answer guessing/omitting and answer switching.
Of course, such a model will be more complex than the With the former, we stated that varying distributions added
one we used for the FS and RF rules. While the NR rule to the complexity of determining which choice of strategy
is modelled using a Rasch model, the ES rule is modelled students would use. For the latter, we contended that answer
with a partial credit model (with the four prospects as the switching is likeliest when a student’s order statistics are
categories). Nevertheless, the logic behind the latent class very close in value to each other.
analysis is the same and we can expect to find students in Finally, we introduced elimination scoring as an extension
two separate classes if they exist. of the FS rule and a means of learning about students’ ranks
Latent class analysis could also be applied to students’ and order statistics. We explained how it works and gave
weighting and value functions. With reference to Table 3, an example using a hypothetical student’s order statistics.
we might specify different classes of weighting functions, We explained why only prospects C and D appeared, and
different classes of value functions and even classes repres- gave justification for why prospects A and B would never
enting combinations of weighting and value functions. In be selected. Recognising that the latter two prospects have
the third scenario, this would be a 12-class model (3 w (.) their place in a testing situation, we revised our ES rule so
classes × 4 v (.) classes). that all prospects would be likely, depending on students’
By combining information about the scoring rule, weight- order statistics. We added examples to demonstrate how
ing and value function classes, it is possible to create an our revised scoring rule works and proposed that ES scoring
individualised testing profile for each student. This allows could be a useful addition to the educator’s toolbox.
educators to understand how every student is likely to be-
have on a test and identify areas in which intervention or
Discussion and Conclusion
assistance is indicated. We believe that the ES rule will be
a useful tool to both students and teachers, and lead to test
Today’s educational systems are designed for the masses,
scores that are more reflective of students’ abilities.
resulting in a one-size-fits-all approach when it comes to test-
ing students. The same test scripts and assessment criteria
Summary are applied to everyone, either within the whole system or
We now summarise what we have covered in this thesis. within a certain educational track. Based on a student’s test
First, we introduced test strategies and described why they results, potentially substantial and long-lasting decisions are
are able to influence test score validity. We limited our then made with regards to his or her academic future.
discussion to multiple-choice questions (MCQs), proposing Given that students are unique products of an infinite
that students who attempt such questions form their own combination of backgrounds, characteristics and experi-
distribution of probability values, or levels of confidence. We ences, we cannot expect them all to respond equally to the
then decomposed the information provided by these prob- same assessment criteria. We would like students to un-
ability values into ranks and order statistics, two sources of dergo a bespoke testing process that caters to their relative
information which are independent of each other. strengths and weaknesses, but this arrangement is expensive,
With regards to students’ probability distributions, we resource-intensive and ultimately unrealistic.
first introduced a simplified scenario where the levels of con- However, in combining ideas from psychometrics and be-
fidence were identical across all response options. Ranks and havioural economics, we have presented a testing approach
order statistics were not featured in this section. However, that takes us some way towards this ideal. Through our
making use of a uniform distribution allowed us to demon- (revised) elimination scoring rule, students are discouraged
strate how prospect theory (?) explains the differential use from using score-maximising strategies – like guessing –
of the answer guessing/omitting strategy among students. that are not based on their existing ability. This rule also
Deciding between guessing and omitting was seen to be benefits students who would normally be disadvantaged
driven by a student’s internalised scoring rule, of which we in NR- or FS-scored tests, for instance because they refuse
introduced three – number right (NR), formula scoring (FS) to guess or switch their answers. These students always
and reverse formula scoring (RF). have a most optimal prospect for their personal set of order
We provided empirical evidence of students who sub- statistics, and they always receive credit as long as they
scribed to either the RF or FS scoring rule, using a latent choose one of the four prospects.
class – or finite mixture – model to classify PISA students By implementing the ES rule, we can learn about how stu-
into one class. Whether or not a student was sorted into the dents approach tests and, importantly, can more accurately
FS latent class depended on whether answers were omitted measure their level of content knowledge. This is particu-
on the test. We visualised our results as a global heat map larly helpful for students who are in educational systems
that displayed the proportions of students in the FS class where high-stakes testing is the norm.
This thesis represents a theoretical introduction to a line

of research aiming to understand and help students using
tests. Much work remains to be done to see if our proposed
scoring rule works in practice, and if it results in the desired
changes in test behaviour when applied to a large student
population. Regardless, it is an important first step towards
changing the way tests are viewed and used, so that they
open doors for students and not close them.

Test-Taking Heuristics and Biases As Potential Threats To Test Score Validity

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Test-Taking Heuristics and Biases As Potential Threats To Test Score Validity

Uploaded by

Copyright:

Available Formats

Ian Koh (4230663) MSc Thesis, Methodology and Statistics

Intended journal: Frontiers in Psychology Utrecht University, The Netherlands

Test-Taking Heuristics and Biases as Potential Threats

Gunter Maris Frederik Coomans

Keywords: educational testing, test score validity, psychometrics, behavioural economics,

Introduction tutions, but they do make it more challenging for students

How many response

Properties of Ranks and Order Statistics

As mentioned earlier, the NR scoring rule is one of the

where j refers to one of the three response types and takes on

above equation is one of several possible parametrisations

the model may appear differently in other literature.

For these students, omission was potentially problem-

LCA Results, RF vs. FS (100 Iterations)

Class 0 (RF) Class 1 (FS)

as their likelihoods, which we express as

We define Hi ,i = 1, . . . , 4, as students’ hypothesis of which

How Elimination Scoring Works VA = v (−2) · w (1),

How Does Elimination Scoring Tell Us About Ranks,

of weighting and value functions. Recall that these functions ?.

0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0

For w under (.), minimum points are non-negative values. We summarise

– 0 Eliminate correct option (1, 0, 0, 0)

With Beth’s order statistics, we have

This thesis represents a theoretical introduction to a line

You might also like