Professional Documents
Culture Documents
In their attempt to achieve maximal test performance, students behave in ways that they
perceive to be in line with their goal. However, their behaviour is guided by heuristics and
biases, an interaction of which leads to either a beneficial or detrimental outcome. We col-
lectively label heuristics and biases ‘test strategies’, explaining how these influence students’
test behaviour for better or worse. For one of these strategies, we also provide empirical
evidence from the 2009 edition of the Programme for International Student Assessment (?) to
support our argument. We then present a new approach to test scoring that aims to uncover
these strategies as much as possible to give us new insights about students. This also has the
consequence of preserving test score validity.
1
2 KOH, MARIS AND COOMANS
result in a helpful or hindering outcome. This outcome guessing strategy), or in a situation where one knows that
can lead to very different implications for their final scores. a subset of the alternatives contains the correct answer but
Because students do the best they can to maximise their does not know which option (within the subset) to select.
scores, and because their actions are guided by heuristics
and biases, we have decided to collapse these two terms into Characteristics Influencing Test Strategy Use
one for convenience’s sake. From this point onwards, we We now have some idea of what test strategies are, but
shall refer to them collectively as test strategies. what influences students to use them differentially? We
We now present two illustrative examples to sketch a pic- believe there are three characteristics of students that influ-
ture of how strategies can lead to differential test outcomes. ence whether and how they use test strategies, at least on a
multiple-choice test. These are (a) degree of confidence, (b)
Two Illustrations of Test Strategy Use representation bias and (c) risk aversion. Figure 1 is a schem-
atic diagram depicting how these student characteristics
Consider two students with identical test scores. One
relate to test strategies and each other.
has employed a set of adaptive test strategies while going
We devote the rest of the thesis to examining the above
through the questions, especially with regards to those ques-
characteristics, providing empirical results as supporting
tions to which the answer was not known. The other has
evidence for representation bias. As we better understand
not used these strategies (or, at least, has used them to a
how they influence students’ use of strategies, we will be
much lesser extent). This means that the latter’s test score
more able to see how these strategies threaten test score
is (almost) fully reflective of ability in the tested subject. A
validity.
casual observer cannot therefore conclude that both students
are comparable in terms of their ability, because one of them
Confidence Levels and Probability Distributions
has achieved this test score with strategic aids, in addition
to ability. We begin our discussion by introducing the idea of stu-
On the flip side of the coin, consider two students with an dents’ levels of confidence when they attempt MCQs. Every
identical ability level on some academic subject. One uses test taker approaches a question with some degree of con-
adaptive test strategies and the other does not, resulting in fidence attached to each option. This refers to how sure one
them getting different test scores. An observer who looks is that a given option is correct compared to the others, and
at both students’ scores would likely conclude that they are is influenced by prior content knowledge.
not comparable in terms of academic ability, even though Ideally, students would have prepared before taking a test,
this is, in reality, the case. but in reality not everyone does so. Even among those who
What these examples demonstrate is that test strategies do, there may be gaps in knowledge. The act of evaluating
have the potential to distort how comparable students’ test the response options to select the best one exposes these
scores are when some students use them and others do not. gaps. For students who studied and therefore have much
In other words, these strategies are potential threats to test content knowledge, discriminating among the options is
score validity. usually trivial. For students with lesser amounts of con-
tent knowledge, discriminating is much more challenging,
Examples of Test Strategies if at all possible. The degree of confidence attached to each
option will differ between the students with more content
Before continuing, it is helpful to define the boundaries knowledge and those with less.
of our discussion. We shall limit our discussion to multiple- At this point, we should stress that the confidence level we
choice questions (MCQs). This is a test format that will be mention in this thesis is not the same as the one used in null-
familiar to most people: for each question, there are several hypothesis significance testing. Our confidence levels relate
alternatives to choose from as the correct answer(s). We strictly to students’ subjective perceptions of which response
further limit our discussion by only considering the case option most appropriately answers the test question.
where there is one correct answer to an MCQ, since there It is useful to characterise a student’s level of confidence
are variations that involve selecting more than one option. meaningfully. Such a characterisation would reflect the posi-
(For tests that are marked by machine, this is reflected in the tion of the student’s internal state between being completely
shading of multiple lozenges on the optical response sheet.) confident and completely unconfident. Thankfully, such a
Three test strategies come to mind: answer guessing, metric already exists in the form of probability values. In
answer switching and answer elimination. In a multiple- fact, one’s confidence level can be thought of as a subjective
choice context, guessing and switching are self-explanatory. probability – the more confident about a response option
The former involves picking an answer from among the one is, the more probable it is that the option is the correct
alternatives (either at random or through some systematic answer. (We shall use confidence levels and probabilities
selection process, like elimination), while the latter involves interchangeably.)
going back to a previously-answered question and altering Probabilities contain all available information about a
the response. student’s choices and can be decomposed into two further
Answer elimination involves examining the given altern- sources of information, known as ranks and order statistics.
atives and crossing out those alternatives that are definitely In doing this, we have more insightful pieces of information
incorrect. This can be used in a situation where one does not that give us a richer understanding of students’ thought
know what the correct answer is (in conjunction with the processes while answering MCQs.
EXAM STRATEGIES AND TEST SCORE VALIDITY 3
Prior content
knowledge
Influences
Probability
distributions
Risk aversion
Fixed Varying
distributions distributions
Probability
values/weights
How might test
scores be affected
by strategy?
Learn about
Varying students’
prospects
Representation
bias Introduce elimin-
ation scoring (ES)
FS RF
How might test How do we learn what
scores be affected the ranks and order
Omit Guess by strategy? statistics for these
distributions are?
Explained
using
Differential
Prospect
switching
theory
Empirical
evidence
Figure 1. Schematic of thesis, started from the node labelled ‘Prior content knowledge’, showing student characteristics (level
of confidence, representation bias and risk aversion) related to the test strategies students use.
Introducing Ranks and Order Statistics can generically write that a student will have X 1 ,X 2 , . . . ,X c
as his or her confidence levels over all the options. The
Ranks and order statistics feature throughout the thesis, constant c refers to the total number of response options
and are themselves important concepts in the area of non- available.
parametric statistics. We explain what they are and how Ranks denote the order of preference that a particular
they work, taking reference from the first chapter of ?. After response option is associated with. Compactly, we write
defining ranks and order statistics, we put them into context
R(m) = k, (1)
with an example.
Let us first agree that, in an MCQ, a student will have vary- where
ing levels of confidence attached to each response option.
1 ≤ m ≤ c and 1 ≤ k ≤ c, and
(We discussed this in the previous subsection.) We do not
know what the values of these confidence levels are, so we m ∈ Z+ and k ∈ Z+ .
4 KOH, MARIS AND COOMANS
Table 1 The reader will notice that option 2, with the largest
An Example of a Student’s Ranks and Order Statistics confidence level of 0.45, has a rank of 4, while option 4, with
the smallest confidence level of 0.10, has a rank of 1. Going
m π r o back to our verbal definition of a rank, this is rather non-
1 0.15 2 0.10 intuitive as what we have here is a reverse order of prefer-
2 0.45 4 0.15 ence. Nonetheless, this convention is used for mathematical
3 0.30 3 0.30 convenience and will be better appreciated in subsequent
4 0.10 1 0.45 works focusing on an axiomatic development of our content.
m has a rank of value k – all we have done is reorganise Fixed Probability Distributions
our existing information. We can now refer to any option
Consider three students who are answering a four-
in terms of its inverse rank, without needing to know what
alternative MCQ in some test. For argument’s sake, assume
that option’s specific index is. In later examples, this shall
that the test assesses them on a subject that they know noth-
prove highly convenient.
ing about. As a result, each student is equally confident of
each presented alternative.
The Different Phenomena Associated With Ranks
and Order Statistics πAdam = (0.25, 0.25, 0.25, 0.25)
π Beth = (0.25, 0.25, 0.25, 0.25)
Over the course of the thesis, several testing phenomena πCate = (0.25, 0.25, 0.25, 0.25)
will be mentioned. These refer to test-taking strategies, scor-
ing rules and student characteristics. The reader is invited The above represents a ‘probability distribution’ of confid-
to refer to Figure 1 for a brief overview of where these phe- ence levels over the available alternatives. Their respective
nomena lie in relation to each other and to the grand scheme probability distributions are denoted by π .
of the thesis. Terms that are presently unfamiliar will be Put differently, Adam, Beth and Cate all have the same
explained as we progress through the thesis. probability of getting the correct answer. Yet, after they
We wish to end our brief introduction to ranks and order answer the question, the following results are obtained:
statistics by stating that different phenomena relate to these
pieces of information in varying degrees. Hence, different Adam: Correct response
issues in the testing process are useful for giving us specific Beth: Incorrect response
information about either ranks or order statistics. We illus- Cate: Omitted response (marked incorrect)
trate the associations between test phenomena and ranks
and order statistics in Figure 2. What could have resulted in the differences in their re-
From the figure, there are two ways to obtain informa- sponses? Why did Cate choose to leave her answer blank?
tion about ranks and order statistics. The first is through We contend that these differences in responses can be traced
students’ levels of confidence, denoted by π . However, this to differences in the degree of representation bias among
is purely hypothetical as one cannot realistically expect stu- the students.
dents to precisely and accurately quantify the degree of con-
fidence they attach to every response option. By imposing Representation Bias and Scoring Rules
an elimination scoring rule (which will be explained in a
Before diving into the matter at hand, we first take a
subsequent section), the second method presents a more
step back to explain what we mean by ‘representation’. The
plausible method of learning about ranks and order statist-
term refers to a student’s personal schema of how much
ics. Elimination scoring also lets us learn about students’
credit is awarded to a question response. In answering a
levels of risk aversion, which then tell us something about
multiple-choice question, one of three outcomes is possible:
how willing they are to guess on a question (pertinent to a
the given response is correct, incorrect or omitted (that is,
formula scoring scenario).
skipped). The amount of credit awarded to each of these
Looking at ranks and order statistics separately, the three outcomes will depend on the instructions specified
former are associated with the number-right (NR) and for- by the test-maker. When a student’s personal schema is at
mula (FS) scoring rules. The latter are associated with answer odds with the ‘test-specified schema’, we have an instance
guessing and switching (both test strategies), and overcon- of representation bias.
fidence (a student characteristic). This test-specified schema is known as a scoring rule. It
determines how much credit a student receives for a correct,
Fixed and Varying Probability Distributions incorrect or skipped response, which is particularly pertin-
ent to the latter two responses. Two examples of scoring
Having established the relevance of ranks and order stat- rules should help illustrate the issue.
istics to this thesis, we return to our earlier discussion of
probability distributions. A probability distribution can take Number-Right Scoring
two forms: fixed and varying. In the commonest and simplest instance of a scoring rule,
A fixed distribution is essentially a uniform distribution, students receive maximum credit for a correct response and
so that all options have the same probability value – or con- zero credit for an incorrect or a skipped response. This
fidence level – attached to them. Ranks and order statistics scoring rule is known as number-right scoring (NR scoring).
are of little use to us in this instance, since they would all be When such a rule is used, the probability of receiving credit
identical. They are therefore not featured when we discuss for a skipped response is zero, whereas for a submitted re-
fixed distributions. sponse, the probability is 1/c, where c indicates the number
A varying distribution is one where the probability val- of available choices (or alternatives). Hence, for a question
ues vary across the options, and also over students. We first that one has difficulty with, it makes sense to guess, since
examine the fixed distribution before moving to the varying there is a higher probability of getting the question correct
distribution. compared to leaving a blank.
6 KOH, MARIS AND COOMANS
π
Guessing
NR
R O Switching
FS
Overconfidence
Risk Elimination
aversion scoring
Figure 2. Different phenomena associated with ranks and order statistics. π refers to students’ levels of confidence, which
are expressed as probability values.
Consider, for instance, the case of a multiple-choice test Regardless of reward or penalty, the underlying logic
where there are four alternatives within each question. The remains the same: In FS scoring, incorrect responses always
probabilities of receiving credit for either guessing or omit- receive less credit than omitted ones. This thesis uses Traub
ting an answer are shown below. and colleagues’ version of the FS rule (that is, the ‘carrot’
variant).
Guess : Pr(correct) = 0.25, Pr(incorrect) = 0.75
Skip : Pr(correct) = 0, Pr(incorrect) = 1 Distinguishing Between Number-Right and Formula
In a situation where a student has absolutely no idea what Scoring
the correct answer is, the above shows that the most rational We can summarise the NR and FS scoring rules as follows.
course of action is to guess. In NR scoring, if one guesses, there is a 0.25 probability
Of course, while guessing is good for students, it is a nuis- of getting a correct answer and, hence, 1 point, and a 0.75
ance for test markers. A student’s ability cannot be accur- probability of receiving 0 points (an incorrect answer). If
ately assessed if he or she makes use of guessing, particularly one skips the question, 0 points are received with certainty.
so if the guesses result in correct answers. With guessing, a In FS scoring, if one guesses, there is a 0.25 probability of
student’s test score will reflect both ability and the positive receiving 1 point, and a 0.75 probability of receiving −0.33
effects of having used the test strategy – basically, the test points. As with NR scoring, if one skips the question, 0
score is higher than it should have been. points are received with certainty.
Having a scoring rule that helps to keep guessing be- The main difference between NR and FS scoring rules is in
haviour in check might therefore be useful. This is where how these rules distinguish incorrect and omitted responses.
formula scoring comes in. In the former rule, there is no difference between incorrect
and omitted responses – both yield zero credit. In the latter,
Formula Scoring an incorrect response receives less credit than an omitted
Formula scoring (FS scoring) is a common alternative to response.
NR scoring. In this instance, one receives maximum credit We borrow notation from expected utility theory (?) to
for a correct response, partial (or zero) credit for an omitted express this more compactly as
response and zero (or negative) credit for an incorrect re- NR scoring :
sponse. As an example, a correct response receives 1 point; a
Uguess = u (1) · 0.25 +u (0) · 0.75
skipped response, 0 points; and an incorrect response, −0.33
points. > u (0) · 1 = Uskip
Formula scoring is set up to penalise guessing: It forces FS scoring :
students to think twice about making a hasty response to a Uguess = u (1) · 0.25 +u (−0.33) · 0.75
question to which they do not know the answer. As such, a
Uskip = u (0) · 1.
student answering a formula-scored test question will have
to carefully consider his or her degree of knowledge – or lack Uguess and Uskip are understood as prospects (?). A prospect is
thereof – before deciding what kind of response to make. a course of action that one could take, out of several in con-
That incorrect responses are penalised constitutes a ‘stick’ tention. When answering a multiple-choice question, this
variant of formula scoring. A ‘carrot’ variant also exists would be to either answer or omit the question. Whichever
where, in contrast to the earlier variant, the focus is on course of action one chooses, there is an outcome that oc-
rewarding students for omitted answers (see ?). Sticking curs with some probability. In our context, the outcomes are
with the example, a correct response receives 1 point, but either 1 or 0 (correct or incorrect).
now a skipped response might receive 0.33 points and an The letter u represents the utility – framed in terms of
incorrect response, 0 points. a gain or loss of points on a test – of some outcome. This
EXAM STRATEGIES AND TEST SCORE VALIDITY 7
Table 2 non-zero). If one does not answer, one receives zero points
Scoring Rules and the (Whole Number) Weights, or Awarded with certainty. If one answers, one may or may not receive
Credit, Associated With Them credit (but zero credit is no longer a certainty).
We make use of this connection when presenting our
Incorrect Omitted Correct empirical example. In the example, we will also present
Number correct 0 0 2 another reason that we use the RF rule.
Formula 0 1 2
Reverse formula 1 0 2 Misrepresenting the Scoring Rule
of action that are contrary to what the theory predicts. Sev- We focused only on dichotomous mathematics items
eral violations of expected utility theory are mentioned in (where 1 = correct and 0 = incorrect) and removed all poly-
?. tomous items from analysis. Maths items were chosen be-
An alternative theory is proposed by Kahneman and Tver- cause these were considered to be least likely to be affected
sky (??) and referred to as prospect theory. It explains why by differences in translation (PISA tests are administered to
people violate expected utility theory’s predictions and is a students in their local languages). We then fitted a latent
descriptive model of choice behaviour. Using prospect the- class model to the students’ responses.
ory, we can rewrite our previous presentation of prospects Such a model is also known as a finite mixture model and
as is used in the analysis of multivariate and categorical data. It
is appropriate when a researcher believes that item response
NR scoring : data belong to unobserved groups – termed ‘classes’ – which
Vguess = v (1) · w (0.25) +v (0) · w (0.75) are distinct from each other (??, p. 166). After accounting
> v (0) · w (1) = Vskip for the dependence introduced by class membership, the
responses within each class are independent of each other.
FS scoring :
Vguess = v (1) · w (0.25) +v (−0.33) · w (0.75) The RF and FS Latent Classes
Vskip = v (0) · w (1). For our purposes, we believed that students’ PISA re-
sponses belonged to one of two scoring rule latent classes:
Aside from Uguess/skip turning into Vguess/skip , and u (.)
RF or FS. Given our earlier points, including the FS rule is an
turning into v (.), notice also that the probabilities are now
expected choice, but the reader might be wondering why we
weighted by the function w (.). The functions v (.) and w (.)
opted for the RF rule despite talking about misrepresenting
constitute the crux of prospect theory. The former refers
the number-right rule.
to gains and losses – in a test, this would be the amount of
We used the RF rule for statistical convenience. The RF
credit awarded or penalised. The latter function is a decision
and FS scoring rules are modelled using partial credit mod-
weight, which ‘reflects the impact of [the probability of an
els – this made the latent class analysis less complicated
outcome occurring] on the over-all value of the prospect’ (?).
in contrast to using the NR and FS scoring rules. (The NR
(Vskip could also have been written as Vomit , but we decided
rule is modelled using a Rasch model, which is statistically
to go with the former for reasons of taste.)
different from a partial credit model.) As we pointed out
Can we actually observe students who differ qualitatively
earlier, since the RF scoring rule is conceptually similar to
on their guessing and omitting behaviour? How might we
the NR rule, comparing the RF and FS rules is conceptually
go about gathering evidence of this in practice?
similar to comparing the NR and FS rules.
We explain – in passing – what the Rasch and partial
Empirical Evidence: PISA credit models are before presenting our analysis results for
We sought to answer these questions by analysing our PISA data.
publicly-available data from the 2009 edition of the Pro-
gramme for International Student Assessment, or PISA (?). What Are the Rasch and Partial Credit Models?
This is an international assessment of 15-year-old students These are statistical models that are used to determine
in member and partner countries of the Organisation for the probability of an item response being correct. In our
Economic Co-operation and Development (OECD), admin- analysis, we were also interested in the probability of an
istered once every three years. item response being omitted or incorrect. The Rasch model
Students are tested on three domains: mathematics, read- (see ??) is based on the logistic function, which is written as
ing and science. Several test forms, or booklets, are available.
For PISA 2009, 21 booklets were published. We only used 20 ex
f (x ) = . (7)
as the last one was meant for special education students and 1 + ex
therefore contained items that were much easier than the
The original argument x is replaced by the term θ − δ , so
others. The items in a test booklet were a mixture of both
that we have the probability function
dichotomous and polytomous questions.
In a dichotomous item, a response is either entirely correct eθ −δ
or entirely incorrect. Up until now, this has been the type Pr(x = 1 | θ ) = , (8)
1 + eθ −δ
of response we have been considering. A polytomous item,
on the other hand, allows for partially correct responses. where x = 1 for a correct response and x = 0 for an incorrect
An easy example is an open-ended mathematics question, response. θ indicates an individual’s ability, while δ indicates
where it is possible to obtain partial credit for one’s working an item’s difficulty. These are measured on a common latent,
even if the final answer is incorrect. This should not be or unobserved, scale, which will be further explained in
confused with FS scoring – a response scored using this rule the section on varying probability distributions and answer
is still dichotomous. Whether partial credit is given, as in switching.
the ‘carrot’ variant of FS, depends on whether a response is The term θ − δ thus tells us about the disparity between
omitted or incorrect and has nothing to do with a response’s a student’s ability and an item’s difficulty. When the dif-
correctness. ference is positive (that is, when θ > δ ), the greater the
EXAM STRATEGIES AND TEST SCORE VALIDITY 9
stress that this is not the same as the polytomous PISA items
we removed from analysis. Here, the (partial) credit associ-
Ability, θ ated with the response types can be thought of as personal
utility. A correct response yields the maximum utility, while
Figure 3. An illustration of a Rasch model, used to represent the type of internalised scoring rule determines the amount
the NR scoring rule. The curves relate to the probabilities of of utility elicited by omitted and incorrect responses.
a correct response (solid line) and incorrect response (broken The probability functions of the partial credit models
line) as ability increases. characterising the RF and FS rules are shown in Figure 4.
Two curves replace the ‘incorrect curve’ earlier seen in the
Rasch model (broken and dotted lines). In RF scoring, the
disparity, the greater the probability that an item will be dotted line represents the ‘omitted curve’ while the broken
correctly answered. The reverse is true when the difference line represents the incorrect curve. The reverse is true for
between θ and δ is negative (θ < δ ) – the greater the dispar- FS scoring – the dotted line represents the incorrect curve
ity, the smaller the probability that an item will be correctly and the broken line represents the omitted curve. In both
answered. cases, the solid line represents the probability of a correct
A graphical representation of the Rasch model is found response.
in Figure 3. Just as we have a curve representing item cor- In layperson’s terms, then, what we specified in our par-
rectness (solid line), we also have one representing item tial credit models was that under RF scoring, students of
incorrectness (broken line). We would write the probability lower ability omit more than students of higher ability. This
function for the ‘incorrectness curve’ as is because as ability, θ , increases, the probability of making
e−(θ −δ ) an omission decreases, while the probability of making incor-
Pr(x = 0 | θ ) = . (9) rect and correct answers (that is, answering the question as
1 + e−(θ −δ )
opposed to skipping it) increases. We specified the reverse in
One issue that arises is that the above curve does not FS scoring, in effect stating that students of higher ability are
distinguish between incorrect and omitted responses. All split into those who omit and those who answer (correctly).
it states is that the probability that an answer is incorrect
decreases as a student’s ability increases. However, it says Results of the Latent Class Analysis
nothing about the probability of omitted responses. Al-
though these are marked as incorrect, their probability of An algorithm based on the extended marginal Rasch
occurrence does not necessarily decrease with ability. This model (?) was written in R (?, version 3.2.3) for this analysis
is particularly true if a student internalises the FS scoring and applied to all PISA booklets (excluding the one used for
rule, in which case the probability of occurrence is equal for special education students). Test booklets fell into one of two
all levels of ability. broad categories: those with good latent class separation and
Past studies demonstrate that students who attempt tests those with poor separation. For the booklets in the second
under (test-specified) NR and FS rules are qualitatively dif- category, poor separation occurred because of insufficient
ferent from each other (?; ?; see also ?). This then implies maths items (the number of maths items differed across
that we should not automatically assume that students with booklets), and so the algorithm was unable to decisively sort
different internally-represented scoring rules are equivalent. students into one class or the other.
PISA tests are scored in an NR fashion, which necessitates Naturally, we shall only consider the booklets with good
using the Rasch model. But because of students who do not separation and, of these, will report results for only booklets
comply with the test-specified scoring rule and differentiate 01 and 21, n = 40, 018. These booklets shared identical maths
between incorrect and omitted responses, there is a need to items which were administered in the same order. (The other
use a model that captures this difference. well-separating booklets were 05, 11, 23 and 25.) We chose
Partial credit models, first introduced by ?, allow us to these two booklets because the territories represented in
do just that. Unlike a Rasch model, a partial credit model booklet 01 did not overlap with those of booklet 21, so we
consists of several probability functions that illustrate how ended up with a larger representation of students around
likely a given response will occur for some category. Our the world.
categories, in this case, refer to the correct, incorrect and As seen in Figure 5, most students were sorted into the RF
omitted response types. For a question with j response types, class. This would suggest that the majority of students do
we write a partial credit model as not misrepresent the test-specified scoring rule as an FS rule.
That said, a small but non-trivial proportion of students
ejθ −δ j
Pr(x = j | θ ) = P2 jθ −δ , (10) (circa 12 per cent) were classified as internalising the FS
j=0 e
j
scoring rule.
10 KOH, MARIS AND COOMANS
1.0
Probability
I/O
0.5
O/I
C
0.0
Ability, θ
Figure 4. An illustration of a partial credit model. The curves relate to incorrect, omitted and correct responses. In the legend,
these are abbreviated as I, O and C respectively.
0.6
0.4
0.2
0.0
Figure 5. Results of the latent class analysis, showing that the majority of students were sorted into class 0 (the RF scoring
rule). The step function shown in the figure is not a single line, but rather, is made up of thousands of data points representing
the students who completed PISA booklets 01 and 21. That the function progresses almost instantaneously from Class 0 to
Class 1 is indicative of good latent class separation.
KOH, MARIS AND COOMANS
0 0.35
Figure 6. Global heat map showing proportions of students per country who are sorted into the FS latent class. The darker the shade of blue, the larger the proportion (range: 0 to
35 per cent of the sample). Territories with no data (that is, non-participants of PISA 2009) are shaded black.
12
EXAM STRATEGIES AND TEST SCORE VALIDITY 13
Varying Probability Distributions and the in their willingness to guess. (We use the credit values that
Answer Guessing/Omitting Strategy we mentioned earlier: 1 point for a correct response, −0.33
points for an incorrect response and 0 points for an omitted
In the previous section, we discussed students’ individual
response.) Under a formula scoring framework, the skipping
probability distributions in a simplistic setting. This allowed
prospect produces the same utility for both students,
us to explain what makes some students guess and others
omit in a clear and intuitive way. In the real world, students’ VAdam = VBeth = v (0) · w (1).
thought processes are not identical replications of each other
– each process is a function of a student’s unique background Guessing, on the other hand, results in different utilities,
and level of test preparation. Using fixed probability distribu-
tions to explain the differential use of test strategies among VAdam = v (1) · w (0.5) +v (−0.33) · w (0.5) and
students thus becomes insufficient.
VBeth = v (1) · w (0.9) +v (−0.33) · w (0.1).
We now expound on a more realistic case where every
student’s probability distribution is unique to that person. If, for simplicity’s sake, we assume that w (.) is an identity
Adam, Beth and Cate reprise their roles as our hypothetical function, then VBeth > VAdam and Beth will be more will-
students. We are still examining their probability distribu- ing than Adam to guess the answer to the question. (An
tions with regards to one question of a test. However, we identity function is a function whose value is the same as
now have it that the students possess varying degrees of that of its argument. For example, in Adam’s case we have
preparation for the question. As such, their ‘new’ probability w (0.5) = 0.5.)
distributions can be written as follows. Regardless of whether the internalised rule is NR or FS,
πAdam = (0, 0.1, 0.4, 0.5) the presence of one coerces Adam and Beth to provide some
answer. In doing so, we learn about their most-preferred
πBeth = (0.01, 0.02, 0.07, 0.9) response option, r 4−1 . However, knowing this tells us nothing
π Cate = (0.5, 0.4, 0.1, 0) about the value of that rank’s order statistic.
If an answer is guessed, we know that the value of the
For this example, option 4 is the correct answer. It is im-
largest order statistic is large enough that it results in a
mediately apparent that Adam and Beth will likely answer
utility value more favourable towards guessing than omit-
the question correctly. Cate, again, will not.
ting. If an answer is omitted, then the converse occurs – the
largest order statistic is unable to tilt the balance towards
Ranks and Order Statistics Revisited
guessing and omitting yields a larger utility. From this test
As stated earlier, a student’s probability distribution gen- strategy perspective, guessing or omitting tell us something
erates information in the form of ranks and order statistics. about how sufficiently large an order statistic is, but does not
We want to know more about these pieces of information, associate it with a specific rank and, by extension, response
and test strategies and internalised scoring rules enable us option.
to do so. Looking at Adam and Cate, we find that they have differ-
In this section, we will see that internalised scoring rules ent ranks but the same order statistics. As such,
give us information about ranks and not order statistics,
while guessing/omitting behaviour gives us information rAdam = (1, 2, 3, 4) while
about order statistics but not ranks. The next section is also rCate = (4, 3, 2, 1).
in the context of varying probability distributions. There, we
will see how another test strategy, answer switching, gives The utility for skipping is the same for Adam and Cate (see
us information about order statistics but not ranks. These above) and, this time, the utility for guessing is also the same
sections serve as elaborations of Figure 2. for both students. However, the framing of the students’
guessing prospects is now different.
Guessing/Omitting, Internalised Scoring Rules and In Adam’s case, the guessing prospect reads: Select op-
Ranks tion 4 with a probability of 0.5 and earn 1 point; select one
We see from the above that Adam and Beth’s probability of the others and receive a penalty of −0.33 points. Cate’s
distributions have the same ranks but different order statist- guessing prospect reads: Select option 1 with a probability
ics. Recall that order statistics are the probabilities reflecting of 0.5 and earn 1 point; select one of the others and receive a
the magnitude of confidence across a question’s response penalty of −0.33 points. Adam and Cate’s guessing prospects
options, while ranks are the ordinal positions of these order differ in the response option to which the greatest amount
statistics. Hence, Adam and Beth have the same vector of of confidence is assigned (presented in bold for clarity).
ranks,
Added Realism: Answer Elimination as a Strategy
rAdam = rBeth = (1, 2, 3, 4).
In addition to what we have presented, another strategy
Considering Adam and Beth when they internalise the NR that will be familiar to many is that of answer elimination.
scoring rule is not so interesting to us because of a lack of a This is particularly helpful for questions where one is not
penalty were they to incorrectly answer the question. How- sure what the correct response option is, but can definitively
ever, if they internalise the FS rule, then we notice differences rule out at least one of the distractor options.
14 KOH, MARIS AND COOMANS
Through elimination, one ends up with a reduced set of evidence to the contrary: For students who change their
options, wherein the utility associated with making an incor- answers on a test, doing so is generally beneficial to their
rect choice becomes smaller and the utility associated with scores.
making a correct choice larger. This in turn has implications Convention establishes three categories of answer
for the overall utility value, which will help with deciding switches on a multiple-choice test: wrong to right (WR),
whether to guess or omit. wrong to wrong (WW) and right to wrong (RW). Previous
studies display these categories as percentages of the total
More Realistic Prospects, More Complexity number of switches made. Unsurprisingly, the WR category
is of interest, and a ratio of WR to RW switches (WR/RW)
So far, we have added some realism to our example and is calculated to conveniently express the degree to which
shown that varying probability distributions make it more answer switching is beneficial.
complex to determine values of utility, or V . This then makes A review by ? summarised the proportions of WR
it less straightforward to identify students who are more switches from 28 studies conducted between 1929 and 1983.
inclined to guess versus omit. For example, given a group These proportions ranged from 44.5 to 71.3 per cent, with
of students who adopt an FS rule instead of a test-specified most studies having a proportion of WR switches greater
NR rule, we know that these students will have a greater than 50 per cent (only four did not).
proclivity to omit their answers compared to their more ? and ? are more recent examples where the majority of
compliant peers. answer switches are WR. For all these studies, the WR/RW
However, these non-compliant students will also vary in ratio was greater than one, indicating that there were more
their degree of omitting. For some, omitting answers will WR switches than RW ones. These results fly in the face
appreciably hinder their test scores, whereas for others, this of received wisdom, constituting ‘empirical wisdom’ that
will not be much of an issue. It could be that, even though answer switching is not a test behaviour to be discouraged.
the latter subgroup view an incorrect guess unfavourably, We now turn to our hypothetical three students to demon-
they are sufficiently prepared and so are highly confident of strate the relation between answer switching and order stat-
a specific response option being correct. (This is what we istics.
saw with Beth.)
What interests us is the former subgroup – those who Order Statistics and Switching
omit to the point of hindering their scores. These are the
Recall that, in the case of varying probability distributions,
ones who would benefit the most from some kind of inter-
the confidence levels of Adam, Beth and Cate with regards
vention to correct their way of interpreting a test’s scoring
to a four-option MCQ are as follows:
rule. Being able to determine the utility that students get
from the guessing prospect is of great use in helping us πAdam = (0, 0.1, 0.4, 0.5)
identify this subgroup. πBeth = (0.01, 0.02, 0.07, 0.9)
That said, determining the values of V is exactly the prob-
πCate = (0.5, 0.4, 0.1, 0)
lem we face. While our example involving Adam, Beth and
Cate conveniently contained the values of their individual Between Adam and Beth, both have the highest amount
probabilities (which are also the order statistics), this is not of confidence that option 4 is the correct answer (which we
the case in reality. For argument’s sake, we simplified the decided earlier that it was). We also mentioned that Beth is
weighting function, assuming that it was an identity func- much more confident about her choice than Adam is. What
tion. Again, we do not normally expect this to be so for are the implications of this?
actual students (see ?). Adam has assigned similar levels of confidence to options
This therefore makes it important to develop a method 3 and 4 (0.4 and 0.5 respectively). That the bulk of his confid-
through which we can learn the values of the ranks and ence has gone to these options (0.4 + 0.5 = 0.9, as compared
order statistics. This was referred to in the introduction as to 0 + 0.1 = 0.1) indicates that he is uncertain about which is
elimination scoring, and we will discuss this at a later point. correct. For instance, Adam could decide initially to shade
option 3 as the correct answer, but then change his answer
Varying Probability Distributions and the to option 4 some time later. This would be an instance of a
Answer Switching Strategy WR switch. Alternatively, he could select option 4 at first
but switch it later to option 3, which would be a RW switch.
Having talked about guessing and omitting, we move Cate’s order statistics are similar to Adam’s, therefore
on to another test strategy often used by students: answer she would encounter the same uncertainty that he does.
switching. This refers to the act of answering a question, However, her indecision is between options 1 and 2, both of
moving on to the rest of the test and going back to the initial which are incorrect. She may select option 1 initially and
question some time later and changing the answer. switch to option 2 or vice versa, but regardless, her course
Since the late 1920s, several studies have investigated of action is a WW switch.
when and why students switch answers on multiple-choice Through the above example, we contend that answer
tests. Received wisdom has it that one’s initial response to switching is most likely to occur when the order statistics
a question, or item, is likeliest to be correct. One should are similar to each other. But there is a catch – the values of
therefore be very cautious about switching one’s answers. these order statistics are unavailable to us, so how might we
Time and time again, however, empirical research uncovers go about testing our hypothesis?
EXAM STRATEGIES AND TEST SCORE VALIDITY 15
δ1 δ2 θA δ3 L 1 = Pr(Data | H 1 ),
L 2 = Pr(Data | H 2 ),
Figure 7. An illustrative latent scale showing Adam’s ability L 3 = Pr(Data | H 3 ) and
(θ A ) relative to three test items’ difficulties (δ 1 ,δ 2 and δ 3 ). L 4 = Pr(Data | H 4 ).
for an item are sufficient statistics for student ability and all options that they consider incorrect. Points are awarded
item difficulty respectively (?). For item difficulty, the smaller for making a correct choice (eliminating an incorrect option
the proportion of students correctly answering the item, the or retaining the correct one). Similarly, points are lost when
more difficult that item is. an incorrect choice is made (eliminating the correct option
A sufficient statistic is one whose estimate contains all or retaining an incorrect one).
available information about a parameter (?), thus providing The ways in which students eliminate and retain response
a means of summarising the given data. In our case, the para- options give us an opportunity to learn about their ranks
meter(s) refer to student ability and item difficulty, while the and order statistics. Since these ways are basically courses
data consist of students’ test responses. Since sum scores of action, we refer to them as prospects as well. In a four-
and proportions of correct answers are sufficient statistics, option MCQ, there are four prospects to be considered: elim-
we can then use these as our operational definitions of our inating none of the options (prospect A), eliminating one
afore-mentioned parameters. option (prospect B), eliminating two options (prospect C) and
eliminating three options (prospect D). We now go through
A Possible Way of Testing Our Hypothesis the prospects in sequence and explain the points system
associated with each.
A shared feature of the studies in the above-mentioned
Prospect A is rendered below.
literature is that the student samples used were relatively
small in comparison to the samples we are presently able to
1 2 3 4
work with. For example, when we conducted the latent class
analysis earlier, we used a sample of 40, 018 students who
Since none of the options has been eliminated, a penalty of
participated in PISA 2009. Given the size of the data sets
−2 points is imposed with certainty.
available to us today, we are able to count the proportions
Prospect B involves the least-preferred option being elim-
of WR, WW and RW switches for entire cohorts of students.
inated, r 1−1 . (Recall that ranks – the subscript in r −1 – are
For example, in the Netherlands, children end their
expressed in reverse order of preference.)
primary education (basisonderwijs) around the age of 12.
During this time, they sit for the Centrale Eindtoets (?), ab- r 1−1
breviated the Cito test after its developer, the Central Insti-
tute for Educational Measurement (Centraal Instituut voor 1 S2 3 4
S
Toetsontwikkeling). This test is conducted over three days
If the eliminated option is the correct one, a penalty of −4
and assesses students’ mathematics and language abilities.
points is levied with probability o 1 (this is the smallest value
Their test results, together with personalised reports and re-
of all the order statistics). Otherwise, 0 points are awarded
commendations from teachers, determine which educational
with probability o 2 +o 3 +o 4 .
track they attend for secondary school.
When students complete the Cito test, their scripts are Prospect C involves the elimination of the two least-
marked by a machine that makes digital scans when in- preferred options, r 1−1 and r 2−1 .
stances of answer switching are detected. These scans, com-
r 1−1 r 2−1
bined with knowledge about student ability and item dif-
ficulty, make it possible to see if answer switching indeed 1 S2 S3 4
S S
occurs when one’s ability level is close to that of item diffi-
culty. If either r 1−1 or r 2−1 happens to be the correct option, a penalty
While the second author in this thesis is an employee of of −2 points is imposed with probability o 1 +o 2 . Otherwise,
Cito, privacy concerns preclude any analysis of Cito test data 2 points are awarded with probability o 3 +o 4 .
at this time. We look forward to embarking on a large-scale Prospect D entails full compliance with the scoring rule’s
analysis with this data in the future, when this hurdle is instructions, so that response options r 1−1 , r 2−1 and r 3−1 are
cleared. For now, we end our discussion on varying probab- eliminated.
ility distributions and move on to a new scoring rule that lets r 3−1 r 1−1 r 2−1
us learn more about certain test characteristics of students.
S1 S2 S3 4
S S S
Elimination Scoring
If any of the eliminated options is the correct one, 0 points
We propose elimination scoring as a new scoring rule are awarded with probability o 1 +o 2 +o 3 . Otherwise, the cor-
that allows us to learn what we could not from regular scor- rect response has been identified and 4 points are awarded
ing rules: ranks and order statistics. In addition, this new with probability o 4 .
rule gives us information about which students are over- or We can thus summarise the calculation of one’s utility
underconfident. values over the four prospects as
An Amalgamation of the NR, FS and RF Scoring Rules thing about how confident they are. For example, if we have
a student who always goes with prospect D, but frequently
From the above, we see that elimination scoring combines receives 0 points for her efforts, we have reason to believe
the features of the scoring rules we have mentioned. Pro- that she may be overconfident. On the other hand, if another
spect A is similar to the RF scoring rule in that it discourages student always goes with prospect C, but frequently receives
guessing by imposing a penalty for it. Prospects B and C are 2 points because the correct answer is not eliminated, we
akin to formula scoring, where a penalty is imposed if the learn that he may be underconfident.
correct answer is eliminated. In FS terms, this is the same as Interestingly, given that a student has made two initial
selecting an option other than the correct one. Prospect D correct eliminations, the choice between prospects C and D
corresponds to the NR rule being complied with, as points are is reminiscent of a problem posed by the French economist
awarded for making the right decision and none otherwise. Maurice Allais (?). This problem is also detailed in ? as a
What elimination scoring does, then, is discourage test starting point for developing their prospect theory of beha-
behaviours like guessing and non-compliance with scoring viour. Allais presents his problem as choice between two
instructions. It also provides an incentive to eliminate all in- gambles:
correct options, since the payoff for the correct combination
of eliminations is much higher than that for prospects B and Gamble 1A :
C (4 points as opposed to 0 or 2). Pr(Receive $1 million) = 1.00
Table 3
Example of Adam’s Most Optimal Prospects, Given His Order Statistics and Combinations of Weighting and Value Functions
(Horizontally- and Vertically-Arranged Figures, Respectively)
D D D
Original
D D D
Ideal
C C C
Overconfident
D D D
Underconfident
20 KOH, MARIS AND COOMANS
Table 4
Credit Awarding System in the Revised Version of Elimination Scoring
Prospect Points Action on test question Illustrative order statistics, o
VA = (1 + 0) · w (1)
2.0
= 1,
Utility
1.0
1
VB = (1 + ) · w (0.02 + 0.07 + 0.90)
4
= 1.2,
0.0
2
VC = (1 + ) · w (0.07 + 0.90) 0.00 0.25 0.50 0.75 1.00
4
= 1.5 and
p
3
VD = (1 + ) · w (0.90) Figure 8. The utility curve for when prospect C – eliminat-
4
= 1.6. ing two response options – is most preferred. The broken
vertical lines illustrate the reduction in utility when the
Given her certainty about one of the response options, it suboptimal prospects B and D are chosen. This curve holds
makes sense to choose prospect D, which is what we see for any number of response options, not just four.
from the above.
We shall consider one final example where a test can-
didate is completely ignorant. That is, the student is un-
able to (a) eliminate any of the incorrect response options Figure 8 graphically depicts our numerical examples. It
and (b) has no inkling of what the correct option might shows how a student’s utility peaks only when the pro-
be. Such a student has a probability distribution of π = portion of correctly-eliminated response options is at its
(0.25, 0.25, 0.25, 0.25), and the prospect utility values are optimum point, p = 0.50 (marked with a solid vertical line),
given that student’s set of order statistics.
VA = (1 + 0) · w (1) The points p = 0.25 and p = 0.75, marked with broken
= 1, vertical lines, represent suboptimal proportions of elimin-
ated options. The former occurs when a student attempts
1
VB = (1 + ) · w (0.25 + 0.25 + 0.25) to eliminate fewer options than he or she should, and the
4 latter occurs when more options are eliminated than should
= 0.9, be. Beyond q 2 , utility continues decreasing until it reaches
2 zero at the point where all options are eliminated. This
VC = (1 + ) · w (0.25 + 0.25) makes sense as eliminating all possible options is akin to
4
eliminating the correct option in terms of undesirability, and
= 0.8 and
should be treated the same way.
3
VD = (1 + ) · w (0.25 + 0.25)
4 The Interdisciplinarity of Elimination Scoring
= 0.4.
The ES rule represents an intersection of psychometrics
For such a student, the best course of action is clearly pro- and behavioural economics. While behavioural economics
spect A – skip the question altogether. yields information about what an individual student might
It is worth nothing that ES scoring is geared to penalise do on a test, psychometrics links students’ individual per-
students for choosing a suboptimal prospect in the form of formances together and makes them comparable. This also
reduced utility. This helps to dissuade students from guess- provides information about test items and the students’ abil-
ing on questions to which they do not know the answer, ities relative to the items.
consequently preserving test score validity. We can use ES scoring to gain a better understanding of
22 KOH, MARIS AND COOMANS
test-taking students. Just as we used a latent class model across the participating countries. For the most part, stu-
to identify students who internalised an FS or RF scoring dents did not misrepresent the test-specified NR rule as an
rule, we can do the same thing with ES scoring. For example, FS one. For those who did misrepresent the scoring rule, the
despite the emphasis given to not guessing, there may still be country with the highest proportion of such students was
some students who decided to do so. These non-compliant found to be Kyrgyzstan.
test-takers might internalise a scoring rule such as NR. A Next, we considered the more realistic case of varying
latent class model will then be useful in helping us detect probability distributions, and how these affected two test
them. strategies: answer guessing/omitting and answer switching.
Of course, such a model will be more complex than the With the former, we stated that varying distributions added
one we used for the FS and RF rules. While the NR rule to the complexity of determining which choice of strategy
is modelled using a Rasch model, the ES rule is modelled students would use. For the latter, we contended that answer
with a partial credit model (with the four prospects as the switching is likeliest when a student’s order statistics are
categories). Nevertheless, the logic behind the latent class very close in value to each other.
analysis is the same and we can expect to find students in Finally, we introduced elimination scoring as an extension
two separate classes if they exist. of the FS rule and a means of learning about students’ ranks
Latent class analysis could also be applied to students’ and order statistics. We explained how it works and gave
weighting and value functions. With reference to Table 3, an example using a hypothetical student’s order statistics.
we might specify different classes of weighting functions, We explained why only prospects C and D appeared, and
different classes of value functions and even classes repres- gave justification for why prospects A and B would never
enting combinations of weighting and value functions. In be selected. Recognising that the latter two prospects have
the third scenario, this would be a 12-class model (3 w (.) their place in a testing situation, we revised our ES rule so
classes × 4 v (.) classes). that all prospects would be likely, depending on students’
By combining information about the scoring rule, weight- order statistics. We added examples to demonstrate how
ing and value function classes, it is possible to create an our revised scoring rule works and proposed that ES scoring
individualised testing profile for each student. This allows could be a useful addition to the educator’s toolbox.
educators to understand how every student is likely to be-
have on a test and identify areas in which intervention or
Discussion and Conclusion
assistance is indicated. We believe that the ES rule will be
a useful tool to both students and teachers, and lead to test
Today’s educational systems are designed for the masses,
scores that are more reflective of students’ abilities.
resulting in a one-size-fits-all approach when it comes to test-
ing students. The same test scripts and assessment criteria
Summary are applied to everyone, either within the whole system or
We now summarise what we have covered in this thesis. within a certain educational track. Based on a student’s test
First, we introduced test strategies and described why they results, potentially substantial and long-lasting decisions are
are able to influence test score validity. We limited our then made with regards to his or her academic future.
discussion to multiple-choice questions (MCQs), proposing Given that students are unique products of an infinite
that students who attempt such questions form their own combination of backgrounds, characteristics and experi-
distribution of probability values, or levels of confidence. We ences, we cannot expect them all to respond equally to the
then decomposed the information provided by these prob- same assessment criteria. We would like students to un-
ability values into ranks and order statistics, two sources of dergo a bespoke testing process that caters to their relative
information which are independent of each other. strengths and weaknesses, but this arrangement is expensive,
With regards to students’ probability distributions, we resource-intensive and ultimately unrealistic.
first introduced a simplified scenario where the levels of con- However, in combining ideas from psychometrics and be-
fidence were identical across all response options. Ranks and havioural economics, we have presented a testing approach
order statistics were not featured in this section. However, that takes us some way towards this ideal. Through our
making use of a uniform distribution allowed us to demon- (revised) elimination scoring rule, students are discouraged
strate how prospect theory (?) explains the differential use from using score-maximising strategies – like guessing –
of the answer guessing/omitting strategy among students. that are not based on their existing ability. This rule also
Deciding between guessing and omitting was seen to be benefits students who would normally be disadvantaged
driven by a student’s internalised scoring rule, of which we in NR- or FS-scored tests, for instance because they refuse
introduced three – number right (NR), formula scoring (FS) to guess or switch their answers. These students always
and reverse formula scoring (RF). have a most optimal prospect for their personal set of order
We provided empirical evidence of students who sub- statistics, and they always receive credit as long as they
scribed to either the RF or FS scoring rule, using a latent choose one of the four prospects.
class – or finite mixture – model to classify PISA students By implementing the ES rule, we can learn about how stu-
into one class. Whether or not a student was sorted into the dents approach tests and, importantly, can more accurately
FS latent class depended on whether answers were omitted measure their level of content knowledge. This is particu-
on the test. We visualised our results as a global heat map larly helpful for students who are in educational systems
that displayed the proportions of students in the FS class where high-stakes testing is the norm.
EXAM STRATEGIES AND TEST SCORE VALIDITY 23