Professional Documents
Culture Documents
Abstract— We describe a method for assessing the visualization literacy (VL) of a user. Assessing how well people understand
visualizations has great value for research (e. g., to avoid confounds), for design (e. g., to best determine the capabilities of an
audience), for teaching (e. g., to assess the level of new students), and for recruiting (e. g., to assess the level of interviewees). This
paper proposes a method for assessing VL based on Item Response Theory. It describes the design and evaluation of two VL tests
for line graphs, and presents the extension of the method to bar charts and scatterplots. Finally, it discusses the reimplementation of
these tests for fast, effective, and scalable web-based use.
Index Terms—Literacy, Visualization literacy, Rasch Model, Item Response Theory
1 I NTRODUCTION
In April 2012, Jason Oberholtzer posted an article describing two fast and effective assessments of visualization literacy for well
charts that portray Portuguese historical, political, and economic established visualization techniques and tasks; and
data [36]. While acknowledging that he is not an expert on these top- • an implementation of four online tests, based on our method.
ics, Oberholtzer claims that thanks to the charts, he feels like he has “a
well-founded opinion on the country.” He attributes this to the simplic- Our immediate motivation for this work is to design a series of tests
ity and efficacy of the charts. He then concludes by stating: “Here’s that can help Information Visualization (InfoVis) researchers detect
the beauty of charts. We all get it, right?” low-ability participants when conducting online studies, in order to
But do we all really get it? Although the number of people famil- avoid possible confounds in their data. This requires the tests to be
iar with visualization continues to grow, it is still difficult to estimate short, reliable, and easy to administer. However, such tests can also be
anyone’s ability to read graphs and charts. When designing a visual- applied to many other situations, such as:
ization for non-specialists or when conducting an evaluation of a new
• designers who want to know how capable of reading visualiza-
visualization system, it is important to be able to pull apart the po-
tions their targeted audience is;
tential efficiency of the visualization and the actual ability of users to
• teachers who want to make a first or final assessment of the ac-
understand it.
quired knowledge of freshmen;
In this paper, we focus on building a set of visualization liter-
• practitioners who need to hire capable analysts; and
acy (VL) tests for line graphs, bar charts, and scatterplots. At this
• education policy-makers who may want to set a standard for vi-
point, we loosely define visualization literacy as the ability to use well-
sualization literacy.
established data visualizations (e. g., line graphs) to handle informa-
tion in an effective, efficient, and confident manner. This paper is organized in the following way. It begins with a back-
To generate these tests, we develop here a method based on Item ground section that defines the concept of literacy and discusses some
Response Theory (IRT). Traditionally IRT has been used to assess ex- of its best-known forms. Also introduced are the theoretical constructs
aminees’ abilities and other psychological constructs via predefined of information comprehension and graph comprehension, along with
tests and surveys in areas such as education [27], social sciences [15], the concepts behind Item Response Theory. Section 3 presents the ba-
and medicine [32]. Here, we first use IRT in a design phase to evaluate sic elements of our approach. Section 4 shows how these can be used
the relevance of potential test items. We then use it in an assessment to create and administer two VL tests using line graphs. In Section
phase for measuring a user’s level of visualization literacy. Finally, 5, our method is extended to bar charts and scatterplots. Section 6
based on these measures, we develop a series of tests for fast, effec- describes how our method can be used to redesign fast, effective, and
tive, and scalable web-based use. The great benefit of this method scalable web-based tests. Finally, Section 7 provides a set of “take-
is that inherits IRT’s property of making ability assessments that are away” guidelines for the development of future tests.
based not only on raw scores, but on a model that captures the stand-
ing of users on latent traits (e. g., the ability to use various graphical 2 BACKGROUND
representations). Very few studies investigate the ability of a user to extract information
As such, our main contributions are as follows: from a graphical representation such as a line graph or bar chart. And
of those that do, most make only higher-level assessments: they use
• a definition of visualization literacy,
such representations as a way to test mathematical skills, or the ability
• a method for: 1) assessing the relevance of VL test items, 2)
to handle uncertainty [37, 14, 52, 34, 35]. A few attempts do focus
assessing an examinee’s level of visualization literacy, 3) making
more on the interpretation of graphically-represented quantities [23,
21], but they base their assessments only on raw scores and limited
test items. This makes it difficult to create a true measure of VL.
• Jeremy Boy is with Inria, Telecom ParisTech, and EnsadLab. E-mail:
myjyby@gmail.com. 2.1 Literacy
• Ronald A. Rensink is with the University of British Columbia. E-mail:
rensink@psych.ubc.ca. 2.1.1 Definition
• Enrico Bertini is with NYU Polytechnic School of Engineering. E-mail: The online Oxford dictionary defines literacy as “the ability to read
ebertini@poly.edu. and write”. While historically this term has been closely tied to its
• Jean-Daniel Fekete is with Inria. E-mail: jean-daniel.fekete@inria.fr. textual dimension, it has grown to become a broader concept. Taylor
Manuscript received 31 March 2013; accepted 1 August 2013; posted online proposes the following much broader definition: “Literacy is a gate-
13 October 2013; mailed on 4 October 2013. way skill that opens to the potential for new learning and understand-
For information on obtaining reprints of this article, please send ing” [47].
e-mail to: tvcg@computer.org. Given this broader understanding, other forms of literacy can be dis-
tinguished. For example, numeracy was coined to describe the skills
needed for reasoning and applying simple numerical concepts. It was for what we will refer to as visualization literacy (VL): the ability to
intended to “represent the mirror image of [textual] literacy” [46, p. confidently use a given data visualization to translate questions spec-
269]. Like [textual] literacy, numeracy is a gateway skill. ified in the data domain into visual queries in the visual domain, as
With the advent of the Information Age, several new forms of liter- well as interpreting visual patterns in the visual domain as properties
acy have emerged. Computer literacy “refers to basic keyboard skills, in the data domain.
plus a working knowledge of how computer systems operate and of the This definition is related to several others that have been proposed
general ways in which computers can be used” [38]. Information liter- concerning visual messages. For example, a long-standing and often
acy is defined as the ability to “recognize when information is needed neglected form of literacy is visual literacy. This has been defined as
and have the ability to locate, evaluate, and use effectively the needed the “ability to understand, interpret and evaluate visual messages” [8].
information” [25]. Media literacy commonly relates to the “ability to Visual literacy is rooted in semiotics, i. e., the study of signs and sign
access, analyze, evaluate and create media in a variety of forms” [54]. processes, which distinguishes it from visualization literacy. While it
has probably been the most important form of literacy to date, it is
nowadays frowned upon; indeed, present-day literacy tests do not take
2.1.2 Higher-level Comprehension visual literacy into account.
In order to develop a meaningful measure of any form of literacy, it is Taylor [47] has advocated for the study of visual information lit-
necessary to understand the various components involved, starting at eracy, while Wainer has advocated for graphicacy [53]. Depending
the higher levels. Friel et al. [19] suggest that comprehension of in- on the context, these terms refer to the ability to read charts and di-
formation in written form involves three kinds of tasks: locating, inte- agrams, or to qualify the merging of visual and information literacy
grating, and generating information. Locating tasks require the reader teaching [2]. Because of this ambiguity, we prefer the more general
to find a piece of information based on given cues. Integrating tasks term “visualization literacy.”
require the reader to aggregate several pieces of information. Generat-
ing tasks not only require the reader to process given information but 2.2.2 Higher-level Comprehension
also require the reader to make document-based inferences or to draw Bertin [4] proposed three levels on which a graph may be interpreted:
on personal knowledge. elementary, intermediate, and comprehensive. The elementary level
Another important aspect of information comprehension is question concerns the simple extraction of information from the data. The inter-
asking, or question posing. Graesser et al. [20] posit that question mediate level concerns the detection of trends and relationships. The
posing is a major factor in text comprehension. Indeed, the ability to comprehensive level concerns the comparison of whole structures, and
pose low-level questions, i. e., to identify a series of low-level tasks, inferences based on both data and background knowledge. Similarly,
is essential for information retrieval and for achieving higher-level, or Curcio [13] distinguishes three ways of reading from a graph: from
deeper, goals. the data, between the data, and beyond the data. 1
2.1.3 Assessment The higher-level cognitive processes behind the reading of graphs
has been the concern of the area of graph comprehension. This
Several literacy tests are in common use. The two most impor- area studies the specific expectations viewers have for different graph
tant are UNESCO’s Literacy Assessment and Monitoring Programme types [50], and has highlighted many differences in the understanding
(LAMP) [51] and the Organisation for Economic Co-operation and of novices and expert viewers [16, 28, 29, 48].
Development (OECD)’s Programme for International Student Assess- Several influential models of graph comprehension have been pro-
ment (PISA) [37]. Other assessments include the Adult Literacy posed. For example, Pinker [39] describes a three-way interaction be-
and Lifeskills Survey (ALL) [33], the International Adult Literacy tween the visual features of a display, processes of perceptual organi-
Survey (IALS) [26], and the Miller Word Identification Assessment zation, and what he calls the graph schema, which directs the search
(MWIA) [31]. for information in the particular graph. Several other models are sim-
Assessments are also made using more local scales like the US Na- ilar (see Trickett and Trafton [49]). All involve the following steps:
tional Assessment of Adult Literacy (NAAL) [3], the UK’s Depart-
ment for Education Numeracy Skills Tests [14], or the University of
Kent’s Numerical Reasoning Test [52]. 1. the user has a pre-specified goal to extract a specific piece of
Most of these tests, however, take basic literacy skills for granted, information
and focus on higher-level assessments. For example the PISA test 2. the user looks at the graph and the graph schema and gestalt pro-
is designed for 15 year-olds who are finishing compulsory education. cesses are activated
This implies that examinees should have learned–and still remember– 3. the salient features of the graph are encoded, based on these
the basic skills required for reading and counting. As such, testing gestalt principles
basic literacy skills is irrelevant. It is only when examinees clearly fail 4. the user now knows which cognitive/interpretative strategies to
these tests that certain measures are deployed to test the lower-level use, because the graph is familiar
skills. 5. the user extracts the necessary goal-directed visual chunks
NAAL provides a set of 2 complementary tests administered to ex- 6. the user may compare 2 or more visual chunks
aminees who fail the main Textual literacy test [3]: the Fluency Ad- 7. the user extracts the relevant information to satisfy the goal
dition to NAAL (FAN) and The Adult Literacy Supplemental Assess-
ment (ALSA). These tests focus on adults’ ability to read single words
and small passages. Visual “chunking” consists in segmenting a visual display into smaller
Meanwhile, MWIA tests whole-word dyslexia. It has 2 levels, each parts, or chunks [28]. Each chunk represents a set of entities that have
of which contains 2 lists of words, one Holistic and one Phonetic, that been grouped according to gestalt principles. Chunks can in turn be
examinees are asked to read aloud. Evaluation is based on time spent subdivided into smaller chunks.
reading and number of words missed. Proficient readers should find Shah [45] identifies two cognitive processes that occur during
such tests extremely easy, while low ability readers should find them stages 2 through 6 of this model:
more challenging.
1. a top-down process where the viewer’s prior knowledge of se-
2.2 Visualization Literacy mantic content influences data interpretation, and
2.2.1 Definition 2. a bottom-up process where the viewer shifts from perceptual pro-
cesses to interpretation.
The view of literacy as a gateway skill can also be applied to the ex-
traction and manipulation of information from graphical representa- 1 For further reference, refer to Friel et al.’s Taxonomy of Skills Required for
tions such as line graphs or bar charts. In particular, it can be the basis Answering Questions at Each Level [19].
These processes are then interactively applied to different chunks, whether Chris was simply lucky (he guessed the answers), or whether
suggesting that the interpretation process is serial and incremental. he is in fact able to get the simpler items right, even though he didn’t
However Carpenter & Shah [9] have shown that graph comprehen- this time.
sion, and more specifically visual feature encoding, is rather an itera- Imagine now that Rob, Jenny, and Chris take a second VL test. Rob
tive process than a straight-forward serial process. gets a score of 3, Chris gets 4, and Jenny gets 10. We would infer that
Freedman & Shah [17] relate the top-down and bottom-up pro- this test is easier, since the grades are higher. However, if we look at
cesses respectively to a construction and an integration phase. Dur- the intervals between examinees’ scores, we see that Jenny is 7 points
ing the construction phase, the viewer activates prior graphical knowl- ahead of Rob in this test, whereas she was only 5 points ahead in the
edge, i. e., the graph schema, and domain knowledge to construct a co- first one. If we were to truly measure abilities, we would want these
herent conceptual representation of the available information. During intervals to be invariant. In addition, seeing that Chris’ score is once
the integration phase, disparate knowledge is activated by “reading” again similar to Rob’s (knowing that they both got the same items right
the graph and is combined to form a coherent representation. These this time), would lead us to think that they do in fact have similar levels
two phases take place in alternating cycles. This suggests that do- of ability. However, this test actually provides more information on
main knowledge can influence the interpretation of graphs. However, lower abilities, since it was able to differentiate Chris and Rob, while
highly visualization-literate people should suffer less influence of both the first test could not.
the top-down and bottom-up processes [45]. Finally, imagine that all three examinees take a third VL test. This
time, they all get scores of 10. We may conclude that this test is not
2.2.3 Assessment good for detecting the ability to read graphs. However, another possi-
Relatively little has been done on the assessment of literacy involving bility would be that all items in this test share the same (low) level of
graphical representations. However, interesting work has been done difficulty. This would mean that this test does in fact assess ability to
on measuring the perceptual abilities of a user to extract information read graphs, but its items are redundant and not very discriminant.
from several types of graphical representations. For example, various One way of fulfilling all of these requirements is by using Item Re-
studies have demonstrated that users can perceive slope, curvature, di- sponse Theory (IRT) [55]. This is a model-based approach that does
mensionality, and continuity in line graphs (see [12]). In addition, not use response data directly, but transforms them into estimates of
Correll et al. [12] have shown that users can make judgements about latent traits (e. g., abilities), which then serve as the basis of assess-
aggregate properties of data using these graphs. ment. IRT models have been applied to tests in a variety of fields such
Scatterplots have also received some degree of attention. For exam- as health studies, education, psychology, marketing, economics, social
ple, studies have examined the ability of a user to determine Pearson sciences (see [44]), and even in graphicacy [53].
correlation r [5, 10, 30, 40, 42]. Several interesting results have been The core idea is to predict the probability of success of an examinee
obtained, such as general tendency to underestimate correlation, es- on an item. This depends on both the examinee’s ability and the item’s
pecially in the range .2 < |r| < .6, and an almost complete failure to difficulty. An important aspect of IRT is that it projects both of these
perceive correlation when |r| < .2. characteristics onto the same scale–that of the latent trait. Ability (or
Concerning the outright assessment of literacy, the only relevant standing on the latent trait) is derived from a pattern of responses to a
work we know of is Wainer’s study on the difference in graphicacy lev- series of test items. Item difficulty is defined by the 0.5 probability of
els between third-, fourth-, and fifth-grade children [53]. In this study, success of an examinee with the appropriate ability for the item. For
an 8 item test was developed for several visualizations, including line example, an examinee with an ability value of 0 (0 being the value
graphs and bar charts. The test was scored using Item Response The- attributed to an average achiever) will have a 50% chance of giving a
ory [55], which proved to be a very effective way of assessing abilities. correct answer to an item that has a difficulty value of 0.
The study revealed that children reach “adult levels of graphicacy” IRT offers models for dichotomous (e. g., true/false question re-
as soon as the fourth-grade, leaving “little room for further improve- sponses) and for polytomous data (e. g., responses on likert-like
ment.” However, it is unclear what these “adult levels” are. If we look scales). In this paper, we focus on models for dichotomous data. These
at textual literacy, some children are more literate than certain adults. define the probability of a correct response to an item i by the function:
People may also forget these skills if they do not regularly practice.
1 − ci
Therefore, while very useful, we consider Wainer’s work to be lim- pi (θ) = ci + (1)
ited. A way to assess adult levels of visualization literacy is necessary. 1 + e−ai (θ−bi )
where θ is an examinee’s standing on a latent trait (i. e., ability), and
ai , bi , and ci are the characteristics of a test item i. The central char-
2.3 Item Response Theory and the Rasch Model acteristic of this logistic function is b, the difficulty characteristic; if
Consider what we would like in an effective VL test. To begin with, θ = b, the examinee has a 0.5 probability of giving a correct answer
it should cover a certain range of abilities, each of which could be to the item. Meanwhile, a is the discrimination characteristic. An
measured by specific scores. Imagine such a test has 10 items, which item with a very high discrimination value basically sets a threshold
are marked 1 when answered correctly, and 0 otherwise. Rob sits the (at θ = b) for which examinees with θ < b have probability of suc-
test and gets a score of 2. Jenny also sits the test, and gets a score cess of 0, and examinees with θ > b have a probability of success of
of 7. We would hope that this means that Jenny is better than Rob at 1. Inversely, an item with a low discrimination value cannot clearly
reading graphs. If both Rob and Jenny were to take the test again, we separate examinees. Finally, c is the guessing characteristic. It sets
would also expect that both would get approximately the same scores, a lower bound for the extent to which an examinee will guess an an-
or at least that Jenny would still get a higher score than Rob. And we swer. We have found c to be unhelpful, so we have set it to zero for
would expect that whatever VL test Rob and Jenny both take, Jenny our development.
will always be better than Rob. Various IRT models have been proposed. One is the one-parameter
Now imagine that Chris sits the test and also gets a score of 2. If we logistic model (1PL), which sets a to a specific value for all items,
based our judgement on this raw score, we would assume that Chris is sets c to zero, and only considers the variation of b. Another is the
as bad as Rob at reading graphs. However, taking a closer look at the two-parameter logistic model (2PL), which considers the variations of
items that Chris and Rob got right, we realize that they are different: a and b, and sets c to zero. A third is the three parameter logistic
Rob gave correct answers for the two easiest items, while Chris gave model (3PL), which considers the variations of a, b, and c [7]. As
correct answers to two relatively complex items. This would of course such, 1PL and 2PL can be regarded as variants of 3PL, where different
require us to know the level of difficulty of each item, and would mean item characteristics are assigned specific values. A last variant of IRT
that while Chris gave incorrect answers to the easy items, he might models is the Rasch model (RM), which is a specific version of 1PL
show some ability to read graphs. Thus we would want the different where a = 1. For a complete set of references on the Rasch model,
scores to have “meanings,” i. e., it would be nice to be able to predict refer to http://rasch.org.
IRT offers the ability to evaluate the relevance of test items during 3.2 Tasks
a design phase (e. g., how difficult items are, how discriminatory they Most of the tests in our survey were part of numeracy assessments,
are, or how much room they leave for guessing), and for getting pre- and required some form of mathematical operation. However, our fo-
cise ability measures during an assessment phase. It is the backbone of cus was on tasks involving visual intelligence, i. e., that required only
the method we present in this paper. We should stress that this ability visual operations or mental projections on a graphical representation.
holds only if an IRT model fits the empirical data collected. Further- We identified 6 relevant tasks: Maximum (T1), Minimum (T2), Varia-
more, the accuracies of the examinee and item characteristics depend tion (T3), Intersection (T4), Average (T5), and Comparison (T6). All
on how closely an IRT model describes the interaction between ex- are standard benchmark tasks in InfoVis. T1 and T2 consist in finding
aminees’ abilities and their item responses, i. e., how well the model the maximum and minimum data points in the graph, respectively. T3
describes the latent trait. Thus different variants of IRT models should consists in detecting a trend, similarities, or discrepancies in the data.
be tested initially to find the best fit. Finally, it should be mentioned T4 consists in finding the point at which the graph intersected with a
that IRT models cannot be relied upon to “fix” problematic items in a given value. T5 consists in estimating an average value. Finally, T6
test. Proper test design is still required. consists in comparing different values or trends.
3 F OUNDATIONS 3.3 Congruency
In the approach we develop here, test items generally involve a 3 part Based on Friel et al. [18], we identified 3 types of questions: per-
structure: 1) a stimulus (e. g., a text, a figure, a table, or a graph), 2) ception questions, and high- and low-congruency questions. A per-
a task, and 3) a question. The stimuli are the particular graphical rep- ception question refers only to the visual aspect of the display (e. g.,
resentations used. Tasks are defined as the combination of the visual what colour are the dots?). The level of congruence meanwhile, is de-
operations and mental projections that an examinee should perform to fined by the “replaceability” of the data-related terms by perceptual
answer a given question. While tasks and questions are usually linked, terms. A highly congruent question translates into a perceptual query
we emphasize the distinction because early piloting revealed that dif- simply by replacing data terms by perceptual terms (e. g., what is the
ferent “orientations” of a question (e. g., emphasis on particular visual highest value/what is the highest bar?). A low-congruence question,
aspects, or data aspects) could affect performance. in contrast, has no such correspondence (e. g., is A connected to B–
To identify the different factors that may influence the difficulty in a matrix diagram/is the intersection between column A and row B
of a test item, we reviewed all the literacy tests that we could find, highlighted?)
which use graphs and charts as stimuli [23, 21, 34, 35, 14, 37, 52, 53].
It should be noted that the aim of this study is not to test the effect 4 A PPLICATION TO L INE G RAPHS
of these factors on item difficulty; we present them here merely as To illustrate our method, we created two line graph tests–Line
elements to be considered in the design phase. Graphs 1 (LG1) and Line Graphs 2 (LG2)–of slightly different de-
We identified 4 parameters for the stimulus: number of samples, signs, based on the principles described above. Both were adminis-
intrinsic complexity (or variability) of the data, layout, and level of tered on Amazon’s Mechanical Turk (MTurk).
distraction. We also found 6 recurring task types: extrema, trend, inter-
section, average, and comparison. Finally, we distinguished 3 different 4.1 Design Phase
question types: “perception” questions, “highly congruent” questions,
and “lowly congruent” questions. Each of these are described in the 4.1.1 Line Graphs 1: General Design
following subsections. For our first test (LG1), we created a set of 12 items using different
stimulus parameters and tasks. We hand tailored each item based on
3.1 Stimulus parameters an expected range of difficulty. Piloting had revealed that high varia-
tion in the different item dimensions failed to produce coherent tests,
In our survey, we first focused on identifying varying graphical param-
i. e., IRT models did not fit the response data. This implied that when
eters in the stimuli. We found four:
item design dimensions varied too much within a single test, addi-
Number of samples This refers to the number of graphically en- tional abilities beyond those involved in basic visualization literacy
coded elements in the stimulus. The value of this parameter can impact were most likely at play. Thus we kept the number of varying dimen-
tasks that require a degree of visual chunking [28]. sions low: only stimulus distraction and tasks varied. The test used
four samples for the stimuli, and a single layout with single scales. A
Complexity We define this as the local and global variability of summary is given in Table 1.
the data. For example, a dataset of the yearly life expectancy in differ- Each item was repeated five times2 . The test was blocked by items,
ent countries over a 50 year time period shows low local variation (no and all items and their repetitions were randomized to prevent carry-
dramatic “bounces” between two consecutive years), and low global over effects. We added a block of questions using a standard table
variation (a relatively stable, linear, increasing trend). In contrast, at the beginning of each group of repetitions to give examinees the
a dataset of the daily temperature in different countries over a year opportunity to consolidate their understanding of each new question,
shows high local variation (temperatures can vary dramatically from and to separate out the comprehension stage of the question-response
on day to the other) and medium global variation (temperature gener- process believed to occur in cognitive testing [11]. Thus the test was
ally rises an decreases only once during the year). composed of 72 trials.
Layout This corresponds to the structure of a graphical frame- Scenario The PISA 2012 Mathematics Framework [37] empha-
work and its scales. Layouts can be single (e. g., a 2-dimensional sizes the importance of having an understandable context for problem
Euclidian space), superimposed (e. g., three axes for a 2-dimensional solving. The current test focuses on one’s community, with problems
encoding), or multiple (e. g., several frameworks for a same visualiza- set in a community perspective.
tion). Multiple layouts include cutout charts and broken charts [24]. To avoid the potential bias of a priori domain knowledge, the test
Scales can be single (linear or logarithmic), bifocal, or lense-like. was set within a science-fiction scenario. The following task sce-
nario [6] was given: The year is 2813. The Earth is a desolate place.
Distraction Certain tasks require examinees to focus on a single
Most of mankind has migrated throughout the universe. The last hand-
sample or dimension (in the case of multi-dimensional visualizations)
ful of humans remaining on earth are now actively seeking another
while several other samples or dimensions are still present in the stim-
ulus; these latter structures can be considered as distractors. Correll et 2 Early piloting had revealed that examinees would stabilize their search
al. [12] have shown that fine variations in certain attributes of distrac- time and confidence after a few repetitions. In addition, repeated trials usu-
tors can impact perception. However, here we simply use distractors ally provide more robust measures as medians can be extracted (or means in
in a Boolean way, i. e., presence or not. the case of Boolean values).
LG1 LG2 BC SP
Item ID Task Distraction Item ID Task Congruency Item ID Task Samples Item ID Task Distraction
LG1.1 max 0 LG2.1 max high BC.1 max 10 SP.1 max 0
LG1.2 min 0 LG2.2 min high BC.2 min 10 SP.2 min 0
LG1.3 variation 0 LG2.3 variation high BC.3 variation 10 SP.3 variation 0
LG1.4 intersection 0 LG2.4 intersection high BC.4 intersection 10 SP.4 intersection 0
LG1.5 average 0 LG2.5 average high BC.5 average 10 SP.5 average 0
LG1.6 comparison 0 LG2.6 comparison high BC.6 comparison 10 SP.6 comparison 0
LG1.7 max 1 LG2.7 max low BC.7 max 20 SP.7 max 1
LG1.8 min 1 LG2.8 min low BC.8 min 20 SP.8 min 1
LG1.9 variation 1 LG2.9 variation low BC.9 variation 20 SP.9 variation 1
LG1.10 intersection 1 LG2.10 intersection low BC.10 intersection 20 SP.10 intersection 1
LG1.11 average 1 LG2.11 average low BC.11 average 20 SP.11 average 1
LG1.12 comparison 1 LG2.12 comparison low BC.12 comparison 20 SP.12 comparison 1
Table 1: Designs of line graph test 1 (LG1), line graph test 2 (LG2), bar chart test (BC), and scatterplot test (SP). Only varying dimensions are
shown. Each item is repeated 6 times, beginning with a table condition (repetitions are not shown). Pink cells in the Item ID column indicate
duplicate items in tests LG1 and LG2. Tasks with the same color coding are the same. Gray cells in the Distraction, Question Congruency,
and Samples columns indicate difference with white cells. The Distraction column uses a binary encoding: 0 = no distractors, 1 = presence of
distractors.
planet to settle on. Please help these people determine what the most Coding 6 Turkers spent less than 1.5 s on average reading and
hospitable planet is by answering the following series of questions as answering item questions; they were considered as random clickers,
quickly and accurately as possible. and their results removed from further analysis. All Turkers for whom
results were retained were native English speakers.
Data We used a dataset with a low-local and medium-global level The remaining data were sorted according to item and repetition ID
of variability: the monthly evolution of unemployment in different (assigned before randomization). Responses for the first repetitions
countries between the years 2000 and 2008. Country names were (table conditions) were removed. A score dataset (LG1s) was then
changed to science-fiction planet names listed in Wikipedia, and years created in accord with the requirements of IRT modeling: correct an-
were modified to fit the task scenario. swers were scored 1 and incorrect answers 0. Scores for each series
Priming and Pacing Before each new item, examinees were of item repetitions (Boolean values) were compressed by taking the
primed with the upcoming graph type, so that the concepts and op- mean values and rounding these, resulting in 12 sets of dichotomous
erations necessary for information extraction could be set up [41]. To item scores for each examinee.
separate the time required to read questions and answer them, and to
make sure examinees fully comprehended the question/task at hand, 4.1.3 Results
a specific pacing was given to each block of repetitions. First, the The purpose of the design phase is to remove items that are unhelpful
question was displayed (full page). At the bottom of this was a label for distinguishing between low and high levels of VL. To do so, we
“Proceed to graph framework”; this led participants to the table head- need to: 1) check that the simplest variant of IRT models (i. e., the
ers or graphical framework with appropriate titles and labels. Below Rasch model) fits the data, 2) find the best variant of the model to get
was displayed a button labeled “Display data”. When this button was the most accurate characteristic values for each item, and 3) assess the
activated, the graph or table was shown and the answer could be given. usefulness of each item with regard to the testing of low ability levels.
As mentioned, every block of repetitions began with a condition in Checking the Rasch model The Rasch model (RM) was fitted
which the data were shown in table form. This was preferred so as to to the score dataset. A 200 sample parametric Bootstrap goodness-of-
remove potential effects caused by the setup of high-level operations fit test using Pearson’s χ2 statistic revealed a non-significant p-value
for solving a particular kind of problem. for LG1s (p > 0.54), suggesting an acceptable fit3 . The Test Informa-
To make sure ability was being tested (and not capacity), we set tion Curve (TIC) is shown in Fig. 1a. It reveals a near-normal distri-
an 11s timer for each repetition. This was based on the mean time bution of test information across different ability levels, with a slight
required to answer the items of our pilot studies. bump around −2, and a peak around −1. This means that the test pro-
Response format To respond, examinees clicked on one of sev- vides more information about examinees with relatively low abilities
eral possible answers, displayed in the form of buttons at the bottom (0 being the mean ability level, i. e., that of an average achiever) than
of the interface. In some cases, correct answers were not directly dis- about examinees with high abilities.
played. For example, certain values required for answering were not Finding the right model variant Different IRT models, imple-
explicitly shown with labeled ticks on the graph’s axis. This was done mented in the ltm R package [43], were fitted to LG1s. A series of
to test examinees’ ability to make confident estimations (i. e., to han- pairwise likelihood ratio tests showed that the two-parameter logistic
dle uncertainty [37]). And although the graphs used color coding to model (2PL) was most suitable.
represent the different planets, the buttons did not. This ensured that
examinees translated the answer retrieved in the visual domain back Assessing the quality of test items Using 2PL, the TIC was
into the data domain. plotted again (Fig. 1b). The big spike suggests that several items
with difficulty characteristics just above −2 have high discrimination
4.1.2 Setup values (a-values). This is confirmed by the very steep Item Charac-
teristic Curves (ICCs) (Fig. 3a) of items LG1.1, LG1.4, and LG1.9
Participants To our knowledge, no particular number of samples (a > 51). This can also explain the distortion in the normal distribu-
is recommended for IRT modeling. We recruited 40 participants on tion in Fig. 1a.
MTurk. Turkers were required to have a 98% acceptance rate and a to- The probability estimates show that examinees with average abili-
tal of 1000 or more HITS approved. While the validity of calibrating ties have a 100% probability of giving a correct answer to the easiest
tests using MTurk may be debated due to lack of control over certain items (LG1.1, LG1.4, and LG1.9), and a 41% probability of giving a
experimental conditions [22], we wanted to perform our calibration
with results from a wide variety of people. 3 For more information about this statistic, refer to [43].
correct answer to the hardest item (LG1.11). However, LG1.11 has a
relatively low discrimination value (a < 0.7), and so may be consid-
ered not very useful for distinguishing between ability levels.
4.1.4 Discussion
IRT modeling appears to be a solid approach for our design phase.
Our results (Fig. 1) show that LG1 is useful for differentiating between
examinees with relatively low abilities, but not so much for ones with
high abilities.
The slight bump in the distribution of the TIC (Fig. 1a) suggests that
there are several test items that are quite effective for distinguishing
(a) TIC of LG1s under RM (b) TIC of LG1s under 2PL between ability levels lower than −2. This is confirmed by the spike
in Fig. 1b, indicating the presence of highly discriminating items.
Fig. 1: Test Information Curves (TICs) of the score dataset of the first Fig. 3a reveals that several items in the test have identical difficulty
line graph test under the original constrained Rasch model (RM) (a) and discrimination characteristics. Some of these could therefore be
and the two-parameter logistic model (2PL) (b). The ability scale considered for removal, as they only provide redundant information.
shows the θ-values. The slight bump in the normal distribution of Similarly, item LG1.11, which has a low discrimination characteristic,
(a) can be explained by the presence of several highly discriminating could be dropped as it is less efficient than others for differentiating
items, as shown by the big spike in (b). between ability levels.