You are on page 1of 11

A Principled Way of Assessing Visualization Literacy

Jeremy Boy, Ronald A. Rensink, Enrico Bertini, Jean-Daniel Fekete

To cite this version:


Jeremy Boy, Ronald A. Rensink, Enrico Bertini, Jean-Daniel Fekete. A Principled Way of Assessing
Visualization Literacy. IEEE Transactions on Visualization and Computer Graphics, 2014, 20 (12),
pp.10. �10.1109/TVCG.2014.2346984�. �hal-01027582�

HAL Id: hal-01027582


https://inria.hal.science/hal-01027582
Submitted on 26 Mar 2015

HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est


archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents
entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non,
lished or not. The documents may come from émanant des établissements d’enseignement et de
teaching and research institutions in France or recherche français ou étrangers, des laboratoires
abroad, or from public or private research centers. publics ou privés.
A Principled Way of Assessing Visualization Literacy
Jeremy Boy, Ronald A. Rensink, Enrico Bertini, and Jean-Daniel Fekete Senior Member, IEEE

Abstract— We describe a method for assessing the visualization literacy (VL) of a user. Assessing how well people understand
visualizations has great value for research (e. g., to avoid confounds), for design (e. g., to best determine the capabilities of an
audience), for teaching (e. g., to assess the level of new students), and for recruiting (e. g., to assess the level of interviewees). This
paper proposes a method for assessing VL based on Item Response Theory. It describes the design and evaluation of two VL tests
for line graphs, and presents the extension of the method to bar charts and scatterplots. Finally, it discusses the reimplementation of
these tests for fast, effective, and scalable web-based use.
Index Terms—Literacy, Visualization literacy, Rasch Model, Item Response Theory

1 I NTRODUCTION
In April 2012, Jason Oberholtzer posted an article describing two fast and effective assessments of visualization literacy for well
charts that portray Portuguese historical, political, and economic established visualization techniques and tasks; and
data [36]. While acknowledging that he is not an expert on these top- • an implementation of four online tests, based on our method.
ics, Oberholtzer claims that thanks to the charts, he feels like he has “a
well-founded opinion on the country.” He attributes this to the simplic- Our immediate motivation for this work is to design a series of tests
ity and efficacy of the charts. He then concludes by stating: “Here’s that can help Information Visualization (InfoVis) researchers detect
the beauty of charts. We all get it, right?” low-ability participants when conducting online studies, in order to
But do we all really get it? Although the number of people famil- avoid possible confounds in their data. This requires the tests to be
iar with visualization continues to grow, it is still difficult to estimate short, reliable, and easy to administer. However, such tests can also be
anyone’s ability to read graphs and charts. When designing a visual- applied to many other situations, such as:
ization for non-specialists or when conducting an evaluation of a new
• designers who want to know how capable of reading visualiza-
visualization system, it is important to be able to pull apart the po-
tions their targeted audience is;
tential efficiency of the visualization and the actual ability of users to
• teachers who want to make a first or final assessment of the ac-
understand it.
quired knowledge of freshmen;
In this paper, we focus on building a set of visualization liter-
• practitioners who need to hire capable analysts; and
acy (VL) tests for line graphs, bar charts, and scatterplots. At this
• education policy-makers who may want to set a standard for vi-
point, we loosely define visualization literacy as the ability to use well-
sualization literacy.
established data visualizations (e. g., line graphs) to handle informa-
tion in an effective, efficient, and confident manner. This paper is organized in the following way. It begins with a back-
To generate these tests, we develop here a method based on Item ground section that defines the concept of literacy and discusses some
Response Theory (IRT). Traditionally IRT has been used to assess ex- of its best-known forms. Also introduced are the theoretical constructs
aminees’ abilities and other psychological constructs via predefined of information comprehension and graph comprehension, along with
tests and surveys in areas such as education [27], social sciences [15], the concepts behind Item Response Theory. Section 3 presents the ba-
and medicine [32]. Here, we first use IRT in a design phase to evaluate sic elements of our approach. Section 4 shows how these can be used
the relevance of potential test items. We then use it in an assessment to create and administer two VL tests using line graphs. In Section
phase for measuring a user’s level of visualization literacy. Finally, 5, our method is extended to bar charts and scatterplots. Section 6
based on these measures, we develop a series of tests for fast, effec- describes how our method can be used to redesign fast, effective, and
tive, and scalable web-based use. The great benefit of this method scalable web-based tests. Finally, Section 7 provides a set of “take-
is that inherits IRT’s property of making ability assessments that are away” guidelines for the development of future tests.
based not only on raw scores, but on a model that captures the stand-
ing of users on latent traits (e. g., the ability to use various graphical 2 BACKGROUND
representations). Very few studies investigate the ability of a user to extract information
As such, our main contributions are as follows: from a graphical representation such as a line graph or bar chart. And
of those that do, most make only higher-level assessments: they use
• a definition of visualization literacy,
such representations as a way to test mathematical skills, or the ability
• a method for: 1) assessing the relevance of VL test items, 2)
to handle uncertainty [37, 14, 52, 34, 35]. A few attempts do focus
assessing an examinee’s level of visualization literacy, 3) making
more on the interpretation of graphically-represented quantities [23,
21], but they base their assessments only on raw scores and limited
test items. This makes it difficult to create a true measure of VL.
• Jeremy Boy is with Inria, Telecom ParisTech, and EnsadLab. E-mail:
myjyby@gmail.com. 2.1 Literacy
• Ronald A. Rensink is with the University of British Columbia. E-mail:
rensink@psych.ubc.ca. 2.1.1 Definition
• Enrico Bertini is with NYU Polytechnic School of Engineering. E-mail: The online Oxford dictionary defines literacy as “the ability to read
ebertini@poly.edu. and write”. While historically this term has been closely tied to its
• Jean-Daniel Fekete is with Inria. E-mail: jean-daniel.fekete@inria.fr. textual dimension, it has grown to become a broader concept. Taylor
Manuscript received 31 March 2013; accepted 1 August 2013; posted online proposes the following much broader definition: “Literacy is a gate-
13 October 2013; mailed on 4 October 2013. way skill that opens to the potential for new learning and understand-
For information on obtaining reprints of this article, please send ing” [47].
e-mail to: tvcg@computer.org. Given this broader understanding, other forms of literacy can be dis-
tinguished. For example, numeracy was coined to describe the skills
needed for reasoning and applying simple numerical concepts. It was for what we will refer to as visualization literacy (VL): the ability to
intended to “represent the mirror image of [textual] literacy” [46, p. confidently use a given data visualization to translate questions spec-
269]. Like [textual] literacy, numeracy is a gateway skill. ified in the data domain into visual queries in the visual domain, as
With the advent of the Information Age, several new forms of liter- well as interpreting visual patterns in the visual domain as properties
acy have emerged. Computer literacy “refers to basic keyboard skills, in the data domain.
plus a working knowledge of how computer systems operate and of the This definition is related to several others that have been proposed
general ways in which computers can be used” [38]. Information liter- concerning visual messages. For example, a long-standing and often
acy is defined as the ability to “recognize when information is needed neglected form of literacy is visual literacy. This has been defined as
and have the ability to locate, evaluate, and use effectively the needed the “ability to understand, interpret and evaluate visual messages” [8].
information” [25]. Media literacy commonly relates to the “ability to Visual literacy is rooted in semiotics, i. e., the study of signs and sign
access, analyze, evaluate and create media in a variety of forms” [54]. processes, which distinguishes it from visualization literacy. While it
has probably been the most important form of literacy to date, it is
nowadays frowned upon; indeed, present-day literacy tests do not take
2.1.2 Higher-level Comprehension visual literacy into account.
In order to develop a meaningful measure of any form of literacy, it is Taylor [47] has advocated for the study of visual information lit-
necessary to understand the various components involved, starting at eracy, while Wainer has advocated for graphicacy [53]. Depending
the higher levels. Friel et al. [19] suggest that comprehension of in- on the context, these terms refer to the ability to read charts and di-
formation in written form involves three kinds of tasks: locating, inte- agrams, or to qualify the merging of visual and information literacy
grating, and generating information. Locating tasks require the reader teaching [2]. Because of this ambiguity, we prefer the more general
to find a piece of information based on given cues. Integrating tasks term “visualization literacy.”
require the reader to aggregate several pieces of information. Generat-
ing tasks not only require the reader to process given information but 2.2.2 Higher-level Comprehension
also require the reader to make document-based inferences or to draw Bertin [4] proposed three levels on which a graph may be interpreted:
on personal knowledge. elementary, intermediate, and comprehensive. The elementary level
Another important aspect of information comprehension is question concerns the simple extraction of information from the data. The inter-
asking, or question posing. Graesser et al. [20] posit that question mediate level concerns the detection of trends and relationships. The
posing is a major factor in text comprehension. Indeed, the ability to comprehensive level concerns the comparison of whole structures, and
pose low-level questions, i. e., to identify a series of low-level tasks, inferences based on both data and background knowledge. Similarly,
is essential for information retrieval and for achieving higher-level, or Curcio [13] distinguishes three ways of reading from a graph: from
deeper, goals. the data, between the data, and beyond the data. 1
2.1.3 Assessment The higher-level cognitive processes behind the reading of graphs
has been the concern of the area of graph comprehension. This
Several literacy tests are in common use. The two most impor- area studies the specific expectations viewers have for different graph
tant are UNESCO’s Literacy Assessment and Monitoring Programme types [50], and has highlighted many differences in the understanding
(LAMP) [51] and the Organisation for Economic Co-operation and of novices and expert viewers [16, 28, 29, 48].
Development (OECD)’s Programme for International Student Assess- Several influential models of graph comprehension have been pro-
ment (PISA) [37]. Other assessments include the Adult Literacy posed. For example, Pinker [39] describes a three-way interaction be-
and Lifeskills Survey (ALL) [33], the International Adult Literacy tween the visual features of a display, processes of perceptual organi-
Survey (IALS) [26], and the Miller Word Identification Assessment zation, and what he calls the graph schema, which directs the search
(MWIA) [31]. for information in the particular graph. Several other models are sim-
Assessments are also made using more local scales like the US Na- ilar (see Trickett and Trafton [49]). All involve the following steps:
tional Assessment of Adult Literacy (NAAL) [3], the UK’s Depart-
ment for Education Numeracy Skills Tests [14], or the University of
Kent’s Numerical Reasoning Test [52]. 1. the user has a pre-specified goal to extract a specific piece of
Most of these tests, however, take basic literacy skills for granted, information
and focus on higher-level assessments. For example the PISA test 2. the user looks at the graph and the graph schema and gestalt pro-
is designed for 15 year-olds who are finishing compulsory education. cesses are activated
This implies that examinees should have learned–and still remember– 3. the salient features of the graph are encoded, based on these
the basic skills required for reading and counting. As such, testing gestalt principles
basic literacy skills is irrelevant. It is only when examinees clearly fail 4. the user now knows which cognitive/interpretative strategies to
these tests that certain measures are deployed to test the lower-level use, because the graph is familiar
skills. 5. the user extracts the necessary goal-directed visual chunks
NAAL provides a set of 2 complementary tests administered to ex- 6. the user may compare 2 or more visual chunks
aminees who fail the main Textual literacy test [3]: the Fluency Ad- 7. the user extracts the relevant information to satisfy the goal
dition to NAAL (FAN) and The Adult Literacy Supplemental Assess-
ment (ALSA). These tests focus on adults’ ability to read single words
and small passages. Visual “chunking” consists in segmenting a visual display into smaller
Meanwhile, MWIA tests whole-word dyslexia. It has 2 levels, each parts, or chunks [28]. Each chunk represents a set of entities that have
of which contains 2 lists of words, one Holistic and one Phonetic, that been grouped according to gestalt principles. Chunks can in turn be
examinees are asked to read aloud. Evaluation is based on time spent subdivided into smaller chunks.
reading and number of words missed. Proficient readers should find Shah [45] identifies two cognitive processes that occur during
such tests extremely easy, while low ability readers should find them stages 2 through 6 of this model:
more challenging.
1. a top-down process where the viewer’s prior knowledge of se-
2.2 Visualization Literacy mantic content influences data interpretation, and
2.2.1 Definition 2. a bottom-up process where the viewer shifts from perceptual pro-
cesses to interpretation.
The view of literacy as a gateway skill can also be applied to the ex-
traction and manipulation of information from graphical representa- 1 For further reference, refer to Friel et al.’s Taxonomy of Skills Required for

tions such as line graphs or bar charts. In particular, it can be the basis Answering Questions at Each Level [19].
These processes are then interactively applied to different chunks, whether Chris was simply lucky (he guessed the answers), or whether
suggesting that the interpretation process is serial and incremental. he is in fact able to get the simpler items right, even though he didn’t
However Carpenter & Shah [9] have shown that graph comprehen- this time.
sion, and more specifically visual feature encoding, is rather an itera- Imagine now that Rob, Jenny, and Chris take a second VL test. Rob
tive process than a straight-forward serial process. gets a score of 3, Chris gets 4, and Jenny gets 10. We would infer that
Freedman & Shah [17] relate the top-down and bottom-up pro- this test is easier, since the grades are higher. However, if we look at
cesses respectively to a construction and an integration phase. Dur- the intervals between examinees’ scores, we see that Jenny is 7 points
ing the construction phase, the viewer activates prior graphical knowl- ahead of Rob in this test, whereas she was only 5 points ahead in the
edge, i. e., the graph schema, and domain knowledge to construct a co- first one. If we were to truly measure abilities, we would want these
herent conceptual representation of the available information. During intervals to be invariant. In addition, seeing that Chris’ score is once
the integration phase, disparate knowledge is activated by “reading” again similar to Rob’s (knowing that they both got the same items right
the graph and is combined to form a coherent representation. These this time), would lead us to think that they do in fact have similar levels
two phases take place in alternating cycles. This suggests that do- of ability. However, this test actually provides more information on
main knowledge can influence the interpretation of graphs. However, lower abilities, since it was able to differentiate Chris and Rob, while
highly visualization-literate people should suffer less influence of both the first test could not.
the top-down and bottom-up processes [45]. Finally, imagine that all three examinees take a third VL test. This
time, they all get scores of 10. We may conclude that this test is not
2.2.3 Assessment good for detecting the ability to read graphs. However, another possi-
Relatively little has been done on the assessment of literacy involving bility would be that all items in this test share the same (low) level of
graphical representations. However, interesting work has been done difficulty. This would mean that this test does in fact assess ability to
on measuring the perceptual abilities of a user to extract information read graphs, but its items are redundant and not very discriminant.
from several types of graphical representations. For example, various One way of fulfilling all of these requirements is by using Item Re-
studies have demonstrated that users can perceive slope, curvature, di- sponse Theory (IRT) [55]. This is a model-based approach that does
mensionality, and continuity in line graphs (see [12]). In addition, not use response data directly, but transforms them into estimates of
Correll et al. [12] have shown that users can make judgements about latent traits (e. g., abilities), which then serve as the basis of assess-
aggregate properties of data using these graphs. ment. IRT models have been applied to tests in a variety of fields such
Scatterplots have also received some degree of attention. For exam- as health studies, education, psychology, marketing, economics, social
ple, studies have examined the ability of a user to determine Pearson sciences (see [44]), and even in graphicacy [53].
correlation r [5, 10, 30, 40, 42]. Several interesting results have been The core idea is to predict the probability of success of an examinee
obtained, such as general tendency to underestimate correlation, es- on an item. This depends on both the examinee’s ability and the item’s
pecially in the range .2 < |r| < .6, and an almost complete failure to difficulty. An important aspect of IRT is that it projects both of these
perceive correlation when |r| < .2. characteristics onto the same scale–that of the latent trait. Ability (or
Concerning the outright assessment of literacy, the only relevant standing on the latent trait) is derived from a pattern of responses to a
work we know of is Wainer’s study on the difference in graphicacy lev- series of test items. Item difficulty is defined by the 0.5 probability of
els between third-, fourth-, and fifth-grade children [53]. In this study, success of an examinee with the appropriate ability for the item. For
an 8 item test was developed for several visualizations, including line example, an examinee with an ability value of 0 (0 being the value
graphs and bar charts. The test was scored using Item Response The- attributed to an average achiever) will have a 50% chance of giving a
ory [55], which proved to be a very effective way of assessing abilities. correct answer to an item that has a difficulty value of 0.
The study revealed that children reach “adult levels of graphicacy” IRT offers models for dichotomous (e. g., true/false question re-
as soon as the fourth-grade, leaving “little room for further improve- sponses) and for polytomous data (e. g., responses on likert-like
ment.” However, it is unclear what these “adult levels” are. If we look scales). In this paper, we focus on models for dichotomous data. These
at textual literacy, some children are more literate than certain adults. define the probability of a correct response to an item i by the function:
People may also forget these skills if they do not regularly practice.
1 − ci
Therefore, while very useful, we consider Wainer’s work to be lim- pi (θ) = ci + (1)
ited. A way to assess adult levels of visualization literacy is necessary. 1 + e−ai (θ−bi )
where θ is an examinee’s standing on a latent trait (i. e., ability), and
ai , bi , and ci are the characteristics of a test item i. The central char-
2.3 Item Response Theory and the Rasch Model acteristic of this logistic function is b, the difficulty characteristic; if
Consider what we would like in an effective VL test. To begin with, θ = b, the examinee has a 0.5 probability of giving a correct answer
it should cover a certain range of abilities, each of which could be to the item. Meanwhile, a is the discrimination characteristic. An
measured by specific scores. Imagine such a test has 10 items, which item with a very high discrimination value basically sets a threshold
are marked 1 when answered correctly, and 0 otherwise. Rob sits the (at θ = b) for which examinees with θ < b have probability of suc-
test and gets a score of 2. Jenny also sits the test, and gets a score cess of 0, and examinees with θ > b have a probability of success of
of 7. We would hope that this means that Jenny is better than Rob at 1. Inversely, an item with a low discrimination value cannot clearly
reading graphs. If both Rob and Jenny were to take the test again, we separate examinees. Finally, c is the guessing characteristic. It sets
would also expect that both would get approximately the same scores, a lower bound for the extent to which an examinee will guess an an-
or at least that Jenny would still get a higher score than Rob. And we swer. We have found c to be unhelpful, so we have set it to zero for
would expect that whatever VL test Rob and Jenny both take, Jenny our development.
will always be better than Rob. Various IRT models have been proposed. One is the one-parameter
Now imagine that Chris sits the test and also gets a score of 2. If we logistic model (1PL), which sets a to a specific value for all items,
based our judgement on this raw score, we would assume that Chris is sets c to zero, and only considers the variation of b. Another is the
as bad as Rob at reading graphs. However, taking a closer look at the two-parameter logistic model (2PL), which considers the variations of
items that Chris and Rob got right, we realize that they are different: a and b, and sets c to zero. A third is the three parameter logistic
Rob gave correct answers for the two easiest items, while Chris gave model (3PL), which considers the variations of a, b, and c [7]. As
correct answers to two relatively complex items. This would of course such, 1PL and 2PL can be regarded as variants of 3PL, where different
require us to know the level of difficulty of each item, and would mean item characteristics are assigned specific values. A last variant of IRT
that while Chris gave incorrect answers to the easy items, he might models is the Rasch model (RM), which is a specific version of 1PL
show some ability to read graphs. Thus we would want the different where a = 1. For a complete set of references on the Rasch model,
scores to have “meanings,” i. e., it would be nice to be able to predict refer to http://rasch.org.
IRT offers the ability to evaluate the relevance of test items during 3.2 Tasks
a design phase (e. g., how difficult items are, how discriminatory they Most of the tests in our survey were part of numeracy assessments,
are, or how much room they leave for guessing), and for getting pre- and required some form of mathematical operation. However, our fo-
cise ability measures during an assessment phase. It is the backbone of cus was on tasks involving visual intelligence, i. e., that required only
the method we present in this paper. We should stress that this ability visual operations or mental projections on a graphical representation.
holds only if an IRT model fits the empirical data collected. Further- We identified 6 relevant tasks: Maximum (T1), Minimum (T2), Varia-
more, the accuracies of the examinee and item characteristics depend tion (T3), Intersection (T4), Average (T5), and Comparison (T6). All
on how closely an IRT model describes the interaction between ex- are standard benchmark tasks in InfoVis. T1 and T2 consist in finding
aminees’ abilities and their item responses, i. e., how well the model the maximum and minimum data points in the graph, respectively. T3
describes the latent trait. Thus different variants of IRT models should consists in detecting a trend, similarities, or discrepancies in the data.
be tested initially to find the best fit. Finally, it should be mentioned T4 consists in finding the point at which the graph intersected with a
that IRT models cannot be relied upon to “fix” problematic items in a given value. T5 consists in estimating an average value. Finally, T6
test. Proper test design is still required. consists in comparing different values or trends.
3 F OUNDATIONS 3.3 Congruency
In the approach we develop here, test items generally involve a 3 part Based on Friel et al. [18], we identified 3 types of questions: per-
structure: 1) a stimulus (e. g., a text, a figure, a table, or a graph), 2) ception questions, and high- and low-congruency questions. A per-
a task, and 3) a question. The stimuli are the particular graphical rep- ception question refers only to the visual aspect of the display (e. g.,
resentations used. Tasks are defined as the combination of the visual what colour are the dots?). The level of congruence meanwhile, is de-
operations and mental projections that an examinee should perform to fined by the “replaceability” of the data-related terms by perceptual
answer a given question. While tasks and questions are usually linked, terms. A highly congruent question translates into a perceptual query
we emphasize the distinction because early piloting revealed that dif- simply by replacing data terms by perceptual terms (e. g., what is the
ferent “orientations” of a question (e. g., emphasis on particular visual highest value/what is the highest bar?). A low-congruence question,
aspects, or data aspects) could affect performance. in contrast, has no such correspondence (e. g., is A connected to B–
To identify the different factors that may influence the difficulty in a matrix diagram/is the intersection between column A and row B
of a test item, we reviewed all the literacy tests that we could find, highlighted?)
which use graphs and charts as stimuli [23, 21, 34, 35, 14, 37, 52, 53].
It should be noted that the aim of this study is not to test the effect 4 A PPLICATION TO L INE G RAPHS
of these factors on item difficulty; we present them here merely as To illustrate our method, we created two line graph tests–Line
elements to be considered in the design phase. Graphs 1 (LG1) and Line Graphs 2 (LG2)–of slightly different de-
We identified 4 parameters for the stimulus: number of samples, signs, based on the principles described above. Both were adminis-
intrinsic complexity (or variability) of the data, layout, and level of tered on Amazon’s Mechanical Turk (MTurk).
distraction. We also found 6 recurring task types: extrema, trend, inter-
section, average, and comparison. Finally, we distinguished 3 different 4.1 Design Phase
question types: “perception” questions, “highly congruent” questions,
and “lowly congruent” questions. Each of these are described in the 4.1.1 Line Graphs 1: General Design
following subsections. For our first test (LG1), we created a set of 12 items using different
stimulus parameters and tasks. We hand tailored each item based on
3.1 Stimulus parameters an expected range of difficulty. Piloting had revealed that high varia-
tion in the different item dimensions failed to produce coherent tests,
In our survey, we first focused on identifying varying graphical param-
i. e., IRT models did not fit the response data. This implied that when
eters in the stimuli. We found four:
item design dimensions varied too much within a single test, addi-
Number of samples This refers to the number of graphically en- tional abilities beyond those involved in basic visualization literacy
coded elements in the stimulus. The value of this parameter can impact were most likely at play. Thus we kept the number of varying dimen-
tasks that require a degree of visual chunking [28]. sions low: only stimulus distraction and tasks varied. The test used
four samples for the stimuli, and a single layout with single scales. A
Complexity We define this as the local and global variability of summary is given in Table 1.
the data. For example, a dataset of the yearly life expectancy in differ- Each item was repeated five times2 . The test was blocked by items,
ent countries over a 50 year time period shows low local variation (no and all items and their repetitions were randomized to prevent carry-
dramatic “bounces” between two consecutive years), and low global over effects. We added a block of questions using a standard table
variation (a relatively stable, linear, increasing trend). In contrast, at the beginning of each group of repetitions to give examinees the
a dataset of the daily temperature in different countries over a year opportunity to consolidate their understanding of each new question,
shows high local variation (temperatures can vary dramatically from and to separate out the comprehension stage of the question-response
on day to the other) and medium global variation (temperature gener- process believed to occur in cognitive testing [11]. Thus the test was
ally rises an decreases only once during the year). composed of 72 trials.
Layout This corresponds to the structure of a graphical frame- Scenario The PISA 2012 Mathematics Framework [37] empha-
work and its scales. Layouts can be single (e. g., a 2-dimensional sizes the importance of having an understandable context for problem
Euclidian space), superimposed (e. g., three axes for a 2-dimensional solving. The current test focuses on one’s community, with problems
encoding), or multiple (e. g., several frameworks for a same visualiza- set in a community perspective.
tion). Multiple layouts include cutout charts and broken charts [24]. To avoid the potential bias of a priori domain knowledge, the test
Scales can be single (linear or logarithmic), bifocal, or lense-like. was set within a science-fiction scenario. The following task sce-
nario [6] was given: The year is 2813. The Earth is a desolate place.
Distraction Certain tasks require examinees to focus on a single
Most of mankind has migrated throughout the universe. The last hand-
sample or dimension (in the case of multi-dimensional visualizations)
ful of humans remaining on earth are now actively seeking another
while several other samples or dimensions are still present in the stim-
ulus; these latter structures can be considered as distractors. Correll et 2 Early piloting had revealed that examinees would stabilize their search
al. [12] have shown that fine variations in certain attributes of distrac- time and confidence after a few repetitions. In addition, repeated trials usu-
tors can impact perception. However, here we simply use distractors ally provide more robust measures as medians can be extracted (or means in
in a Boolean way, i. e., presence or not. the case of Boolean values).
LG1 LG2 BC SP
Item ID Task Distraction Item ID Task Congruency Item ID Task Samples Item ID Task Distraction
LG1.1 max 0 LG2.1 max high BC.1 max 10 SP.1 max 0
LG1.2 min 0 LG2.2 min high BC.2 min 10 SP.2 min 0
LG1.3 variation 0 LG2.3 variation high BC.3 variation 10 SP.3 variation 0
LG1.4 intersection 0 LG2.4 intersection high BC.4 intersection 10 SP.4 intersection 0
LG1.5 average 0 LG2.5 average high BC.5 average 10 SP.5 average 0
LG1.6 comparison 0 LG2.6 comparison high BC.6 comparison 10 SP.6 comparison 0
LG1.7 max 1 LG2.7 max low BC.7 max 20 SP.7 max 1
LG1.8 min 1 LG2.8 min low BC.8 min 20 SP.8 min 1
LG1.9 variation 1 LG2.9 variation low BC.9 variation 20 SP.9 variation 1
LG1.10 intersection 1 LG2.10 intersection low BC.10 intersection 20 SP.10 intersection 1
LG1.11 average 1 LG2.11 average low BC.11 average 20 SP.11 average 1
LG1.12 comparison 1 LG2.12 comparison low BC.12 comparison 20 SP.12 comparison 1

Table 1: Designs of line graph test 1 (LG1), line graph test 2 (LG2), bar chart test (BC), and scatterplot test (SP). Only varying dimensions are
shown. Each item is repeated 6 times, beginning with a table condition (repetitions are not shown). Pink cells in the Item ID column indicate
duplicate items in tests LG1 and LG2. Tasks with the same color coding are the same. Gray cells in the Distraction, Question Congruency,
and Samples columns indicate difference with white cells. The Distraction column uses a binary encoding: 0 = no distractors, 1 = presence of
distractors.

planet to settle on. Please help these people determine what the most Coding 6 Turkers spent less than 1.5 s on average reading and
hospitable planet is by answering the following series of questions as answering item questions; they were considered as random clickers,
quickly and accurately as possible. and their results removed from further analysis. All Turkers for whom
results were retained were native English speakers.
Data We used a dataset with a low-local and medium-global level The remaining data were sorted according to item and repetition ID
of variability: the monthly evolution of unemployment in different (assigned before randomization). Responses for the first repetitions
countries between the years 2000 and 2008. Country names were (table conditions) were removed. A score dataset (LG1s) was then
changed to science-fiction planet names listed in Wikipedia, and years created in accord with the requirements of IRT modeling: correct an-
were modified to fit the task scenario. swers were scored 1 and incorrect answers 0. Scores for each series
Priming and Pacing Before each new item, examinees were of item repetitions (Boolean values) were compressed by taking the
primed with the upcoming graph type, so that the concepts and op- mean values and rounding these, resulting in 12 sets of dichotomous
erations necessary for information extraction could be set up [41]. To item scores for each examinee.
separate the time required to read questions and answer them, and to
make sure examinees fully comprehended the question/task at hand, 4.1.3 Results
a specific pacing was given to each block of repetitions. First, the The purpose of the design phase is to remove items that are unhelpful
question was displayed (full page). At the bottom of this was a label for distinguishing between low and high levels of VL. To do so, we
“Proceed to graph framework”; this led participants to the table head- need to: 1) check that the simplest variant of IRT models (i. e., the
ers or graphical framework with appropriate titles and labels. Below Rasch model) fits the data, 2) find the best variant of the model to get
was displayed a button labeled “Display data”. When this button was the most accurate characteristic values for each item, and 3) assess the
activated, the graph or table was shown and the answer could be given. usefulness of each item with regard to the testing of low ability levels.

As mentioned, every block of repetitions began with a condition in Checking the Rasch model The Rasch model (RM) was fitted
which the data were shown in table form. This was preferred so as to to the score dataset. A 200 sample parametric Bootstrap goodness-of-
remove potential effects caused by the setup of high-level operations fit test using Pearson’s χ2 statistic revealed a non-significant p-value
for solving a particular kind of problem. for LG1s (p > 0.54), suggesting an acceptable fit3 . The Test Informa-
To make sure ability was being tested (and not capacity), we set tion Curve (TIC) is shown in Fig. 1a. It reveals a near-normal distri-
an 11s timer for each repetition. This was based on the mean time bution of test information across different ability levels, with a slight
required to answer the items of our pilot studies. bump around −2, and a peak around −1. This means that the test pro-
Response format To respond, examinees clicked on one of sev- vides more information about examinees with relatively low abilities
eral possible answers, displayed in the form of buttons at the bottom (0 being the mean ability level, i. e., that of an average achiever) than
of the interface. In some cases, correct answers were not directly dis- about examinees with high abilities.
played. For example, certain values required for answering were not Finding the right model variant Different IRT models, imple-
explicitly shown with labeled ticks on the graph’s axis. This was done mented in the ltm R package [43], were fitted to LG1s. A series of
to test examinees’ ability to make confident estimations (i. e., to han- pairwise likelihood ratio tests showed that the two-parameter logistic
dle uncertainty [37]). And although the graphs used color coding to model (2PL) was most suitable.
represent the different planets, the buttons did not. This ensured that
examinees translated the answer retrieved in the visual domain back Assessing the quality of test items Using 2PL, the TIC was
into the data domain. plotted again (Fig. 1b). The big spike suggests that several items
with difficulty characteristics just above −2 have high discrimination
4.1.2 Setup values (a-values). This is confirmed by the very steep Item Charac-
teristic Curves (ICCs) (Fig. 3a) of items LG1.1, LG1.4, and LG1.9
Participants To our knowledge, no particular number of samples (a > 51). This can also explain the distortion in the normal distribu-
is recommended for IRT modeling. We recruited 40 participants on tion in Fig. 1a.
MTurk. Turkers were required to have a 98% acceptance rate and a to- The probability estimates show that examinees with average abili-
tal of 1000 or more HITS approved. While the validity of calibrating ties have a 100% probability of giving a correct answer to the easiest
tests using MTurk may be debated due to lack of control over certain items (LG1.1, LG1.4, and LG1.9), and a 41% probability of giving a
experimental conditions [22], we wanted to perform our calibration
with results from a wide variety of people. 3 For more information about this statistic, refer to [43].
correct answer to the hardest item (LG1.11). However, LG1.11 has a
relatively low discrimination value (a < 0.7), and so may be consid-
ered not very useful for distinguishing between ability levels.

4.1.4 Discussion
IRT modeling appears to be a solid approach for our design phase.
Our results (Fig. 1) show that LG1 is useful for differentiating between
examinees with relatively low abilities, but not so much for ones with
high abilities.
The slight bump in the distribution of the TIC (Fig. 1a) suggests that
there are several test items that are quite effective for distinguishing
(a) TIC of LG1s under RM (b) TIC of LG1s under 2PL between ability levels lower than −2. This is confirmed by the spike
in Fig. 1b, indicating the presence of highly discriminating items.
Fig. 1: Test Information Curves (TICs) of the score dataset of the first Fig. 3a reveals that several items in the test have identical difficulty
line graph test under the original constrained Rasch model (RM) (a) and discrimination characteristics. Some of these could therefore be
and the two-parameter logistic model (2PL) (b). The ability scale considered for removal, as they only provide redundant information.
shows the θ-values. The slight bump in the normal distribution of Similarly, item LG1.11, which has a low discrimination characteristic,
(a) can be explained by the presence of several highly discriminating could be dropped as it is less efficient than others for differentiating
items, as shown by the big spike in (b). between ability levels.

4.1.5 Line Graphs 2: General Design


Our second line graph test (LG2) also used a low number of varying
dimensions, limited to tasks and question congruency. The test used
four samples for the stimuli, and a single layout with single scales.
The same task scenario, dataset, pacing, and response format were
kept, as well as the five repetitions, the initial table condition, and the
11s timer. A summary is given in Table 1. Half of the test’s items
were designed to be identical to items in LG1 (displayed in pink in
Table 1) to ensure that the results provided by the IRT modeling were
consistent.
40 participants were recruited on MTurk; the work of 3 Turkers was
rejected, for the same reason as above.

4.1.6 Results and Discussion


Our analysis was driven by the requirements listed above. Data were
Fig. 2: Test Information Curve of the score dataset of the second line sorted and encoded in the same way as before, and a score dataset for
graph test under the original constrained Rasch model. The test infor- LG2 was obtained (LG2s).
mation is normally distributed. RM was fitted to the score dataset, and the goodness-of-fit test re-
vealed an acceptable fit (p > 0.3). The pairwise likelihood ratio test
showed that RM fitted best. Fig. 2 shows that the Test Information
Curve is normally distributed, with an information peak roughly cen-
tered around −1, indicating that like LG1, this test mostly provides
information about examinees with abilities below average.
The Item Characteristic Curves were plotted (Fig. 3b) and com-
pared to those of our first test (Fig. 3a) to make sure that the identical
items in LG1 and LG2 had similar difficulty distributions. While it
cannot be expected that each identical item has the exact same charac-
teristics, since item characteristics are relative to individual tests, the
order of their difficulty should be consistent. To illustrate this point,
consider a simple numeracy test with two items: 10 + 20 (item 1) and
17 + 86 (item 2). It should be assumed that item 1 is easier than item 2.
In other words, the difficulty characteristic of item 2 should be higher
than that of item 1. Now if we add another item to the test, say 51 × 93
(item 3), the most difficult item in the previous version of the test (item
2) will no longer seem so difficult. However, it should still be more
(a) ICCs of LG1s under 2PL (b) ICCs of LG2s under RM difficult than item 1.
Fig. 3 shows some slight discrepancies in the difficulty of items 1,
Fig. 3: Item Characteristic Curves (ICCs) of the score datasets of the 3, and 6 between the two tests. However, the fact that item LG1.3 is
first line graph test (LG1s) under the original the two-parameter lo- further to the left in Fig. 3a is misleading. It is due to the characteris-
gistic model (a), and of the second line graph test (LG2s) under the tics of items LG1.1 and LG1.4. These have extremely high a-values.
constrained Rasch model (b). The different curve steepnesses in (a) Therefore, while their b-values are slightly higher than that of LG1.3,
are due to the fact that 2PL computes individual discrimination values the probability of success of an average achiever is higher for these
for each item, while RM sets all discrimination values to 1. items than it is for LG1.3 (1 > 0.94). Furthermore, the difficulty char-
acteristics of LG1.3 and LG2.3 are very similar (0.94 ≃ 0.92). Thus
the only exception in the ordering of item difficulties is item 6, which
is estimated to be more difficult than item 2 in LG1s, and not in LG2s.
We stress that each item characteristic value is not absolute: it is
relative to the latent trait that the test is attempting to uncover. Nev-
ertheless, the fact that the difficulty of identical items was generally
preserved across LG1 and LG2 (with the exception of item 6) sug-
gests that these traits may be the same for both tests (i. e., ability to
read line graphs). To examine this, we equated the test scores (i. e., we
joined them) using a common item equation approach. RM was fitted
to the resulting dataset, and the goodness-of-fit test showed an accept-
able fit. 2PL provided best fit. Individual item characteristics were (a) TIC of BCs under RM (b) ICCs of BCs under RM
generally preserved, with the exception of item 6 which, interestingly,
ended up with difficulty and discrimination characteristics very simi- Fig. 4: Test Information Curve (a) and Item Characteristic Curves (b)
lar to those of item 2. This indicates that the two tests cover the same of the score dataset of the bar chart test under the constrained Rasch
latent trait. Thus, although item characteristics are slightly adjusted model. The TIC in (a) is not normally distributed because of several
by the equation (e. g., item 6), and therefore this approach should be very low difficulty items, as shown by the curves to the far left of (b).
preferred when possible, it can be assumed that items in LG1 can be
transposed to LG2 (and vice-versa) without hindering the overall co-
herence of each test.
4.2 Assessment Phase
Having shown that our test items have a sound basis in theory, we now
turn to the assessment of examinees’ levels of visualization literacy.
To do so, we inspect the ability scores, derived from the fitted IRT
models. Essentially, these scores represent examinees’ standings (θ)
on the latent trait. They have great predictive power as they can deter-
(a) TIC of the subset of BCs under RM (b) ICCs of the subset of BCs under RM
mine the probability of success on items that have not been answered,
provided that these items follow the same latent variable scale as other
Fig. 5: Test Information Curve (a) and Item Characteristic Curves (b)
test items that have been answered. Thus, these scores can help predict
of the subset of the score dataset of the bar chart test under the con-
how well an examinee will perform on a given test. As such, they are
strained Rasch model. The subset was obtained by removing the very
perfect indicators for assessing VL.
low difficulty items shown in Fig. 4b.
LG1 revealed 27 different ability levels, ranging from −1.85 to 1.
These 27 levels correspond to unique response (score) patterns. While
a standard testing method would simply sum up the correct responses,
our method considers each response individually, with regard to the The only difference with the factors used in the previous tests (apart
difficulty of the item it was given for, before deriving the ability score. from the stimulus) was the variation task, as this was more of a trend
The distribution of ability scores for LG1 was nearly normal, with a detection task in the line graph tests. Bar charts are sub-optimal for
slight bump around −1.75. 40.7% of participants had ability scores determining trends, so the task was replaced by “global similarity de-
above average (i. e., θ > 0), and the mean ability was −0.27. tection”, as done in [23] (e. g., “Do all the bars have the same value?”).
LG2 revealed 33 different ability levels, ranging from −1.83 to 40 participants were recruited on MTurk. The work of 6 Turkers
1.19. The distribution was also nearly normal, with a bump around was rejected, for the same reason as mentioned above.
−1. 39.4% of participants had ability scores above average, and the
mean ability was −0.17. 5.1.2 Results and Discussion
These results show that most recruited Turkers had somewhat below Our analysis was driven by the same requirements as the ones set for
average levels of visualization literacy for line graphs. However, the the line graph tests. Data were sorted and encoded in the same way as
means are close to zero, and the distributions nearly normal. This before, resulting in a score dataset for BC (BCs).
suggests that most Turkers, while below, are close to average. It is RM was fitted to the data; the goodness-of-fit test revealed an ac-
important to point out that, just like item characteristics, ability scores ceptable fit (p > 0.37), and the pairwise likelihood ratio tests proved
are relative to a latent trait. If ability scores between different tests are that RM fit best. Fig. 4a shows that the Test Information Curve is
to be compared, the latent traits in each must be the same. not normally distributed. This is due to the presence of several ex-
Finally, while it should be interesting to develop broader ranges of tremely low difficulty (i. e., easy) items (BC.3, BC.7, BC.8, and BC.9;
item complexities for the line graph stimulus (by using the common b = −25.6), as shown in Fig. 4b. Inspecting the raw scores for these
item equation approach), thus extending the psychometric quality of items revealed a 100% success rate. Thus, they were considered too
the tests, we consider LG1 and LG2 to be sufficient for our current easy for the assessment we wished to make and were removed. Simi-
line of research. Overall, we believe that these low levels of difficulty larly, items BC.1 and BC.2 (for both, b < −4) were also removed.
reflect the overall simplicity of, and massive exposure to, line graphs. To check the coherence of the subset test, RM was fitted again to
the new set of scores. Goodness-of-fit was maintained (p > 0.33), and
5 E XTENSIONS RM still fitted best. The new TIC (Fig. 5a) is normally distributed, and
To further test the validity of our method, and to see whether it applies shows that this subset test mainly provides information for low ability
to other types of visualizations, we created two additional tests: one levels, with a peak around θ ≃ −1, meaning that like LG1 and LG2,
for bar charts (BC) and one for scatterplots (SP). This section presents the subset of BC is useful for differentiating examinees with relatively
the design and assessment phases of these tests. low abilities from ones with average abilities. Only one item (BC.5,
see Fig. 5b) is above average difficulty.
5.1 Design Phase
5.1.1 Bar Charts: General Design 5.1.3 Scatterplots: General Design
Like LG1 and LG2, the design of our bar chart test (BC) was based For our scatterplot test (SP), all design attributes of the previous tests
on the principles described in Section 3. The varying factors were were once again used. The varying factors were tasks and distraction,
restricted to tasks and samples; a summary is given in Table 1. The as shown in Table 1. Slight changes were made to some of the tasks
same task scenario was used. The dataset presented life expectancy since scatterplots use two spatial dimensions (as opposed to bar charts
in various countries, with country names again changed to names of and line graphs). For example, stimuli with distractors in LG1 only
fictitious planets. The 11 s timer was kept. required examinees to focus on one of several samples; here, stimuli
SP revealed 23 different ability levels, ranging from −1.72 to 0.72.
The distribution here was not normal. 43.5% of participants had ability
scores above average, and the median ability was -0.14.
These results show that the majority of recruited Turkers had some-
what below average levels of visualization literacy for bar charts and
scatterplots. The very low percentage of Turkers above average in BC
led us to reconsider the removal of items BC.1 and BC.2, as they were
not truly problematic. After reintegrating them in the test scores, 21
(a) TIC of SPs under 2PL (b) ICCs of SPs under 2PL ability levels were observed, ranging from −1.67 to 0.99, and 42.8%
of participants had ability scores above average. This seemed more
Fig. 6: Test Information Curve (a) and Item Characteristic Curves (b) convincing. However, this important difference illustrates once again
of the score dataset of the scatterplot test under the two-parameter lo- the relativity of these values, and shows how important it is to properly
gistic model. The TIC (a) shows that there are several highly dis- calibrate the tests during the design phase.
criminating items, which is confirmed by the very steep curves in (b). Finally, we did not attempt to equate these tests, since we ran them
In addition, (b) shows that there are also a two poorly discriminating independently without any overlapping items (unlike LG1 and LG2).
items, represented by the very gradual slopes of items SP.3 and SP.11. To have a fully comprehensive test, i. e., a generic test for visualization
literacy, whatever the type of chart or graph, intermediate tests are re-
quired where the stimulus itself is a varying factor. If such tests prove
to be coherent (i. e., if IRT models fit the results), then it should be
possible to assert that VL is a general trait that allows one to under-
stand any kind of graphical representation. Although we believe that
the ability to extract information from a visualization varies with ex-
posure and habit of use, a study to confirm this is outside of the scope
of this paper.

6 FAST, E FFECTIVE T ESTING


(a) TIC of the subset of SPs under 2PL (b) ICCs of the subset of SPs under 2PL
If these tests are to be used as practical ways of assessing VL, the
Fig. 7: Test Information Curve (a) and Item Characteristic Curves (b) process must be sped up, both in the administration of the tests and in
of the subset of the score dataset of the scatterplot test under the two- the analysis of the results. While IRT provides essential information
parameter logistic model. The subset was obtained by removing the on the quality of tests and on the ability of those who take them, it is
poorly discriminating items shown in Fig. 6b. quite costly, both in time and in computation. This must be changed.
In this section, we present a way in which the tests we have devel-
oped in the previous sections can be optimized for future use.
with distractors could either require examinees to focus on a single 6.1 Test Administration Time
datapoint or on a single dimension. A percentage of adult literacy by
expenditure per student in primary school dataset was used. As we have seen, several items can be removed from the tests, while
We had expected that SP might be slightly more difficult and items keeping good psychometric quality. However, this should be done
would therefore require more time to complete. However, a pilot test carefully, as some of these items may provide useful information (like
on 10 Turkers showed the average response time per item was once in the case of BC.1 and BC.2, Sect. 5.2).
again roughly 11 s. Therefore the 11 s timer was kept. We first removed LG1.11 from LG1, as its discrimination value was
< 0.8 (see Sect. 5.1.4). We then inspected items with identical diffi-
40 participants were recruited on MTurk. The work of 1 Turker
culty and discrimination characteristics, which are represented by the
could not be kept due to technical (logging) issues.
overlapping curves in the ICCs (see Fig. 3). Such items were consid-
5.1.4 Results and Discussion ered prime candidates for removal, since probability of success can
be inferred from the ability scores. There was one group of overlap-
Our analysis was once again driven by the same requirements as be- ping items in LG1 ([LG1.1, LG1.4, LG1.9]), and two in LG2 ([LG2.1,
fore. The same sorting and coding was applied to the data, resulting LG2.3], [LG2.2, LG2.7]). For each group, we kept only one item.
in the score dataset SPs. The same fitting procedure was then applied, Thus LG1.1, LG1.4, LG2.3, and LG2.7 were dropped.
revealing a good fit for RM (p = 0.6), and a best fit for 2PL. We were forced to reintegrate items BC.1 and BC.2 to BC, as they
The Test Information Curve (Fig. 6a) shows the presence of several proved to have a big impact on ability scores (Sect. 5.2). The subset of
highly discriminating items around θ ≃ −1 and θ ≃ 0. The Item Char- SP created for the analysis was kept, and no extra items were removed.
acteristic Curves (Fig. 6b) confirm that there are three (SP.6, SP.8, and RM was fitted once again to the newly created subsets of LG1, LG2,
SP.10; a > 31). However, they also show that two items (SP.3, and and BC; the goodness-of-fit test showed acceptable fits for all (p >
SP.11) have quite low discrimination values (a < 0.6). 0.69 for LG1, and p > 0.3 for both LG2 and BC). 2PL fitted best for
Here, we set a threshold for a > 0.8. Items SP.3 and SP.11 were LG1, and RM fitted best for LG2 and BC.
thus removed. The resulting subset of 10 items’ scores was fitted once A post-hoc analysis was conducted for each test to see whether the
again. RM fitted well (p = 0.69), and 2PL fitted best. The different number of repetitions could be reduced (first to three, then to one). Re-
curves of the subset are plotted in Fig. 7. They show a good amount sults showed that RM fitted all score datasets using three repetitions.
of information for abilities that are slightly below average (Fig. 7a), However, several examinees had lower scores. In addition, while BC
which indicates that the subset of SP is also more adequate for testing and SP showed similar amounts of information for the same ability
examinees with relatively low abilities. levels, the three very easy items in BC (i. e., BC.3, BC.7, BC.8, and
BC.9) were no longer problematic. This suggests that several par-
5.2 Assessment phase ticipants did not get a score of 1 for these items, and confirms that,
Here again, we inspected the Turkers’ ability scores. Only the items for some examinees, more repetitions are needed. Results for one-
retained during the analysis were used. repetition-tests showed that RM no longer fitted the scores of BC, sug-
BC revealed 21 different ability levels, ranging from −1.75 to 0.99. gesting that first repetitions (with the graph/chart stimulus) are noisy.
The distribution of ability scores was nearly normal, with a slight Therefore, at this point, we decided to keep the five repetitions.
bump around −1.5. The mean ability was −0.39. However, only In the end, the redesign of LG1 contained 9 items (with a ≃ 10 min
14.3% of participants had ability scores above average. completion time), the redesigns of LG2 and SP contained 10 items (11
min), and the redesign of BC contained 8 items (≃ 9 min). nee’s ability since it erases one-time errors which may be due to
lack of attention or to clicking on the wrong answer by mistake.
6.2 Analysis Time and Computational Power 8. Fit the original constrained version of the Rasch model to the
data. RM is the simplest variant of IRT models. If it does not fit
To speed up the analysis, we first considered setting up the procedure
the data, other variants will not either.
we had used in R on a server. However, this solution would have
9. Check the fit of the model. Here we used a 200 sample para-
required a lot of computational power, so we dropped it.
metric Bootstrap goodness-of-fit test using Pearsons χ2 statistic.
Instead, we chose to tabulate all possible ability scores for each test, To reveal an acceptable fit, the returned p-value should not be
based on all possible response patterns. An interesting feature of IRT statistically significant (p > 0.05). In some cases (like in our pi-
modeling is that it accepts unobserved response patterns (i. e., patterns lot studies), the model may not fit. Option here are to inspect the
that do not exist in the empirical data), and partial response patterns χ2 p-values for pairwise associations, or the two- and three-way
(i. e., patterns with missing values). Consequently, we generated every χ2 residuals, to find potentially problematic items5 .
possible pattern for each test. There are 2ni − 1 possibilities for a test 10. Determine which IRT model variant best fits the data. To
containing ni items. Thus 511 patterns were created for LG1, 1023 test this, series of pairwise likelihood ratio tests can be used. If
for both LG2 and SP, and 255 for BC. We then derived the different several models fit, it boils down to how precise the examinee and
ability scores that could be obtained in each test. item characteristics need to be. Generally speaking, it is better
To make sure that shortening the tests (i. e., removing certain items) to go with the model that fits best. Our experience showed that
did not greatly affect the ability scores, we derived all ability scores such models were most often RM and 2PL.
for the full LG1 and LG2 tests. We found some small differences, spe- 11. Identify potentially useless items. In our examples of LG1
cially in the upper and lower bounds of ability, but we consider these and SP, certain items had very low discrimination characteris-
to be negligible, since our tests were not designed for fine distinction tics. These are not very efficient for differentiating ability levels.
between very low abilities or high abilities. We also tested the im- They can be removed. In cases like the one for BC, items may
pact of refitting the models, once the items were removed. For this, also simply be too easy. These items can also be removed, but
we repeated the procedure, using the full LG1 and LG2 tests, but we special care must be taken with regard to the impact on ability
replaced the Boolean response values for the items considered for re- scores. Finally, it is important the model be refitted at this stage,
moval by not available (NA) values. The derived scores were exactly (reproducing steps 8 through 10) as removing these items may
the same as the ones derived from our already shortened (and refitted) affect examinee and item characteristics.
tests, which proves that they can be trusted.
Finally, we integrated these ability scores and their correspond- 7.2 Final Design
ing response patterns into the web-based, shortened versions of the
tests. This way, the scores are immediately available. Consider- 12. Identify potentially overlapping items and remove them. If
ing our original motivation, by administering these online tests, re- the goal is to design a short test, such items can safely be re-
searchers can have direct access to participants’ levels of visualiza- moved, as seen in Sect. 6.1. IRT models should then be refitted
tion literacy. Informed decisions can then be made as whether to to the subsets of item scores, repeating steps 8 through 10.
keep participants for further studies or not. All four tests are ac- 13. Generate all 2ni − 1 possible score patterns, where ni is the
cessible at http://peopleviz.gforge.inria.fr/trunk/ number of (retained) items in the test. These patterns are series
vLiteracy/home/ of possible Boolean responses to the test items.
14. Derive the ability scores from the model, using the patterns of
responses generated in step 13. These scores represent the range
7 M ETHODOLOGY G UIDELINES
of visualization literacy levels that the test can assess.
As the preceding sections have shown, we have developed and vali- 15. Integrate the ability scores into the test to make fast, effective,
dated a fast and effective method for assessing visualization literacy. and scalable estimates of people’s visualization literacy.
This section summarizes the major steps of our method. It is written
in the form of easy “take-away” guidelines. 8 C ONCLUSIONS AND F UTURE W ORK
In this paper, we have presented a method based on Item Response
7.1 Original Design Theory for assessing visualization literacy, through the design and
1. Pay careful attention to the design of all 3 components of a evaluation of a series of tests that each cover different aspects of vi-
test item, i. e., stimulus, task, and question. Each of these can sualization literacy. Our original motivation was to make a series of
influence item difficulty, and too much variation may fail to pro- fast, effective, and reliable tests which researchers could use to detect
duce a good test (as was seen in our pilot studies). participants with low VL levels using online studies. In addition, we
2. Repeat each item several times. We did 5 repetitions + 1 “ques- have proposed a way in which these tests can be redesigned and im-
tion comprehension” condition for each item. This is important plemented in order to get immediate estimates of examinee’s levels of
as repeated trials provide more robust measures. Ultimately, it visualization literacy.
may be feasible to reduce the number of repetitions to 3. 4 We intend to continue developing these tests, as well as examine
3. Use a different, and ideally, non-graphical, representation their suitability for other kinds of representation (e. g., parallel coor-
for the question comprehension condition. We chose a table dinates, node link diagrams, starplots, etc.), and possibly for other
condition. While our present analysis did not focus on proving purposes. In other contexts like that of a classroom evaluation, tests
its essentialness, we believe that this attribute is important. could be made longer, and broader assessments of visualization liter-
4. Randomize order of items and of repetitions. This is common acy could be made. This would imply further exploration of the design
practice in experiment design, having the benefit of preventing parameters we have proposed in section 3. Evaluating these parame-
carryover effects. ters to find their impact on item difficulty should also be interesting.
5. Once the results are in, sort the data according to item and Finally, we acknowledge that this work is but a small step into the
repetition ID, and remove the data for the initial repetitions. realm of visualization literacy, which is why we have made our tests
6. Encode examinees’ scores in a dichotomous way, i. e., 1 for available on GitHub for versioning [1]. Ultimately, we hope that this
correct answers and 0 for incorrect answers. will serve as a foundation for further research into VL.
7. Calculate the mean score for all repetitions of an item and
round the result. This will give a finer estimate of the exami- R EFERENCES
4 The [1] https://github.com/INRIA/Visualization-Literacy-101.
number of repetitions should be odd, so as to not end up with a mean
score of 0.5 for an item. 5 For more information, refer to [43].
[2] D. Abilock. Visual information literacy: Reading a documentary photo- 2010.
graph. Knowledge Quest, 36(3), January–February 2008. [28] R. Lowe. “Reading” scientific diagrams: Characterising components
[3] J. Baer, M. Kutner, and J. Sabatini. Basic reading skills and the literacy of skilled performance. Research in Science Education, 18(1):112–122,
of the america’s least literate adults: Results from the 2003 national as- 1988.
sessment of adult literacy (naal) supplemental studies. Technical report, [29] R. K. Lowe, P. A. N. K. C. f. S. Curtin Univ. of Tech., and Mathematics.
National Center for Education Statistics (NCES), February 2009. Scientific Diagrams: How Well Can Students Read Them? [microform]:
[4] J. Bertin and M. Barbut. Semiologie graphique. Mouton, 1973. What Research Says to the Science and Mathematics Teacher. Number
[5] P. BOBKO and R. KARREN. The perception of pearson product mo- 3 / Richard K. Lowe. Distributed by ERIC Clearinghouse [Washington,
ment correlations from bivariate scatterplots. Personnel Psychology, D.C.], 1989.
32(2):313–325, 1979. [30] J. Meyer, M. Taieb, and I. Flascher. Correlation estimates as perceptual
[6] P. Borlund. Experimental components for the evaluation of interactive in- judgments. Journal of Experimental Psychology: Applied, 3(1):3, 1997.
formation retrieval systems. Journal of Documentation, 56:71–90, 2000. [31] E. Miller. The miller word-identification assessment. http://www.
[7] M. T. Brannick. Item response theory. http://luna.cas.usf. donpotter.net/pdf/mwia.pdf, 1991.
edu/˜mbrannic/files/pmet/irt.htm. [32] National Cancer Institute. Item response theory modeling. http://
[8] V. J. Bristor and S. V. Drake. Linking the language arts and content areas tiny.cc/cxfjdx.
through visual technology. T.H.E. Journal, 22(2):74–77, 1994. [33] National Center for Education Statistics. Adult literacy and lifeskills sur-
[9] P. Carpenter and P. Shah. A model of the perceptual and conceptual pro- vey. http://nces.ed.gov/surveys/all/.
cesses in graph comprehension. Journal of Experimental Psychology: [34] Numerical reasonin - table/graph. http://tiny.cc/djfjdx.
Applied, 4(2):75–100, 1998. [35] Numerical reasoning online. http://
[10] W. S. CLEVELAND, P. DIACONIS, and R. MCGILL. Variables on scat- numericalreasoningtest.org/.
terplots look more highly correlated when the scales are increased. Sci- [36] J. Oberholtzer. Why two charts make me feel like an expert on portugal.
ence, 216(4550):1138–1141, 1982. http://tiny.cc/ujgjdx, 2012.
[11] Cognitive testing interview guide. http://www.cdc.gov/nchs/ [37] OECD. Pisa 2012 assessment and analytical framework. Technical re-
data/washington_group/meeting5/WG5_Appendix4.pdf. port, OECD, 2012.
[12] M. Correll, D. Albers, S. Franconeri, and M. Gleicher. Comparing aver- [38] OTA. Computerized Manufacturing Automation: Employment, Educa-
ages in time series data. In Proceedings of the 2012 ACM annual confer- tion, and the Workplace. United States Office of technology Assessment,
ence on Human Factors in Computing Systems, pages 1095–1104. ACM, 1984.
2012. [39] S. Pinker. A theory of graph comprehension, pages 73–126. Lawrence
[13] F. R. Curcio. Comprehension of mathematical relationships expressed in Erlbaum Associates, Hillsdale, NJ, 1990.
graphs. Journal for research in mathematics education, pages 382–393, [40] I. Pollack. Identification of visual correlational scatterplots. Journal of
1987. experimental psychology, 59(6):351, 1960.
[14] Department for Education. Numeracy skills tests: Bar charts. [41] R. Ratwani and J. Gregory Trafton. Shedding light on the graph schema:
http://www.education.gov.uk/schools/careers/ Perceptual features versus invariant structure. Psychonomic Bulletin and
traininganddevelopment/professional/b00211213/ Review, 15(4):757–762, 2008.
numeracy/areas/barcharts, 2012. [42] R. A. Rensink and G. Baldridge. The perception of correlation in scatter-
[15] R. C. Fraley, N. G. Waller, and K. A. Brennan. An item response theory plots. In Proceedings of the 12th Eurographics / IEEE - VGTC Con-
analysis of self-report measures of adult attachment. Journal of person- ference on Visualization, EuroVis’10, pages 1203–1210, Aire-la-Ville,
ality and social psychology, 78(2):350, 2000. Switzerland, Switzerland, 2010. Eurographics Association.
[16] E. Freedman and P. Shah. Toward a model of knowledge-based graph [43] D. Rizopoulos. ltm: An r package for latent variable modeling and item
comprehension. In M. Hegarty, B. Meyer, and N. Narayanan, editors, Di- response analysis. Journal of Statistical Software, 17(5):1–25, 11 2006.
agrammatic Representation and Inference, volume 2317 of Lecture Notes [44] RUMMLaboratory. Rasch analysis. http://www.
in Computer Science, pages 18–30. Springer Berlin Heidelberg, 2002. rasch-analysis.com/rasch-models.htm.
[17] E. Freedman and P. Shah. Toward a model of knowledge-based graph [45] P. Shah. A model of the cognitive and perceptual processes in graphical
comprehension. In M. Hegarty, B. Meyer, and N. Narayanan, editors, Di- display comprehension. Reasoning with diagrammatic representations,
agrammatic Representation and Inference, volume 2317 of Lecture Notes pages 94–101, 1997.
in Computer Science, pages 18–30. Springer Berlin Heidelberg, 2002. [46] G. Sir Crowther. The Crowther Report, volume 1. Her Majesty’s Sta-
[18] S. Friel, F. Curcio, and G. Bright. Making sense of graphs: Critical factors tionery Office, 1959.
influencing comprehension and instructional implications. Journal for [47] C. Taylor. New kinds of literacy, and the world of visual information.
Research in Mathematics Education, 32(2), 2001. Literacy, 2003.
[19] S. N. Friel, F. R. Curcio, and G. W. Bright. Making sense of graphs: [48] J. G. Trafton, S. P. Marshall, F. Mintz, and S. B. Trickett. Extracting
Critical factors influencing comprehension and instructional implications. explicit and implicit information from complex visualizations. Diagrams,
Journal for Research in mathematics Education, 32(2):124–158, 2001. pages 206–220, 2002.
[20] A. C. Graesser, S. S. Swamer, W. B. Baggett, and M. A. Sell. New models [49] S. Trickett and J. Trafton. Toward a comprehensive model of graph com-
of deep comprehension. Models of understanding text, pages 1–32, 1996. prehension: Making the case for spatial cognition. In D. Barker-Plummer,
[21] Graph design i.q. test. http://perceptualedge.com/files/ R. Cox, and N. Swoboda, editors, Diagrammatic Representation and In-
GraphDesignIQ.html. ference, volume 4045 of Lecture Notes in Computer Science, pages 286–
[22] J. Heer and M. Bostock. Crowdsourcing graphical perception: using 300. Springer Berlin Heidelberg, 2006.
mechanical turk to assess visualization design. In Proceedings of the [50] B. Tversky. Semantics, syntax, and pragmatics of graphics., pages 141–
SIGCHI Conference on Human Factors in Computing Systems, CHI ’10, 158. Lund University Press, Lund, 2004.
pages 203–212, New York, NY, USA, 2010. ACM. [51] UNESCO. Literacy assessment and monitoring programme.
[23] How to read a bar chart. http://www.quizrevolution.com/ http://www.uis.unesco.org/Literacy/Pages/
act101820/mini/go/. lamp-literacy-assessment.aspx.
[24] P. Isenberg, A. Bezerianos, P. Dragicevic, and J. Fekete. A study on dual- [52] University of Kent. Numerical reasoning test. http://www.kent.
scale data charts. Visualization and Computer Graphics, IEEE Transac- ac.uk/careers/tests/mathstest.htm.
tions on, 17(12):2469–2478, Dec 2011. [53] H. Wainer. A test of graphicacy in children. Applied Psychological Mea-
[25] Join, ACRL and Council, Chapters. Presidential committee on informa- surement, 4(3):331–340, 1980.
tion literacy: Final report. Online publication, 1989. [54] What is media literacy? a definition... and more.
[26] I. Kirsch. The international adult literacy survey (ials): Understanding http://www.medialit.org/reading-room/
what was measured. Technical report, Educational Testing Service, De- what-media-literacy-definitionand-more.
cember 2001. [55] M. Wu, R. Adams, and E. M. Solutions. Applying the Rasch model to
[27] N. Knutson, K. S. Akers, and K. D. Bradley. Applying the rasch model to psycho-social measurement [electronic resource] : a practical approach
measure first-year students perceptions of college academic readiness. In / Margaret Wu and Ray Adams. Educational Measurement Solutions Mel-
Paper presented at the annual meeting of the MWERA Annual Meeting, bourne, Vic, 2007.

You might also like