Watson 2003

This article was downloaded by: [Colorado College]
On: 27 October 2014, At: 18:38

Publisher: Taylor & Francis
Informa Ltd Registered in England and Wales Registered Number: 1072954
Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK
International Journal of
Mathematical Education in Science
and Technology
Publication details, including instructions for authors and
subscription information:
http://www.tandfonline.com/loi/tmes20
The measurement of school

students' understanding of statistical
variation
a a a
Jane M. Watson , Ben A. Kelly , Rosemary A. Callingham
b
& J. Michael Shaughnessy
a
Faculty of Education, University of Tasmania,
GPO Box 252-66, Hobart 7001, Australia E-mail:
Jane.Watson@utas.edu.au
b
Department of Mathematical Sciences, Portland State
University, Portland, Oregon 97207, USA
Published online: 11 Nov 2010.
To cite this article: Jane M. Watson , Ben A. Kelly , Rosemary A. Callingham & J. Michael
Shaughnessy (2003) The measurement of school students' understanding of statistical
variation, International Journal of Mathematical Education in Science and Technology, 34:1,
1-29, DOI: 10.1080/0020739021000018791
To link to this article: http://dx.doi.org/10.1080/0020739021000018791
PLEASE SCROLL DOWN FOR ARTICLE
Taylor & Francis makes every effort to ensure the accuracy of all the information
(the Content) contained in the publications on our platform. However, Taylor
& Francis, our agents, and our licensors make no representations or warranties
whatsoever as to the accuracy, completeness, or suitability for any purpose of the
Content. Any opinions and views expressed in this publication are the opinions and
views of the authors, and are not the views of or endorsed by Taylor & Francis. The
accuracy of the Content should not be relied upon and should be independently
verified with primary sources of information. Taylor and Francis shall not be liable
for any losses, actions, claims, proceedings, demands, costs, expenses, damages,
and other liabilities whatsoever or howsoever caused arising directly or indirectly in
connection with, in relation to or arising out of the use of the Content.
This article may be used for research, teaching, and private study purposes. Any
substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing,
systematic supply, or distribution in any form to anyone is expressly forbidden.
Terms & Conditions of access and use can be found at http://www.tandfonline.com/
page/terms-and-conditions
Downloaded by [Colorado College] at 18:38 27 October 2014
int. j. math. educ. sci. technol., 2003
vol. 34, no. 1, 129
The measurement of school students understanding of

statistical variation
JANE M. WATSON,*{ BEN A. KELLY,{ ROSEMARY A. CALLINGHAM{

and J. MICHAEL SHAUGHNESSY{
{Faculty of Education, University of Tasmania, GPO Box 252-66, Hobart 7001, Australia
{Department of Mathematical Sciences, Portland State University, Portland,
Oregon 97207, USA
E-mail: Jane.Watson@utas.edu.au
(Received 13 November 2001)

The paper presents a questionnaire devised to assess school students

understanding of statistical variation. The questionnaire is based on earlier
research into students understanding of the chance and data curriculum and
recent work, more specically related to variation. It was devised, piloted,
revised, and administered to 746 students in grades 3, 5, 7, and 9 in ten
Tasmanian schools. The analysis of outcomes was carried out in three stages: a
hierarchical coding scheme was developed based on a structural model of
cognitive development; a Rasch analysis was carried out to produce a variable
map of student performance and item diculty on a single scale; and a holistic
model of development was suggested for the questionnaire. Outcomes for
individual items are presented to illustrate the range of student responses, and
possible rubrics for use by teachers. For some items, comparisons are made
with results of other researchers.
1. Introduction
The case for the need to measure school students understanding of statistical
variation would appear easy to make: statistics requires variation for its existence.
If understanding of statistics is to be measured then account must be taken of
understanding of variation. This does not mean understanding of standard devi-
ation but of something more fundamentalthe underlying change from expecta-
tion that occurs when measurements are made or events occur. This implies that
variation occurs in a context and thus it must be measured in context. For school
students the context is the chance and data part of the mathematics curriculum.
Hence it is within this context that measurement of variation itself must take place.
It was the purpose of this project to devise a survey that would use the context of
the chance and data curriculum, which only implicitly acknowledges the variation
on which it stands, to gain an appreciation of that elusive foundation of the subject.
Until very recently there has been little research into school students under-
standing of variation. Shaughnessy [1] suggested that this might reect the
emphasis of the school curriculum, which has traditionally been on measures of
the middles of data sets rather than on measures of spread. This in turn may reect
* The author to whom correspondence should be addressed.
International Journal of Mathematical Education in Science and Technology

ISSN 0020739X print/ISSN 14645211 online # 2003 Taylor & Francis Ltd
http://www.tandf.co.uk/journals
DOI: 10.1080/0020739021000018791
2 J. M. Watson et al.
a view of teachers and curriculum planners that variation is messy and leads to a
complex codication in the standard deviation. This view is not reected, however,
by statisticians such as Moore [2], who included variation in four of his ve core
elements of statistical thinking, and Cobb and Moore, who stated explicitly that
the need for statistics arises from the omnipresence of variability [3, p. 801].
Further, the work of Wild and Pfannkuch [4] in developing a framework for
statistical thinking in empirical enquiry places variation at the heart of all
investigation, reinforcing the need to address students understanding and devise
ways to assist the development of appropriate appreciation before they leave
school.
Shaughnessy et al. [5] began the antecedent work to that presented in this study
in response to the analysis of data and chance items from the 1996 National
Assessment of Educational Progress (NAEP) in the USA [6]. In particular, an item
in the NAEP asked students to make a best prediction of how many red gum balls
would appear in a group of 10 obtained from a (well-mixed) gumball machine

containing 20 yellow, 30 blue, and 50 red gum balls. The lack of encouragement to
think of the possibility of a range of outcomes led Shaughnessy et al. [5] to revise
the question and trial it in various forms with students in the USA and Australia.
Additionally, the stem was broadened in these trials to describe a series of six
removals of 10 gumballs (renamed as lollies) with replacement after each trial.
The options included asking students for a range of values, asking them to choose
from a multiple choice list of ve possibilities, and asking them to produce a list of
the six results. The outcomes from these trials were encouraging in providing
richer information on students thinking about variation in this setting.
Subsequently, small groups of students were interviewed in two Australian
settings to explore students understanding in more detail.
One study, by Torok and Watson [7], reported on interviews with 16 students
in grades 4, 6, 8, and 10 using a protocol on variation that included in-depth
questions on the gumball/lollies problem. They suggested a four-level model of
development of understanding of variation that appeared to be inuenced by
students understanding of proportional reasoning. It was proposed that at level A,
appreciation of variation was weak; at level B, isolated appreciation of variation and
clustering was shown; at level C, appreciation was stronger but still inconsistent;
whereas at level D, good consistent appreciation was shown. Subsequently,
Reading and Shaughnessy [8] interviewed 12 students using a similar protocol
and reported in detail on the responses of one student from each of grade 4, 6, 9,
and 12. These students reected, respectively, many of the characteristics of the
four levels observed by Torok and Watson [7]. The grade 12 student also showed
conict in deciding between multiple choice answers of (5,5,5,5,5,5) and
(3,7,5,8,5,4) for the number of red lollies obtained from six draws of 10 lollies
from a container with half red. The authors suggested that the inuence of senior
secondary instruction on theoretical probability was the likely cause of this
conict.
As part of a larger study to evaluate the eect of teaching the chance and data
curriculum in such a way as to enhance appreciation of variation, the survey
presented in this paper was devised in the light of the experiences described by
Shaughnessy et al. [5], Torok and Watson [7], and Reading and Shaughnessy [8].
The overriding objective was to include items that would allow the freedom to
Students understanding of statistical variation 3
demonstrate an understanding of variation that occurred in those studies, as well as

cover the chance and data contexts that allowed for variation to occur.
The overall objectives of this paper are three-fold. First, the questionnaire will
be presented with a rationale for the choice of items and their sources. Second, the
results of administering the questionnaire to students in grades 3 to 9 will be
presented. This will include a discussion of the hierarchical coding scheme for
items, with examples of students responses for particular items that illustrate the
levels of development observed, the outcomes of applying the Rasch partial credit
model (PCM) [9] to the data, and an interpretation in terms of general develop-
mental levels. Third, comparisons will be made with other studies that have used
similar items and with other frameworks for student thinking relevant to the
questionnaire. Some suggestions will also be made for future research.
2. Survey development
The search for items appropriate to a paper-and-pencil survey of about 45
minutes duration began with existing material used in previous research with
chance and data (see [10] for many sources), with the work of Shaughnessy et al.
[5] as described above, and with activities carried out in classrooms [11]. The
framework for the chance and data curriculum conceived by Holmes [12] was used
to ensure that aspects of variation across the curriculum were covered. The
following areas were hence specically catered for with the items in the
Appendixsampling variation, displaying variation, chance variation, describ-
ing/measuring variation, and sources of variation (explanations, inferences).
An initial survey was trialled in a small private school with 58 students in
grades 4 to 10. A few items were reworded to aid interpretation and understanding,
and decisions were made about a core set of items to be asked to grade 3 students,
with additions made at each of grades 5, 7, and 9. Initial analyses based on Rasch
methods [9, 13] and the Structure of Observed Learning Outcomes (SOLO)
developmental model of Biggs and Collis [14] were promising and with amend-
ments, the surveys, as presented in the Appendix, were used for the main study
(the spacing of the items in the Appendix does not reect the spacing used on the
actual surveys). The rationale for the choice of items and the addition of parts and
items for higher grades will be discussed in the following paragraphs.
To set the context for questions dealing with variation it was necessary to
include some items focusing more directly on basic chance measurement and table
or graph reading. Items 1 and 14 used by Watson et al. [15] and Item 3(a) used by
Torok [11] were of this nature for chance. Parts (a) to (d) of Item 15, parts (a) and
(b) of Item 6, and parts (a) to (d) of Item 4, served a similar purpose for reading
tables, pictographs, and stacked dot plots, respectively (based on items reported in
[11], [16], [17], and [18]). Also, Item 11 asked students to note anything unusual in
a bar graph presented from a newspaper article [19].
To explore students understanding of variation in chance settings, Item 2
(following on from Item 1 for all grades), asked students to imagine throwing a die
60 times with the possible results being recorded in a table. The purpose was to
assess the degree of variation they would attribute to outcomes associated with
many trials but having an underlying uniform expectation. The motivation for the
item arose from the work of Reading and Shaughnessy [8] and Torok and Watson
[7], although these researchers did not work with dice. The aims of Item 3, parts
(b) to (e) were similar but set in the context of a spinner task [11]. Students in
grades 3 and 5 were asked to imagine the outcomes of 10 spins performed on six
dierent occasions, whereas students in grades 7 and 9 were asked to imagine the
outcomes for 50 spins performed on six dierent occasions. Finally, the words
random and variation were included in the words whose meanings were
requested in Item 12 [20, 21].
Consideration of variation in other data handling settings, including graphs,
was covered in the following: Item 4(e) and Item 5, answered by students in grades
7 and 9 for stacked dot plots related to the spinner task [11]; parts (c) to (f) of Item
6 for the pictographs, which was answered by students in all grades [17]; Item 7 for
a comparison of stacked dot plots, answered by students in grades 5, 7, and 9 [22];
Item 11 for bar graphs and Item 13 for nding an average in a set with an outlier,
both answered by students in grades 7 and 9 only [23, 24].
To consider sampling, Items 8, 9, and 10 were adapted from the work of Jacobs
[25] based on a school class doing a survey before selling rae tickets. All grades
were asked Item 8, involving how the students would conduct the survey
themselves. For Item 9, suggesting how other students had conducted the survey,
grade 3 students were only presented with three alternatives: (a) Shannon, (b) Jake,
and (c) Adam. Grade 5 students had these three alternatives, as well as (d) Ra. In
grades 7 and 9, students had these four alternatives, plus (e) Claire. All students
were asked to justify why each survey method was Good, Bad, or they were Not
sure, and to choose which of the survey alternatives was the best. Item 10, asking
for a nal prediction on school support for the survey, was only given to grade 9
students. Item 15(e) asked about fair selection of a sample and Item 16 explored
bias in sampling [26, 27], whereas the denition was solicited in Item 12.
3. Method
3.1. Sample
Data were collected from students in ten government schools in the Australian
state of Tasmania. A total of 746 students in grades 3 n 177, 5 n 183, 7
n 189, and 9 n 197 were administered the questionnaire described above
and shown in the Appendix. Each class was allowed approximately 45 minutes to
complete the questionnaire, and students were encouraged to ask questions to help
clarify the reading of items. Teachers and researchers, however, did not assist
students in such a way as to answer the question or inuence responses. Students
who nished early were given the opportunity to go back and answer the items left
blank. This assisted in eliminating missing data.
3.2. Coding
The SOLO taxonomy of Biggs and Collis [14], and previous analyses of some
items by Watson et al. [15] and Watson and Moritz [2628] provided starting
points for the analysis of the items in the present study. A coding scheme was
devised based on this background and the desire to provide structure to students
displayed understanding of variation in the contexts provided by the items.
Overall, 44 sub-parts were coded for the 16 items. The coding of Items 1 and
14 was based on previous work by Watson et al. [15], whereas the coding of Item
16 was based on a scheme devised by Watson and Moritz [26]. All other items were
coded based on their degree of mathematical correctness and/or the appropriate-
ness of variation expressed (if relevant to the response). A number of dierent

categories of response were assigned for each of the 44 parts. Although distinct,
there were often some common characteristics or themes among categories, which
allowed for the grouping of categories into more general groupings that were
hierarchical.
In Item 3(e), for example, nine distinct categories of responses were found;
these were collapsed into three more global categories of response. The three
global categories were (with examples for six sets of 10 spins): (i) inappropriate
responses, including no response, misinterpretation of question (e.g. L, S, L, S, L,
S), and out-of-range responses (e.g. 7, 5, 20, 18, 12, 8); (ii) lop-sided and centred
responses, including those with no variation (e.g. 10, 10, 10, 10, 10, 10 or 5, 5, 5, 5,
5, 5), or too small or too wide variation (e.g. 0, 1, 1, 0, 0, 1; or 0, 1, 9, 1, 2, 5); and
(iii) centred responses with appropriate variation (e.g. 5, 4, 3, 6, 8, 7). This type of
coding scheme distinguished between the ner categories of response for each sub-
part, as well as catering for the global categories combined from these smaller
gradations.
The criterion for determining the appropriateness of the variation displayed in
responses to Item 3(e) was based on a simulation of 1000 outcomes for the response
using an EXCEL spreadsheet. The standard deviation for each simulation was
calculated and then plotted. Appropriate variation was determined by the standard
deviations falling within the middle 90% of a normal curve (0.62.3 for 10 spins;
1.35.0 for 50 spins). A similar simulation was carried out for Item 2 with the 60
tosses of the die, with the standard deviations within the middle 90% being the
indicator of appropriate variability (1.24.7). For each item, the standard deviation
of the students response was calculated and compared with one of these criteria.
For items similar in content, such as Items 9(a) to 9(e), a parallel classication
was used relative to the quality of the suggestions made in the stem of the item.
The four categories for these items reected no appreciation for the task, an
inappropriate decision on the adequacy of the survey, a non-central criticism or
approval of the techniques, and an appropriate statistical response.
3.3. Analysis
Use of Item Response Models, and more specically Rasch models, with data
coded using SOLO and other hierarchical systems has been described by a number
of authors (see, for example [29, 30]). These models use the interaction between
persons and items to estimate person abilities and item diculties on the same
scale. The unit of measurement is the logitthe natural logarithm of the odds of
success of a person on an item. For further discussion of the theoretical aspects of
the PCM used in this study, see Masters [9].
The PCM is underpinned by three assumptions: rst, the variable is uni-
dimensional; second, the variable is hierarchical or has direction; and third, the
items are independent of each other. The focused nature of the content in this
study suggested that the requirement of unidimensionality could be met. The
hierarchical coding scheme allowed the requirement of direction to be realized,
and nally, although some items, such as the set about the rae scenario (Items 8,
9, and 10), had a common context, each sub-part was stand-alone and was not
dependent on a correct answer to a previous part. The PCM [9] was hence tted to
the data using the Quest computer program [31]. This analysis allowed all the
items to be calibrated and placed on a single scale in one operation.
Component Question number in survey
Basic chance 1, 3(a), 14

Basic graphs/tables 4(a), 4(b), 4(c), 4(d), 6(a), 6(b), 11*, 15(a), 15(b), 15(c), 15(d)
Variation in chance (VC) 2, 3(b), 3(c), 3(d), 3(e), 12(b), 12(c)**
Variation in data/graphs (VD) 4(e), 5, 6(c), 6(d), 6(e), 6(f), 7, 11*, 12(c)**, 13
Sampling (VS) 8, 9, 10, 12(a), 15(e), 16
* This item was coded in two dierent ways.

** This item was included in both components.
Table 1. Survey questions contributing to ve components by subject matter.
In this analysis, 13 items were marked rightwrong; 8 were given a three-point

hierarchical scale; 19, a four-point scale; 3, a ve-point scale; and 1, a six-point
scale. The coding was developed by collapsing previously identied codes into
fewer categories, as described previously. This improved the initial t of the data
to the model. A variable map of person ability and item diculty for the 746
students for 44 items was produced (see gure 1 in the Results).
Table 1 summarizes the items and parts of items that measure aspects of
pattern (in basic chance, and graphs and tables) or variation (in chance, data/
graphs, and sampling), so that each response in the survey can be allocated to one
recognizable component contributing to the overall scale. Item 12(c) refers to the
word variation, and has been included within both components for chance and
data/graphs. Question 11 was analysed in two ways to tap into dierent sources of
chance and data measurement (nding an unusual feature and showing an
appreciation of variation). For ease of identication, labels VC, VD, and VS
have been used to identify items associated with chance variation, data variation,
and sampling variation in the Appendix and elsewhere as appropriate.
4. Results
The results will be discussed from two perspectives: a descriptive analysis of
selected items that elucidates aspects of understanding and the outcomes of the
Rasch analysis including levels of understandings arising from interpretation of the
variable map.
4.1. Descriptive analysis of selected items

In this section, selected items that illustrate student understanding of aspects of
variation will be described in more detail.
Question 2. Of particular interest in this study is the relationship of the
variation associated with chance outcomes from a random device and the expected
pattern of outcomes related to the theory behind the device. Hence for Item 2 there
is tension between the theoretical expected outcome of 10, 10, 10, 10, 10, 10 for
the six outcomes of 60 tosses of a die and the recognition that there is likely to be
variation in these outcomes. The highest-level responses reect this tension in the
reasoning accompanying the suggested values.
A response was considered inappropriate and coded 0 (20.5% of responses) if
the sum of numbers was more than 60 and/or included a single value greater than
21 (e.g. 31, 5, 10, 29, 10, 10One comes up a lot of times, six comes up a few
times and the rest come up all over the place); if the sum was less than 40; or if
there appeared to be a misinterpretation of the question (e.g. 1, 60, 0, 0, 0, 0[no
reasoning]). Responses with a coding of 1 (30.2%) summed to 60 but provided
idiosyncratic reasoning for the variation displayed (e.g. 10, 20, 10, 10, 0, 10
Because you usually get lower numbers than higher numbers). Coding 2 (31.8%)
was given to a variety of numerical responses reecting strict probability or
unusual variation but with reasoning that reected some understanding of the
context (e.g. 5, 15, 10, 15, 10, 5Because when you throw a die, you get some
numbers a lot and never at other times). At code 3 (8.4%), responses showed
variation that was either too narrow or too wide, but expressed appropriate
reasoning about variation in the context (e.g. 11, 9, 11, 9, 11, 9Because they
are close to 10 which would be the average number rolled). At coding 4 (9.2%), the
reasoning and the displayed variation were appropriate (e.g. 9, 11, 8, 12, 10, 10
They are all around the same amount).
Question 4(e). Question 4 asked students a series of simple graph reading

questions based on a stacked dot plot. Item 4(e) explored students appreciation of
variation shown in the shape of the data presented. Two codes emerged from this
item, with answers focusing on the type of graph (e.g. Straight line with crosses on
it) or unclear descriptions or idiosyncratic comments (e.g. Dierent) being coded
0 (46.5%). Responses that either acknowledged variation (e.g. Up and down) or
gave a description based on similarities between the graph and real physical objects
(e.g. City) or geometric shapes (e.g. Triangle) were coded 1 (53.5%). Nearly half
of the students were unable to give a reasonable response to describe the shape of a
stacked dot plot. This may be due to the lack of familiarity with this type of graph,
or to the attempts by many students to describe the actual graph, and not its shape.
Question 5. The last of the spinner items for grades 7 and 9 asked students to
determine which of three stacked dot plots appeared to be genuine data and which
appeared to be fake. The issue of recognizing a strict pattern and a display with too
much variation was at the heart of the question. The coding categories reected the
number of correct decisions, and whether the associated reasoning was appro-
priate. Since responses required justication, those with no reasoning were coded 0
(14.2%). Other inappropriate responses were those with no correct decisions or
with illogical reasoning. Reasoning for the partially correct category, coding 1
(33.2%), typically included the following:
(a) Made up: Cause it is so in shape.
(b) Real from experiment: Cause theyre wonky.
(c) Real from experiment: Cause theyre wonky.
Reasoning for the correct category, coding 2 (52.6%) was similar to the following:
(a) Made up: You would be very lucky to get an exact triangle.
(b) Made up: The range is too big.
(c) Real from the experiment: The range isnt too big, but it isnt an exact
triangle.
Question 7. Item 7, given to students in grades 5, 7, and 9, had three parts. The
rst two parts to the item asked students to interpret two stacked dot plots showing
how many years students from a ctitious class had lived in their town. The third
question asked students to evaluate which of the two stacked dot plots told the
story better. Although including identical data values in both plots, the scale of the
Item 7(a) Item 7(b) Item 7(c)

Code % % %
0 26.91 30.90 26.91

1 16.94 15.61 27.57
2 26.91 30.90 23.59
3 29.24 22.59 21.93
Table 2. Percentage in each coding category for Question 7.
rst plot included only occurring data points, whereas the second plot used all
possible data points in the range (see Appendix). All three parts of this item were
coded in four categories. In all cases a code of 0 was given for responses with no
discernable logical reasoning (e.g. The town must not be well known or Neither
[graph is better] because neither of the graphs tell a story). In coding 1, for parts
(a) and (b), responses included data reading comments (e.g. Column 3 has four
crosses), sometimes combined with an inappropriate comment (e.g. 0 people
lived there), and if a summary statement was made, it was combined with an
inappropriate statement (e.g. They have lived there a long time. They have 22
people in their family.). At coding 2, responses included one summary statement,
and one data reading statement, or a single summary statement with no other
comment (e.g. The years vary as to how long they have lived in the town.
Someones lived there 37 years.). At coding 3, responses provided two distinct
appropriate summary statements (e.g. Most people have lived in town 3 years.
Second most people have lived in town for 12 years.). Table 2 shows the
percentages of students in the four coding categories for each part of Item 7. As
can be seen, the percentages of responses are relatively stable over and within the
three sub-items.
Questions 8, 9, and 10. Items 8, 9, and 10 concerning students carrying out a
survey before selling rae tickets to have a trip to Movieworld were the longest
and most complex questions for most grades. Additional parts were included for
higher grades. The coding of categories for the various parts is given in table 3
with examples of student responses. The consistency of types of response across
questions is shown in table 3, however, some questions had higher diculty levels
(see gure 1 [VS8VS10]).
Question 15. Item 15(e), the last part, which followed on from four parts about
reading tables, asked about choosing students in a fair way for the closing parade at
a sports day. Categories were determined based on the quality of responses.
Responses that did not address the question of how to choose the students (e.g.
They play the girls games rst. Then play the boys games.) were classied as
inappropriate and were coded 0 (21%). Students who gave responses using either a
behavioural or personality characteristic for selection (e.g. Watch them march and
see whos the best), or responded with methods containing at least one descriptor
for selection (e.g. 2 girls and 2 boys) were given a code of 1 (32.6%). Students who
used representative methods to make a selection (e.g. One child from each sport,
but only 2 boys from their sports and 2 girls from their sports) were given a code
of 2 (6%). Students who used one random method of selection but not combined
with stratication were given a code of 3 (30.7%)(e.g. Pick out of a hat), and
students that combined both representativeness and random methods in the one
response or gave two distinct purely randomized methods were given a code of 4
Q8 How many/how
to survey Q9(a) Shannon Q9(b) Jake Q9(c) Adam Q9(d) Ra Q9(e) Claire
Code [all grades] [all grades] [all grades] [all grades] [grades 5-9] [grades 7-9]
3 Appropriate Representative and Random methods: Detecting bias and Detecting bias: Lack of range and/or Appropriate
statistical response random: 10 from Good, because its a small sample size: Bad, not enough variation: criticism:
each grade, 5 boys good random way to Bad, not enough dierent age groups Bad, they would Bad, some kids
and 5 girls picked at survey people and probably say the might go twice
random selectively picked same thing
Random only: Put
all 600 student
names in a hat and
draw out 65
2 Non-central Representative Adequate sample Detecting bias only: Large sample size: Adequate sample Adequate sample
ideas and student methods only: You size: Bad, its not broad Bad, too many size: size:
uncertainty would survey 60 Good, theres a lot enough people Good, you get a lot Good, you just have
children, 10 from of people Student uncertainty: Student uncertainty: of answers enough
each grade so you Not sure, because Not sure, because Student uncertainty: Student uncertainty:
could see an average not many dierent thats only one class Not sure, it depends Not sure, because
for each grade people would go but he surveyed the how many of his people who thought
there most people friends have it was a bad idea
dierent opinions wouldnt bother
1 Inappropriate Non-representative: Method too random: Creating bias: Fairness: Friendship: Free choice:
analysis 50 students that I Bad, he could pick Good, to give them Good, because it is Good, because they Good, it is their
meet the wrong people a hint to buy one fair are his friends own choice
Entire population:
You would survey
them all
Students understanding of statistical variation
0 Inappropriate Misinterpret Misinterpret Misinterpret Misinterpret Misinterpret Misinterpret

logic question: Choose question: question: question: question: question:
them all because the Bad, too many Good, so you could Bad, none might not Good, more money Good, rst in best
more rae tickets people play it buy any for them served
they sell the more
money they get
Table 3. Coding categories of response for Items 8 and 9(a) to 9(e).

9
Coding Sample (12(a)) Random (12(b)) Variation (12(c))
0 The number of Very quickly (40.3%) Jayden varies from

something (28.9%) place to place (41.7%)
1 Try something. Choosing something. You get a choice. A car
Getting a sample of Random breath test. varies from sizes and
water. (34.4%) (19.9%) colours. (25.6%)
2 You have a small piece It means in any order. How something
of something. I had a The songs on the CD changes. The weather.
sample to eat at the came out randomly. (25.1%)
supermarket. (23.6%) (36.5%)
3 Take a small amount of Random means Slight change or
one thing away to test. something that does not dierence. People vary
Blood sample. happen in a pattern. In a in size . . . (7.6%)
(13.1%) Tatts lotto draw . . .
(3.3%)
Table 4. Coding categories of response with examples for Item 12.
(9.7%) (e.g. Put all the boys names in a hat and pull out two and do the same for
the girls). Not many responses were given a code of 2, indicating a preference for
chance over stratication methods.
Question 12. Item 12 asked for denitions of the three terms sample, ran-
dom, and variation. In each case three coding categories reected increased
structure and understanding in the response. Coding 0 responses were inappropri-
ate or tautological. Coding 1 reected single ideas or examples of the term. Coding
2 reected straightforward but clear explanations of the term, whereas coding 3
responses were considered to integrate ideas with examples. Examples of responses
for each coding category of each of the terms are given in table 4, with the
percentage in each coding category given in parenthesis.
4.2. Rasch analysis and levels of understanding

The variable map produced by the Quest program [31] is shown in gure 1. It
shows the relationship between the person parameter (ability) on the left-hand side
of the gure and the item parameter (diculty) on the right-hand side, on a logit
scale. In gure 1, a stem followed by the step level (e.g. Q1.1) describes the item.
The step level corresponds to the threshold at which the higher response score
becomes more likely. Thus Q1.1 corresponds to the point where a score of 1, 2, 3,
or 4 is more likely than a score of 0, and Q1.3 represents the point on the scale
where a score of 3 or 4 becomes more likely than a score of 0, 1, or 2 [30]. The t of
the data to the model using the int mean square [32] is shown in gure 2.
Acceptable levels of t are shown by the dotted lines at 0.7 and 1.3. All items on
the scale fell within the acceptable range, which conrms the unidimensionality of
the underlying construct, statistical variation. The reliability of the estimate was
high at 0.90 using the item separation index of Wright and Masters [32].
Since the analysis suggested a unidimensional variable, it is of interest to
observe how the ve components of the questionnaire as devised by the authors are
distributed along the variable. The items associated with the rst two components
in table 1, related to basic chance, and to graph and table reading, are each printed
-------------------------------------------------------------------------------------------------
4.0 |
|
| VS9b.3
|
| VC12b.3 LEVEL 4 Critical
| Q11a.2 aspects of variation
3.0 | VD6e.2 Employing complex
| VD6f.5 justification or critical
| reasoning
| VS15e.4
| VD6d.3 VD7b.3
| Q4d VD13.3
2.0 | VD7a.3
|
| VC3bA.3 VS9a.3 VS9d.3 VS9e.3 VD11b.3 VD12c.3
| VC2.4 VS8.3 VS12a.3 VD11b.2 VS16a
X | Q1.4 VC3bB.3
X | VS9e.2 VS16b.2 VS10.3
XX | VC2.3 VC3eB.2 VD7c.3 LEVEL 3 Applications of
1.0 XX | VD6f.4 VS9c.3 Q14.3 VD4e variation
XXXXX | VC3cA.3 VC3cB.3 VS8.2 Q4c VS16b.1 Consolidating and using ideas in
XXXXXX | VD6f.3 VS9a.2 VS12a.2 VD13.2 VS10.2 context, inconsistent in picking
XXXXXXXXXXX | VD6d.2 VS9f.2 Q4b VD5.2 VC12b.2 VD12c.2 most salient features
XXXXXXXXXXXXX | VD11b.1
XXXXXXXXXXXXXXXX |VC2.2 VC3eA.2 VS9b.2 VS9c.2 VS15e.2 VS15e.3 VS9d.2 VD7b.2 VD7c.2 Q11a.1 VD13.1
.0 XXXXXXXXXXXXXXX | VC3bA.2 VC3cA.2 VS9a.1 VD7a.2 VD12c.1
XXXXXXXXXXXXXXX | Q1.3 VS9e.1 VC12b.1 LEVEL 2 Partial recognition of
XXXXXXXXXXXXXXXXXXX | VC3dA VS9b.1 VS9d.1 VD7b.1 variation

XXXXXXXXXXXXXXXXX | Q1.2 VC3cB.2 VS9c.1 Putting ideas in context, tendency to
XXXXXXXXXXXXXXX | VD6d.1 VS9f.1 VS12a.1 Q14.2 VD7a.1 VD7c.1 focus on single aspects and neglect
XXXXXXXX | Q3a.2 VC3cA.1 VC3eA.1 Q14.1 others
-1.0 XXXXXXXXXX | VC2.1 VD6f.2 VS8.1 Q15c.2 VS15e.1
XXXXXX | VC3eB.1
XXXXXXXX | VC3cB.1 VC3dB Q15d Q4a VS10.1
XXXX | VC3bB.2 Q15c.1
XX | VD5.1 LEVEL 1 Prerequisites
XX | Q6b Q15a for variation
-2.0 XX | VD6f.1 Q15b Working out the
X | Q1.1 environment, table/simple
X | VDQ6c VD6e .1 graph reading, intuitive
X | Q3a.1 VC3bA.1 reasoning for chance
XX |
X |
|
-3.0 | VC3bB.1
|
X | Q6a
|
|
| CODE:
-4.0 | VC = Variation in chance
|
| VD = Variation in data and graphs
|
| VS = Variation in sampling
|
-5.0 X |
-------------------------------------------------------------------------------------------------
Figure 1. Variable map for underlying construct.
with their original item numbers (e.g. Q1) in gure 1, whereas the second three
components have been relabelled as VC, VD, and VS, for Variation in Chance,
Variation in Data/Graphs, and Variation in Sampling, to aid identication of the
components. The distribution of each component along the variable was satisfac-
tory, indicating the prociency of the scale to indicate students achievement on
specic sub-components as well as the underlying overall understanding of vari-
ation in chance and data. Further, the item diculty distribution matched the
distribution of student ability, indicating that the scale allowed all students to
demonstrate what they knew. Four levels of increasing understanding were
identied within the variable. The thresholds for each level were determined by
considering the increasing understanding of variation displayed and the sophisti-
cation and structure of responses in the categories of response. These levels are
identied in gure 1.
At Level 1, Prerequisites for Variation, students are likely to use stories or
personal experience to justify responses. They recognise variation only in the
simple context of the travel graph not looking the same everyday (Item 6(c)) or in
-------------------------------------------------------------------------------------------------------------
Item Fit
(N = 746 L = 48 Probability Level= .50)
-------------------------------------------------------------------------------------------------------------
INFIT
MNSQ .63 .67 .71 .77 .83 .91 1.00 1.10 1.20 1.30 1.40 1.50 1.60
----------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+-
Q1 . | *
VC2 . | * .
Q3a . * | .
VC3bA .* | .
VC3bB . | * .
VC3cA . * | .
VC3cB . | * .
VC3dA . * | .
VC3dB . | * .
VC3eA . | * .
VC3eB . | * .
Q6a . * | .
Q6b . * | .
VD6c . * | .
VD6d . | * .
VD6e . * | .
VD6f . | * .
VS8 . * .
VS9a . * | .
VS9b . * | .
VS9c . * | .
VS9f . * .
VS12a . * | .
Q14 . | * .
Q15a . * | .
Q15b . * | .
Q15c . | * .
Q15d . *| .
VS15e . | * .
VS9d . * | .
VD7a . * .
VD7b . * .
VD7c . | * .
Q4a . | * .
Q4b . | * .
Q4c . * | .
Q4d . * | .
VD4e . |* .
VD5 . * | .
VS9e . * | .
VD11a . * | .
VD11b . * | .
VC12b . * | .
VC/VD12c . * | .
VD13 . * | .
VS16a . * | .
VS16b . * | .
VS10 . | * .
====================================================================================================================================
Figure 2. Fit map for variation items.
describing a surprising outcome from the spinner (Item 3(d)). Interpretations of

graphs and tables are limited to basic reading skills. Responses are provided to
chance questions but these are numerically inappropriate.
At Level 2, Partial Recognition of Variation, students are likely to use
unquantied chance statements to describe outcomes, except in the case of a 50
50 chance. This may involve comments such as anything can happen. Although
recognizing the need for outcomes to sum to 60 in Item 2, students reasons do not
reect understanding of chance or variation. Patterns appear to override variation
in the interpretation of graphs. In interpreting stacked dot plots, students are more
likely to make awed interpretations or report simple data values. The terminology
of interest (sample, random, variation) is likely to be familiar but students have
diculty in expressing the concepts in words.
At Level 3, Applications of Variation, except for improvement in detailed
graph reading, the focus is on improved outcomes related to variation and
sampling. In many cases students focus on some appropriate aspects of the
concepts while ignoring or being misled by others. Students, for example, are
likely to be able to nd the mean of a data set but not appreciate the importance of
the variation in the values averaged, to provide partial summaries of stacked dot
plots or bar graphs while missing overall trends, to suggest outcomes for 60 die
throws that show variation but not of the appropriate degree, to suggest non-
central aspects in criticizing sampling procedures, and to focus on either repre-
sentative or chance methods of sample selection but not the two simultaneously.
Denitions of the three basic terms are more structured and/or combine more
aspects, but do not achieve a high level of sophistication. Students are likely to be
successful on some questions requiring critical analysis (e.g. choosing the appro-
priate stacked dot plot in Item 7), indicating some transition to Level 4 thinking.
This is not consistent, however, across contexts. Students are also likely to
complete successfully items based on the school curriculum in chance and graph
work.
Level 4, Critical Aspects of Variation, is where consolidation of concepts
occurs. In terms of variation students are likely to summarize graphical informa-
tion in statistically appropriate ways and acknowledge varying values in a data set
in calculating the mean. Appropriate variation is demonstrated in suggesting
outcomes for 60 die tosses and justications for spinner results. For sampling
items students are likely to nd the critical aspects of bias such as non-representa-
tiveness as well as make appropriate suggestions on their own. For the terms
sample, variation, and random, students are likely to display sophisticated
understandings, often without relying on examples. They include aspects of

uncertainty to explain variation, are aware of the possibility of errors in graphs,
and provide explanations with several integrated components. For curriculum-
based probability and graph reading tasks, students are likely to be successful.
As noted, the decisions for the threshold scores, which determined the levels
dened in gure 1, were based on overall judgements about the content, sophis-
tication, and structure of the responses as shown in the coding categories. There
were, however, some items whose codings showed similar diculty levels despite a
descriptive dierence observed when coding categories were assigned. Given the
continuous nature of the process and the number of codings and items, this is not
surprising. The meaning associated with the coding level was deemed suciently
important in a descriptive sense to warrant retaining, rather than collapsing
codings. Hence for some items, item diculty thresholds of codings were
relatively close together showing that the increase in ability required to move
from one threshold to the next was relatively small, and within one of the levels
dened in gure 1.
The fact that some items have adjacent pairs of coding categories within the
levels identied by the quantitative analysis emphasizes that this describes a
continuum of understanding. Students may demonstrate qualitatively higher
levels of understanding at lower levels of ability and this may depend on context.
For Item 3(e), Rasch analysis indicated some apparent anomalies in diculty.
The diculty level of obtaining a code of 1 (3eA.1) (inappropriate variation), was
more dicult for 10 spins than for 50 spins (3eB.1), whereas the diculty level of
obtaining a code of 2 (3eA.2 and 3eB.2) (appropriate variation), was considerably
higher for 50 spins. This can be attributed to the tighter requirements for a
judgement of appropriate variation for 50 spins and consequently looser require-
ments for a judgement of inappropriate variation on 50 spins. Parts 3(b), 3(c), and
3(d) indicated consistently that considering 10 spins was more dicult or had the
same diculty level as 50 spins. This suggests that rather than making the
question easier by having a smaller number of spins, in fact it raised the diculty
level. This has implications for teaching younger children.
Table 5 shows how the coding categories for the questions were distributed
along the overall hierarchical levels identied for the ve components of the scale.
There are a number of features that should be noted. First, any particular coding
category does not appear at the same level for each item. This suggests that
Basic Chance Items
Levels 1 3(a) 14
4 4
3 3
2 2/3 2 1/2
1 1 1
0 0 0 0
Basic Tables/Graphs Items
Levels 4(a) 4(b) 4(c) 4(d) 6(a) 6(b) 11(a) 15(a) 15(b) 15(c) 15(d)
4 1 2
3 1 1 1
2
1 1 1 1 1 1 1 1
0 0 0 0 0 0 0 0 0 0 0 0
Variation in Chance Items
Levels 2 3(bA) 3(bB) 3(cA) 3(cB) 3(dA) 3(dB) 3(eA) 3(eB) 12(b) 12(c)
4 4 3 3 3 3
3 2/3 3 3 2 2 2 2
2 1 2 1/2 2 1 1 1 1
1 1 1/2 1 1 1
0 0 0 0 0 0 0 0 0 0 0 0
Variation in Data/Graphs Items
Levels 4(e) 5 6(c) 6(d) 6(e) 6(f) 7(a) 7(b) 7(c) 11(b) 12(c) 13
4 3 2 5 3 3 2/3 3 3
3 1 2 2 3/4 2 2/3 1 2 1/2
2 1 2 1/2 1 1 1
1 1 1 1 1
0 0 0 0 0 0 0 0 0 0 0 0 0
Variation in Sampling Items
Levels 8 9(a) 9(b) 9(c) 9(d) 9(e) 9(f) 10 12(a) 15(e) 16(a) 16(b)
4 3 3 3 3 3 3 4 1
3 2 2 2 2/3 2 2 2 2/3 2 2/3 1/2
2 1 1 1 1 1 1 1 1 1
1 1
0 0 0 0 0 0 0 0 0 0 0 0 0
Table 5. Coding categories for ve components of the Variation scale.
although the codings indicate a hierarchy of complexity, the contexts of dierent

questions may place greater demands on students thus raising the diculty levels
for some codings. Second, of the ve components, Basic Tables and Graphs had
the lowest diculty overall. The questions posed here may not have allowed the
most able students to demonstrate fully their understanding of this area. Third,
only one question in the Variation in Sampling component, VS10, had a coding
that appeared in Level 1. In this case the questions on sampling did not allow less
able students to display their understanding. The coding categories in table 5
could provide a useful scoring rubric for teachers with the addition of suitable
descriptors.
5. Discussion
The discussion will consider four aspects of the current study: (i) the objective
to develop a scale to measure students understanding of variation in the context of
the chance and data curriculum; (ii) responses to some items in comparison with
outcomes reported by other researchers; (iii) the place of the scale in the larger
contexts of statistical literacy and literacy, and (iv) directions for future research.
5.1. Overall objectives

The objective to develop a scale to measure students understanding of vari-
ation within the chance and data curriculum has been met from three perspectives:
the coverage of components of the curriculum, the t of items to a unidimensional

scale, and the qualitative description of levels of appreciation of variation.
As summarized in table 1 and discussed descriptively in section 4, the items
selected covered basic aspects of chance and data as well as aspects of variation.
The coding generally reected increased sophistication in responses. The distri-
bution of items in gure 1 from the Rasch analysis shows that the ve areas of the
chance and data curriculum are represented across the ability levels. Ideas of
sampling do not occur often at Level 1, nor basic chance at Level 4. This reects a
lack of opportunity provided by the items set to show intuitive understanding of
sampling or complex understanding of probability. Otherwise the coverage of
content across the ability range appears adequate.
In general terms, the performance of the items in the variable map compares
favourably to that of SOLO-scaled items on problem solving analysed by Wilson
[29] and to contextually-based numeracy items analysed by Callingham and
Grin [33], both in terms of t to the Rasch PCM shown in gure 2 and in
spread of diculty. The position of a few of the items on the variable map in gure
1 suggests that the kind of thinking conveyed by the coding may be more indicative
of a transition to a higher level grouping. Both wording and context may make a
dierence to the diculty level of a particular item [34, 35]; a familiar context, for
example, may make it easier for a student to reach a level of critical thinking. In
broad terms, however, the kind of thinking required increases in complexity along
the variable. The unidimensional model conrmed by the Rasch analysis suggests
that considering variation as the underlying feature linking all aspects of the
chance and data curriculum is useful from a measurement point of view. As noted
earlier many statistics educators [14] have suggested an emphasis on variation
from a teaching perspective. The results of this study appear to conrm their belief
that variation is at the heart of statistical investigation and can be measured within
the context of chance and data. Also, the development of the survey described in
this study responds to the call of Gareld and Chance [36] for more instruments to
assess student understanding in statistics.
The qualitative description of four levels of progressively sophisticated under-
standing within the variable map for the underlying construct in gure 1, leads to
optimism that the scale will be useful in tracking student improvement over time
and in relation to planned learning activities. The hierarchical arrangement of the
categorical codings, based on statistical criteria and the SOLO model, reinforces
the developmental aspect of understanding in this potentially dicult and ne-

glected area of the mathematics curriculum. The descriptions of the levels may
also assist curriculum planners in devising activities to inuence increased under-
standing, particularly with respect to appreciation of variation understanding in
context (Level 3) and developing critical thinking skills (Level 4).
5.2. Comparison with other research

Stacked dot plots. As noted earlier, Item 7 was developed following the
research of Konold and Higgins [22]. The plot in 7(a) (see Appendix) was based
on a student-generated plot using only actually occurring data for the number of
years that the families have lived in a town. Konold and Higgins [22] found that
students quite commonly made the assumption that it is only necessary to include
occurring data when plotting values; it was not until prompted that students in fact
realized it was better to include non-occurring data values as well, in order to gain
a better representation of the data. Within the current study this is reected in the
number of students who, in response to Item 7(c), chose Graph 1 (27.6%) as their
preferred graph over the more statistically appropriate Graph 2. Students who
preferred Graph 1, and believed it to tell the story better, often cited such reasons
as it was easy to understand and read, and that it was better as it only showed
information relevant to the question. Students who preferred Graph 2 (21.9%), on
the other hand cited such reasons as it was more spaced out and thus easier to
understand. Although there is no one correct way of scaling, including non-
occurring values helps to display information in such a way as to raise interesting
questions about the factors that might have aected town growth [22], and this is a
desirable outcome for educators.
Konold and Higgins [22] also recognized that relating data back to the real
situation is an important feature in statistical understanding. Quite often students
in this study talked about the numbers only, and neglected what these numbers
represent. In relation to Items 7(a) and 7(b), it was common to see a response such
as Four Xs on 3. Such a response, although correct is not optimal since it does
not reect an understanding of what the data represents in the context.
Another item, 4(e), asked students to describe the shape of the stacked dot plot
provided. This question was designed to tap into students abilities to describe
informally a data set in terms of variation. Cobb [37], as a result of a teaching
experiment with grade 7 students, stated that talking informally about the shape of
a data set as hills and clusters is quite often good enough for the task at hand, and
can even give students some experience with working on ideas that can help
construct some meaningful interpretations of measures of spread. Within the
present study students who provided a reasonable description of the shape of
the data did so in three ways. Students described the stacked dot plot in terms of
geometric shapes (e.g. triangle, circular, two peaks), physical objects (e.g. pyramid,
Melbourne city, mountains, stairs, like on a stereo player), or by acknowledging
the existence of variation (e.g. up and down, uneven, lumpy, spread out, not in a
pattern). All three description types provide useful, non-threatening ways of
interacting with data and generating discussions about variability and spread.
Bias in sampling. Items 8 to 10 were a set of sampling questions based largely
on the work by Jacobs [25]. Like Jacobs study, the present ndings show that for
Item 8 (asking students how many people they would survey and how they would
choose them) many struggled with the concept of sample as a representation of a
whole, and responded with such answers as ask everyone or all 600. Metz [38]
found a similar result, with 12 students out of 37 opposing sampling as a means of
inference, stating a similar claim.
Roughly 7% of students in the present study initially suggested appropriate
randomization and stratied randomization techniques in Item 8. Jacobs [25],
however, found that one-third of all fth graders used such techniques. Schwartz
et al. [39] similarly found that 40% of sixth graders, although sceptical of truly
random samples, proposed methods that were stratied and avoided obvious bias.
In the present study, only 15% of grade 5 and 19% of grade 7 students provided
random and/or stratied techniques. This may reect a lack of opportunity to learn
about these concepts.
When asked to evaluate dierent sampling techniques only 22% of students
positively evaluated the preferred technique (Shannons survey). One of the main
criticisms was that drawing 60 names out of a hat from a population of 600 could
create a biased result. Students in the Schwartz et al. [39] study focused on similar
issues with such reasons as She might pull out [out of a hat] all rst grade names
(p.255). Other reasons for negative evaluations of Item 9(a) in the present study
were based on the ideas of inaccuracy (e.g. Because it is way o, and Because it
limits the range). Other students were more explicit and like those in the Schwartz
et al. [39] and Jacobs [25] studies, went as far as to say the method was too random
(e.g. Because it could be anyone, You dont know whose name will come out . . .).
These ndings also reect results of Metz [38], who stated one reason (among
others) that students did not like to generalize from a random sample was because
of the inherent variability within the population. Students, therefore, when able,
prefer to purposely select individuals to represent the characteristics in the
population through stratication [22, 26, 39]
Quite often students cited non-statistical reasons in their appraisals of the
methods presented; one such reason was the fairness rationale [25]. In response to
the randomized method in Item 9(a), some students evaluated the method as unfair
due to the fact that some children may have been selected who did not want to
participate, whereas others who were not selected probably did want to participate.
This rationale relies on emotive and personal beliefs of what constitutes a fair
survey. This reasoning also applied to the positive evaluations of Claires self-
selected survey technique: 40% of reasons for appraisal were freedom of choice,
fairness, methodological implications (easy to conduct), and inappropriate as-
sumptions of range and natural variation. Similarly, Jacobs [25] found that
students in her study assumed that, although self-selected, Claires method
would in fact produce a good mixture of respondents because of the absence of
other sample restrictions such as age and sex. Although the appreciation of the
need for range and variation was present in both studies, often there was diculty
in balancing this with the idea that self-selection methods are likely to produce
biased outcomes in samples.
The remaining three survey methods (Jake, Adam, and Ra) were all methods
that lacked representativeness. Schwartz et al. [39] found that some students in
response to a similar question suggested that surveying their friends and others
they thought likely to behave in the desired fashion to be acceptable in generalizing
about the wider population. In the present survey, students who evaluated Jake,
Adam, or Ra positively, often cited similar reasons (see table 3), suggesting a
limited understanding of the notion of sample.
Item 10 presented grade 9 students with a list of conicting results from the
dierent sampling techniques. In response, the majority of students chose a survey
method and results that were biased and unrepresentative of the population, but
nevertheless congruent with the method they thought was best when asked in the
previous item (Item 9(f)). There were a few students, however, who changed their
choice from Shannons survey as being the best method, to choosing a result from
one of the other four biased surveys perhaps because of the tendency to favour
methods with more decisive results (e.g. Claire) than methods that yielded quite
indecisive results (e.g. Shannon). These students were apparently inuenced by
the additional information presented. Relatively few students chose the correct
survey method, and even fewer were inuenced in this direction from other
methods. Many students (16%) refused to choose an outcome based on the
methods and results supplied and instead circled the response Average them,
despite having evaluated some of the surveys as bad, and correctly evaluating
Shannons as good in the previous section. Jacobs [25] suggests that although
many students can successfully evaluate dierent survey methods as either positive
or negative, many are unable to condently draw conclusions from multiple
surveys eectively, and try to aggregate the information despite already identifying
biases within the sampling techniques. It is therefore important that teachers guide
instruction so that students are given opportunities to reason and draw conclusions
from the outcomes of multiple survey techniques.
5.3. Links to statistical literacy and literacy

The placement of the four levels dened from the item mapping in gure 1
within the larger milieu of statistical literacy and literacy generally is important if
the scale is to prove useful across the curriculum in schools. Two related frame-
works are useful here. Watson [40] developed a statistical literacy hierarchy with
three tiers of desirable achievement.
. Tier 1 constitutes the basic understandings of the language required to get
started with chance and data. Basic denitions and procedures (e.g. graph
and table reading) would be expected of students operating in Tier 1.
. Tier 2 relates to the appreciation of the basic denitions and understandings
in context in order to make sense of information provided in social settings.
Knowledge of the social context may also be necessary.
. Tier 3 requires the analytical and critical thinking skills that will enable
questioning of claims that are made without proper statistical justication.
This hierarchy would appear to t well with the levels suggested in gure 1,
providing that Levels 2 and 3 are encompassed in Tier 2 as part of the application
in context.
Further to this hierarchy, Luke and Freebody [41] oer a framework for
reading as a social practice that ts both with the levels in gure 1 and with the
statistical literacy hierarchy (e.g. as noted in [27]). Their ideas for literacy are
easily transferable to the context of this study. Four practices are suggested as
elements of reading for social practice: coding practices, text-meaning practices,
pragmatic practices, and critical practices.
. Coding practices relate to developing the learners resources as a code-
breaker: working out text, nding patterns, guring out how text works.
. Text-meaning practices relate to developing resources as a text-participant:

how texts string together, what cultural resources can be brought to the text
and what meanings are possible.
. Pragmatic practices relate to developing resources as a text-user: how uses of
the text shape meaning, what can be done with text and by whom and what
are the alternatives for interpretation.
. Critical practices relate to developing resources as a text-analyst and critic:
what interests lie behind the text, what action is intended to ow from the
text, what is not said in the text.
All of the information provided in the questionnaire in this study is in some form
of text. Level 1 performance reects the code-breaking aspects of getting into the
eld of chance and data. Level 2 reects the making of meaning in contexts, which
in this case are social as well as cultural; in this study the participation was at a
relatively straightforward level often focusing on single ideas. Level 3 performance

is closer to becoming a user of the information presented in appreciating what can
be done with data, samples, outcomes, and graphs. Level 4 reects the critical
aspects that lead to a questioning attitude in relation to all claims emanating from
chance and data.
Although Luke and Freebody [41] did not see their four elements as hier-
archical, they t well within the statistical literacy hierarchy of Watson [40] with
text-meaning and pragmatic practices falling within Tier 2 of applying under-
standing in context. Given the often-repeated perception of mathematics as a
school subject divorced from the real world, it is important to emphasize within
the chance and data component of the curriculum the necessity to move beyond
denition and process into a context where these can be applied to make meaning
and sometimes to question claims. Table 6 shows the relationship of the levels
dened from the variable map in this study with the models suggested by Watson
[40] and Luke and Freebody [41].
5.4. Future research

Several avenues for future research present themselves following the outcomes
of this study. One is the possibility of further rening the survey with new items to
ll in the gaps noted in the content with respect to the ve components of the
chance and data curriculum and the elimination of those that provide repetitive
information. The usefulness of combining Rasch methods with more traditional
cognitive models has been demonstrated in the conrming of increased structural
sophistication and critical thinking associated with the coding of items and their
appearance in the variable map. It will be of interest to follow up with further work
to conrm the unidimensionality of the variable that is the foundation for the
understanding of chance and data based on an appreciation of the part played by
variation. Having measurement conrmation of what many statistics indicators
believe, may assist in getting the message of the importance of variation across to
the wider teaching community.
Trialling the questionnaire with other students, for example in other cultural
settings, and with older students, would add to the condence in the usefulness of
the scale for educational measurement. It will also add to the increasing store of
assessment instruments available in the area of statistics education [36].
Variable mapping Luke and Freebody Statistical literacy hierarchy

(gure 1) [41] [40]
Level 4 Critical aspects of Text analyst Tier 3 Ability to question

variation: Employing Critical practice claims
complex justication or
critical reasoning
Level 3 Applications of Text user Tier 2 Application in
variation: Consolidating and Pragmatic practice context
using ideas in context,
inconsistent in picking
salient features
Level 2 Partial recognition Text participant
of variation: Putting ideas in Text-meaning practice
context, tendency to focus on
single aspects and neglect
others
Level 1 Prerequisites for Code-breaker Tier 1 Language,
variation: Working out the Coding practice denitions/processes
environment, table/simple
graph reading, intuitive
reasoning for chance
Table 6. Relationships of models for developing student understanding of variation as a

foundation for chance and data.
Developing teacher friendly coding rubrics for the items and providing scores
indicative of levels of student thinking would allow the scale to be used by
teachers, without recourse to sophisticated statistical analysis. Examples of de-
scriptive rubrics are given in the rst part of section 4 and in tables 3 and 4. They
currently exist for all codings in table 5 but would need to be prepared in booklets
for teachers to use.
The use of the scale in measuring change after instruction and learning
experiences aimed at increasing appreciation of variation is the next step in an
extended research programme of which this study is part. It is hoped that other
researchers may nd the questionnaire useful in similar pre-post studies of
learning interventions in this area.
Acknowledgments
This project was funded by an Australian Research Council grant
(No. 00000716).
Appendix: Survey Items (item tags t gure 1)
Q1. Consider rolling a normal six-sided die.

Which is easier to get?
(a) A one
(b) A six
(c) Both a one and a six are equally easy
Please explain your answer.
VC2. Imagine you threw the dice 60 times. Fill in the table below to show how
many times each number might come up.
Number How many times it

on Dice might come up
1
2
3
4
5
6
TOTAL 60
Why do you think these numbers are reasonable?

3. A class used this spinner.
Q3(a) If you were to spin it once, what is the chance that it will land on the
shaded part?
VC3(b) Out of 50 [10] spins, how many times do you think the spinner will
land on the shaded part? Why do you think this?
VC3(c) If you were to spin it 50 [10] times again, would you expect to get
the same number out of 50 [10] to land on the shaded part next
time? Why do you think this?
VC3(d) How many times out of 50 [10] spins, landing on the shaded part
would surprise you?
VC3(e) Suppose that you were to do 6 sets of 50 [10] spins. Write a list that
would describe what might happen for the number of times the
spinner would land on the shaded part?
______, ______, ______, ______, ______, ______
4. A class did 50 spins of the above spinner many times and the results for the
number of times it landed on the shaded part are recorded below.
Number of times on the shaded part

Q4(a) What is the lowest value?

Q4(b) What is the highest value?
Q4(c) What is the range?
Q4(d) What is the mode?
VD4(e) How would you describe the shape of the graph?
VD5. Imagine that three other classes produced graphs for the spinner. In some
cases, the results were just made up without actually doing the experiment.
(a) Do you think class As results are made up or really from the experiment?
& Made up
& Real from experiment
Explain why you think this.
(b) Do you think class Bs results are made up or really from the experiment?
& Made up
(c) Do you think class Cs results are made up or really from the experiment?
& Made up
6. How children get to school one day
Q6(a) How many children walk to school?

Q6(b) How many more children come by bus than by car?
VD6(c) Would the graph look the same every day? Why or why not?
VD6(d) A new student came to school by car. Is the new student a boy or a
girl? How do you know?
VD6(e) What does the row with the Train tell about how the children get to
school?
VD6(f) Tom is not at school today. How do you think he will get to school
tomorrow? Why?
7. A class of students recorded the number of years their families had lived in their
town. Here are two graphs that students drew to tell the story.
Graph 1
X
X X
X X X X
X X X X X X X X X X X X X X X
| | | | | | | | | | | | | | | |
0 1 2 3 4 5 6 10 11 12 13 14 17 25 37
YEARS IN TOWN
VD7(a) What can you tell by looking at Graph 1? [2 spaces provided]

Graph 2
X
X X
X X X X
X X X X X X X X X X X X X X X
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
0 5 10 15 20 25 30 35
YEARS IN TOWN
VD7(b) What can you tell by looking at Graph 2? [2 spaces provided]

VD7(c) Which of these graphs tells the story better? Why?
VS8. MOVIEWORLD
A class wanted to raise money for their school trip to Movieworld on the Gold
Coast. They could raise money by selling rae tickets for a Nintendo Game
system. But before they decided to have a rae they wanted to estimate how
many students in their whole school would buy a ticket. So they decided to
do a survey to nd out rst. The school has 600 students in grades 16 with 100
students in each grade. How many students would you survey and how
would you choose them? Why?
VS9. Five students in the school conducted surveys.

(a) Shannon got the names of all 600 children in the school and put them in a
hat, and then pulled out 60 of them. What do you think of Shannons
survey?
& GOOD & BAD & NOT SURE
Why ____________________________________________________________
(b) Jake asked 10 children at an after-school meeting of the computer games

club. What do you think of Jakes survey?
Why ____________________________________________________________
(c) Adam asked all of the 100 children in Grade 1. What do you think of
Adams survey?
Why ___________________________________________________________
(d) Ra surveyed 60 of his friends. What do you think of Ras survey?

Why ____________________________________________________________
(e) Claire set up a booth outside of the tuck shop. Anyone who wanted to
stop and ll out a survey could. She stopped collecting surveys when she
got 60 kids to complete them. What do you think of Claires survey?
Why ___________________________________________________________
(f) Who do you think has the best survey method? Why?
VS10. Here are the results of the surveys.
THE PERCENTAGE OF STUDENTS

SURVEY METHOD WHO SAY THEY WILL BUY A TICKET
Shannon put all the names in a hat and 35% said they would buy tickets.
pulled out 60.
Jake asked 10 kids at the computer games 90% said they would buy tickets.
club
Adam asked all the children in Grade 1. 50% said they would buy tickets.
Ra surveyed 60 of his friends. 75% said they would buy tickets.
Claire set up a booth outside the tuckshop. 95% said they would buy tickets.
What percentage of students in the whole school will buy a rae ticket?
(Circle one)
(a) 35% (Shannons result) because _____________________________________

(b) 90% (Jakes results) because ________________________________________
(c) 50% (Adams results) because _______________________________________
(d) 75% (Ras result) because _________________________________________
(e) 95% (Claires result) because ________________________________________
(f) I think it is best to average the 5 surveys. The average of the kids that said
they would by a rae ticket is 69%.
(g) I dont know because Ra, Shannon, Claire, Jake and Adam all got
dierent results.
(h) I think that ______ per cent of the kids in the whole school are willing to by
a rae ticket because ______________________________________________
VD11.
BOATIES SAFETY FAILURE
These graphs were part of a newspaper story reporting on boating deaths in

Tasmania.
^ Comment on any unusual features of the graphs. [2 spaces provided]
VS12(a) What does sample mean? Give an example of a sample.

VS12(b) What does random mean? Give an example of something that
happens in a random way.
VC/VD 12(c) What does variation mean? Use the word variation in a sentence.
Give an example of something that varies.
VD13. A small object was weighed on the same scales separately by nine students
in a science class. The weights (in grams) recorded by each student are
shown below.
6.3 6.0 6.0 15.3 6.1 6.3 6.2 6.15 6.3
The average value could be calculated in several ways.
(a) How would you nd the average? The average weight is ______ grams.
[Show your working in the box provided]
Q14. Box A and Box B are lled with red and blue marbles as follows:
BOX A BOX B
6 RED 60 RED
4 BLUE 40 BLUE
Each box is shaken. You want to get a blue marble, but you are only allowed to pick
out one marble without looking.
^ Which box should you choose?
(a) Box A (with 6 red and 4 blue)
(b) Box B (with 60 red and 40 blue)
(c) It doesnt matter
Please explain your answer.
15. A primary school had a sports day where every child could choose a sport to
play. Here is what they chose.
Netball Soccer Tennis Swimming Total
BOYS 0 20 20 10 50
GIRLS 40 10 15 10 75
Q15(a) How many girls chose Tennis?

Q15(b) What was the most popular sport for girls?
Q15(c) What was the most popular sport for boys?
Q15(d) How many children were at the sports day?
VS15(e) The teacher wanted to choose four children to lead the closing
parade. Suggest two fair ways she could have chosen them.
VS16. The following article appeared in the Hobart Mercury.
Decriminalise druge use: poll
Some 96 percent of callers to youth radio Only 389 believed possession of the drug
station Triple J have said marijuana use should remain a criminal oence. Many
should be decriminalised in Australia. The callers stressed they did not smoke
phone-in listener poll, which closed marijuana but still believed in
yesterday, showed 9924 out of the decriminalising its use, a Triple J
10,000-plus callers favoured statement said.
decriminalisation, the station said.
(a) What was the sample size in this article?

(b) Is the sample reported here a reliable way of nding our public support for
the decriminalisation of marijuana? Why or why not?
References
[1] Shaughnessy, J. M., 1997, Missed opportunities in research on the teaching and
learning of data and chance. In Proceedings of the 20th annual conference of the
Mathematics Education Research Group of AustralasiaPeople in mathematics
education, edited by F. Biddilph and K. Carr (Waikato, NZ: MERGA), pp. 612.
[2] Moore, D. S., 1990, Uncertainty. In On the Shoulders of Giants: New Approaches to
Numeracy, edited by L. A. Steen (Washington, DC: National Academy Press),
pp. 95137.
[3] Cobb, G. W., and Moore, D. S., 1997, Am. Math. Monthly, 104, 801823.
[4] Wild, C. J., and Pfannkuch, M., 1999, Int. Statist. Rev., 67, 223265.
[5] Shaughnessy, J. M., Watson, J., Moritz, J., and Reading, C., 1999, School
mathematics students acknowledgment of statistical variation. Paper presented at
the Presession Research SymposiumTheres more to life than centers, 77th
Annual National Council of Teachers of Mathematics Conference, San Francisco,
CA.
[6] Zawojewski, J. S., and Shaughnessy, J. M., 2000, Data and chance. In Results from the
Seventh Mathematics Assessment of the National Assessment of Educational Progress,
edited by E. A. Silver and P. A. Kenney (Reston, VA: NCTM), pp. 235268.
[7] Torok, R., and Watson, J., 2000, Math. Educ. Res. J., 12, 147169.
[8] Reading, C., and Shaughnessy, M., 2000, Student perceptions of variation in a
sampling situation. In Proceedings of the 24th Conference of the International Group
for the Psychology of Mathematics Education, edited by T. Nakahara and M. Kyama
(Hiroshima: Hiroshima University), pp. 8996.
[9] Masters, G. N., 1982, Psychometrika, 47, 149174.
[10] Watson, J. M., 1994, Instruments to assess statistical concepts in the school
curriculum. In Proceedings of the Fourth International Conference on Teaching
Statistics: Volume 1, edited by National Organizing Committee (Rabat, Morocco:
National Institute of Statistics and Applied Economics), pp. 7380.
[11] Torok, R., 2000, Australian Math. Teacher, 56(2), 2531.
[12] Holmes, P., 1980, Teaching Statistics 1116. (Berkshire: Schools Council and
Foulsham Education).
[13] Rasch, G., 1960, Probabilistic Models for Some Intelligence and Attainment Tests.
(Copenhagen: Denmarks Paedagogiske Institute; Reprinted by University of
Chicago Press, 1980).
[14] Biggs, J. B., and Collis, K. F., 1982, Evaluating the Quality of Learning: The SOLO
Taxonomy (New York: Academic Press).
[15] Watson, J. M., Collis, K. F., and Moritz, J. B., 1997, Math. Educ. Res. J., 9(1), 60
82.
[16] Watson, J. M., 1998, Numeracy benchmarks for years 3 and 5: What about chance and
data? In Teaching mathematics in new times: Volume 2, edited by C. Kanes, M. Goos,
and E. Warren (Brisbane: Mathematics Education Research Group of Australasia),
pp. 669676.
[17] Watson, J. M., and Moritz, J. B., 1999, Australian J. Early Childhood, 24(2), 2227.
[18] Watson, J., and Pereira-Mendoza, L., 1996, Australian J. Lang. Literacy, 19, 244
258.
[19] Haley, M., 2000, Boaties safety failure. The Mercury, 30 March, p.7.
[20] Watson, J. M., Collis, K. F., and Moritz, J. B., 1993, Assessment of statistical
understanding in Australian schools. Paper presented at the Statistics 93 conference,
Wollongong, NSW.
[21] Moritz, J. B., Watson, J. M., and Pereira-Mendoza, L., 1996, The language of
statistical understanding: an investigation in two countries. Paper presented at the
Joint ERA/AARE Conference (Singapore) Available [on-line]: http://www.swin.
edu.au/aare/96pap/morij96.280
[22] Konold, C., and Higgins, T. L., in press, Working with data: Highlights related to
research. In Developing Mathematical Ideas: Collecting, Representing, and Analyzing
Data, edited by D. Schifter, V. Bastable and S. J. Russell (Parsippany, NJ: Dale

Seymour Publications).
[23] GarFIeld, J. B., and Gal, I., 1999, Teaching and assessing statistical reasoning. In
Developing Mathematical Reasoning in Grades K-12: 1999 Yearbook, edited by L. V.
Sti and F. R. Curcio (Reston, VA: National Council of Teachers of Mathematics),
pp. 207220.
[24] Watson, J. M., and Moritz, J. B., 1999, Focus Learning Prob. Math., 21(4), 1539.
[25] Jacobs, V. R., 1999, Math. Middle School, 5, 240263.
[26] Watson, J. M., and Moritz, J. B., 2000, J. Res. Math. Educ., 31, 4470.
[27] Watson, J. M., and Moritz, J. B., 2000, Math. Behav., 19, 109136.
[28] Watson, J. M., and Moritz, J. B., 1998, Math. Educ. Res. J., 10, 103127.
[29] Wilson, M., 1990, Investigation of structured problem-solving items. In Assessing
Higher Order Thinking in Mathematics, edited by G. Kulm (Washington, DC:
American Association for the Advancement of Science), pp. 187203.
[30] Wilson, M., 1992, Measuring levels of mathematical understanding. In Mathematics
Assessment and Evaluation: Imperatives for Mathematics Educators, edited by
T. A. Romberg (Albany: State University of NY Press), pp. 213241.
[31] Adams, R. J., and Khoo, S. T., 1996, Quest: The Interactive test analysis system Version
2.1 [Computer software]. (Melbourne, VIC: Australian Council for Educational
Research.)
[32] Wright, B. D., and Masters, G. N., 1982, Rating Scale Analysis: Rasch Measurement.
(Chicago: MESA Press).
[33] Callingham, R., and Griffin, P., 2000. Towards a framework for numeracy
assessment. In Proceedings of the 23rd annual conference of the Mathematics Education
Research Group of AustralasiaMathematics education beyond 2000, edited by J.
Bana and A. Chapman (Perth, WA: MERGA), pp. 134141.
[34] Ellerton, N. F., and Clements, M. A., 1991, Mathematics in Language: A Review of
Language Factors Mathematics Learning. (Geelong, VIC: Deakin University).
[35] Menon, R., 1995, Focus Learning Prob. Math., 17(1), 2533.
[36] GarFIeld, J., and Chance, B., 2000, Math. Thinking Learning, 2, 99125.
[37] Cobb, P., 1999, Math. Thinking Learning, 1(1), 543.
[38] Metz, K. E., 1999, Why sampling works or why it cant: Ideas of young children
engaged in research of their own design. In Proceedings of the 21st Annual Meeting:
North American Chapter of the International Group for the Psychology of Mathematics
Education: Vol. 2, edited by F. Hitt and M. Santos (Columbus, OH: Eric), pp. 492
499.
[39] Schwartz, D. L., Goldman, S. R., Vye, N. J., and Barron, B. J., 1998, Aligning
everyday and mathematical reasoning: The case of sampling assumptions. In
Reections on Statistics: Learning, Teaching, and Assessment in Grades K-12, edited
by S. P. Lajoie (Mahwah, NJ: Lawrence Erlbaum), pp. 233273.
[40] Watson, J. M., 1997, Assessing statistical literacy using the media. In The Assessment
Challenge in Statistics Education, edited by I. Gal and J. B. Gareld (Amsterdam:
IOS Press and The International Statistical Institute), pp. 107121.
[41] Luke, A., and Freebody, P., 1997, Shaping the social practices of reading. In
Constructing Critical Literacies: Teaching and Learning Textual Practice, edited by
S. Musprati, A. Luke and P. Freebody (St. Leonards, Australia: Allen and Unwin),
pp. 185225.

Watson 2003

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Watson 2003

Uploaded by

Copyright:

Available Formats

This article was downloaded by: [Colorado College]

On: 27 October 2014, At: 18:38

The measurement of school

To link to this article: http://dx.doi.org/10.1080/0020739021000018791

PLEASE SCROLL DOWN FOR ARTICLE

The measurement of school students understanding of

JANE M. WATSON,*{ BEN A. KELLY,{ ROSEMARY A. CALLINGHAM{

(Received 13 November 2001)

The paper presents a questionnaire devised to assess school students

* The author to whom correspondence should be addressed.

International Journal of Mathematical Education in Science and Technology

would appear in a group of 10 obtained from a (well-mixed) gumball machine

demonstrate an understanding of variation that occurred in those studies, as well as

ness of variation expressed (if relevant to the response). A number of dierent

Component Question number in survey

Basic chance 1, 3(a), 14

* This item was coded in two dierent ways.

In this analysis, 13 items were marked rightwrong; 8 were given a three-point

4.1. Descriptive analysis of selected items

Question 4(e). Question 4 asked students a series of simple graph reading

Item 7(a) Item 7(b) Item 7(c)

0 26.91 30.90 26.91

Table 2. Percentage in each coding category for Question 7.

0 Inappropriate Misinterpret Misinterpret Misinterpret Misinterpret Misinterpret Misinterpret

Table 3. Coding categories of response for Items 8 and 9(a) to 9(e).

Coding Sample (12(a)) Random (12(b)) Variation (12(c))

0 The number of Very quickly (40.3%) Jayden varies from

Table 4. Coding categories of response with examples for Item 12.

4.2. Rasch analysis and levels of understanding

XXXXXXXXXXXXXXXXXXX | VC3dA VS9b.1 VS9d.1 VD7b.1 variation

Figure 1. Variable map for underlying construct.

Figure 2. Fit map for variation items.

describing a surprising outcome from the spinner (Item 3(d)). Interpretations of

understandings, often without relying on examples. They include aspects of

Basic Chance Items

Table 5. Coding categories for ve components of the Variation scale.

although the codings indicate a hierarchy of complexity, the contexts of dierent

5.1. Overall objectives

the coverage of components of the curriculum, the t of items to a unidimensional

the developmental aspect of understanding in this potentially dicult and ne-

5.2. Comparison with other research

5.3. Links to statistical literacy and literacy

. Text-meaning practices relate to developing resources as a text-participant:

relatively straightforward level often focusing on single ideas. Level 3 performance

5.4. Future research

Variable mapping Luke and Freebody Statistical literacy hierarchy

Level 4 Critical aspects of Text analyst Tier 3 Ability to question

Table 6. Relationships of models for developing student understanding of variation as a

Appendix: Survey Items (item tags t gure 1)

Q1. Consider rolling a normal six-sided die.

Number How many times it

Why do you think these numbers are reasonable?

3. A class used this spinner.

______, ______, ______, ______, ______, ______

Number of times on the shaded part

Q4(a) What is the lowest value?

6. How children get to school one day

Q6(a) How many children walk to school?

VD7(a) What can you tell by looking at Graph 1? [2 spaces provided]

VD7(b) What can you tell by looking at Graph 2? [2 spaces provided]

VS9. Five students in the school conducted surveys.

(b) Jake asked 10 children at an after-school meeting of the computer games

(d) Ra surveyed 60 of his friends. What do you think of Ras survey?

VS10. Here are the results of the surveys.

, , , , ,