Gustaffson 2008

See
discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/250151857
Effects of International Comparative Studies on

Educational Quality on the Quality of
Educational Research
Article in European Educational Research Journal March 2008

DOI: 10.2304/eerj.2008.7.1.1
CITATIONS READS
18 27
1 author:
Jan-Eric Gustafsson
University of Gothenburg
122 PUBLICATIONS 2,633 CITATIONS
SEE PROFILE
All in-text references underlined in blue are linked to publications on ResearchGate, Available from: Jan-Eric Gustafsson
letting you access and read them immediately. Retrieved on: 08 June 2016
European Educational Research Journal
Volume 7 Number 1 2008
www.wwwords.eu/EERJ
ECER GHENT KEYNOTE
Effects of International Comparative

Studies on Educational Quality on
the Quality of Educational Research[1]
JAN-ERIC GUSTAFSSON
Department of Education, Gothenburg University, Sweden
ABSTRACT Large-scale survey studies of educational achievement are becoming increasingly

frequent, and they are visibly present in both educational policy debates and within the educational
research community. One main aim of these studies is to provide descriptions of inputs, processes and
outcomes, and another aim is to provide explanations of how different factors interrelate to produce
educational outcomes. These aims are difficult to reach, which in combination with the fact that the
comparative studies are typically more policy driven than theory driven, are reasons why these studies
are contested on quality grounds. In this article, a set of fundamental methodological challenges related
to the validity of the measurement instruments and to the possibility of making inferences about
causality are identified and discussed in relation to examples of different studies. Strengths and
weaknesses of different research approaches are discussed, and it is proposed that the dichotomy
between qualitative and quantitative approaches should be replaced with distinctions between low-
and high-level inference approaches with respect to data, generalization and explanation. It is
concluded that while the international studies easily invite misuse and misinterpretation, they also offer
possibilities for improving the quality of educational research, because the high-quality data generated
by these studies can be taken advantage of in research on causal effects of factors in and out of
educational systems.
Introduction
The international comparative studies of educational achievement receive much attention, both
within the field of educational research and outside the field. They figure prominently in
discussions about the quality of education in different countries and how quality can be improved.
It is essential, therefore, to try to understand how these studies fit into current educational
research, and how they affect the perceived and actual quality of educational research. These are
the main purposes of the current article.
Let me start by making a few notes about the background and development of the
international studies. The International Association for the Evaluation of Educational Achievement
(IEA) was founded in 1959 by a small group of educational and social science researchers, with the
purpose of conducting international comparative research studies focused on educational
achievement and its determinants. We can discern two phases in the development of IEA. During
the first phase, researchers pushed the development forward. Their aim was to understand the
great complexity of factors influencing student achievement in different subject fields. They used
the popular metaphor that they wanted to use the world as an educational laboratory to investigate
effects of school, home, student and societal factors, arguing that an international comparative
1 http://dx.doi.org/10.2304/eerj.2008.7.1.1
Downloaded from eer.sagepub.com by guest on August 21, 2015

Jan-Eric Gustafsson
approach was necessary to investigate effects of many of these factors. The researchers also had the
responsibility to raise funding and conduct the entire research process, from the theoretical
conceptions and design, to analysis and reporting.
The first phase comprised about 30 years, from the late 1950s to 1990. The first study, which
investigated mathematics achievement in 12 countries, was conducted in 1964. The very large and
complex Six Subject Survey, conducted in 1970-71, that comprised reading comprehension,
literature, civic education, French as a foreign language, English as a foreign language, and science,
followed this study. During the 1980s, the level of activity in IEA was lower, but among other
things, the studies in mathematics and science were repeated.
In 1990, a new organization of IEA was set up, with a permanent secretariat in the
Netherlands, and a data-processing centre in Hamburg. The TIMSS (Third International
Mathematics and Science Study) 1995 study (Beaton et al, 1996) most clearly marks the beginning
of the second phase. During this phase, which still continues, the researcher presence is less marked
and the aim of the studies has shifted away from explanatory purposes towards descriptive
purposes. The involvement of national administrative and policy institutions has become stronger,
and even though researchers are still involved in the design, analysis and reporting of the studies,
the level of ambition of the international reporting is limited. The international reports mainly
describe outcomes, along with background and process factors, but there is no attempt to explain
the variation in outcomes between school systems, or to make inferences about causes and effects.
The task of analysing the factors behind the outcome for the different countries is left to each
participating country, and the databases are made available to the research community for
secondary analysis. There thus has been a drift from explanation to description, mainly serving the
purpose of evaluation of educational quality as a basis for national discussions about educational
policy.
When the Organization for Economic Cooperation and Development launched its
Programme for International Student Assessment (PISA) in 2000 (OECD, 2001) the emphasis of the
international comparative studies on evaluation of educational equality in the service of
educational policy became even more emphasized. This was the expressed purpose of the PISA
programme, and even though the OECD makes deeper analyses of the PISA database, caution is
expressed as to the possibility of making causal inference.
The transition of the international studies from phase one to phase two also has been
associated with a dramatic increase in the volume and frequency of the studies. The number of
countries participating in a particular study has increased dramatically and now often amounts to
more than 50 countries or school systems. The frequency of repetition also has increased: the IEA
studies of mathematics and science (i.e. TIMSS), and reading (i.e. PIRLS [Progress in International
Reading Literacy Study]) are now designed to capture within-country achievement trends and are
therefore repeated every fourth or fifth year. The PISA study, which covers mathematics, science
and reading, includes all the OECD countries, along with a large number of associate countries,
and it is repeated every third year. The amount of data generated in the international studies
during phase two thus far outweighs the amount of data generated during phase one.
In summary, then, in phase one the goal was to generate knowledge about determinants and
mechanisms behind educational achievement, while in phase two the goal is to describe the
outcomes of different educational systems, leaving it to the different participating countries to find
the explanations. The IEA and the OECD also immediately make the databases available to the
international research community for secondary analyses, and they are used in a very large number
of studies. Thus, in a sense, the international studies are more to be seen as providing an
infrastructure of data that researchers can take advantage of, than research studies in their own
right.
Why did this change of orientation happen? One main reason was that the researchers
involved in the first phase of IEA were exhausted and frustrated. As has been described by Husn
(1979), the funding of the extremely large, complex and ambitious Six Subject Survey came from
many different sources, and yet the study was severely under-funded. There were many difficult
design issues about which decisions had to be made in a short time, and in retrospect, it was

Effects of International Comparative Studies on Educational Quality
realized that not all these decisions were optimal. Furthermore, the fieldwork and the logistics of
data preparation and analysis required major efforts. However, the main source of frustration was
that the researchers did not quite feel that they had accomplished what they intended to do. The
original reporting of the Six Subject Survey comprised nine volumes written by several of the
leading researchers in the field of education. While these volumes in many ways are excellent, it
was obvious that the Six Subject Survey failed to answer the questions about determinants of
educational achievement, and the causal mechanisms involved (Husn, 1979).
The primary reason for this frustration was that the IEA design was a cross-sectional survey,
and such designs do not easily allow causal inference (Husn, 1979, pp. 383-384). Indeed, Allardt
(1990) argued that there is little evidence that comparative surveys in any field of social science
have been able to generate knowledge about causal relations. He pointed to the great complexity
of the phenomena investigated, and to the uniqueness of different countries, as the reasons for this.
However, even though this line of reasoning explains why the researchers more or less
abdicated from the international comparative studies, it cannot explain the upsurge of activity
within the field in the 1990s. There were several reasons for this upsurge. One was that the goal of
the international comparative studies was reformulated to focus on the outcomes of education,
thus essentially limiting the task to being one of describing outcomes, along with some background
and process variables. Thus, the international studies were transformed to serve purposes of
educational evaluation, rather than to serve as a scientific approach. Such a transformation fitted
well with the increased emphasis on outcomes of education, partly as a consequence of the changes
in educational governance through processes of decentralization and deregulation. Another reason
for the increased activity in the field was that great methodological advances had been made in the
technology for large-scale assessment of knowledge and skills. The international studies adopted
the methodology developed in the National Assessment of Educational Progress (NAEP) in the
United States in the 1980s, based on complex item-response theory and matrix-sampling designs
(Jones & Olkin, 2004). This methodology was well suited for efficient and unbiased estimation of
system-level performance, and through collaboration between researchers at Educational Testing
Service and Boston College it was skilfully implemented to support the international studies. The
TIMSS 1995 study was the first study to take full advantage of this technology, and when PISA
started a few years later similar techniques were adopted in that study.
This development raises two main questions regarding the quality of the international
comparative studies. The first is whether we can trust the descriptive results generated from these
studies, and which thus is a question of reliability and validity in the international assessments. The
other question is to what extent it is desirable to try to establish causal explanations based on these
studies, and to what extent this is possible. These questions are discussed below.
Reliability and Validity of International Assessments

Scepticism is frequently expressed concerning the reliability and validity of international
assessments. It is asked if it is at all possible to measure such complex and multifaceted phenomena
as knowledge and skills, and particularly when doing this in large-scale assessments, which involve
many languages and cultures. These questions are very reasonable indeed, and I will discuss them
at a general level below. However, there is reason to start by presenting a concrete example.
An Example: limitations of paper and pencil assessments

Schoultz et al (2001) argued that instead of seeing differences in performance to be a consequence
of students abilities and knowledge, performance should be seen as produced through concrete
communicative practice. They thus took a sociocultural perspective as a starting-point, and they
challenged the assumption that conceptual knowledge is something that is reflected in performance
in different situations. In particular, they focused on the consequences of the fact that test items
often are presented in written form, and that the difficulties associated with this particular

Jan-Eric Gustafsson
communicative format are seldom recognized. They thus claimed that reading and writing in
solitude cannot be taken as an unbiased indicator of what students know and understand.
Schoultz et al (2001) selected two items from the TIMSS 1995 study for scrutiny in an
interview study comprising 25 Swedish grade 7 students. One was an optics item. It presented an
illustration showing two flashlights, one with and one without a reflector, and the question was
which of the two flashlights shines more light on a wall 5 metres away. An open response was
required, and to be scored correct the response had to include an explanation that argued that the
reflector focused the light on the wall.
According to the TIMSS results, this item was quite difficult. In the Swedish grade 7 sample,
only 39% of the students answered the item correctly, which figure was somewhat below the
international mean. In the interview study, 66% of the students gave correct answers. Even though
the small and possibly unrepresentative sample makes it difficult to compare this result with that
from the TIMSS study, it nevertheless indicates that the interview situation makes the item easier.
One reason for this was that in the interview situation the students did not have to write the
answer. Furthermore, many students did not understand the word reflector and they had initial
difficulties connecting what was written in the question with the illustration, but in the dialogue
with the interviewer, these things were clarified. Thus, the higher performance in the interview
study was to a large extent due to the scaffolding provided by the interviewer in a Socratic
dialogue.
For the other item, which was a multiple-choice chemistry item, the results were even more
dramatic. According to the TIMSS data, only 26% of the Swedish grade 7 students chose the correct
response alternative, but in the interview study, no less than 80% of the students responded
correctly. This was due to the interaction between the interviewer and the interviewee, which
helped the students to interpret the text and the meaning of the response alternatives.
From this study, the authors concluded, among other things, that the low performance
demonstrated in the TIMSS study was because the students were limited to operating on their
own, and in a world on paper. They concluded that Knowing is in context and relative to
circumstance. This would seem an important premise to keep in mind when discussing the
outcomes of psychometric exercises (p. 234).
This may seem to be a serious criticism not only of the TIMSS study, but also of results from
paper and pencil tests generally. However, the results of this study have little to do with quality
aspects of the TIMSS assessment, or of the validity of paper and pencil tests. While the Schoultz et
al (2001) study appears to deal with the validity of TIMSS items, this study is based on different
assumptions from those made in TIMSS, which makes it impossible to make any inference about
the phenomena studied in TIMSS from the results obtained in the Schoultz et al study, and vice
versa.
The most fundamental difference concerns the assumptions made about the nature of
performance differences over different contexts. Schoultz et al view the performance differences
between the paper-and-pencil and interview situations as absolute, while in TIMSS performance
differences between two situations are seen as relative. Thus, Schoultz et al interpret the higher
level of performance when the item is administered in an interview situation rather than in a paper
and pencil situation as evidence of a higher level of knowledge and conceptual insight, i.e. as
evidence of higher student ability. This interpretation also implies that if TIMSS were to use
interviews to a larger extent than is currently done, this would result in a more positive picture of
student knowledge. But this is not so, because in TIMSS the observed performance level is seen as
being determined not only by student ability but also by the difficulty of the item. Thus, a TIMSS
person who is presented with the finding that the level of performance is higher when an item is
presented in a highly supportive interview context than in a paper and pencil context, would not
necessarily think that the level of ability becomes higher when students are interviewed than when
they sit alone and read and write. Instead, another possible interpretation is that the level of ability
of the person is more or less constant in the two situations, while the interview situation is easier
than the paper and pencil situation. This mode of thinking is based on the assumption that students

have a constant level of ability over situations, which would seem difficult to accept in a world-
view that emphasizes the uniqueness of contexts.
Another difference between the assumptions underlying the Schoultz et al (2001) study and
the TIMSS study concerns the notions of reliability and validity. Schoultz et al argue that it is
possible to subject the TIMSS items, which already had been tested for validity and reliability, to a
further test, which in a truer sense would reveal the actual validity and reliability of the items.
According to this view, the items have immanent and absolute characteristics, which can be
revealed through a careful and detailed analysis of the context in which the student interacts with
the item. This view is, of course, related to the absolute view of student performance discussed
above. A TIMSS person, in contrast, would find such a view to be incomprehensible, because
according to this world-view the constructs of validity and reliability do not primarily refer to
characteristics of single items, but to collections of items. Thus, the most commonly used form of
reliability refers to the internal consistency of a scale. Similarly, the most fundamental concept of
validity, namely construct validity (Messick, 1989), is not applicable to an item in isolation.
Thus, according to this analysis Schoultz et al (2001) have made the mistake of starting from
one set of assumptions, which emphasize the context-bound nature of human action and
interaction, and have applied them to an activity which is based on the assumption that it is
possible to generalize across contexts. This generates more confusion than clarification, because
concepts and observations that seem to refer to the same phenomena do in fact refer to different
phenomena.
Assumptions and Metaphors

It could be asked which of these two sets of assumptions is correct. The obvious answer to this
question is that both are equally correct, or incorrect. Complex phenomena cannot be described
unless we see them from a perspective, which entails certain fundamental assumptions. One way
to capture different perspectives is to describe them in terms of metaphors. For example, Sfard
(1998) observed that learning may be described in terms of either an acquisition metaphor or a
participation metaphor. The acquisition metaphor views knowledge as an acquired commodity, so
learning is a process of acquisition with individual ownership of knowledge as a result. In contrast,
the participation metaphor views learning as taking part in a collective and communicative process.
Sfard also observed that while individual researchers tend to favour one of these two metaphors,
both present a limited view on learning and knowledge, so there are dangers in just choosing one
of them. In terms of these metaphors, the Schoultz et al (2001) study is best captured by the
participation metaphor, while the measurement approach in TIMSS is best captured by the
acquisition metaphor.
Let me introduce another metaphor. We are almost always concerned with weather, because
it profoundly affects our daily life, such as decisions about what clothes we should wear, if we
should go to the golf course or to the museum, if it would be advisable to take the car or not, just
to mention a few. Weather also affects our mood, and it supplies us with conversation material in
almost all social contexts. However, we cannot do much about the weather, except adapting to the
conditions it creates for us. Fortunately enough, meteorologists can predict what the weather will
be like within the next couple of days. However, there is a margin of error in these predictions, and
beyond a week or so, the predictions are useless. This is because of the great complexity of weather
phenomena, and because the weather is chaotic, it is not even theoretically possible to predict
weather over longer periods of time.
Should we not like the weather there is, thus, not much to do, except, of course, to move to a
place with a better climate. Simple indicators, like average temperature, average rainfall, and
number of days with sunshine, give us much information on which to compare the climates of
different places. However, even though such information tells us much about the climate, it does
not tell us much about what weather we are likely to experience on a particular visit, because these
numbers are averages with a lot of variation. Thus, the link between climate and weather is a
weak, probabilistic, one.

Jan-Eric Gustafsson
However, while weather is unpredictable and chaotic, climate and climate changes are stable
phenomena, which we can understand theoretically and for which empirically based models may
be constructed, that predict long-term development. It could be argued that climate does not exist,
in the sense that we can experience it directly. We experience weather, and through aggregating
these experiences, we get a sense of climate. In a more precise manner, scientists define climate as
aggregate weather, using indicators such as mean temperature. Thus, climate is an abstraction,
which in a sense only exists in theoretical models. Nevertheless, this is a powerful abstraction,
which has very concrete and important implications for how we could and should live our lives.
In terms of this metaphor large-scale survey studies are concerned with climate, while
research which focuses on context-bound phenomena is concerned with weather. Thus, the
assessment in TIMSS is based on aggregation of a very large number of item responses, and little or
no interest is focused on the particular items. In contrast, the Schoultz et al (2001) study is focused
on particular contexts.
Many object to aggregation in educational and psychological research. For example, Yanchar
& Williams (2006) argued that:
it has become increasingly apparent that data aggregation and accompanying statistical tests
often hide qualitative patterns and lead to excessively abstract or artificial conclusions ;
operational definitions force meaningful human phenomena into a reductive framework that
distances researchers from what is being studied and leads to trivial and distorted results ;
statistical indices are often used as facile substitutes for careful interpretation and human
judgment ... ; the logic of experimentation, including Humean causality, cannot adequately deal
with complex human meaning and purpose in context ; and patterns in aggregate data are
erroneously used to make inferences about the structure of psychological processes in
individuals. (p. 6)
But the argument can also be turned around, and it can be argued that in order to see the general
aspects (e.g. the climate) it is necessary to get rid of the specifics (e.g. the weather). Seen from this
perspective methods which conceal context-dependent variation have strengths, rather than
disadvantages, when the purpose is to investigate general patterns and relations.
Alternatives to the Qualitative/Quantitative Dichotomy

Much of the methodological debate, as well as textbooks in research methodology, starts from a
dichotomy between quantitative and qualitative methods, and associated distinctions between
objective/subjective, positivistic/hermeneutic, nomothetic/ideographic, and bad/good (e.g.
Cohen et al, 2000). However, Ercikan & Roth (2006) challenged the meaningfulness of this
distinction, arguing that the quantitative and qualitative dichotomy is fallacious. One of their
arguments was that all phenomena involve both quantitative and qualitative aspects at the same
time.
We can illustrate thus with the two studies discussed above. While the Schoultz et al (2001)
study is primarily qualitative with its focus on analyses of interview transcripts, the researchers also
applied quantification. Similarly, while the TIMSS study is primarily quantitative with its focus on
establishing a scale expressing quantity of knowledge and skill, there is also the question about the
meaning of the scale, which is a qualitative issue. Thus, saying that one of them is qualitative and
the other quantitative does not capture the differences between the studies particularly well.
As an alternative to the quantitative/qualitative distinction Ercikan & Roth (2006) proposed
that different forms of research should be put on a continuous scale that goes from the lived
experience of people on one end (low-level inference) to idealized patterns of human experience on
the other (high-level inference). According to Ercikan & Roth (2006):
Knowledge derived through lower-level inference processes ... is characterized by contingency,
particularity, being affected by the context, and concretization. Knowledge derived through
higher-level inferences is characterized by standardization, universality, distance, and abstraction
... The more contingent, particular, and concrete knowledge is, the more it involves inexpressible
biographical experiences and ways in which human beings are affected by dramas of everyday
life. The more standardized, universal, distanced and abstract knowledge is, the more it

summarizes situations and relevant situated knowledge in terms of big pictures and general
ideas. (p. 20)
This level-of-inference approach to characterizing different forms of research is much more useful
than the qualitative/quantitative dichotomy. Thus, while research on weather and climate cannot
easily be characterized with the quantitative/qualitative distinction, research on weather may be
meaningfully described as low-level inference and research on climate as high-level inference.
Similarly, the Schoultz et al (2001) study is an example of low-level-inference research, while the
TIMSS study is an example of high-level-inference research.
However, while the Ercikan & Roth (2006) proposal is much more useful than the
qualitative/quantitative dichotomy, it too is limited in its reliance on a single bipolar continuum.
Thus, while certain aspects of a study may be high-level inference, other aspects may be low-level
inference. In particular, I think that it is useful to distinguish between level of inference with respect
to data, generalization, and explanation.
The international studies use a high-level inference approach to generate data through
abstracting information over contexts and items. But it is not necessary for large-scale survey
studies to take such a high-level inference approach to constructing data. In the original design of
the US National Assessment of Educational Progress (NAEP) all analysis and reporting of results
was done at the item level (Jones & Olkin, 2004). While this low-level inference data approach
served many good purposes, it failed to give a comprehensible description at the system level, and
these data could not be used to describe and analyse group differences in level of achievement.
When the item-response theory (IRT) techniques became available in the early 1980s NAEP took
advantage of and adapted these techniques to develop scales based on matrix sampling designs, in
which different groups of students respond to different sets of items. This high-level inference data
approach proved to be very useful for the purposes of NAEP, and it has since been adopted in the
international studies.
In some research, the aim is to generalize results to a population, while in other research
there is no intention to generalize empirical results to a population. In the international studies, the
aim is to generalize to the population level, and the studies take advantage of sophisticated
sampling designs to enable generalizations with known margins of error at minimum cost.
However, not all studies which use high-level inference data aim at high-level generalizations. It is,
in fact, only rarely the case that an explicit sampling model is used in experimental and
correlational studies, even though these studies often aim at high-level generalization. Studies that
rely on low-level inference data typically are oriented towards low-level generalization even
though there are many exceptions. The original NAEP design referred to above is one example.
The Schoultz et al (2001) study involved a comparison between the nationally representative
TIMSS results, so it too had a high-level generalization aim.
Yet another issue concerns whether the research aims at going beyond description to
generate or test explanations. As has already been emphasized, the international comparative
studies aim at high-level generalization based on high-level inference data, but they are not
primarily designed with causal inferences and explanations in mind. In contrast, as was observed by
Ercikan & Roth (2006), there are many examples of studies based on low-level inference data that
aim at explanations. The Schoultz et al (2001) study obviously had the aim of finding explanations
based on the low-level inference data. However, the phenomena to be explained are quite different
when a low-level inference data approach is used than when a high-level inference data approach is
used. The theoretical constructs employed in the explanatory models must correspond to the
constructs used in the data construction. Thus, in the high-level inference data approach, the
explanatory models will focus upon causal relations among abstract constructs.
It thus seems more useful to characterize research in terms of three different aspects, namely,
level of inference with respect to data, generalization and explanation. Even though these aspects
are not completely independent there is sufficient independence for this approach to offer a better
basis for discussing characteristics of different research approaches than does a single level-of-
inference dimension, or, of course, the simple dichotomy between qualitative and quantitative.

Jan-Eric Gustafsson
Quality Aspects of High-Level Inference Data

In the present context, I cannot go much further into a general discussion about different research
approaches, but there is reason to go into a somewhat more detailed discussion about the level-of-
inference issue with respect to data. While the low-level inference approach to data can be
grounded in interpretations generated from observations in specific contexts, this is not possible in
the high-level inference approach to data. In this approach the intention is to capture abstractions,
which span specific contexts and contents. The question then is if this is possible and meaningful,
and what criteria we can used to decide whether this is meaningful.
Ocular inspection of the items obviously cannot be used, and the answer cannot be found in
detailed analyses of the contents and contexts of specific items. The solution to this problem is,
instead, to take advantage of the concepts and techniques within the field of measurement. The
technology of measurement has evolved over more than 100 years, and thousands of researchers
have contributed to its development. It is still under development, and the field of large-scale
assessments is a driving force in this development.
The technology of measurement is full of complex and esoteric constructs such as reliability,
validity, item characteristic curve, item difficulty parameter, just to mention a few. In the
international studies, the items are developed in a laborious process of invention, creation,
preliminary try-outs, and field trials. In this process, different statistical techniques are used to
generate information about the characteristics of the items, along with qualitative techniques. In
the final step, the scaling is done, which implies that the results on different items are put onto the
same scale, taking the difficulties of the items into account.
This is an extremely complicated process, and at every step, things may go wrong, causing
threats to the usefulness of the derived scale. In order to ensure the quality of the final scale every
step of the process of development and implementation of the large-scale assessments involves
quality controls against explicitly defined criteria (Martin et al, 1999). However, while there are
numerous quality criteria, the technology of measurement does not offer a single technique or
number which may be used to characterize the meaningfulness and quality of the resulting scale.
We thus have to depend upon a technology which cannot guarantee that the outcome is the
one that we aspire to achieve. However, this is a general characteristic of technologies. For
example, when we go on an aeroplane, there is no proof that we will reach our ultimate
destination, but based on experience we know that the chances are good that everything will work
out fine, and so we are willing to trust our lives in the hands of technology.
One interesting characteristic of technology is that it is evolves over a long time, taking
advantage of scientific progress in many different fields. This is true for aviation technology and it is
true for measurement technology. However, while we can easily see the cumulative development
of aviation technology from the Wright brothers biplane constructed in 1903 to a jumbo jet, it is
not so easy to actually see with ocular inspection the development of measurement technology
from the first contributions in the first years of the twentieth century to the large-scale
international assessments. On the surface, the tests of the early days of measurement look quite
similar to the tests of today, even though there has been very much of a cumulative development
in every aspect of measurement technology.
However, even though technology has its advantages, there are disadvantages as well.
Aeroplanes are major contributors to global warming, and the international comparative studies of
educational achievement have global effects, which by many are seen to be highly negative. For
example, Nova & Yariv-Mashal (2003) argued that these studies are becoming a means of
educational governance that reduces the importance and influence of national policy makers (see
also Simola, 2005).
Technology also excludes those who do not have access to the technology. Such exclusion
effects are a great problem when it comes to the international studies. They rely on sophisticated
sampling designs in order to increase the precision of the statistical estimates, matrix sampling
designs in order to increase the validity of the tests, and very complex statistical techniques to
construct scales and estimate the country results. This sophistication brings strength to the primary
purpose of making high-level generalizations from high-level inference data. However, the

complexities make it impossible for anyone but a few experts to understand how the process
works. These complexities also have implications for who can use the data, and for the quality with
which that can be done.
These problematic aspects of the large-scale assessments of knowledge and skills can also be
regarded as threats to validity. According to Messicks (1989) reformulation and extension of the
construct of construct validity, the consequences of interpretation and use of measurement devices
form important aspects of validity. These consequential validity aspects would seem to require
more attention in further research than so far has been devoted to them.
The main conclusion from this discussion is that the technological character of the high-level
inference data generation that is the basis of the international comparative studies supports, but
does not guarantee, the reliability and validity of the data, and that the meaningfulness of the
results is the basis for judging the reliability and validity of the descriptions. As far as I can see there
is little in the pattern of outcomes from the studies conducted since 1995 that throws any serious
doubt on the quality of the descriptions of the achievement outcomes. On the contrary, it could be
claimed that the dramatic increase in frequency and importance of the international comparative
studies indicates that they satisfy sufficient standards of measurement quality. Furthermore,
productive use in research, such as, for example, the Hanushek & Wmann (2006) study on
effects of educational tracking, which is based on a combination of several of the international
studies, attests to the quality of the data.
Explanations
It has already been noted that the primary aim of the international comparative studies is to make
high-level inference descriptions of levels and patterns of achievement in a large number of
educational systems. However, the original aim of the international comparative studies was to use
the world as an educational laboratory for explanatory purposes. Even though it is less conspicuous
now, this aim is still present, and the OECD, in particular, publishes reports that make policy
recommendations based on analyses couched in causal language (e.g. OECD, 2006). Furthermore,
the data form an infrastructure for research, of which literally thousands of researchers around the
world take advantage in order to study, among other things, causal relations.
It also would be nave to think that description and explanation are separate activities. One
reason for this is that the widely publicized descriptive results call for explanations, and if no
explanations are offered by the researchers, explanations will be put forward by other stakeholders,
such as the media and politicians. It is a strong experience from analyses of the reception of results
of large-scale evaluations of educational quality, such as the US NAEP results, that the media and
those involved in educational policy discussion go beyond the data to bolster commentators
preconceived notions or already established political agendas (Pellegrino et al, 1999, p. 26). Often,
far-reaching and concrete suggestions for how to improve results are proposed, such as improved
teacher education, more resources, or earlier diagnosis of learning difficulties.
Thus, the explanatory aim is important indeed, so we need to take a closer look at what is
involved in using data from the international comparative studies to generate explanations. Again I
will take a concrete example as the starting point.
An Example: information technology, wealth and achievement

Barber (2006) used IEA data to test hypotheses about the nature of the commonly observed
relation between economic wealth and educational achievement of countries. His main hypothesis
was that this relation is caused by the greater availability of information technology, such as mass
media and computers, in economically wealthy countries. This hypothesis may be broken down
into three more specific hypotheses: (1) there is a relation between economic wealth and
availability of information technology; (2) there is a relation between availability of information
technology and student achievement; and (3) when we control for availability of information
technology there is no effect of wealth on achievement.

Jan-Eric Gustafsson
Barber (2006) reports two studies intended to investigate these hypotheses. In the first study
Barber analysed mathematics and science scores for eighth grade students in 36 countries
participating in TIMSS 1999. In addition, country-level information about wealth (gross national
product [GNP]) and exposure to information technologies (daily newspapers per 1000 people, and
television sets per 1000 people), among other variables, were assembled from different databases.
The results showed that GNP was related to achievement, as were the mass media variables.
However, when Barber entered these variables together in a multiple regression, the partial
regression coefficient for GNP was insignificant. Barber thus concluded that the advantages in
information technology entirely mediate the educational advantages of economically developed
countries.
The second study took advantage of the fact that the PIRLS 2001 study asked the fourth-grade
children about their exposure to newspapers, television, and other information technologies. The
study was designed to measure exposure to information technology in a similar manner as in Study
1. Barber had originally planned to use the GNP measure in this study as well, but this variable was
too highly correlated with the other predictors to permit meaningful regression analyses. He
therefore replaced it with a variable expressing the proportion of the labour force employed in
agriculture, as a negative index of the level of economic development.
In this study too there were generally positive correlations between the information
technology variables and reading achievement at the country level, and in particular self-reported
daily use of computers and viewing of television/video had positive relations to reading scores.
When Barber analysed the indicator of economic development along with the other variables, the
partial regression coefficient was not significant. Thus, this study showed that the effects of wealth
on reading achievement were completely mediated by the students actual use of technology.
Barber (2006) concluded that the data provide compelling support for the prediction that
students use of information technology, particularly television and computers, sharpens their
intellectual skills so that they perform better in school (p. 148) and that access to information
technology outside school can have major benefits for childrens educational achievement (p. 148).
Barber also concluded: One practical implication is that even poor countries would be well advised
to commit more of their budget to technological enrichment (p. 149).
Threats to Causal Inference in Cross-sectional Research

These are strong conclusions and policy implications. However, there are many threats to the
correctness of the causal inferences made in this study. Thus, these inferences rely on the
assumption that economic wealth has been adequately measured and properly analysed. Another
fundamental assumption is that there is no other variable that is correlated with wealth and with
educational achievement which has been omitted from the analysis. Both these assumptions can be
challenged. In Study 2, Barber had to replace the GNP measure with another measure of economic
development, because GNP was too highly correlated with the other independent variables for the
statistical analysis to work. However, if the proportion of the labour force employed in agriculture
is a weaker measure of economic development than GNP, it will underestimate the effect of
economic development on educational achievement. This will make it easier for the information
technology variables to account for the relation between economic development and educational
achievement, and therefore also for Barber to find support for his main hypothesis. Even the GNP
measure is a less than perfect measure of economic development because it does not cover all the
important aspects of the economy, and because it is influenced by errors of measurement.
It also is easy to conceive of other variables that should have been included in the statistical
model. For example, quality of education, as expressed by teacher competence and resources,
correlates with economic development, and these factors may have effects on educational
achievement. However, because quality of education correlates with use of information
technology, omission of variables representing educational quality will cause the erroneous
inference that information technology influences educational achievement causally. In summary,
Barber may have used too simplified a model for his conclusions to be correct.
10

Another study on effects of computers on educational achievement supports this criticism.

Fuchs & Wmann (2004) analysed data from PISA 2000 in order to determine effects of
availability and use of computers at home and at school on achievement in mathematics and
reading. They used data from 31 countries and more than 170,000 15-year-old students. In addition
to the variables included in the PISA study, Fuchs & Wmann added country-level variables such
as GNP, and information about educational expenditures.
Fuchs & Wmann made a series of regression analyses, in which they successively added
new categories of variables. In the first step, they investigated the bivariate relationship between
availability of computers and educational achievement, and found that students who had access to
one or more computers at home had a higher level of achievement than those who did not.
However, when they added groups of variables representing the family background of the students
(e.g. parental education, migration status, work status, parental occupation, and the country GNP),
the relationship between home availability of computers and achievement vanished. This was, of
course, because the students with parents with a higher socio-economic status to a larger extent
had access to computers at home, but this category of students had higher achievement anyway.
Thus, controlling for family background through keeping it constant in the statistical analysis
showed that the bivariate relationship between home availability of computers and achievement in
mathematics and reading was spurious.
When Fuchs & Wmann added another group of variables representing the schools
resources (e.g. class size, educational expenditure per student of country, instructional material,
teacher education, and instruction time) the relationship between home availability of computers
and achievement turned negative. When yet another group of variables representing institutional
factors (e.g. external exit examinations, standardized tests, and school autonomy) was added, the
relationship between home availability of computers and achievement turned even more strongly
negative. Fuchs & Wmann interpreted these results as showing that there are true negative
effects of availability of computers at home, presumably because they rather distract from
schoolwork than support schoolwork. Using information from the student questionnaire about use
of the home computers, Fuchs & Wmann found a positive effect of computer availability for the
students who regularly used the computers for email and web access, and they argued that this
indicates that the purposes for which the students use the computers determines their actual
effects.
The results obtained by Fuchs & Wmann were thus diametrically opposed to those
obtained by Barber (2006). This is because Fuchs & Wmann used a much more elaborate
statistical model, with more powerful control both of student family background and of other
resource factors. Another difference is that Fuchs & Wmann used the individual student as the
unit of analysis, while Barber performed the analysis at country level.
The contradictory results obtained by Barber on the one hand and by Fuchs & Wmann on
the other is a striking demonstration that the way the researcher conducts the analyses can have a
decisive effect on conclusions about causal effects. This is, of course, far from satisfactory, and it is a
serious threat to the quality of educational research. Indeed, the validity of the Fuchs & Wmann
results may be questioned as well. Even their most elaborate statistical model does not control for
every possible variable, and there still may be omitted variables that cause the obtained results not
to reflect the true causal effect of computer availability.
Thus, their model did not include any measure of previous educational achievement, which is
the most influential factor in determining achievement. It is easy to imagine that student ability
may have biased the results. Suppose that students with a low level of initial achievement are the
ones most eager to persuade their parents to make available, or give access to, a computer at home.
If that is the case, there will be a negative correlation between student initial achievement and
availability of computers at home. If computers do not have any effect on achievement, there will
still be a negative correlation between educational achievement and availability of computers,
because of the strong correlation between initial and final achievement. Even if there is a positive
effect of computers, there may remain a negative correlation between availability of computers and
final educational achievement, if the causal effect is weaker than the selection effect.
11

Jan-Eric Gustafsson
The scenario of a negative correlation between initial achievement and home availability of
computers is, perhaps, not a likely one, even though it certainly is possible. It does seem quite
likely, though, that student level of achievement relates to choice of use of the computer. For
example, students high in reading ability would seem more likely users of the computer for regular
email use than would students with a lower reading ability. Conversely, a negative correlation
between using the computer for gaming purposes and reading ability seems more likely than a
positive one. This implies that the Fuchs & Wmann (2004) finding that the pattern of use of the
home computer seemed to determine the effects on educational achievement may well be a
spurious relation, because choice of uses correlated with initial ability.
The phenomenon that the treatment obtained by students correlates with initial level of
ability is a very common one indeed. One of the reasons for this is that resources are often
distributed in a compensatory fashion, and self-selection also often determines which treatments
different students actually get. This has serious consequences for the validity of causal inferences.
These consequences are referred to as problems of endogeneity or as selection bias or as
reversed causality, and they are very difficult to deal with when only cross-sectional data is
available. Typically, statistical controls are used, through relying on whatever indicators are
available of student ability. The problem is that the indicators often only weakly and indirectly
relate to ability, making it unclear to what extent the bias remains. Thus, even though the Fuchs &
Wmann (2004) results seem more reasonable and valid than those of Barber (2006), there is
lingering doubt about their validity too.
The OECD (e.g. 2006) also has given special attention to the issue of effects of computers, and
has conducted analyses of the PISA 2003 data to investigate patterns of use, and effects of computer
use on student learning. It was concluded that even taking account of socio-economic factors there
is a sizable positive effect from regular computer use. However, given the results in the Fuchs &
Wmann (2004) study, this conclusion is more likely to be incorrect than correct. It is, of course,
highly unsatisfactory that scientific conclusions and far-reaching policy recommendations based on
the international comparative studies may point in entirely different directions.
Approaches to Valid Causal Inference

Not only within the field of international comparative research were the research aims
reformulated in the early 1980s. Angrist (2004) observed that much of educational research at this
time turned away from high-level inference explanations towards low-level inference descriptions.
As an example, he describes the field of evaluation, which under the influence of the Cronbach et al
(1980) volume on programme evaluation came to emphasize the context-dependence of
approaches and results, rather than seeking general principles and effects. Many questions in the
field of education are of a causal nature that ... ask us to imagine alternative states of the world
where groups of communities, schools or students are differentially exposed to a particular
intervention or reorganization (Angrist, 2004, p. 198). Still, educational researchers tended to avoid
investigation of such questions in the 1980s and 1990s. Cronbach (1975) himself was overwhelmed
by the complexities of educational phenomena, and the seeming impossibility of amassing
empirical evidence in support of generalizations in the presence of multiple higher-order
interactions between factors.
However, even though education researchers tended to turn away from research problems
involving causality in the early 1980s, Angrist (2004) observed that this did not happen in all fields
of social science. In particular, researchers within the field of economy have kept this interest, and
they have picked up many of those issues put aside by the education researchers. Thus, research on
many classical education issues, such as effects of class size and organizational differentiation, now
often is published in economic journals rather than in journals in the field of education.
It also is interesting to observe that the economic researchers often rely on data generated in
the international studies. While they too have to confront the problems of bias in causal inference
due to selection bias and omitted variables, they have taken advantage of the fact that great
progress has been made in techniques for causal inference from non-experimental data. Thus, even
12

though there certainly are problems that need very careful attention when making causal inference
from observational data, we are now in a much better position than those researchers who
conducted the first international comparative studies in the 1960s and 1970s.
This is not the proper time and place to go into details about techniques for causal inference
from observational data, but there may be reason to mention briefly some recent approaches. One
statistical technique, which can be applied to cross-sectional data, is what is called instrumental
variable estimation (Angrist, 2004). This method can be used to overcome problems of omitted
variables, and it is based on the idea that we should find a variable which is correlated with the
independent variable that we try to determine the effect of, but which is uncorrelated with the
omitted variables. If these assumptions are fulfilled, instrumental variable estimation yields causal
effects which are unaffected by the omitted variables. In the economic literature, there are
numerous applications of instrumental variable estimation.
An increasingly popular technique to deal with problems of sample selection bias is to use so-
called propensity scores (Rubin, 2006). This method uses background information to compute
probabilities for individuals to have the treatment under investigation or not. Thus, information
about a large number of characteristics is aggregated into a propensity score and based on this
score students can be matched. If the comparison between different treatments is restricted to
those with the same propensity scores, control for the biasing effects of the observed differences in
background variables is achieved. This method has been applied in several recent studies (e.g.
Hong & Raudenbush, 2005).
Another method used to estimate causal effects is the so-called regression discontinuity
method. This method is useful when there is a variable which is related to achievement (e.g. age)
and when there is a cut-off point (e.g. school-starting age) which determines assignment to
treatment (e.g. starting school or not). The regression-discontinuity design investigates the
difference in level of performance between individuals just above the cut-off point with the level of
performance of individuals just below the cut-off point. Particularly when the cut-off point is
sharply defined, this method gives the same result as a randomized experiment would do. This
method has, among many other things, been applied to estimate the effects of an additional year of
schooling (e.g. Cahan & Cohen, 1989), and it has been extended to deal with continuous variation
(Cliffordson & Gustafsson, 2008).
Longitudinal data have much more power to disclose causal effects than have cross-sectional
data. The reason for this is that in a longitudinal design, we do repeated observation on the same
units (e.g. individuals, schools, or school systems) and when we investigate change over time, most
of the characteristics of the units that we observe are constant. Thus, if we can assume that we are
observing fixed units, this is a powerful approach to control for the effects of selection bias and
omitted variables.
The international comparative studies are typically not designed to be longitudinal. However,
it is possible to extend the basic international design nationally by adding follow-up data
collections. This was done in the German PISA 2003 design, where a follow-up data collection of
student achievement was conducted after a year, and additional information was collected about,
among other, things teacher competence. With this design, it was possible to investigate causal
effects of teacher content knowledge on the one hand and pedagogical content knowledge on the
other (Baumert, 2007). Another way to do analyses of the same units over time is to take advantage
of the fact that the international studies (i.e. PISA, PIRLS and TIMSS) now have a trend design and
thus are longitudinal at the country level. This design allows analysis of change over time within
countries, and such changes may be related to changes in independent variables. In this way, the
advantages of analyses of fixed units over time are relied upon. Let me illustrate an application of
this approach with a concrete example.
An Example of a Country-Level Longitudinal Analysis

Through relating change in the level of achievement at country level to change in one or more
independent variables many of the problems related to omitted variables can be avoided. This is
13

Jan-Eric Gustafsson
because most country-level characteristics keep constant over time, which causes them to
disappear, as it were, when we investigate change over time. The problem of selection bias also
disappears when data are analysed at country level. This is because there are no mechanisms of
compensatory resource allocation or self-selection that operate across country borders. Gustafsson
(2007) presented two examples of analyses using this approach. One investigated effects of student
age on achievement, and the other investigated effects of class size on achievement. Both these
examples yielded encouraging results, so it may be interesting to see if this analytical approach can
shed light on the issue of effects of computer availability.
In 2001, a repeat of the 1991 reading literacy study was conducted (Mullis et al, 2003a). This
so-called Ten-Year Trend Study (10YTS) gave estimates of change in reading scores during the 10-
year period 1991 to 2001 for nine countries (Greece, Hungary, Iceland, Italy, New Zealand,
Singapore, Slovenia, Sweden and the USA). This is an interesting study because it covers a period
during which for some of the countries there have been major changes in access to and use of
information technology. Thus, this study may shed light on the contradictory results obtained by
Barber (2006) and the OECD (2006) on the one hand, and Fuchs & Wman (2004) on the other.
The 10YTS student questionnaire did not include any questions about computer use, because
in 1991 such a question was not relevant. However, for purposes of country-level analyses we can
take advantage of the information in the PIRLS study (Exhibit 6.32 in Mullis et al, 2003b, p. 212),
which was conducted at the same time. The country-level correlation between frequency of
computer use at home in 2001 and total achievement was only 0.08. However, the country-level
correlation between change in achievement between 1991 and 2001 and frequency of computer use
at home in 2001 was -0.73. Even with this small sample of nine countries this correlation is
significant (p < .025).
50.00
40.00 Greece
Slovenia
Change: Overall Reading Score
30.00
Iceland
20.00
Hungary
Italy
10.00
Singapore
New Zealand
0.00
United States
-10.00
Sweden
-20.00
40 50 60 70
Uses computer at home every week (%)
Figure 1. Relations between frequency of computer use at home and change

in level of reading achievement for nine countries between 1991 and 2001.
Figure 1 presents a scatter plot of the change in overall reading score versus the proportion of
students in the country that uses a computer at home at least once a week. The high negative
correlation is evident from the scatter plot, and most of the countries have a position close to the
diagonal. Thus, for the country where the largest proportion of students uses a computer at home
14

every week (Sweden) there is a drop in performance of 15 points, while in the country where the
lowest proportion of students uses a computer (Greece) there is an improvement of 40 points.
This analysis certainly does not prove that computer use causes poor reading performance.
However, the fact that the strong association appears in the analysis of change over countries,
rather than in the cross-sectional analysis, supports a causal interpretation. In the cross-sectional
analysis, a large number of variables are correlated with level of home computer use, such as
economic wealth and technological development, and many of these variables are correlated with
student achievement. They thus are omitted variables and they make the causal interpretation of
the relation (or rather lack of relation) between computer use at home and achievement invalid.
When we instead investigate change over time for the fixed set of countries, most of these factors
are more or less constant for the different countries so they cannot be omitted variables.
Given that we do not expect that use of computers itself causes deterioration of reading
comprehension, we would expect that other variables correlate with this change as well. Change in
computer use at home correlates -0.59 with change in frequency of borrowing books at the library,
and the latter variable correlates 0.76 with increase in level of reading achievement. This pattern of
correlations suggests that the reason for the negative relation between reading comprehension and
computer use is that computers consume time, which otherwise would have been spent reading.
This also suggests that there is trade-off between development of different skills. While students
who have access to computers at home may tend to develop poorer reading literacy, their
computer literacy may improve. Thus, it seems likely that the explanation is to be found in a
pattern of change in activities in and out of schools, and it would seem worthwhile to investigate
this more closely.
Conclusions
The international comparative studies on student achievement have a somewhat deceptive
appearance. They involve students who work on tasks that are similar to those used in the
classrooms in everyday schoolwork. Yet, the primary purpose is not to provide knowledge about
everyday classroom activities, but to make generalized descriptions of achievement outcomes at
the school system level.
These studies also have the appearance of research studies, involving large and representative
samples of students, teachers and schools, and a large number of instruments designed to capture
not only student outcomes, but also many categories of background and explanatory variables. Yet,
they are not designed to test theories or provide explanations, but rather to provide an
infrastructure for research through generating data that may be used to investigate a wide range of
issues. Being based on cross-sectional survey designs, these data are, however, not easy to use for
what should be the primary purpose of high-level inference research, namely, to make inferences
about cause and effect. One consequence of this is that many research reports use a causal language
when discussing the relations found, even when there is no basis for causal inference. This is the
single most important negative consequence for the quality of educational research of the
international studies.
However, at the same time as the international studies form threats to the quality of
educational research in this respect, they also offer opportunities to influence the development of
educational research in a positive direction. Questions about cause and effect have been neglected
in educational research for a long time, and the international studies provide a strong impetus to
put such questions back on the research agenda, because the descriptive patterns of results they
generate call for explanations. Furthermore, in spite of the fact that the studies are cross-sectional,
the data infrastructure generated by these studies provides a very useful basis for causal research,
given that the recent developments in methods for making causal inference from observational
data are taken advantage of. It also must be emphasized that compared to other methods for causal
inference, such as randomized experiments, the data from the international studies offer great
advantages in many respect. They involve large samples collected with sophisticated sampling
designs from a large number of school systems, and they are generated with careful attention to
quality criteria in every step of the data generation process. They furthermore avoid many of the
15

Jan-Eric Gustafsson
ethical problems involved in randomized experiments, and compared to large-scale experiments

the cost is low. There also are great possibilities for increasing the explanatory power of the
international studies through national extensions, such as adding longitudinal or observational
components (Schneider et al, 2007). If we take advantage of these possibilities, the international
studies may prove to be exceptionally beneficial for the quality of educational research.
Note
[1] Invited keynote address presented at the European Conference for Educational Research, Ghent,
19-22 September 2007.
References
Allardt, E. (1990) Challenges for Comparative Social Research, Acta Sociologica, 33(3), 183-193.
http://dx.doi.org/10.1177/000169939003300302
Angrist, J.D. (2004) American Education Research Changes Tack, Oxford Review of Economic Policy, 20(2),
198-212. http://dx.doi.org/10.1093/oxrep/grh011
Barber, N. (2006) Is the Effect of National Wealth on Academic Achievement Mediated by Mass Media and
Computers? Cross-Cultural Research, 40(2), 130-151. http://dx.doi.org/10.1177/1069397105277602
Baumert, J. (2007) On the Way to Causal Inferences: teacher knowledge, teaching and student progress
within the framework of PISA. Paper presented at the Biannual Meeting of the European Association for
Research on Learning and Instruction, Budapest, 28 August-1 September 2007.
Beaton, A.E., Mullis, I.V.S., Martin, M.O., et al (1996) Mathematics Achievement in the Middle School Years: IEAs
Third International Mathematics and Science Study (TIMSS). Chestnut Hill, MA: Boston College.
Cahan, S. & Cohen, N. (1989) Age versus Schooling Effects on Intelligence Development, Child Development,
60, 1239-1249. http://dx.doi.org/10.2307/1130797
Cliffordson, C., & Gustafsson, J.-E. (2008) Effects of Age and Schooling on Intellectual Performance:
estimates obtained from analysis of continuous variation in age and length of schooling, Intelligence,
36(1), 143-152. http://dx.doi.org/10.1016/j.intell.2007.03.006
Cohen, L., Manion, L. & Morrison, K. (2000) Research Methods in Education, 5th edn. London:
RoutledgeFalmer.
Cronbach, L. (1975) Beyond the Two Disciplines of Scientific Psychology, American Psychologist, 30, 116-127.
http://dx.doi.org/10.1037/h0076829
Cronbach, L.J., S.R. Ambron, S.M. Dornbusch, et al (1980) Toward Reform of Program Evaluation: aims, methods
and institutional arrangements. San Francisco, CA: Jossey-Bass.
Ercikan, K. & Roth, W.-M. (2006) What Good is Polarizing Research into Qualitative and Quantitative?
Educational Researcher, 35(5), 14-23. http://dx.doi.org/10.3102/0013189X035005014
Fuchs, T. & Wmann, L. (2004) Computers and Student Learning: bivariate and multivariate evidence on the
availability and use of computers at home and at school. Munich: University of Munich, Ifo Institute for
Economic Research.
Gustafsson, J.-E. (2007) Understanding Causal Influences on Educational Achievement through Analysis of
Differences over Time within Countries, in T. Loveless (Ed.) Lessons Learned: what international
assessments tell us about math achievement, 37-63. Washington, DC: The Brookings Institution.
Hanushek, E.A. & Wmann, L. (2006) Does Educational Tracking Affect Performance and Inequality:
differences-in-differences evidence across countries, Economic Journal, 116(510), C63-C76.
http://dx.doi.org/10.1111/j.1468-0297.2006.01076.x
Hong, G. & Raudenbush, S.W. (2005) Effects of Kindergarten Retention Policy on Childrens Cognitive
Growth in Reading and Mathematics, Educational Evaluation and Policy Analysis, 27(3), 205-224.
http://dx.doi.org/10.3102/01623737027003205
Husn, T. (1979) An International Research Venture in Retrospect: the IEA surveys, Comparative Education
Review, 23, 371-385. http://dx.doi.org/10.1086/446067
Jones, L.V. & Olkin, I. (Eds) (2004) The Nations Report Card: evolution and perspectives. Bloomington: Phi Delta
Kappan.
16

Martin, M.O., Rust, K. & Adams, R.J. (Eds) (1999) Technical Standards for IEA Studies. Amsterdam: IEA.
Messick, S. (1989) Validity, in Linn, R. (Ed.) Educational Measurement (3rd ed). Washington: National Council
of Measurement in Education.
Mullis, I.V.S., Martin, M.O., Gonzalez, E.J. & Kennedy, A.M. (2003a) Trends in Childrens Literacy Achievement
1991-2001: IEAs repeat in nine countries of the 1991 Reading Literacy Study. Chestnut Hill, MA: Boston
College.
Mullis, I.V.S., Martin, M.O., Gonzalez, E.J. & Kennedy, A.M. (2003b) PIRLS 2001 International Report: IEAs
study of reading literacy achievement in primary schools in 35 countries. Chestnut Hill, MA: Boston College.
Nova, A. & Yariv-Mashal, T. (2003) Comparative Research in Education: a mode of governance or a
historical journey? Comparative Education, 39(4), 423-438.
http://dx.doi.org/10.1080/0305006032000162002
Organization for Economic Cooperation and Development (OECD) (2001) Knowledge and Skills for Life: first
results from PISA 2000. Paris: OECD.
Organization for Economic Cooperation and Development (OECD) (2006) Are Students Ready for a
Technology-Rich World? What PISA Studies Tell Us. Paris: OECD.
Pellegrino, J.W., Jones, L.R. & Mitchell, K.J. (Eds) (1999) Grading the Nations Report Card: evaluating NAEP and
transforming the assessment of educational progress. Washington, DC: National Academy Press.
Rubin, D.B. (2006) Matched Sampling for Causal Effects. New York: Cambridge University Press.
Schneider, B., Carnoy, M., Kilpatrick, J., Schmidt, W.H. & Shavelson, R.J. (2007) Estimating Causal Effects
Using Experimental and Observational Designs. Washington, DC: American Educational Research
Association.
Schoultz, J., Slj, R. & Wyndhamn, J. (2001) Conceptual Knowledge in Talk and Text: what does it take to
understand a science question, Instructional Science, 29, 213-236.
http://dx.doi.org/10.1023/A:1017586614763
Sfard A. (1998) On Two Metaphors for Learning and the Danger of Choosing Just One, Educational
Researcher, 27(2), 4-13.
Simola, H. (2005) The Finnish Miracle of PISA: historical and sociological remarks on teaching and teacher
education, Comparative Education, 41(4), 455-470. http://dx.doi.org/10.1080/03050060500317810
Yanchar, S.C. & Williams, D.D. (2006) Reconsidering the Compatibility Thesis and Eclecticism: five
proposed guidelines for method use, Educational Researcher, 35(9), 3-12.
JAN-ERIC GUSTAFSSON has been a professor of Education at Gothenburg University since 1986.
One of his research interests concerns individual prerequisites for education, in which field he has
been working on models for the structure of cognitive abilities, and on instruments for selection to
higher education. Another field of research concerns effects of education on knowledge and skills,
which he has studied in international studies of educational achievement, among other approaches.
Issues concerning organization of education, and the importance of resources, such as teacher
competence, have also increasingly come into focus. Another line of research, which is running in
parallel with the substantively oriented research, concerns development of quantitative methods
with a focus on measurement and statistical analysis. Correspondence: Jan-Eric Gustafsson,
Department of Education, Gothenburg University, PO Box 300, S-405 30 Gothenburg, Sweden
(jan-eric.gustafsson@ped.gu.se).
17

Gustaffson 2008

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Gustaffson 2008

Uploaded by

Copyright:

Available Formats

See

Effects of International Comparative Studies on

Article in European Educational Research Journal March 2008

ECER GHENT KEYNOTE

Effects of International Comparative

ABSTRACT Large-scale survey studies of educational achievement are becoming increasingly

Downloaded from eer.sagepub.com by guest on August 21, 2015

Downloaded from eer.sagepub.com by guest on August 21, 2015

Reliability and Validity of International Assessments

An Example: limitations of paper and pencil assessments

Downloaded from eer.sagepub.com by guest on August 21, 2015

Downloaded from eer.sagepub.com by guest on August 21, 2015

Assumptions and Metaphors

Downloaded from eer.sagepub.com by guest on August 21, 2015

Alternatives to the Qualitative/Quantitative Dichotomy

Downloaded from eer.sagepub.com by guest on August 21, 2015

Downloaded from eer.sagepub.com by guest on August 21, 2015

Quality Aspects of High-Level Inference Data

Downloaded from eer.sagepub.com by guest on August 21, 2015

An Example: information technology, wealth and achievement

Downloaded from eer.sagepub.com by guest on August 21, 2015

Threats to Causal Inference in Cross-sectional Research

Downloaded from eer.sagepub.com by guest on August 21, 2015

Another study on effects of computers on educational achievement supports this criticism.

Downloaded from eer.sagepub.com by guest on August 21, 2015

Approaches to Valid Causal Inference

Downloaded from eer.sagepub.com by guest on August 21, 2015

An Example of a Country-Level Longitudinal Analysis

Downloaded from eer.sagepub.com by guest on August 21, 2015

Figure 1. Relations between frequency of computer use at home and change

Downloaded from eer.sagepub.com by guest on August 21, 2015

Downloaded from eer.sagepub.com by guest on August 21, 2015

ethical problems involved in randomized experiments, and compared to large-scale experiments

Downloaded from eer.sagepub.com by guest on August 21, 2015

Downloaded from eer.sagepub.com by guest on August 21, 2015

You might also like