Elsevier: Measuring Value Added Effects Across Schools: Should Schools Be Compared in Performance?

Studies in
Educational
Evaluation
Studies in Educational Evaluation 31 (2005) 247-266
ELSEVIER www.elsevier.com/stueduc
MEASURING VALUE ADDED EFFECTS ACROSS SCHOOLS:

SHOULD SCHOOLS BE COMPARED IN PERFORMANCE?
John P. Keeves, Njora Hungi, and Tilahun Afrassa
Flinders University, South Australia
Abstract
This article traces the evolution of a quest for solving the problems involved in the
analysis of multilevel data and the estimation of the value added effects of schools in
influencing edtlcational outcomes. The authors report the findings of two studies that
fbllowed several cohorts of students that were tested at two grade levels (Grade 3 and
Grade 5) and at three grade levels (Grade 3, 5, and 7) respectively with basic skills
tests of literacy and numeracy in a single school system. The tests were calibrated
using Rasch scaling and equated using concurrent equating procedures for
approximately 8000 students in 440 schools. Hierarchical linear modelling was
employed in the analysis with different multilevel models used in the two studies to
assess relative and absolute change in performance respectively. The findings of
these studies show that with different regression models and different variables taken
into consideration there are very different estimates of the variance between schools
and under some circumstances the residual variance between schools is very small.
Research is clearly needed into the procedures of analysis and the different value
added effects that could and should be employed.
Forty years ago, in 1964. three major studies in the field of education were undertaken.
These three studies changed the nature of educational research in several different ways in
so far as they opened up new lines of inquiry, both raising new issues and advancing new
strategies of investigation. Two of the most challenging problems in education today are
those of identit}cing the characteristics of effective and ineffective schools and the
0191-491X/04/$ - see front matter © 2005 Published by Elsevier Science Ltd.

doi: 10.1016/j.stueduc.2005.05.012
248 J P Keeves et al. / Studies in Educational Evaluation 31 (2005) 247-266
characteristics of effective and ineffective teachers who work within them. This article
addresses the first of these problems and directs attention to key aspects of the second
problem.
The three studies referred to above had important features in common, but also
differed in highly significant ways. They all sought to measure and analyse educational
outcomes as well as background infbrmation about the homes, schools and communities
from which students were drawn to attend the schools under survey. Moreover, they all
sought to analyse the data collected to identify factors and to estimate the sizes of the
effects of these factors on student learning through new approaches to statistical analysis
using procedures that involved the testing and estimation of the parameters of regression
models.
The first study was the National Survey commissioned in England and Wales that
was reported in Children and Their Primary Schools (The Plowden Report) by G. F. Peaker
(1967). The children studied were drawn equally from three age cohorts and were followed
up fbur years later and the findings were reported with brief but highly innovative accounts
that employed for the first time in the field of education, the concepts and procedures of
path analysis (Peaker, (1971). In the first report the analyses of data were undertaken
between schools and between students within schools.
The second study was reported in Equality of Educational Opportunity (Coleman,
Campbell. Hobson, McPartland, Mood, Weinfeld, & York, 1966) and involved a large
national sample of students in the United States. The unit of analysis in this report was the
students. Subsequent reanalyses of the data used the school as the unit of analysis in order
to identify both within school and student group effects (Mayeske, Wisler, Beaton,
Weinfeld, Cohen, Okada. Proshek, & Tabler, 1968). These reports raised important policy
issues and gave rise to considerable controversy and several further analyses.
The third study was a comparison of achievement in mathematics across 12
countries at the 13-year-old, Grade 8, and terminal secondary school levels. The report of
the study, which was directed by T. N. Postlethwaite, with the data analysis being carried
out by R. M Wolf, was prepared by a distinguished team of seven scholars including B. S.
Bloom, G. F. Peaker. R. L. Thorndike, D. A. Walker and with T. Hus6n (1967) as editor.
This study, although cross-sectional in nature, was able to estimate the extent of learning
that occurred across the grades, as well as the factors influencing student learning in
different countries with students as the unit of analysis.
The key features of these studies were (a) the measurement and analysis of
educational outcomes; (b) the seminal identification of student and school factors that
influenced student learning; (c) the importance of the home circumstances, attitudes and
practices ~br learning in schools; and (d) the power of computers for the storing and
analysis of data concerned with teaching and learning within schools. Educational research
in the years that have fbllowed these three major studies has been able to investigate many
educational issues and problems that had previously been intractable. However, the
success of these research studies gave rise to a body of opinion that challenged statistical
analysis and forced a dichotomy between the quantitative and qualitative aspects of
educational research and sought to divide the field of inquiry into educational problems.
Fortunately, over the past 40 years there have been marked developments in the power of
computers lbr the analysis of complex bodies of data, and advances in statistical procedures
J. P. Keeves et al. / Studies in Educational Evaluation 31 (2005) 247-266 249
lbr the rigorous and systematic testing of models that served to explain how data relating to
student learning were generated. The issues facing educational research workers at the
present time involve (a) the examination of the learning and development of students as
they advance through school grades; (b) the identification of the factors that influence the
effectiveness of schools in facilitating student learning and development, and (c) the
similarities and differences between countries and cultures in the factors that influence
educational outcolnes.
Some Problems of Analysis in School Effectiveness Studies
Several challenging problems facing educational research workers engaged in the

study of school effectiveness are discussed in the sections of this article that follow, while
this article is primarily concerned with the solution of the second issue listed above,
namely, the assessment of school effectiveness.
The Level o f Analysis Problem
One of the major problems in the analysis of data in educational research involves
the appropriate level of analysis that should be employed, namely between students,
between students within schools, or between schools. In 1939, E. L Thorndike (1939)
warned of the dangers of using correlation and regression estimates recorded at the group
level when making inferences relating to individuals and subgroups. Subsequently,
Robinson (1950) re/erred to this problem as the "ecological fallacy" which involved the
introduction of bias when data were aggregated to the group level. The resolution of the
problem lay not in choosing an appropriate unit of analysis, but in the development of
analytical procedures /'or a hierarchically structured regression model. Burstein, Li and
Capell (1978) proposed a procedure to investigate the modelling of the factors influencing
the within-group slopes in educational data, a procedure that they referred to as the "slopes-
as-outcomes" approach lo multilevel analysis. The problem of providing efficient
estimates of error was solved by Mason, Wong and Entwisle (1984) and developed into an
effective computer program by Raudenbush and Bryk (t986) using an empirical Bayes
estimation procedure. In this approach the variance is partitioned between levels, and
variables that account significantly for variance at the within-group and the between-group
levels and. in addition, cross-level interaction effects are readily estimated. Other similar
analytical strategies have also been advanced by Aitkin and Longford (1986) and Goldstein
(1987).
The Analysis o f Change Data
Studies of student learning and development necessarily involve the analysis of

change data in which student performance is measured on two or more occasions. If
student performance is obtained on only two occasions, the low level of reliability of
difference scores between occasions generally precludes the effective analysis of these
difference scores in much educational research. However, the use of the principle of
relative change in which the later performance scores are regressed on the earlier
250 J P. Keeves et aL / Studies in Educational Evahlation 31 (2005) 247-266
pertbrlnance scores using multilevel analysis procedures provides an effective analytical

approach (Larkin & Keeves, 1984). Moreover, if performance scores are obtained on at
least three occasions absolute change can generally be estimated using the slopes of the
within-group regression lines in a hierarchical linear modelling analysis (Willett, 1988).
With perIbrmance measured on more than three occasions a non-linear model of change
can be used if necessary and the effects of factors influencing change can be estimated.
Raudenbush and Bryk's (1986) computer program permitted change data to be effectively
analysed at the same time as the hierarchical or nested structure of educational data that
involved students working within schools was modelled. Moreover, this computer program
calculated values of the reliability of key relationships included in the model and used these
values in the empirical Bayes estimation procedure that is involved in the calculation of the
parameters of the change model.
The Effectiveness o f Schools and the Concept o f "Value Added"
The early studies of educational achievement were unable to employ multilevel

analysis procedures, as they were not, at that time in existence. However, it was possible to
identify home characteristics that accounted for the variability in the outcome measures,
both at the between-student and between-school levels. Once these student level factors
had been identified, it became possible to remove their effects by regressing the criterion
on the variables involving the home characteristics and to obtain residual measures for each
school. These residual effects tbr each case could also be estimated accurately and tested
lbr statistical significance with some cases recorded with residual effects well below an
expected value, and other cases recorded with residual effects well above an expected
value. Peaker (1975) proposed that schools performing well above, or well below,
expectation should be identified and carefully examined using a case study approach to
identify reasons why such schools might be performing above or below expectation.
ExploratolT studies were undertaken by Owen (1975) and Wilson (1975), but were only
partially able to identifiy why some schools were teaching science more effectively and
other schools were less effective than might have been expected.
These ideas have formed the basis of studies of school effectiveness, and this simple
term has been widely employed to refer to the extent to which a school might be
performing above or below expectation. However, while it was recognized that schools
could only be expected to pertbrm at a level that was set by the home circumstances of the
students attending the school, because of aggregation bias and other factors, the values of
home background effects calculated at the school level did not provide sound estimates of
how well the students within a school as a group might be expected to perform.
Consequently, as opportunities arose to measure student performance on two or more
occasions it was seen to be more appropriate to examine change in performance over time.
McPherson (1993. p. 1) used the term "value added" to refer to the extent to which schools
pertbrmed above expectation after allowance had been made for both the prior achievement
of the students and their background characteristics. He defined the term value added as "a
school's 'added value' in the boost it gives to a child's previous level of attainment", with
the term attainment employed where normally achievement would be used. The term value
J. t~ Keeves et al. /Studies in Educational Evaluation 31 (2005) 247-266 251
a d d e d is now quite widely adopted and involves the estimated residual term after the
effects of home background characteristics and prior achievement have been removed.
The concern for the effectiveness of schools that arises both from the greater interest
of parents in the progress made by their children as they move through the grades of
schooling, as well as concern for the rising costs of education that are noted by politicians
and taxpayers, has led to the suggestion that schools should be in some way compared and
ranked in terms how well the individual schools are found to perform. Merely to list
schools in terms of the mean level of achievement of the students within a catchment area
served by the school without any allowance being made tbr the prior performance of the
students within the school, or tbr the home background characteristics of the students is
highly unsatisfactory. Clearly, it would be misleading to compare schools without
adjusting the scores lbr the differences between schools in student intake. Moreover, it
would be inadequate simply to collect cross-sectional data and not to examine change in
performance over a specified period of time.
The P r o b l e m o f Transience
The examination of data involving change in performance over time raises two
substantial problems. First, many students change schools between the occasions of
testing. Under these circumstances, should only those students who have remained in the
same school over the duration of the study be considered to represent the school when
calculating the value added by the school? Alternatively, should the school that the student
attended on the first occasion of testing be the school to which the student is assigned, thus
avoiding the dropping from calculation those students who had changed school during the
duration of the study, or should the student be allocated to the school attended at the final
time of testing'? Second, there is the major problem of actually tracking down students
between the times of testing and the accurate identification of the schools to which a
student belongs. This could lead to all students being assigned a number that would allow
each student to be identified even if the student moved to or from another school system at
some time during the duration of the study. Experience has shown that even in a relatively
small school system the tracking down of students over time is a sizeable and complex task,
as is the specification of the school to which a student should belong.
The examination of data sets indicates that the students who changed schools and
the students who could not be traced between occasions would tend to be those students
who were performing at a lower level than average. Consequently, the mean value of a
school would be raised as a result of losses due to transience since it would be the lower
pertbrming students, who would be more likely to be transient and therefore be lost from
the study because of a breakdown in tracking. Moreover, they would most likely be the
students in greatest need of remedial help, the provision of which would be a particular
objective of the testing program.
Types of Adjustment Before Value Added Estimates Are Made
There are four types of adjustment that can be carried out before estimates of value
added effects are made tbr each school. First, allowance can be made for the initial
252 J. P. Keeves et al. / Studies in Educational Evaluation 31 (2005) 247-266
perfbrmance of each student by using either a relative or an absolute change procedure of

analysis. This issue is dealt with in greater detail in a later section of this article. In
addition, it is possible to calculate three types of value added school effects.
Estimation o f S c h o o l Effects
Wilhns and Raudenbush (1989, pp. 212-214) and Raudenbush and Willms (1975,
pp. 212-214) argue that student performance (Y) is influenced by three general factors: the
student background characteristics (S), school context (C) and identified school policies
and practices (P), as well as each school's unique contribution.
Thus Y = grand mean + C + P + S + school contribution + student error.
This model is derived fi'om Carroll's (1963) model of school learning and the many
extensions developed from this model. They also argue that there are two types of school
effects: Type A effects and Type B effects.
Raudenbush and Willms (1995, p.310) define a Type A effect as "the effect parents
generally consider when choosing one of the many schools for their child."
The Type A effects are specified as A = C + P + school contribution.
The Type B effect "is the effect school officials consider when evaluating the
performance o f those who work in the schools".
The Type B effects are specified as B = P + school contribution.
The student background characteristics (S), are estimated at the between student
level while the Type A effects are estimated at the school level with the student background
characteristics removed. Type B effects are estimated by removing the school context
effects (C), such as the average socio-economic characteristics of the students in the school.
Normally, it is the Type B effects that are of interest since they include policies and
practices that are associated with the school, and for each school they specify that school's
unique contribution. However, there are some policy factors that should be removed, such
as whether the school is urban or rural, or the size of the school in situations where the
school has no control over its size, as well as other stratifying variables such as State or
School Type.
Thus Type X effects can be defined as X = P* + school contribution
where P has been reduced to P* by removal of the effects (p) of schools size, urban or rural,
or stratifying factors (F). It may be argued that the Type X estimate is the most appropriate
estimate of value added, with student effects (S), student context effects (C), and some
policy effects (P-P*) removed from the value added estimates.
Reductions in Variance a n d D e v i a n c e
It should be noted that the differences between schools would be successively

reduced as Type A, Type B and Type X effects were estimated. It would be expected that
as S factors. C factors and some p and F factors were added to the regression model, there
would also be successive reductions in the deviance, with the models providing better fit to
the data, as well as reductions in the residual variance. I f the residual variance reached a
low level, the estimated differences between schools might have become so small that the
variance remaining at the school level of analysis would be negligible and there would be
J P. Keeves et al./Studies in Educational Evaluation 31 (2005) 247-266 253
little case for arguing that real differences existed between schools in their level o f
performance, when other things were considered equal.
Background to the Study
In order to illustrate the many issues that were involved in the examination and
estimation of the concept of'value added' by schools operating within a school system, data
and finding were drawn from the government school system of South Australia and the
analyses undertaken and reported by Njora Hungi (2003) and Tilahun Afrassa (2004). The
simple comparison and ranking of schools across a school system would provide little
evidence about the effectiveness of each school unless allowance were made both for the
home background characteristics of the students and the quality of the teaching and
learning received by those students during their very early school years. The starting point
for the examination of the effectiveness of schools at the primary school level would
necessarily be the responses of the students to a pencil and paper test that required the
students to be able to read a simple text, and to respond by blackening a small box or
writing simple words. The individual testing of large numbers of students in a school
system by an experienced teacher would clearly be too costly to form part of a regular
testing program that would have to be carried out at the time that the students commenced
school. From the current starting point at the Grade 3 level or eight year-old level it is
possible to measure the amount of learning achieved by individual students through the
administration of pencil and paper tests at regular intervals across the years of primary
schooling. Nevertheless, in order to measure change over time it is necessary for the tests
to be equated and student achievement to be recorded on an interval scale that satisfied the
requirement of unidimensionality. This can be done using the procedures of Rasch scaling
with computer programs specially developed for this purpose, provided the tests contain so-
called bridge items to link across grade levels and occasions. This article discusses two
approaches to the multilevel analysis of student performance over time to assess growth in
learning, as well as appropriate procedures for the removal of the effects on student
performance associated with the home circumstances of students, the context in which each
school is set, and policy decisions over which the school has no control or those factors that
are taken into consideration in drawing a stratified sample. While this discussion employs
tests of statistical significance, it must be recognized that the data under examination
represent sub-populations or cohorts of students who progress through the years of
schooling.
While there were losses in the data due to student absences on the day o f testing, or
a student's inability to read and write, as well as losses that arose from an inability to match
the computer records of some individual students over time, the data under examination
represented sub-populations associated with the larger student population that extended
over a decade and more of primary schooling in a centrally administered school system in
which all schools participated in the testing program.
254 J. P. Keeves et al. / Studies in Educational Evaluation 31 (2005) 247-266
The B a s i c Skills Testing P r o g r a m a n d the D a t a
The Basic Skills Testing Program commenced operation in South Australia in 1995,
testing at the Grade 3 and Grade 5 levels. In 2001 the testing program was extended to the
Grade 7 level that in South Australia remained within the stage of primary schooling.
During the initial years the tests used were developed by the New South Wales Department
of School Education in collaboration and consultation with the staff of the Curriculum
Division of the South Australian system. In the most recent years the New South Wales
Department of School Education has not been involved in the development of the tests or
the processing and the examination of the data. The scaling and the analyses, as well as the
merging o f the data files were carried out by the two co-authors of this article (Hungi
Njora, 2003; Tilahun Ati'assa. 2004). For more detailed accounts of the procedures used,
their reports should be consulted.
Tests of literacy and numeracy were administered at the three grade levels of Grade
3, Grade 5 and Grade 7. In addition, students responded to a questionnaire that provided
information on the students' home background. Additional information about the students
and the schools they attended were obtained from the computer files held by the South
Australian school system.
Two different sets of data were made available for examination, scaling and analysis
that have given rise to two separate treatments of the estimation of value added effects.
The first data set comprised students in Grade 3 and Grade 5 in years 1995 to 2000. There
were tour cohorts of students who were tested on two occasions, 1995 and 1997, 1996 and
1998, 1997 and 1999, and 1998 and 2000. The second data set comprised students who
were tested at Grade 3, Grade 5 and Grade 7. There were three cohorts of students who
were tested on three occasions, 1997, 1999 and 2001; 1998, 2000 and 2002; and 1999,
2001 and 2003. Table 1 records the numbers of students and schools considered in the
analyses presented here. It should be noted that the analyses discussed included only those
students and schools where the student data could be matched on the two or three occasions
of testing and where the students remained in the same school over the period of time
involved.
Table 1: Numbers of Students and Schools Participating in Study 1 and Study 2
Study 1 1995 - 1997 1996- 1998 1997- 1999 1998- 2000

Two tests taken Cohort 1 Cohort 2 Cohort 3 Cohort 4
Students Students Schools Students School Students School Students School
Same schools 6898 426 7788 426 8926 426 9129 426
matched 10283 489 11095 485 12437 482 12794 474
Cohort size
Study 2 1997 - 2001 1998 -2002 1999 - 2003
Three tests Cohort 1 Cohort 2 Cohort 3
taken
Students Students School Students School Students School
Same schools 6553 393 6690 393 6638 393
matched 12437 482 12794 474 12550 473
Cohort size
Sources: HungiNjora (2003): Tilahun Afi'assa(2004).
In general terms the sizes of the grade cohorts over the nine years during which this study
was carried out were estimated to be about 14,000 students drawn from approximately 500
Government schools. The cohort size for each group in this study was determined by the
number of students and schools taking part on the first testing occasion for each cohort, and
it is evident that not all schools participated in the program on all occasions. Exemptions
were readily granted to special schools, small schools and schools with exceptional
circumstances. For the analyses reported in this article, only students who could be
matched, who did not repeat a grade, and who remained in the same school for both testing
occasions tbr Study 1, and tbr all three testing occasions for Study 2, were included in the
cohorts under survey. Consequently, some degree of systematic bias is inevitable in the
estimates derived fi'om the analyses carried out, as a consequence not only of non-
participation and grade repeating, but also transience between schools, and the inability to
match a student between occasions arising from changes in name or errors in the recording
of names, age or student gender. In most cases it would seem that the less able students
would be the ones who would drop out fi'om the two studies. Nevertheless, it would seem
fair to judge a school tbr its effects on students by examining the performance of only those
students who remained in the same school for the time intervals of the two studies.
Clearly. further research should be undertaken to examine the effects of transience
on the per*brmance of a school and Hungi Njora (2003) has carried out some analyses of
this kind. In addition, there would be a need to keep track of students more effectively as
they passed through the school system, and the assigning of an identification number to all
students in Australia would facilitate maintaining better contact with all students. The issue
of grade repetition would also be of interest, in so far as some students were known to have
commenced formal schooling at an early age, because a position was available in the
Reception grade, and such students tended to be advanced through the years of schooling
prematurely encountering learning problems as a consequence of a failure to build firm
foundations for tbrmal schooling. These students were sometimes required to repeat a
grade during their primary school years, and the consequences of grade repetition could be
examined.
Analysis for Study 1 Data
Separate analyses were carried out for two outcomes, namely literacy and numeracy,
to provide replication and to assess the effectiveness of a school on more than a single
outcome measure. The data collected in Study 1 at Grade 3 and Grade 5 required that at the
micro level, or Level 1, in the multilevel analysis using the H L M computer program
(Raudenbush, Bryk, Cheong, & Congdon, 2000) the effects of relative change were
estimated by regressing the Grade 5 scores on the Grade 3 scores. At the same time the
effects of the home background variables were estimated. It should be noted that each
school had a different mean score value for each cohort, each occasion and for each
outcome.
At the meso level, or Level 2, variables to estimate the effects of each cohort were
included in the regression equation. In this way the level of school performance that had
been adjusted for the characteristics of the student intake, including performance at the
256 J. P. Keeves et al. /Studies in Educational Evaluation 31 (2005) 247-266
Grade 3 level were torlned into a stable component and a component that varied across
occasions or cohorts.
At the macro level, or Level 3, the mean performance of a school was considered to
vary randomly about the grand mean and the average contextual effect of the students in
the school was able to explain some of the variation around the school means. This step
produced the Type B effect. At the macro level the effects of stratification factors (F) as
well as externally determined policy factors (p), such as the urban or rural nature of the
school were modelled to provide the Type X value added effects. Figure 1 shows in a
diagrammatic tbrm the three levels of analysis of the multilevel HLM model. The Type A
effects were given by the residuals after the inclusion of the Level 1 or micro level factors
in the model. The Type B effects were given by the residuals after the context variables (C)
were included in the model. The Type X effects were given after the effects o f the
strati~ing factors (F) and non-school based policy (p) variables were included in the
model.
Macf
Leve Jals provide
A, B and X
Mes
Lev,
Mic~
Lew
Figure 1" Multilevel Model Showing Type A, Type B and Type X Effects Provided by
Residuals at Successive Stages of the HLM Analysis for Study 1 Involving Relative Change
Analysis of Study 2 Data
The data collected in Study 2 at Grade 3, Grade 5 and Grade 7 required that at Level
1, or the micro level, the effects of learning across the grades should be modelled in the
regression equation. This estimation of absolute change could be done in two ways. Either
a straight line could be fitted for each individual to model growth (the slope providing a
measure of absolute change) or a variable for a Grade 5 effect and a separate variable for a
Grade 7 effect could be modelled and the size of absolute change effects that were
averaged across all students could be estimated. The differences between cohorts or
between occasions could be modelled at the meso level, or Level 2, together with the
J. P. Keeves et al. /Studies in Educational Evaluation 31 (2005) 247-266 257
variables involving the student background characteristics. However, cross-level

interaction effects could also be used to estimate the effect of a cohort on the Grade 5 and
Grade 7 slopes or on the straight line slopes. These cross-level interactions provided for
ally shortcomings in the development and equating of the tests that might have occurred, if
the Grade 5 and Grade 7 slopes were estimated.
At the macro level, or Level 3, or the school level of analysis the effects o f the
stratifying variables (F) as well as the system level policy (p) variables were modelled in
the regression equations.
The residuals after completion of modelling at Level 2 provided the Type A effects.
While the residuals after completion of modelling at Level 3 provided directly the Type X
effects. The Type B effects were obtained by using a sub-stage at Level 3 in which only
the effects of school level context variables (C) were estimated. The effects of the
stratifying variables (F) and the system level policy variables (p) were added subsequently
and the residuals so tbrmed provided the Type X effects.
Macn
s provide
Level
andX
Mes(
Leve provide
Tects
Micr~
Leve
Figure 2: Multilevel Model Shows Type A, Type B and Type X Effects Provided by
Residuals at Successive Stages of the HLM Analysis for Study 2 Involving Absolute Change
It should be noted that the two models developed for Study 1 and Study 2 provided
different information. The estimates fi'om Model 1 provided Stable and Change estimates
of Type A and Type B effects. The estimates from Model 2 provided corresponding
estimates from the cross level interactions between the cohort or the occasion variable and
the Grade 5 and Grade 7 or Time slopes. Moreover, in Model 1 the residuals at the macro
or school level of analysis gave the Type A, Type B and Type X effects directly, while in
Model 2 a sub-stage in the Level 3 estimation was required to distinguish between the Type
B and Type X effects. 111 the Model 2 analysis, the fitting of a straight line for growth
across the three time points at Level 1 proved to be unstable (with very low reliability) and
258 J 17 Keeves et al. / Studies in Educational Evaluation 31 (2005) 247-266
the estimation of Grade effects was found to be more stable, but also gave rise to some
anomalous values tbr the variance estimates.
Factors Influencing Learning Outcomes
Table 2 summarises the factors found in the two studies to influence the
achievement outcomes in Literacy and Numeracy over time. The factors are grouped in
terms of student characteristics (S), school context (C), stratifying variables (F), and policy
factors (p) over which the school has no control. There are clearly differences of
measurement and estimation between the analyses of relative change and absolute change
that lead to some rectors making different contributions to explaining the outcomes.
However, as might be expected, there is a high degree of similarity between the two
approaches to analyzing outcomes. Moreover, slightly different sets of explanatory
variables were made available by the South Australian school system lbr the analyses
undertaken in the two studies.
Table 2: Significant Factors Influencing Achievement Outcomes in Study 1 and Study 2
Numeracy Literacy
Measures of change Measures of change

Study 1 Study 2 Study 1 Study 2
relative absolute relative absolute
S Student characteristics
Student gender
Student age
Aboriginal and Tortes Strait Island race
Born in Australia - migrancy
* * :g
Language of home - English
School card - Disadvantage benefit
Negotiated curriculum package
Grade 3 Score
C School Context
Age of cohort
Language of home
Absenteeisnl
School card - Disadvantage benefit
Aboriginal and Torres Strait Island race
* -t-
Education Index of Community
i
p. F Policy and Stratification

Urban - Rural *+ * *+
School size *+ *+
Isolation
* Significant effect; + Cross level interaction
Sources: Hungi Njora (2003, pp. 184-5; 212-3)
Tilahun Afi'assa (2004, Appendix 1)
In addition to the variables recorded in Table 2, Study 2 tbund highly significant growth
effects between G r a d e s 3 and 5, and between Grades 3 and 7, as well as significant cohort
effects. In Study 2, the variable Urban-Rural only had significant cross-level interaction
effects with gender o f student for Literacy, and with holding a School Card (Disadvantage
benefit) for Numeracy. However, it had significant effects for Literacy in Study 1.
Est#nation o f Variance Explained
The amount o f variance explained in the analysis o f the two different m o d e l s is o f

considerable interest, in so tar as the variance left unexplained at the school level provides
an indication o f the extent o f differences between schools from which the value added by
the schools can be estimated. As explained above the value added effects are estimated in
terms o f T y p e A, T y p e 13 or Type X effects. However, while significant policy and
stratification effects are recorded in Table 2, these effects are not o f sufficient size to
influence the amount o f variance explained to a recognizable extent, although they do
improve the model fit with a significant reduction in deviance.
Study 1
Table 3 gives the variance estimates o f the variability between schools for the T y p e
A and T y p e 13 effects, there being no recognizable change in variance associated with the
T y p e X effects for the analysis o f the relative change model
Table 3: Estimates of Variance Explained in Study 1 for Relative Change Model
Numeracy Literacy
Level 1 Level Level 3 Total Level 1 Level 2 Level Total
2 3
Type A Effects Number of (32,741) (1823) (479) (32,741) (1823) (479)
cases
Variance for simplest model 0.99 0.23 0.22 1.43 0.98 0.21 0.19 1.37
Variance for Type A effects 0.54 0.04 0.04 0.43 0.04 0.02
model
Variance available 68.8 16.0 15.2 100.0 71.3 15.2 13.5 I00.0
Variance explained ' 45.4 82.1 81.1 55.9 82.3 88.0
Total variance explained 31.2 13.1 12.3 56.7 39.8 12.5 l l.9 64.2
Variance left unexplained 37.6 2.9 2.9 43.3 31.4 2.7 1.6 35.8
Type B Effects
Variance for simplest model 0.99 0.23 0.22 1.43 0.98 0.21 0.19 1.37
Variance lbr Type B effects 0.54 0.04 0.02 0.43 0.04 0.01
model
Variance available ' 68.8 16.0 15.2 100.0 71.3 15.2 13.5 100.0
Variance explained ' 45.3 81.7 90.7 55.7 82.2 92.7
Total variance explained ~ 31.2 13.1 13.8 58.1 39.7 12.5 12.5 64.8
Variance left unexplained ' 37.6 2.9 1.4 41.9 31.5 2.7 1.0 35.2
Source: HungiNjora(2003, pp. 190-191)
260 J P. Keeves et al. / Studies in Educational Evahtation 31 (2005) 247-266
The analyses tbr numeracy and literacy are very similar, with approximately two-thirds of
the total variance explained tbr literacy and just under 60% of the variance explained for
numeracy. The estimation of the Type B effects makes very little reduction in the variance
unexplained with only a small increase in the variance explained at Level 3, the school
level. However, it is noteworthy that while 15.2% of the total variance in numeracy and
13.5% of the total variance in literacy exists initially at the school level, only 1.4% for
numeracy and 1.0% fbr literacy remains unexplained after the removal of the effects of
significant factors over which the schools have no control. Thus, the value added by the
schools over the two year period is only concerned with a very small proportion of the total
variance associated with the test scores.
Study 2
Table 4 presents the variance estimates for the variability between schools for the
Type A and Type B effects for the analysis of the absolute change model (Model 2). Since
the estimation of the Type X effects makes no recognizable change in the estimates of
variance they are not presented.
'Fable 4: Estimates of Variance Explained in Study 2 for Absolute Change Model
Source: Tilahun Afi'assa (2004, Numeracy Literacy

App 7)
Levels 1 & Level3 Total Levels 1 & 2 Level3 Total
2
Type A Effects N (19,881) (393) (19,880) (393)
Variance tbr simplest model 2.091 0.197 2.288 1.842 0.15 t 1.993
Variance for Type A effects 1.061 0.I96 0.977 0.147
model
Variance available % 91.4 8.6 92.4 7.6
Variance explained % 49.2 1.0 53.1 2.6
Total variance explained 45.0 0.0 45.0 49.0 0.2 49.2
Variance left unexplained 46.4 8.6 55.0 43.3 7.4 50.7
Type B Effects
Variance for simplest model 2,091 0.197 2,288 1.842 0.151 1.993
Variance tbr Type B effects model 0.630 0.086 0.978 0.057
Variance availabIe % 91.4 8.6 92.4 7.6
Variance explained % 49.2 43.7 53.1 62.3
Total variance explained % 45.0 3.7 48.7 49.0 4.7 53.7
Variance left unexplained % 46.4 4.9 51.3 43.3 2.9 46.3
In the absolute change model the unreliability of the gain scores that are provided by the
Grade 5 and Grade 7 gain variables both lbr Numeracy and Literacy serve to increase
slightly the variance to be explained at Level 3, that is oflLset by the reduction in overall
variance by the variance explained by the student characteristics in the estimation of Type
A effects, but produces only a very small explanation of variance at the School level or
Level 3. Only by the introduction of the Context variables does a substantial reduction of
Level 3 variance occur. However, this still leaves sizable proportions of the school level
variance to be explained, with 4.9% of the total variance for Numeracy and 2.9% of the
J. P. Keeves et aL / Studies in Educational Evaluation 31 (2005) 247-266 261
total variance tbr Literacy remaining unexplained after the Context variables are included
in the model. Consequently, with the use of absolute change model the Type B and Type X
effects show rather greater differences between schools than are recorded from the use of
the relative change model.
C o m m e n t on the S t u d e n t L e v e l Variance
While only relatively small amounts of school level variance remain after the
estimation of the Type B and Type X effects in the value added estimation procedure, large
amounts of student level variance remain unexplained in the models of both studies. The
unexplained variance is rather larger for the absolute change model than for the relative
change model. Thus, it is clear that powerful lhctors are in operation to produce the
unexplained variability between students within schools. A clear indication of these
powerful thctors came from two seminal studies conducted in 1964. Peaker (1967) showed
in the findings of the Plowden study the importance of the effects of parental attitudes
towards their children's education on performance at school. In addition, Coleman et al.
(1966) reported the significant effects of the language skills of teachers on student
achievement. At the same time Richard Wolf and his colleague Dave provided evidence
tbr the marked effects of the educative processes and practices of the home on educational
outcomes (see Bloom, 1964, p. 78 and 124). In addition, recent unpublished analyses by
Hungi Njora in two developing countries, namely Vietnam and Kenya, have produced
evidence of the influence of teacher's knowledge of the subject content they were teaching
on the performance of the students in their classrooms.
These studies indicate that not only the effects of the attitudes and practices of the
home but also of teachers in classrooms need to be taken into consideration in research
studies if a greater understanding of the learning of students in schools is to be obtained.
The major component of unexplained variance in student achievement lies not in the value
added contribution of the schools, but rather in the processes and practices of the home and
in the differential effects of teachers within schools on the learning of the students in their
classrooms. The importance of studies of school effectiveness lies not so much in
identil~ing effective organizational policies and practices of schools, but rather in focusing
attention on the monitoring of student learning within schools and within classrooms where
students are being taught by teachers with different language skills, different knowledge of
their subject matter, and different capacities to guide and enthuse students in the tasks of
learning.
Reporting School Effectiveness
It is with the view of increasing the effectiveness of monitoring student learning in

schools that the closing section of this article raises questions about the ways in which
estimates of school effectiveness derived from major testing programs such as the Basic
Skills Testing Program should be provided to schools.
262 J P. Keeves et al. /Studies in Educational Evaluation 31 (2005) 247-266
Identifying School Improvement Over Time
Testing programs are now well established in many parts of the world that permit
the examination o1" both student and school perlbrmance as the students progress upwards
across the grades of schooling, and as the schools take in successive cohorts of students in
response to the changing policies and practices of the schools. Thus, both the gains made
by individual students and groups of students as they move through the school grades and
the gains made by successive cohorts of students, can be systematically examined.
From the HLM (Raudenbush & Bryk, 1986) and the MLwiN (Goldstein, 1987)
computer programs it is possible to obtain data files of the residuals that remain after the
effects of student background characteristics, school context, system determined policy
factors, and stratification variables, such as school type and urban-rural location have been
taken into consideration. From these residual files the value added effects of schools can
be estimated after agreement has been reached as to what variables should be included in
the analyses that provide adjustment lbr factors outside the schools' control. Consideration
must also be given to the model of change that should be employed in the multilevel
analyses, and whether student transience should be provided for and in what way. It should
be recognized in the discussion and the diagram that is presented in Figure 3 that the data
have been adjusted tbr the influence of significant effects that have been identified in Study
2. The examination of these so-called value added effects should replace discussion of
change in raw scores that are currently provided to schools, or any comparisons, such as a
change in ranking of schools in terms of their raw scores, that might be presented.
Presentation o f Results o f Value-added Estimations
Issues arise with respect to how the results of the estimation of value added effects
should be presented to schools and to a wider public who might be interested in the
effectiveness of particular schools. The question arises whether the value added estimates
should be compared among schools by ranking schools in the order of the size of value
added effects or whether schools should only be concerned with comparing their value
added estimates over time. across grades and between cohorts within each particular
school. For each school since the size of each tested group is known, confidence intervals
can be calculated for the value added estimates using the student error estimates for the
school. However, since the data being analysed are for populations and since the sizes of
the school groups tested are likely to vary considerably, significance testing on a large scale
is likely to be a dubious undertaking.
Identi~ing Effective and Ineffective Schools
The residual estimates produced by the HLM program can be used to arrange
schools in the order of their value added scores. Since all scores can be calculated on a logit
scale, it is possible to provide a meaningful interpretation of the magnitude o f a value
added score. Evidence from earlier studies Hungi Njora (1997) has shown that for both
Literacy and Numeracy the extent of learning during a school year at the Grade 3 to Grade
5 levels in this school system is estimated to be approximately 0.5 logits. Consequently, in
J. P. Keeves et al. / Studies in Educational Evaluation 31 (2005) 247-266 263
order to identit) effective and ineffective schools, a tolerance band of 0.5 on either side of
zero on the value added score scale would serve to show schools that were advanced above
expectation or retarded below expectation by more than one year of school learning. It
should be noted that some schools, because of losses of students through transience or
failure to match students across occasions, might have their value added scores augmented
by small amounts as a consequence of the transient and not-matched students being likely
to perform at a lower level than the non-transient and matched students. Nevertheless, the
setting of a tolerance band associated with one year of learning in schools would appear to
be an effective way of identifying schools that had value added effects of sufficient size to
warrant further critical examination.
Effective Schools
)
1 O0 l, ~'"',,
050 k~ Oneyear performanceaboveaverage _
0 O0 ~ -- . . . . . . . . .
|
-0s0 ~ ~2~j-~. . . . . . . .
. ~ ( Not effective Schools
-1 O0 ' ' ~
Schools
Figure 3: Empirical Value Added Estimates for Schools in Study 2

in Numeracy (source: Tilahun Afrassa, 2004)
Figure 3 shows the value added estimations for schools in Numeracy that are ranked
on a graph from right to left in terms of the sizes of their value added effects. Effective
schools can be considered to be those that are 12 months ahead of other schools because of
their estimated value added effects. There is also a small group of schools that are 12
months or more behind, because of their low value added effects. Research needs to be
undertaken to investigate why those schools that are located in the tails of the value added
score distribution are effective or ineffective in their provision for student learning. It
should be noted that the correlation between the Literacy and Numeracy value added
effects for the approximately 400 schools for which such effects were estimated was 0.64.
This indicates that the ordering of schools in Literacy is noticeably different from that in
Numeracy and that schools differ in their effectiveness with respect to different educational
outcomes.
264 J. P Keeves et al. / Studies in Educational Evaluation 31 (2005) 247-266
Conclusion
Programs for the monitoring of educational outcomes have been introduced in many
parts of the world. In these programs populations of students are tested at intervals on a
regular basis. The information obtained is reported back (a) to students and their parents;
(b) to school principals, teachers and school communities, and (c) to school systems,
politicians and the wider public. Different purposes are served by reporting at diftErent
levels to the different stakeholders. However, parents, school principals and the
administrators of school systems are all interested in the effectiveness of schools and the
concept of the value a d d e d by a school is now widely accepted. However, the value added
by a school can only be estimated from an analysis of the variability that exists between
schools. Consequently, the performance of schools must be subjected to comparative
procedures and systematic analysis, for the effectiveness of the schools to be investigated
and assessed in meaningful ways. Nevertheless, there are many problems involved in the
estimation of the value added effects of individual schools. These problems include (a) the
type of regression model to be employed for the examination of change; (b) the type of
value added effects to be reported; (c) the variables to be considered for the adjustment of
the outcomes to allow for factors over which the schools have no control; (d) the
educational outcomes to be investigated and measured, and (e) the adjustments made for
transience of students, since testing occurs on more than one occasion.
O f some considerable importance are the methods to be employed in reporting to the
different stakeholders in education. Some comparisons can be made within schools
between outcomes, between student cohorts, between grades, and over time and there is
little doubt that these comparisons are of interest to parents, school principals and to the
administrators of school systems. A major problem arises with respect to the making of
simplistic and invidious comparisons between schools by ranking them in order of value
added eftkcts and in other ways. The findings of this study show that with different
regression models and different variables taken into consideration there are very different
estimates of variance between schools. Moreover, under some circumstances the residual
variance between schools is very small. Furthermore, the errors associated with the value
added effects depend on the size of each school group tested, so that comparisons between
schools are likely to be complex with errors depending on the numbers o f students tested in
each particular school. Research is clearly needed into the procedures of analysis and the
estimation of different value added effects that could and should be employed. This article
does not make recommendations except to argue for continuing research in the field and to
thank the institutions that made these two research studies possible.
References
Afiassa, T.M. (2004). Using student achievement data to identify school improvement over time.
Ninth Round Table oil Assessment. November 7-9, 2004, Sydney.
Aitkin, M.R., & Longtbrd, N. (1986). Statistical modelling in school effectiveness studies. Journal
q['the Royal St~ttistical Socie O, A. 149 ( I ), 1-26.
Bloom, B.S. (1964). Stahili O, and change in human characteristics, New York: Wiley.
J P. Keeves et al. / Studies in Educational Evaluation 31 (2005) 247-266 265
Burstein, L., Li, R.L., & Capell, F.J. (1978). Analysing multilevel data in the presence of
heterogeneous within class regressions. Journal of Educational Statistics, 3 (4), 347-383.
Carroll, ,I.B. (1963). A model of school learning. Teachers Co[lege Record, 64. 723-733.
Coleman, J.S., Campbell, E.Q., Hobson, C.J., McPartland, J., Mood, A.M,, Weinfeld, F.D., &
York, R.L. (1966). EqualiO~ o[educational opportunity, Washington, DC: NCES, US Government Printing
Office.
Goldstein, H. (1987). M, ltilevel models qf'educational and social research. London: Griffin.
Hungi, N. (1997). Measuring basic skills across primary school years. Unpublished Master of
Education thesis: Flinders University, Adelaide.
Hungi, N. (2003). Measuring school effects across grades. Flinders University Institute of
International Education Research Collection. No. 6. Adelaide: Shannon Press.
Hus~n, T. (Ed.) (1967). International study 0 c achievement in mathematics (2 vols). Stockholm:

Almqvist and Wiksell.
Larkin, A.I., & Keeves, J.P. (I 984). The class size question." A study at different levels of analysis.
Hawthorn, Victoria: ACER.
Mason, W.M., Wong, G.Y., & Entwisle, B. (1983) Contextual analysis through the multilevel
mode[. In S. Leinhardt (Ed.), Sociological methodology 1983-84. San Francisco: Jossey-Bass.
Mayeske. G.W.. Wisler, C.E., Beaton, A.E., Weinfeld, F.D., Cohen, W.M., Okada, T., Proshek,
J.M., & Tabler, K.A. (I 969). A study of" our nation's schools. Washington, DC: US Government Printing
Office.
McPherson, A. (1993). Measuring a&ted value in school. London: National Commission on

Education.
Owen. J.M. (1975). The e:ffe,ct g/schools on achievement in science, lEA Australia Reports, 1975,
I. Hawthorn, Victoria: ACER.
Peaker, G.F. (t967). Timeregression analysis of timeNational Survey. Central Advisory Council for
Education (England). Children and their primary schools. (Plowden Report). London: HMSO Volume 2,
appendix IV.
Peaker, G.F. (1971). The Plowden children four years later. Slough, Bucks: N FER.
Peaker, G.F. (1975). An empirical study gf education in twenty-one countries: A technical report.
International Studies in Evaluation VII1. Stockholm: Almqvist and Wiksell.
Raudenbush, S.W., & Bryk, A. (1986) A hierarchical model for studying school effects. Sociology
of Education. 59. I - 17.
Raudenbush, S.W., & Willms, J.D. (1995). The estimation &school effects. Journal of Educational
and Behavioral Statistics. 20 (4), 307-335.
266 J P. Keeves et aL / Studies in Educational Evahtation 31 (2005) 247-266
Raudenbush, S.W., Bryk, A.S., Cheong, Y.F.. & Congdon, R.T. (2000). HLMS- : Hierarchical
linear and non-linear modeling (Computer Software). Lincolnwood, IL: Scientific Software International.
Robinson, W.S, (1950). Ecological correlations and the behavior of individuals. American
Socio]o<~,,iccd Review I5, 351-7.
Thorndike. E.L. (1939). On the fallacy of imputing correlations found tbr groups to the individuals
and in smaller groups composing them. American Journal o f Psychology 52, 122-124,
Willett, J.B. (1988). Questions and answers in the measurement of change. In E. Rotlnkopf(Ed.),
Review q/'reseavch in edHcalion (1988-89) (pp. 345-422), Washington, DC: AERA,
Willms..I.D.. & Raudenbush, S.W. (1989). A longitudinal hierarchical linear model for estimating
school effects and their stability. ,]ournal ojEduealional Measuremem, 2 (6), 209-232.
Wilson. A.F. (1975). The e[/ect q/'schoo& in Victoria on the science achievement o f / u n i o r
secomla W students, lEA (Auslralia) Reports 1975: 2. Hawthorn, Victoria: ACER.
The Authors
JOHN KEEVES taught mathematics and science in secondary schools in Australia and
England before moving to research and curriculum development at the Australian Council
tbr Educational Research where he became associate director and subsequently director
following research training and work at the Australian National University and the
Universities of Melbourne and Stockholm. In retirement, he is currently carrying out
research and teaching at Flinders University in South Australia.
NJORA HUNGI completed an honours degree in Agricultural Science in Kenya and taught
in rural high schools be~bre undertaking masters and doctoral studies at Flinders University
into the estimation of value added effects in South Australian schools. He has recently
moved to work on large scale testing programs at the Australian Council for Educational
Research in the Sydney Office.
TILAHUN M A N G E S H A AFRASSA undertook training in educational research at Flinders

University after working in the schools of Ethiopia and as acting director of the Ethiopian
National Examination office. Following the completion of masters and doctoral degrees he
worked at the Senior Secondau Assessment Board of South Australia, before moving to a
position at the South Australian Department of Education and Children's Services with
responsibility tbr the conduct of major testing programs.
Correspondence: <john.keeves@flinders.edu.au>

Elsevier: Measuring Value Added Effects Across Schools: Should Schools Be Compared in Performance?

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Elsevier: Measuring Value Added Effects Across Schools: Should Schools Be Compared in Performance?

Uploaded by

Copyright:

Available Formats

Studies in

MEASURING VALUE ADDED EFFECTS ACROSS SCHOOLS:

John P. Keeves, Njora Hungi, and Tilahun Afrassa

Flinders University, South Australia

0191-491X/04/$ - see front matter © 2005 Published by Elsevier Science Ltd.

Some Problems of Analysis in School Effectiveness Studies

Several challenging problems facing educational research workers engaged in the

The Level o f Analysis Problem

The Analysis o f Change Data

Studies of student learning and development necessarily involve the analysis of

pertbrlnance scores using multilevel analysis procedures provides an effective analytical

The Effectiveness o f Schools and the Concept o f "Value Added"

The early studies of educational achievement were unable to employ multilevel

Types of Adjustment Before Value Added Estimates Are Made

perfbrmance of each student by using either a relative or an absolute change procedure of

It should be noted that the differences between schools would be successively

Background to the Study

The B a s i c Skills Testing P r o g r a m a n d the D a t a

Table 1: Numbers of Students and Schools Participating in Study 1 and Study 2

Study 1 1995 - 1997 1996- 1998 1997- 1999 1998- 2000

Analysis for Study 1 Data

Analysis of Study 2 Data

variables involving the student background characteristics. However, cross-level

Factors Influencing Learning Outcomes

Table 2: Significant Factors Influencing Achievement Outcomes in Study 1 and Study 2

Measures of change Measures of change

p. F Policy and Stratification

Est#nation o f Variance Explained

The amount o f variance explained in the analysis o f the two different m o d e l s is o f

Table 3: Estimates of Variance Explained in Study 1 for Relative Change Model

'Fable 4: Estimates of Variance Explained in Study 2 for Absolute Change Model

Source: Tilahun Afi'assa (2004, Numeracy Literacy

Reporting School Effectiveness

It is with the view of increasing the effectiveness of monitoring student learning in

Identifying School Improvement Over Time

Presentation o f Results o f Value-added Estimations

Identi~ing Effective and Ineffective Schools

050 k~ Oneyear performanceaboveaverage _

Figure 3: Empirical Value Added Estimates for Schools in Study 2

Hus~n, T. (Ed.) (1967). International study 0 c achievement in mathematics (2 vols). Stockholm:

McPherson, A. (1993). Measuring a&ted value in school. London: National Commission on

TILAHUN M A N G E S H A AFRASSA undertook training in educational research at Flinders

You might also like