Classroom Observation Errors

Classroom Observation Schemes: Where Are the Errors?
Author(s): Barry McGaw, James L. Wardrop and Mary Anne Bunda

Source: American Educational Research Journal, Vol. 9, No. 1 (Winter, 1972), pp. 13-27
Published by: American Educational Research Association
Stable URL: http://www.jstor.org/stable/1162047 .
Accessed: 22/05/2013 10:46
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .
http://www.jstor.org/page/info/about/policies/terms.jsp
.
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of
content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms
of scholarship. For more information about JSTOR, please contact support@jstor.org.
American Educational Research Association is collaborating with JSTOR to digitize, preserve and extend
access to American Educational Research Journal.
http://www.jstor.org
This content downloaded from 148.209.126.44 on Wed, 22 May 2013 10:46:05 AM

All use subject to JSTOR Terms and Conditions
Classroom Observation Schemes:

Where are the Errors?
BARRY McGAW and JAMES L. WARDROP

University of Illinois at Urbana-Champaign
MARY ANNE BUNDA

The Ohio State University
Much of the confusion about the notion of reliability was cleared

with the publication of Technical Recommendations for Psychological
Tests and Diagnostic Techniques (American Psychological Association,
1954) and Technical Recommendations for Achievement Tests (American Educational Research Association, 1955). The subsequent publication of Standards for Educational and Psychological Tests and Manuals
(American Psychological Association, 1966), while it introduced
considerable changes in the conception of validity, redefined reliability
in essentially the same terms.
The term "reliability" was clarified through the recognition of its
several meanings and, consequently, that " 'reliability coefficient' is a
generic term referring to various types of evidence; each type of
evidence suggesting a different meaning" (APA, 1966, p. 25). Although
the explicit definition of a series of alternative conceptions of reliability
clarified the issue it did not resolve a number of important questions of
interpretation. For example, there is no clear guide for the selection of
an appropriate reliability coefficient nor is there any resolution of the
problem of interpreting differences in the values of the various
coefficients for the same test.
Further problems arise when consideration is given to the
reliability of observation schedules. Here a further variable is introduced into the measurement situation, and inter-observerdisagreement
13

American Educational Research Journal

becomes an additional source of errors of measurement. This type of
error is somewhat akin to that between alternate forms of a test, which
is typically estimated with an equivalent forms correlation coefficient.
It is a more important source of error with an observation schedule,
however, since it is usually unavoidable. Whereas, with a test, it is
possible to use only a single form in many situations to avoid this
source of unreliability, with an observation schedule the obvious
physical limitations necessitate the use of more than one observer in all
but the smallest studies.'
A major problem with a series of reliability indices, each of which
measures the effect of only one or two sources of unreliability, is that
there is no way of obtaining a "total" picture of the combined effects
of all relevant sources of error. The range of sources of error with
observational techniques is discussed by Medley and Mitzel (1963). At
the descriptive level with respect to classroom observation techniques a
measure can be claimed to be reliable "to the extent that the average
difference between two measurements independently obtained in the
same classroom is smaller than the average difference between two
measurements obtained in different classrooms" (p. 250). Unreliability
occurs when two measures of the same object (person, classroom) tend
to differ too much either because the behaviors being observed are too
variable or because independent observers cannot agree on what is
happening.
In addition to these sources, unreliability may also be due to the
smallness of differences among the objects of observation on the
dimensions observed. This is a less important point for test construction, where preliminary item analyses are used to reject items which do
not discriminate, than it is for the development of observation
schedules. The point seems seldom to have been recognized, particularly by those who have developed classroom observation schemes. It is
clearly possible that an instrument for which the level of inter-judge
agreement is very high, may be quite unreliable in spite of the judges'
consistency.
Coefficients of ObserverAgreement
The inadequacy of measures of inter-observer agreement as
reliability indices has been discussed (see e.g., Medley and Mitzel, 1963;
Westbury, 1967), yet these discussions appear to have had little impact.
Most authors of classroom observation schemes deal with reliability
exclusively in terms of observer agreement. For example, Bellack, et al.
This is not intended to imply that the use of only one observer is desirable,
for in such a study no estimate of the effect of observer bias can be obtained.
14
Vol. 9 No. 1 January 1972

Classroom Observation Schemes: Where are the Errors?
(1966), describing their technique to measure observer agreement,

claim that the results "indicate a consistently high degree of reliability
for all major categories of analysis: agreement ranged from 84 to 96
percent" (p. 35). They established these figures using two pairs of
independent coders.
Smith, et al. (1964, 1967), used the same technique used
previously by Smith and Meux (1962) to calculate the consistency with
which independent judges identified and classified their units of
behavior (ventures). These two estimates are essentially percent
agreement measures. Flanders (1967) also specified reliability only in
terms of observer agreement. The r coefficient he used is superior to
simple percentage agreement measures since it estimates, instead, the
extent to which chance agreement has been exceeded. A notable
exception to these studies is that of Brown, Mendenhall, and Beaver
(1968), in which the reliability estimate was a measure of the
consistency with which teachers could be discriminated from one
another.
Medley and Mitzel (1963) argued strongly against using observer
agreement indices as measures of reliability, in fact referringto them as
coefficients of observer agreement to distinguish them from coefficients
of reliability.
Weick (1968), referring to observation schemes in general,
observed that "the most common reliability measure in observational
studies is observer agreement" but, unfortunately, supported this state
of affairs by arguing that inter-observer agreement is more important
than replicability, i.e., intra-observer agreement over occasions. While
this is undoubtedly true, it misses the main point. The establishment
and maintenance of a suitably high level of inter-judge agreement is
important only insofar as it operates as a limiting factor on reliability.
However great the problems apparently introduced by a failure to
obtain suitable inter-judge agreement, this is a situation under fairly
direct control of the experimenter. In the case of a category system
clearer definition of the categories, minimization of overlap, and more
extensive training of observers should raise the level of agreement, while
in the case of sign analysis more precise definitions of the behaviors to
be recorded should achieve the same results.
If the objects do not differ sufficiently in the behavior observed,
in comparison with their own stability over time, no level of inter-judge
agreement can render the observation scheme acceptably reliable (if one
insists-as is done in this discussion-on restricting the use of
"reliability" to its measurement-theoretic sense).
The confusion introduced into the literature through failure
clearly to distinguish the different sources of unreliability, and through
15


over-emphasis on inter-judge agreement has resulted from a confusion
of the importance of primacy with prime importance. Inter-judge
agreement is the first but not the most important issue to be faced.
Stability of Behavior
Variability of the object of observation is the most important
source of error variance. Unless stable estimates of behavior can be
obtained, inter-object variability will inevitably be swamped by
intra-object variability. Thus sampling of behavior is of critical importance. Medley and Mitzel (1963, p. 268) contended that it is better to
increase the number of observations than the number of observers.
They cite an example of a study of teacher behavior in which increasing
the number of observers on a single occasion from 1 to 12 increased
reliability from .47 to .50 whereas using the 12 observers individually
on 12 separate occasions increased reliability from .47 to .92. Errors
due to instability of behavior were clearly greater than errors due to
inter-observerdisagreement in this example.
A critical point in the conception of reliability expounded by
Medley and Mitzel is that instability of behavior over occasions (i.e.,
time) is due to random error in one or both of the environment or the
person (object). This implies that the characteristic being measured is
stable in a sense that does not allow of lawful change. While this may be
a reasonable assumption in relation to relatively enduring aspects of
personality, it is unreasonable when other types of behavior patterns
are being observed.
In the case of observations of teacher behavior, efforts to develop
indices to characterize particularteachers appear to be misplaced unless
there is some allowance made for lawful adaptations of behavior to
different situations.
One study (Flanders, et al., 1969) has attempted to examine the
effect of variability in teacher behavior. In this study a series of
different situations, over which teacher behavior might be expected to
vary systematically, were chosen. Ten situations were included and
treated as independent, though it appears that there were two
dimensions, at least, underlying the situations. One dimension (six
situations) involved variations in subject matter (language arts, science,
etc.). The other (four situations) involved differences in activities
(introducing new material, extending old material, planning, etc.).
Assuming that there are two dimensions involved then, it is clear that
there are actually 24 (6 X 4) possible situations rather than ten. The
authors collected data only in the broad categories without indicating
that they had balanced (or controlled) one dimension while observing
the other.
16

Classroom ObservationSchemes: Whereare the Errors?

Twenty teachers were observed in each of the ten situations (there
were some empty cells, in fact) and their teaching behavior in each
situation was categorized using Flanders' Interaction Analysis Schedule.
Thus ten i/d ratios were obtained for most of the teachers, fewer for
those teachers not observed in all situations. The standard deviations of
i/d ratios for each teacher were calculated and taken to be measures of
variability, or even "flexibility" as the authors choose to call it. This
index of flexibility was then used as a dependent variablein comparing
groups of teachers.
The view, implicit in the Flanders study, that changes in behavior
over situations are lawful, is premature without some empirical
evidence. An appropriate way to demonstrate such lawfulness, or at
least that the changes are systematic, would be to assess the reliability
of the observation schedule for measuring behavior within situations.
From this standpoint, stability would be expected only over separate
occasions within each situation. Variability among situations, within
teachers, would no longer be attributed to error variance but to true
variance, i.e., variation in the measure reflecting real, systematic
variation in the person.
To attempt to assess each of the sources of error variance in a
complex model such as this via inter-class correlation coefficients would
be impossible. Apart from the work involved, the difficulties discussed
earlier with respect to interpretations of coefficients of equivalence,
stability and observer agreement would be multiplied since, for
example, there would not be a separate coefficient of stability for each
situation involved. The most appropriate approach would be to use a
variance components analysis such as that proposed by Medley and
Mitzel (1963) or by Cronbach and his associates (Gleser, et al., 1965;
Cronbach, et al., in press). In the following pages, the basic concepts of
generalizability theory, developed by Cronbach, et al., will be described
and then a solution to the above problem of estimating reliability will
be suggested.
Generalizability Theory
The conditions under which observations are made can be
classified in various ways. For example, in the previous section, in the
study of "variability," observations could be classified with respect to
situations, occasions (within situations), and observers. Each of these
aspects of the observations is termed a facet in the terminology of
generalizability theory. Each facet is, of course, a source of variability
in the observations in addition to the between person variability. The
variation attributable to each facet can be estimated by analysis of
variance procedures.
17


The procedure will first be illustrated with respect to a single facet
design. If observations are made on P persons in I situations the ratings of
all persons in all situations can be presented by a P X I matrix X, in
which Xpi is the rating of person p in situation i. In this simple example
only one observer is used for all observations. If more than one observer
were used, observers would constitute a second facet.
If the I situations are exactly the same for all P persons, the
example satisfies the definition of "matched data" (Cronbach, et al.,
1963), and in terms of analysis of variance, constitutes a two-way
crossed design. If, as is more likely to be the case, the situations in
which observations are made for different persons are different, the
data would be "unmatched." In terms of analysis of variance, this
would be a nested design.
The P persons and the I situations are considered to be random
samples from their respective universes of persons and situations. The
only assumptions or requirements made are:
1. The universe is described unambiguously, so that it is clear what
situations fall within the universe.
2. Situations are experimentally independent; a person's score in
situation i does not depend on whether he has or has not been
previously observed under other conditions.
3. Scores Xpi are on an interval scale.2
(See Cronbach,et al., 1963, p. 145.)
For each person p, the average of the Xpi over all situations i in
the universe, is his universe score. That is
Ap
= E(Xpilp) = E(Xpi)
i
Similarly the universe score for situation i is

Pi = E(Xpili) = E(Xpi)
and the universe score for all persons and conditions is

/ = EE(Xpi)
pi
From the additive model for two-way analysis of variance, with the
restriction of only one observation per cell and consequent confounding of interaction and error, each person's score can be expressed as:
2
It is unlikely that
this assumption. On the
formal application of the
to be of little consequence
18
data derived from classroom observation schedules meet

other hand, while the assumption is necessary for the
variance components approach, its violation would seem
in practice.

Classroom ObservationSchemes: Whereare the Errors?-
[1]
+ (p - p) + (i -
Xpi =
person
effect
) + epi
situation residual
effect
The population variances due to each of these effects are:

a2 (pp) = E(pp-- p)2
02 (pi ) = E(pi
--
)2
02 (epi) = Eep2
The total observed score variance for the population would be
a2 (Xpi) = E(Xpi -)2
and since there are no covariances among the effects in the additive
model, the population observed score variance may be expressed in
terms of its components as:
[2]
02 (Xpi) = 02(p)
+ a2(pi) + 02(ep
If a nested design (unmatched data) were used there would be no

identifiable situation effects since each person would be observed in
different situations. In this case the model would reduce to
[3]
Xpi =
+ (p - I) + ep
and the identifiable components of variance would be

[4]
02 (Xpi) = 02(p)
+ a2(i,epi)
(see Travers,1969)
Equation [1] may be rewritten, with the interaction %opiand
specific error Epi effects separated, as:
Xpi =
or:
p + (i
-II)
+ cpi + Epi
Xpi = lp + Epi
where pp is the universe score for person p, over all I occasions,

(equivalent to generic true score in Lord and Novick's, 1968,
treatment) and where epi, the generic error score, contains both the
residual and situation effects of equation [1]. For one person p, over
all situations i, the generic error variance is:
02 (ep) = Ee2 pi
19


and over all persons, the error variance becomes
02() = E 02(Ep*)
[5]
= 02 (pi) + EEa2 (epi)

pi
This expression for error variance can be seen to contain the between
situations variance component and the residual variance.
From the two way analysis of variance design it can be seen that
the expected mean squares may be expressed in terms of variance
components as:
EMSp
= 02 (epi)+
EMSi
Io2 (tlp)
2 (epi) + Pa2 (i)
EMSres = 02 (epi)
and, therefore, unbiased estimators of the variance components may be

obtained as:
[6]
a2 (pp) = (1/I) (MSp -- MSres)
[7]
a2
[8]
a2 (epi) = MSres
(i
) = (1/P) (MSi - MSres)
For the nested design the expected mean squares would be

EMSp
= a2 (i,epi)+
lo2 (gp)
EMSwp = 02 (i,epi)
where MSwp represents mean squares within persons since, in the case
of nesting, this is the residual mean square. In this design the estimates
of the components of variance may be found as:
[9]
[10]
a2 (Ap)= (1/I) (MSp- MSwp)

a2 (i,epi)
= MSwp
Coefficients of Generalizability
In classical theory only a single estimate of reliability is obtained
as an estimate of the correlation between scores obtained on two
administrations of parallel tests.
20

In generalizability theory, the range of different reliability

coefficients is made explicit. The universe for which the universe score
is estimated may be variously defined and, hence, for each different
universe of scores a different coefficient of correlation (generalizability)
between universe and observed scores may be obtained. Thus a clear
definition of the universe of generalization for any particular study is
important. Furthermore, reliability, defined in this sense and estimated
as a coefficient of generalizability, is clearly dependent on the design of
the study in which the instrument is to be used since the relevant
universe is defined in terms of the facets included in the study.
Still considering the simple case of a single observer rating P
persons in I situations, the most likely event in an experimental study
would be that the situations would differ from person to person. The
coefficient of generalizability required is one for the population of
situations. Cronbach et al. (1963) show how this coefficient, for
unmatched data, may be obtained from a generalizability study with
matched data (i.e., in which all P persons are observed in the same I
situations).
The coefficient of generalizability may be defined as the ratio of
the variance of the universe scores for persons gp to the variance, over
persons and situations, of the observed scores Xpi. Thus,
p2 (X,pp) =
[11
02(p)
a2(X)
From equations [5] through [8] we see that the variance of the
observed scores in the population may be estimated as:
[12]
62 (X)=
2(p +)
=
A2(Ap)+ A (e)
=A2(p)
=
(1/I)(MSp
2(i)
+ 2 (epi)
- MSres) + (1/P)(MSi - MSres) + MSres
= (1/I)
MSp + (I/P)MSi + [(IX P-P-1)/P]
MSrest
An estimate for the variance of universe scores is given as [6] .3

3 Cronbach, et al. (1963), and Lord and Novick (1968) provide methods for
estimating the reliability of observations in a single situation (analogous to the
reliability of a single test form, or a single observer) from data obtained in more
than one situation. These formulae are not relevant in the present context where
the concern is with reliability across the universe of situations.
21


Generalizability and Variations in Behavior
Earlier in this paper a distinction was made between instability of
behavior and lawful variations in behavior in response to varying
conditions. A data layout, indicating the type of data to be collected in
a generalizability study which allows for such variations, is shown in
Table 1.4
The model for this analysis is:
[13]
Xikm
+ aT
k + aO
ik m + aJn + aTS
in
I + aS
+ aSJ + aTSJ + aOJ
+ E.
ikn
kn
ikmn
ikmn
where p = general mean
Eikmn - specific error (as in true score theory)

and, in terms such as aTS the superscripts indicate the effect and the
subscripts have the usual meaning. Where the number of subscripts is
greater than the number of superscripts, the extraneous subscripts refer
to the factors within which the effect is nested. Thus, aikm represents
the effect of the mth level of Factor O (occasions) nested in the ith level
of Factor T (teacher) and the kth level of Factor S (situations).
In the usual generalizability theory analysis all but the teacher
effect would be considered as part of the generic error,
eikmn,
contributed by the various facets in the design over which generalizations are to be made. From this standpoint [13] could be reduced to
[14]
+
Xikm = p aT+ikmn
In the present analysis, however, systematic (non-random) changes

in behavior over situations are not considered as contributing to error.
This^is also true for systematic differences among teachers in their
changes in behavior (the teachers x situations interaction component).
Changes in behavior over occasions (within T x S) are considered to be
random fluctuations and thus contribute to error. Differences among
judges are clearly in the same category.
4 The design portrayed in Table 1 is an elaboration of Design 5 in Gleser, et
al. (1965). It is the introduction of a "situations" facet and the consideration of a
teachers-by-situations interaction which represents the major departure from the
treatment given by Cronbach, et al. (in press). The proposed design may also be
seen as merging the Medley and Mitzel (1963) conception with the reformulation
presented by Cronbach, et al. (in press).
22

00
CCIO
0
CIS
+
..
-~O
00
CCn
4-4
?
O
i.
,.
-
23


Thus, the model for the analysis may be written as:
[15]
Xikmn
+ak
+a
ik
ikmn
In terms of the partitioning of variance provided in the analysis of

variance the observed score variance may be expressed as:
U2
[16]
(Xikmn)
= 02 (aT)
(as) + U2(aTS)
(ikmn)
Converting to a simpler notation this may be written as:
2 + 2+
- 2+o
[171]
where
[18]
(ts)
tj
si
tsj
o(ts)j
+
where, since the analysis has only one observation per cell, or2o(ts)j
must be estimated as 02
res"
All factors in the design are considered to be random, with levels
chosen at random from an infinite universe. The expected mean square
for each of the sources of variation is shown in Table 2.
Unbiased estimators of each of the components of variance are given in
Table 3, in terms of the observed mean squares.
In this analysis, three generalizability coefficients may be determined. The first p2 provides a measure of the reliability with which
teachers' behavior (within situations) may be observed.
This coefficient may be estimated as:
A
Pt
t a(t
+0 )
=JA^2U
A+2A2)
where 62 can be estimated as indicated in Table 3 and a by

substituting in [18] the appropriate estimators from Table 3.
A second coefficient of generalizability ps2 provides an index of
the reliability with which situations may be distinguished. This
coefficient may be estimated as:
"A2=A2 /A2
Ps =s/(a+a)
+A2
If this coefficient is small then the implication is that behavior changes

from situation to situation are either random (and, therefore, not
lawful) or vary systematically from teacher to teacher in such a way
that no overall differences among situations may be detected. If this
24

z
+
b
+
zb
zb
NN
bb
b
+
Cd
b
+
0C
4a
4_a
"o
25


TABLE 3
Components of variance
2t
2
+
+ MSres)
MSo(ts) - MSts - MStj
2MStsj
(I/KMN) (MSt-
= (1/IMN)(MS
- MSMS
s -MSo(ts)
ts-s+
o(ts)
A2
=
(ts)
(1/N)(MSo(ts)
= (I/IKM)(MSj
-MStj
dts
= (I/MN)(MSts
- MStsj)
62
-MSj
(1/KM)(MStj
-MStsj)
(1/IM)(MSsj
-- MStsj
= (1/M)(MS5
tsj
res'
MSres)
Oj
t2
+ 2MStsj + MSs)
- MS
+ MStsj)
- Mres
(Mstsj
2es
res =MS res
latter is the case it will be detected by the third coefficient of

generalizability proposed, viz,
A2 =A2 /( +2+192
Pts
ts
ts
This coefficient provides an index of the reliability with which

observers can detect systematic variations among teachers in their
changes in behavior from one situation to another.
Obviously considerable amounts of data need to be collected to
obtain these estimates, but without them it seems futile, on the one
hand, to treat all changes in behavior as though they were lawful or, on
the other, to treat all variations in behavior as errors of measurement.
REFERENCES
American Educational Research Association. Technical recommendations for
achievement tests. Washington, D. C.: NEA, 1955.
American Psychological Association. Technical recommendations for psychological
tests and diagnostic techniques. Washington, D. C.: APA, 1954.
American Psychological Association. Standards for educational and psychological
tests and manuals. Washington, D. C.: APA, 1966.
BELLACK, A. A., KLIEBARD, H. M., HYMAN, R. T., & SMITH, F. L., Jr. The
language of the classroom. New York: Teachers College Press, Columbia
University, 1966.
26


BROWN, B. B., MENDENHALL, W., & BEAVER, R. The reliability of observations
of teachers' classroom behavior. Journal of Experimental Education, 1968,
36, 1-10.
CRONBACH, L. J., RAJARATNAM, N., & GLESER, G. Theory of generalizability: a liberalization of reliability theory. British Journal of Statistical
Psychology, 1963, 16, 137-163.
CRONBACH, L. J., GLESER, G., & RAJARATNAM, N. Dependability of behavioral measurements. New York: Wiley, in press.
FLANDERS, N. A. The problems of observer training and reliability. In Amidon,
E., and Hough J. (Eds.) Interaction analysis: theory, research, and application. Massachusetts: Addison-Wesley, 1967.
FLANDERS, N. A., et al. Teacher influence patterns and pupil achievement in
second, fourth and sixth grade levels. Vols. I-II. Cooperative Research Project
No. 5-1055, U.S. Dept. of Health, Education, and Welfare: Office of
Education. University of Michigan, 1969.
GLESER, G., CRONBACH, L. J., & RAJARATNAM, N. Generalizability of scores
influenced by multiple sources of variance. Psychometrika, 1965, 30,
395-418.
LORD, F. M., & NOVICK, M. R. Statistical Theories of Mental Test Scores.
Massachusetts: Addison-Wesley, 1968.
MEDLEY, D. M., & MITZEL, H. E. Measuring classroom behavior by systematic
observation. In N. L. Gage (Ed.) Handbook of research on teaching. Chicago:
Rand McNally, 1963.
SMITH, B. O., & MEUX, M. A study of the logic of teaching. Cooperative Research
Project No. 258, U.S. Dept. of Health, Education, and Welfare: Office of
Education. University of Illinois, 1962.
TRAVERS, K. J. Correction for attenuation: a generalizability approach using
components of covariance. College of Education, University of Illinois, 1969
(mimeo).
WEICK, K. E. Systematic observational methods. In G. Lindzey and E. Aronson
(Eds.) The Handbook of Social Psychology Vol. II (2nd Ed.) Massachusetts:
Addison-Wesley, 1968.
WESTBURY, I. The reliability of measures of classroom behavior. Ontario Journal
of Educational Research, 1967, 10, 125-138.
(Received June, 1971)
(Revised September, 1971)
AUTHORS
McGAW, BARRY Address: Center for Instructional Research and Curriculum
Evaluation, College of Education, University of Illinois, Urbana, Illinois
61801. Title: University of Illinois Fellow. Age: 30. Degrees: B.Sc., B.Ed.
(Hons.) University of Queensland (Australia); M.Ed., University of Illinois.
Specialization: Measurement, learning, evaluation.
WARDROP, JAMES L. Address: Center for Instructional Research and Curriculum
Evaluation, College of Education, University of Illinois, Urbana, Illinois
61801. Title: Associate Professor. Age: 30. Degrees: B.A., Ph.D., Washington
University. Specialization: Measurement, evaluation, statistics.
BUNDA, MARY ANNE Address: The Ohio State University, Evaluation Center,
1712 Neil Avenue, Columbus, Ohio 43210. Title: Assistant Professor. Age:
27. Degrees: B.S., M.Ed., Loyola University of Chicago; Ph.D., University of
Illinois. Specialization: Measurement, statistics, evaluation.
27


Classroom Observation Errors

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Classroom Observation Errors

Uploaded by

Copyright:

Available Formats

Classroom Observation Schemes: Where Are the Errors?

Author(s): Barry McGaw, James L. Wardrop and Mary Anne Bunda

This content downloaded from 148.209.126.44 on Wed, 22 May 2013 10:46:05 AM

Classroom Observation Schemes:

BARRY McGAW and JAMES L. WARDROP

MARY ANNE BUNDA

Much of the confusion about the notion of reliability was cleared

This content downloaded from 148.209.126.44 on Wed, 22 May 2013 10:46:05 AM

American Educational Research Journal

Vol. 9 No. 1 January 1972

This content downloaded from 148.209.126.44 on Wed, 22 May 2013 10:46:05 AM

Classroom Observation Schemes: Where are the Errors?

(1966), describing their technique to measure observer agreement,

This content downloaded from 148.209.126.44 on Wed, 22 May 2013 10:46:05 AM

American Educational Research Journal

Vol. 9 No. 1 January 1972

This content downloaded from 148.209.126.44 on Wed, 22 May 2013 10:46:05 AM

Classroom ObservationSchemes: Whereare the Errors?

This content downloaded from 148.209.126.44 on Wed, 22 May 2013 10:46:05 AM

American Educational Research Journal

Similarly the universe score for situation i is

and the universe score for all persons and conditions is

data derived from classroom observation schedules meet

Vol. 9 No. 1 January 1972

This content downloaded from 148.209.126.44 on Wed, 22 May 2013 10:46:05 AM

Classroom ObservationSchemes: Whereare the Errors?-

The population variances due to each of these effects are:

If a nested design (unmatched data) were used there would be no

and the identifiable components of variance would be

where pp is the universe score for person p, over all I occasions,

This content downloaded from 148.209.126.44 on Wed, 22 May 2013 10:46:05 AM

American Educational Research Journal

= 02 (pi) + EEa2 (epi)

2 (epi) + Pa2 (i)

and, therefore, unbiased estimators of the variance components may be

a2 (pp) = (1/I) (MSp -- MSres)

) = (1/P) (MSi - MSres)

For the nested design the expected mean squares would be

a2 (Ap)= (1/I) (MSp- MSwp)

Vol. 9 No. 1 January 1972

This content downloaded from 148.209.126.44 on Wed, 22 May 2013 10:46:05 AM

Classroom Observation Schemes: Where are the Errors?

In generalizability theory, the range of different reliability

- MSres) + (1/P)(MSi - MSres) + MSres

An estimate for the variance of universe scores is given as [6] .3

This content downloaded from 148.209.126.44 on Wed, 22 May 2013 10:46:05 AM

American Educational Research Journal

where p = general mean

Eikmn - specific error (as in true score theory)

In the present analysis, however, systematic (non-random) changes

Vol. 9 No. 1 January 1972

This content downloaded from 148.209.126.44 on Wed, 22 May 2013 10:46:05 AM

This content downloaded from 148.209.126.44 on Wed, 22 May 2013 10:46:05 AM

American Educational Research Journal

In terms of the partitioning of variance provided in the analysis of

Converting to a simpler notation this may be written as:

where 62 can be estimated as indicated in Table 3 and a by

If this coefficient is small then the implication is that behavior changes

Vol. 9 No. 1 January 1972

This content downloaded from 148.209.126.44 on Wed, 22 May 2013 10:46:05 AM

Classroom Observation Schemes: Where are the Errors?

This content downloaded from 148.209.126.44 on Wed, 22 May 2013 10:46:05 AM

American Educational Research Journal

latter is the case it will be detected by the third coefficient of