You are on page 1of 15

LEARNING EVALUATION PAPER

(TYPE, ACCURACY, STANDARDIZATION TEST ERROR)

Taught by lecturers: Dr. Atan Pramana, M.Pd

Compiled by : - Sandy Fajrin (185)


- Yasmin Zalivia Adindafa (181)

FACULTY OF EDUCATION
CURRICULUM AND EDUCATIONAL TECHNOLOGY
SURABAYA STATE UNIVERSITY
2024
PREFACE

Praise be to Allah SWT who has given His grace and guidance so that I
can complete this paper assignment entitled "Types, Accuracy and
Standardization Test Errors" on time. The purpose of writing this paper is to
fulfill the assignment in the Learning Evaluation course. In addition, this paper
also aims to add insight into determining the objectives of learning analysis for
readers and also for writers. I would also like to thank all those who have shared
some of their knowledge so that I can complete this paper. I realize that this
paper is far from perfect. Therefore, constructive criticism and suggestions will
be awaited for the perfection of this paper.

1
Table of Contents

PREFACE............................................................................................................................................... 1

Table of Contents....................................................................................................................................2

Theoretical Basis......................................................................................................................................3

The History Of Academic Aptitude Testing............................................................................................ 4

a. Individually Administered Academic Aptitude Tests....................................................................5

A. Projective Personality Tests..........................................................................................................6

B. ACCURACY OF STANDARDIZED TEST PRECISION.......................................................... 7

C. STANDARD ERROR OF MEASUREMENT............................................................................. 8

CONCLUSION....................................................................................................................... 13

BIBLIOGRAPHY....................................................................................................................14

2
Theoretical Basis

No test is perfectly reliable, and therefore no test is perfectly


accurate. All tests are subject to error. Error is any factor that leads an
individual to perform better or worse on a test than the individual's
true level of performance. In the context of standardization, test error
accuracy refers to how close the measurement results obtained from a
standard test are to the true or generally accepted values.

Tom Kubiszyn and Gary D. Borich explain in their book entitled


educational testing & measurement that "I've made a mistake." We
have all said this aloud or to ourselves at one time or another. In other
words, we've made an error—we have failed to do something
perfectly, or as well as we would like to have done it, so an error in a
test cannot actually be avoided but can be minimized.

3
The History Of Academic Aptitude Testing
The development of tests to predict school achievement began
in France at the beginning of the twentieth century. France had
embarked on a program of compulsory education, and the minister of
public instruction realized that not all French children had the
cognitive or mental potential to be able to benefit from instruction in
regular classes. “Special”classes were to be established for the
instruction of such children. Admission to these classes was to be
dependent on the results of a medical and psychological evaluation.
In 1905 Alfred Binet and his assistant, Theo Simon, were
commissioned to develop such a test. The aim of their test was to
measure a trait that would predict school achievement.
By the time of his death in 1911, Binet’s scale was widely used
and heralded as an “intelligence” test. English translations of the 1908
and 1911 revisions were made, and in 1916 a Stanford University
psychologist, Louis Terman, standardized the Binet test on American
children and adults. That version of the test became known as the
Stanford–Binet Intelligence Scale or IQ test, and it was revised and/or
restandardized again in 1937, 1960, 1972, 1985, and 2003. What had
begun as a test designed to predict school achievement evolved into a
test of intelligence. Since Binet’s seminal work, several other
intelligence or IQ tests have been developed. Some, like the
Stanford–Binet, are designed to be administered individually (e.g.,
Wechsler Intelligence Scale for Children–IV, Wechsler Adult
Intelligence Scale–IV, Kaufman Assessment Battery for Children–II,
Slosson Intelligence Test); others are designed for group
administration (e.g., Cognitive Abilities Test, Otis–Lennon Mental
Ability Test, Kuhlmann–Anderson Intelligence Tests). Although each
of these tests is different from the Binet, each also has similarities and
correlates strongly with the Binet. Recall that tests that correlate
strongly measure much the same thing.

4
one of the recent conceptions of intelligence (Sternberg, 1989,
2007) is that it can be defined by its underlying components and
altered through instruction. Thus, traits previously thought inherited
and unalterablecould be taught.

a. Individually Administered Academic Aptitude Tests


*Stanford–Binet Intelligence Scale, Fifth Edition, The Fifth
Edition appeared in 2003 and represents a significant revision and
renorming effort. It is a substantial improvement over previous
editions. Stimuli are now more contemporary, and the test now has 15
subtests organized into five areas: Fluid Reasoning, Knowledge,
Quantitative Reasoning, Visual–Spatial Processing, and Working
Memory, which provides a more complete assessment of individual
intelligence.
*Wechsler Intelligence Scale for Children–IV (WISC–IV), The
WISC–IV is published by the Psychological Corporation. It is
appropriate for students between 6 and 16 years of age. Along with its
companion tests, the Wechsler Preschool and Primary. Scale of
Intelligence–Revised (WPPSI–III), appropriate for ages 2–7, and the
Wechsler. Adult Intelligence Scale–IV (WAIS–IV), appropriate for
ages 16–adult, the Wechsler scales are the most popular individually
administered IQ tests. The WISC–IV, published in 2003, contains a
number of changes from previous versions of the WISC.

5
A. Projective Personality Tests
*Rorschach Inkblot Technique, The Rorschach test is
published by Hans Huber Medical Publisher, Berne, Switzerland. It
consists of 10 cards or plates. Each card contains an inkblot, some
black and white, some colored. Examinees are asked to describe what
they “see” in the ambiguous blot. Responses are scored according to
location, content, and a variety of other factors, including whether
form or color was used to construct the image, whether movement is
suggested, and whether shading or texture was considered. Scoring is
complex, but acceptable validity is evident when properly trained
scorers are used. The Rorschach has been one of the most popular
projective tests used by clinical psychologists. In the hands of a
skilled clinician, it can yield a surprising variety of information.
*Thematic Apperception Test (TAT), The TAT is published by
the Harvard University Press and is designed for individuals aged 10
through adult. Adaptations of the test for younger children (Children’s
Apperception Test) and senior citizens (Senior Apperception Test) are
also available. The subject is presented with a series of pictures
(usually 10 or 12 of the total of 30) and asked to make up a story to fit
each picture. The pictures vary in degree of structure and ambiguity.
The record of stories is then examined to determine the projection of
the subject’s personality, as indicated by recurrent behavioral themes,
needs, perceived pressures, and so on. Interpretation is often complex
and requires an appropriately trained psychologist or psychiatrist.

6
B. ACCURACY OF STANDARDIZED TEST PRECISION

We have learned (or have been programmed) to put a great deal


of faith in test scores. In fact, to some individuals test results represent
the ultimate truth. Our position is that tests and the scores they yield
are useful but also fallible. They come with varying degrees of
“goodness,” but no test or score is completely valid or reliable.
This is the extent of the precision we can realistically arrive at in
interpreting test scores. In other words, a score from any test is our
best guess about an individual’s true level of knowledge, ability,
achievement, and so forth, and, like all guesses, the guesses we make
can be wrong. All tests are subject to various sources of error that
impair the reliability of their scores and, consequently, their accuracy.
However, most test scores are not perfectly reliable—in fact,
most test scores are a long way from being perfectly reliable.
Therefore, when Student A scores 75 on a test, we only hope that his
or her true score—his or her actual level of ability—is somewhere
around 75.The closer the score reliability of a test is to perfect, the
more likely it is that the true score is very close to 75.

7
C. STANDARD ERROR OF MEASUREMENT

*ERROR ACCURACY, can you think of some concrete examples of


the types of errors that lowered your score? Remember when you
couldn't sleep the night before a test, when you were sick but took the
test anyway, when the essay test you took was so poorly constructed
that it was hard to figure out what was being tested, when the test had
a 45-minute time limit but you were only allowed 38 minutes, or
when you took a test that had more than one justifiable answer? Each
of these examples illustrates different types of errors. These and
possibly other sources of error prevent your true "score" from being
equal to the score you earn. Another way of saying this is that your
earned score is equal to your true score minus any errors.
Now, what about some examples of situations where mistakes
played a role in increasing the score you earned above your actual
level of knowledge, skill or ability? In short, what about the time you
got a higher score than you should have? Never happens, you say!
Well, what about the times you happened to see an answer on your
neighbor's paper, the times you were lucky to guess, the times you had
52 minutes for a 45-minute test, or the times the test was full of
accidental clues so you could answer some questions based on
information given in other questions? Remember the principal we
described in Chapter 2 who "helped" students take important tests?
Each of these examples illustrates a mistake. Again, because of these
mistakes, your true score is not reflected in the score you get. You
received a higher score than you should have! In these cases, your
earned score is equal to your true score plus each error.
We actually never know the true score or individual error
scores. Why care about them then? They are important concepts
because they allow us to illustrate some important points about test
score reliability and test score accuracy. For now, just keep the
following in mind:
Score obtained = true score ± error score

8
*Error Of Measurement, The standard error of measurement of
a test (abbreviated as Sm) is the standard deviation of the error scores
of a test. In the following calculations, Sm is the standard deviation of
the error score column. It is determined in the same way as
determining the standard deviation of any score distribution (see
Chapter 14). Review the following calculation to confirm this. We
will use the error scores from Table 18.1: 3, -7, -2, 5, 4, -3.
Step 1: Determine the average.
M = ΣX/N
N=6=0
Step 2: Subtract the average from each error score to get the deviation
score. Square each deviation score and sum the squared deviations.
X - M = x x^2
+3 - 0 = 3 9
-7 - 0 = -7 49
-2 - 0 = -2 4
+5 - 0 = 5 25
+4 - 0 = 4 16
-3 - 0 = -3 9
Σx^2 = 112
Step 3: Put the sum of x^2 into the formula and solve for the standard
deviation.
Error Score SD =
Σx^2 / N = 112 / √18.67 = 4.32 = Sm
The standard deviation of the error score distribution, also
known as the standard error of measurement, is 4.32. If we could
know what the error score is for each test we take, we could calculate
Sm in this way. But, of course, we never know these error scores. If
you followed along here, your next question should be, "But how can
you determine the standard deviation of the error score if you never
know the error score?"

9
Fortunately, a fairly simple statistical formula can be used to estimate
this standard deviation (Sm) without having to actually know the error
score:
Sm = SD√1 - r
where r is the reliability of the test and SD is the standard deviation of
the test.
*Using Measurement Standard Error, Error scores are
assumed to be random. As such, they negate each other. That is, the
score obtained is inflated by random errors in the same amount as the
score deflated by the errors. Another way of saying this is that the
average error score for a test is zero. The distribution of error scores is
also important, as it approximates the normal distribution enough for
us to use the normal distribution to represent it. In summary, we know
that error scores (1) are normally distributed, (2) have a mean of zero,
and (3) have a standard deviation referred to as the standard error of
measurement (Sm).
The distribution of error scores is a normal distribution. This is
important because, as you learned in Chapter 14, the normal
distribution has characteristics that allow us to make decisions about
scores that fall between, above, or below various points in the
distribution. We can do so because a fixed percentage of scores fall
between various score values in a normal distribution. Figure 18.3
should refresh your memory.
Therefore we can see that 68% of the error scores for the test will be
no more than

-3Sm -2Sm -1Sm 0 +1Sm +2Sm +3Sm

10
FIGURE 18.1 Distribution of error scores.

-12.96 -8.64 -4.32 0 +4.32 +8.64 +12.96

FIGURE 18.2 Distribution of error scores for the test described in


Table 18.1

4.32 points higher or 4.32 points lower than the true score. That
is, if there are 100 scores obtained on this test, 68 of those scores will
not be "wrong" of the true score by more than 4.32 points, plus or
minus from the score obtained. Sm then, tells us about the distribution
of the scores obtained around the true score. By knowing a person's
true score, we can predict what his/her earned score is likely to be.
Careful readers might think, "That's not very useful information. We
can never know what someone's true score is, only their earned

11
score." This is true. As test users, we only work with the scores
obtained.
However, we can follow our logic in reverse. If 68% of the
earned score is within 1 Sm of their true score, then 68% of the true
score must be within 1 Sm of their earned score. Strictly speaking,
this reverse logic is somewhat inaccurate when we consider individual
test scores. However, across all test scores, this will be true 99% of
the time (Gullikson, 1987).

12
CONCLUSION

learning evaluation paper that discusses the types, accuracy, and


standardization test errors in the context of academic aptitude testing.
It covers the history of academic aptitude testing, including the
development of the Stanford-Binet Intelligence Scale and other
individually administered tests like the Wechsler Intelligence Scale
for Children. The paper also explains the concept of error in testing
and the standard error of measurement, which is a measure of the
variability of error scores around the true score. The author
emphasizes that no test is perfectly reliable or accurate, and all tests
are subject to various sources of error that can affect the precision and
accuracy of test scores

13
BIBLIOGRAPHY

Kubiszyn, T., & Borich, G. D. (2013). Educational testing and


measurement: Classroom application and Practice. Wiley.

14

You might also like