You are on page 1of 24

Qualities of a good Measuring 

Instrument
Leave a reply

Whether a test is standardized or teacher-made, it should apply the qualities of a good measuring
instrument. This module discusses the qualities of a good test which are: validity, reliability, and
usability.

After reading this module, students are expected to:

1. define and explain the characteristics of a good measuring instruments;


2. identify the types of validity;
3. describe what conditions can affect the validity of test items;
4. discuss the factors that affect the reliability of test;
5. estimate test reliability using different methods;
6. enumerate and discuss the factors that determine the usability of test; and
7. point out which is the most important characteristics of a good test.

Validity

Validity – is the most important characteristics of a good test.

Validity – refers to the extent to which the test serves its purpose or the efficiency with which it
measures what it intends to measure.

The validity of test concerns what the test measures and how well it does for. For example, in order
to judge the validity of a test, it is necessary to consider what behavior the test is supposed to
measure.

A test may reveal consistent scores but if it is not useful for the purpose, then it is not valid. For
example, a test for grade V students given to grade IV is not valid.

Validity is classified into four types: content validity, concurrent validity, predictive validity, and
construct validity.

Content validity – means that extent to which the content of the test is truly a representative of the
content of the course. A well constructed achievement test should cover the objectives of instruction,
not just its subject matter. Three domains of behavior are included: cognitive, affective and
psychomotor.

Concurrent validity – is the degree to which the test agrees with or correlates with a criterion which is
set up an acceptable measure. The criterion is always available at the time of testing.

Concurrent validity or criterion-related validity- establishes statistical tool to interpret and correlate
test results.

For example, a teacher wants to validate an achievement test in Science (X) he constructed. He
administers this test to his students. The result of this test can be compared to another Science
students (Y), which has been proven valid. If the relationship between X and Y is high, this means
that the achievement test is Science is valid. According to Garrett, a highly reliable test is always
valid measure of some functions.
Predictive validity – is evaluated by relating the test to some actual achievement of the students of
which the test is supposed to predict his success. The criterion measure against this type is
important because the future outcome of the testee is predicted. The criterion measure against
which the test scores are validated and obtained are available after a long period.

Construct validity – is the extent to which the test measures a theoretical trait. Test item must include
factors that make up psychological construct like intelligence, critical thinking, reading
comprehension or mathematical aptitude.

Factors that influence validity are:

1. Inappropriateness of test items – items that measure knowledge can not measure skill.
2. Direction – unclear direction reduce validity. Direction that do not clearly indicate how the pupils
should answer and record their answers affect validity of test items.
3. Reading vocabulary and sentence structures – too difficult and complicated vocabulary and
sentence structure will not measure what it intend to measure.
4. Level of difficulty of Items – too difficult or too easy test items can not discriminate between bright
and slow pupils will lower its validity.

5. Poorly constructed test item – test items that provide clues and items that are ambiguous confuse
the students and will not reveal a true measure.

6. Length of the test- a test should of sufficient length to measure what it is supposed to measure. A
test that is too short can not adequately sample the performance we want to measure.

7. Arrangement of items – test item should be arrange according to difficulty, with the easiest items
to the difficult ones. Difficult items when encountered ahead may cause mental block and may also
cause student to take much time in that number.

8. Patterns of answers – when students can detect the pattern of correct answer, they are liable to
guess and this lowers validity.

Reliability

Reliability means consistency and accuracy. It refers then to the extent to which a test is
dependable, self consistent and stable. In other words, the test agrees with itself. It is concerned
with the consistency of responses from moment to moments even if the person takes the same test
twice, the test yields the same result.

For example, if a student got a score of 90 in an English achievement test this Monday and gets 30
on the same test given on Friday, then both score can not be relied upon.

Inconsistency of individual scores however may be affected by person’s scoring the test, by limited
samples on certain areas of the subject matter and particularly the examinees himself. If the
examinees mood is unstable this may affect his score.

Factors that affect reliability are:

1. Length of the test. As a general rule, the longer the test, the higher the reliability. A longer test
provides a more adequate sample of the behavior being measured and is less distorted by chance
factors like guessing.
2. Difficulty of the test. When a test is too easy or too difficult, it cannot show the differences among
individuals; thus it is unreliable. Ideally, achievement tests should be constructed such that the
average score is 50 percent correct and the scores range from near zero to near perfect.

3. Objectivity. Objectivity eliminates the bias, opinions or judgments of the person who checks the
test. Reliability is greater when test can be scored objectively.

4. Heterogeneity of the student group. Reliability is higher when test scores are spread over a range
of abilities. Measurement errors are smaller than that of a group that is more heterogeneous.

5. Limited time. a test in which speed is a factor is more reliable than a test that is conducted at a
longer time.

A reliable test however, is not always valid.

Methods of Estimating Reliability of Test:

1. Test-retest method. The same instrument is administered twice to the same group of subjects.
The scores of the first and second administrations of the test are determined by Spearman rank
correlation coefficient or Spearman rho and Pearson Product-Moment Correlation Coefficient.

The formula using Spearman rho is:

rs  1 – 6D2 Where ; D2 = sum of squared difference


N3 – N between ranks
N = total number of cases

For example, 10 students where used as samples to test the reliability of the achievement test in
Biology. After two administration of test the data and computation of Spearman rho is presented in
the table below:

Differences squared
Scores Ranks between ranks difference
Students S1 S2 R1 R2 D D2

1 89 90 2 1.5 0.5 0.25


2 85 85 4.5 4 0.5 0.25
3 77 76 9 9 0 0
4 80 81 7.5 8 0.5 0.25
5 83 83 6 6.5 0.5 0.25
6 87 85 3 4 1.0 1.0
7 90 90 1 1.5 0.5 0.25
8 73 72 10 10 0 0
9 85 85 4.5 4 0.5 0.25
10 80 83 7.5 6.5 1.0 1.0

Total D = 3.5

rs = 1 – 6D2
N3 – N
=1–

=1–

= 1 – 0.0212

= 0.98 (very high relationship

The rs value obtained is 0.98 which means very high relationship; hence achievement test in Biology
is reliable.

Pearson Product-Moment Correlation Coefficient can also be used for test-retest method of
estimating the reliability of test. The formula is:

Using the same data for Spearman rho, the scores for 1st and 2nd administration may be presented
in this way:

X (S1) Y(S2) X2 Y2 XY

89 90 7921 8100 8010


85 85 7225 7225 7225
77 76 5929 5776 5852
80 81 6400 6561 6480
83 83 6869 6869 6869
87 85 7569 7225 7395
90 90 8100 8100 8100
73 72 5329 5184 5256
85 85 7225 7225 7225
80 83 6400 6889 6640
X = 829  = 830 X2 = 68967 2 =69154 X = 69052

Could you now compute by using the formula above? Illustrate below:

Alternate-forms method. The second method of establishing the reliability of test results. In this
method, we give two forms of a test similar in content, type of items, difficulty, and others in close
succession to the same group of students. To test the reliability the correlation technique is used
(refer to the formula used in Pearson Product-Moment Correlation Coefficient).

2. Split-half method. The test may be administered once, but the test items are divided into two
halves. The most common procedure is to divide a test into odd or even items. The results are
correlated and the r obtained is the reliability coefficient for a half test. The Spearman-Brown formula
is used which is:

where; r = reliability of whole test


rht = reliability of half of the test

For example, rht is 0.69. what is r?

rt = 2 rht
1 + rht
= 2(0.69)
1+ 0.69

= 0.82 very high relationship, so the test is reliable.

Split-half method is applicable for not highly speeded measuring instrument. If the measuring
instrument includes easy items and the subject is able to answer correctly all or nearly all items
within the time limit of the test, the scores on the two halves would be about similar and the
correlation would be closed to +1.00.

3. Kuder-Richardson Formula 21 is the last method of establishing the reliability of a test. Like the
split half method, a test is conducted only once. This method assumes that all items are of equal
difficulty. The formula is:

Where:

X = the mean of the obtained scores


S = the standard deviation
k = the total number of items

Example: Mr. Marvin administered a 50-item test to 10 of his grade 5 pupils. The scores of his pupils
are presented in the table below:

Pupils
Score (X)
X-X
(X-X)
A 32 3.2 10.24
B 36 7.2 51.84
C 36 7.2 51.84
D 22 -6.8 46.24
E 38 9.2 84.64
F 15 -13.8 190.44
G 43 14.2 201.64
H 25 -3.8 14.44
I 18 -10.8 116.64
J 23 -5.8 33.64

288 801.60

X = 28.8 S = 89.07 k = 50

Show how mean and standard deviation was obtained in the box below:

Could you now compute the reliability of the test applying the formula on Kuder Richardson formula
21? Please try!

Usability
Usability means the degree to which the tests are used without much expenditure of time, money
and effort. It also means practicability. Factors that determine usability are: administrability,
scorability, interpretability, economy and proper mechanical makeup of the test.

Administrability means that the test can be administered with ease, clarity and uniformity. Directions
must be made simple, clear and concise. Time limits, oral instructions and sample questions are
specified. Provisions for preparation, distribution, and collection of test materials must be definite.

Scorability is concerned on scoring of test. A good test is easy to score thus: scoring direction is
clear, scoring key is simple, answer is available, and machine scoring as much as possible be made
possible.

Test results can be useful if after evaluation it is interpreted. Correct interpretation and application of
test results is very useful for sound educational decisions.

An economical test is of low cost. One way to economize cost is to use answer sheet and reusable
test. However, test validity and reliability should not be sacrificed for economy.

Proper mechanical make-up of the test concerns on how tests are printed, what font size are used,
and are illustrations fit the level of pupils/students.

Summary

A good measuring instruments posses’ three qualities, which include: validity, reliability and usability.

ADVERTISEMENT
REPORT THIS AD

Validity – the extent to measure what it intends to measure. It has four types; the content, construct,
concurrent and predictive. Test validity can be affected by: inappropriateness of the test, direction,
vocabulary and construction of test, level of difficulty, constructions, length of test, arrangement of
items and patterns of answers.

Reliability is the consistency of scores obtained by an individual given the same test at different
times. This can be estimated using test-retest method, alternate forms, split-half and kuder-
Richardson Formula 21. The reliability of test may however be affected by the length of test, the
difficulty of test item, the objectivity of scoring heterogeneity of the student group, and limited time.
Usability of test means its practicability. This quality is determined by: ease in administration
(administrability), ease in scoring (scorability), ease in interpretation and application (interpretability),
economy of materials and the proper mechanical make-up of the test.
A test to be effective must be valid. For a valid test is always valid, but not all reliable test is valid.

Learning Exercises

I. Multiple Choice: Encircle the correct answer.

1. Which statement concerning validity and reliability is most accurate?

a. A test can not be reliable unless it is valid.


b. A test can not be valid unless it is reliable.
c. A test can not be valid and reliable unless it is objective.
d. A test can not be valid and reliable unless it is standardized.

2. Which type of validity is appropriate for criterion-reference measure?

a. content validity c. construct validity


b. concurrent validity d. predictive validity

3. Which is directly affected by objectivity in scoring?

a. The validity of test c. The reliability of test


b. The usability of test d. The administrability of test

2. A teacher-made test constructed had overemphasized facts and underemphasized other


objectives of the course for which it is designed, what can be said about the test?

a. It lacks content validity.


b. It lacks construct validity.
c. It lacks predictive validity.
d. It lacks criterion-related validity.

3. When an achievement test for grade V pupils was administered to grade VI, what is most
affected?

a. reliability of the test c. usability of the test


b. validity of the test D. reliability and validity of the test

4. Which factor of usability is described by the wise use of testing materials?


a. scorability c. economy
b. adminstrability d. proper mechanical make-up

5. Clarity and uniformity in giving directions affects:


a. scorability of the test c. interpretability of the test
b. administrability of the test d. proper mechanical make-up
6. Which best describe validity?
a. consistency in test result
b. practicability of the test
c. homogeneity in the content of the test
d. objectivity in administration and scoring of test

II. Setting and Option Multiple choice: Table 1 presents the scores of 10 students who were tested
twice (test-retest) to test the reliability of such test. Complete the table and answer the question
below. Choose the correct answer and show the computation in the question where it is needed.

Score Rank Differences Squared


Students from ranks difference
S1 S2 R1 R2 D D2

1 68 71
2 65 65
3 70 69
4 65 68
5 70 72
6 65 63
7 62 62
8 64 66
9 58 60
10 60 60

Total =

1. What is the squared difference between ranks of students?


a. 13 b. 14 c. 15 d. 16

2. Who got the highest score in the second administration of test?


a. student 1 c. student 5
b. student 3 d. student 7

3. What is the calculated rs?


a. 0.86 b. 0.88 c. 0.90 d. 0.92

4. What is the Garrets’ interpretation of the obtained rs (refer to question no.3)?


a. negligible correlation c. marked relationship
b. low correlation d. high or very high relationship

5. Based on Garrets’ interpretation of the calculated rs, what can you say about the test
constructed?

III. Simple Recall:

1. Mr. Gwen conducted a 40-item Mathematics test to his 10 udents. Their scores in the first half and
in the second half are shown below. Find the reliability of the whole test using split-half method. Is
the test reliable? Justify.

1st half – 17 18 20 11 10 13 20 19 19 15
2nd half – 15 13 18 10 8 10 18 16 17 14

2. Ms. Pearl administered a 30 item science test to her Grade 6 pupils. the scores are shown below.
What is the reliability of the whole test using Kuder-Richardson Formula 21? Is the test reliable?
Justify.

Pupils – A B C D E F G H I J K L
Scores – 25 22 30 22 17 15 18 24 27 18 23 26

IV. Essay:

1. In your own opinion, which is better a valid test or reliable test? Why?
2. Why do you think students’ score in a particular test sometimes vary?
3. Discuss what makes test items/test results invalid?

References
Asaad, Abubakar S. and Wilham M. Hailaya. Measurement and Evaluation,
Concepts and Principles, 1st Ed. Manila, Philippines: Rex Book Store,
Inc. 2004.

Calmorin, Laurentina. Educational Research, Measurement and Evaluation, 2nd Ed. Metro Manila,
Philippines: National Book Store, Inc. 1994.

Oriondo, Leonora L. and Eleonor M. Antonio. Evaluating Educational Outcomes. Manila, Philippines:
Rex Book store, 1984.

Advertisements
REPORT THIS AD

Share this:

 Twitter
 Facebook

Related
Module 3 - Teacher-Made Test ConstructionIn "Test and Measurement"
Module 1 - Measurement and EvaluationIn "Test and Measurement"
Course Syllabus (IT204)In "course"
This entry was posted in Test and Measurement on March 17, 2010.
Post navigation
← Module 1 – Measurement and EvaluationModule 3 – Teacher-Made Test Construction →
Leave a Reply

Search for

Test-retest reliability
Test-retest reliability measures the consistency of results when you repeat the same test on the
same sample at a different point in time. You use it when you are measuring something that you
expect to stay constant in your sample.

A test of colour blindness for trainee pilot applicants should have high test-retest reliability, because
colour blindness is a trait that does not change over time.

Why it’s important


Many factors can influence your results at different points in time: for example, respondents
might experience different moods, or external conditions might affect their ability to respond
accurately.

Test-retest reliability can be used to assess how well a method resists these factors over time.
The smaller the difference between the two sets of results, the higher the test-retest reliability.
How to measure it
To measure test-retest reliability, you conduct the same test on the same group of people at two
different points in time. Then you calculate the correlation between the two sets of results.

Test-retest reliability example


You devise a questionnaire to measure the IQ of a group of participants (a property that is
unlikely to change significantly over time).You administer the test two months apart to the same
group of people, but the results are significantly different, so the test-retest reliability of the IQ
questionnaire is low.

Improving test-retest reliability

 When designing tests or questionnaires, try to formulate questions, statements and tasks in a
way that won’t be influenced by the mood or concentration of participants.
 When planning your methods of data collection, try to minimize the influence of external
factors, and make sure all samples are tested under the same conditions.
 Remember that changes can be expected to occur in the participants over time, and take these
into account.

Interrater reliability
Interrater reliability (also called interobserver reliability) measures the degree of agreement
between different people observing or assessing the same thing. You use it when data is collected
by researchers assigning ratings, scores or categories to one or more variables.

In an observational study where a team of researchers collect data on classroom behavior, interrater
reliability is important: all the researchers should agree on how to categorize or rate different types of
behavior.

Why it’s important


People are subjective, so different observers’ perceptions of situations and phenomena naturally
differ. Reliable research aims to minimize subjectivity as much as possible so that a different
researcher could replicate the same results.

When designing the scale and criteria for data collection, it’s important to make sure that
different people will rate the same variable consistently with minimal bias. This is especially
important when there are multiple researchers involved in data collection or analysis.

How to measure it
To measure interrater reliability, different researchers conduct the same measurement or
observation on the same sample. Then you calculate the correlation between their different sets
of results. If all the researchers give similar ratings, the test has high interrater reliability.
Interrater reliability example
A team of researchers observe the progress of wound healing in patients. To record the stages of
healing, rating scales are used, with a set of criteria to assess various aspects of wounds. The
results of different researchers assessing the same set of patients are compared, and there is a
strong correlation between all sets of results, so the test has high interrater reliability.

Improving interrater reliability

 Clearly define your variables and the methods that will be used to measure them.
 Develop detailed, objective criteria for how the variables will be rated, counted or categorized.
 If multiple researchers are involved, ensure that they all have exactly the same information and
training.

What can proofreading do for your paper?


Scribbr editors not only correct grammar and spelling mistakes, but also strengthen your writing
by making sure your paper is free of vague language, redundant words and awkward phrasing.
See editing example

Parallel forms reliability


Parallel forms reliability measures the correlation between two equivalent versions of a test. You
use it when you have two different assessment tools or sets of questions designed to measure the
same thing.

Why it’s important


If you want to use multiple different versions of a test (for example, to avoid respondents
repeating the same answers from memory), you first need to make sure that all the sets of
questions or measurements give reliable results.

In educational assessment, it is often necessary to create different versions of tests to ensure that
students don’t have access to the questions in advance. Parallel forms reliability means that, if the same
students take two different versions of a reading comprehension test, they should get similar results in
both tests.

How to measure it
The most common way to measure parallel forms reliability is to produce a large set of questions
to evaluate the same thing, then divide these randomly into two question sets.

The same group of respondents answers both sets, and you calculate the correlation between the
results. High correlation between the two indicates high parallel forms reliability.

Parallel forms reliability example


A set of questions is formulated to measure financial risk aversion in a group of respondents. The
questions are randomly divided into two sets, and the respondents are randomly divided into two
groups. Both groups take both tests: group A takes test A first, and group B takes test B first. The
results of the two tests are compared, and the results are almost identical, indicating high parallel
forms reliability.

Improving parallel forms reliability

 Ensure that all questions or test items are based on the same theory and formulated to measure
the same thing.

Internal consistency
Internal consistency assesses the correlation between multiple items in a test that are intended to
measure the same construct.

You can calculate internal consistency without repeating the test or involving other researchers,
so it’s a good way of assessing reliability when you only have one data set.

Why it’s important


When you devise a set of questions or ratings that will be combined into an overall score, you
have to make sure that all of the items really do reflect the same thing. If responses to different
items contradict one another, the test might be unreliable.

To measure customer satisfaction with an online store, you could create a questionnaire with a set of
statements that respondents must agree or disagree with. Internal consistency tells you whether the
statements are all reliable indicators of customer satisfaction.

How to measure it
Two common methods are used to measure internal consistency.

Average inter-item correlation: For a set of measures designed to assess the same construct,
you calculate the correlation between the results of all possible pairs of items and then calculate
the average.

Split-half reliability: You randomly split a set of measures into two sets. After testing the entire
set on the respondents, you calculate the correlation between the two sets of responses.

Internal consistency example


A group of respondents are presented with a set of statements designed to measure optimistic and
pessimistic mindsets. They must rate their agreement with each statement on a scale from 1 to 5.
If the test is internally consistent, an optimistic respondent should generally give high ratings to
optimism indicators and low ratings to pessimism indicators. The correlation is calculated
between all the responses to the “optimistic” statements, but the correlation is very weak. This
suggests that the test has low internal consistency.

Improving internal consistency

 Take care when devising questions or measures: those intended to reflect the same concept
should be based on the same theory and carefully formulated.

Which type of reliability applies to my research?


It’s important to consider reliability when planning your research design, collecting and
analyzing your data, and writing up your research. The type of reliability you should calculate
depends on the type of research and your methodology.

What is my methodology? Which form of reliability is relevant?

Measuring a property that you expect to stay the same over time. Test-retest

Multiple researchers making observations or ratings about the same topic. Interrater

Using two different tests to measure the same thing. Parallel forms
What is my methodology? Which form of reliability is relevant?

Using a multi-item test where all the items are intended to measure the same variable. Internal consistency

If possible and relevant, you should statistically calculate reliability and state this alongside
your results.

Classroom assessments are a big responsibility on educators’ plates. There are plenty of
possible formats out there: summative, formative, essay, multiple choice – the list goes on and
on. Rather than settling for a form response, many teachers design their own assessments.
Whether pre-made or not, when developing classroom assessment tools, teachers should take
the following criteria into account:

 Purpose: How will it be used?


 Impact: How will it impact instruction? Will it shrink the curriculum?
 Validity: Is it designed to measure what it was supposed to measure?
 Fairness: Will all the students have the same opportunities to show what they have learned?
 Reliability: Is the scoring system consistent with that of the school and state benchmarks?
 Significance: Does it address the contents that are valued?
 Efficiency: Is the test consistent with the time available for the students to be able to complete
it.

Teacher-designed tests offer clear advantages:


 They are better aligned with classroom objectives.
 They present consistent evaluation material, having the same questions for all the students in
the class.
 They are easy to store and offer accessible material for parents to consult.
 They are easy to administer.

And an important drawback:

 Some teachers may not have the necessary abilities to design their own test, and therefore
evaluations may be less reliable.

The type of test chosen must be consistent with the content of the course. Take the time to
plan tests carefully and to decide which type of test suits the content you taught. Most teachers
favor objective testing because it saves time when marking and has much more reliability. It is
highly recommended that classroom tests should not contain a wide variety of test items,
because some students may find difficulty in shifting ways of processing information.
Additionally, each of the items should evaluate whether the student has mastered the
objectives and separate these from those who haven’t. A test won’t be effective if students can
guess the answers or gain a perception of what the answer may be by means of clues or aids
extracted from the questions.

Not sure yet whether a teacher-designed test is the best answer to the question of best
assessment format for your classroom? Check our other articles on evaluation formats and
how to guarantee that you’re implementing the one that’s the best fir for your classroom.
STANDARDIZED TEST
LAST UPDATED: 11.12.15
A standardized test is any form of test that (1) requires all test takers to answer the same
questions, or a selection of questions from common bank of questions, in the same
way, and that (2) is scored in a “standard” or consistent manner, which makes it
possible to compare the relative performance of individual students or groups of
students. While different types of tests and assessments may be “standardized” in this
way, the term is primarily associated with large-scale tests administered to
large populations of students, such as a multiple-choice test given to all the eighth-
grade public-school students in a particular state, for example.
In addition to the familiar multiple-choice format, standardized tests can include true-
false questions, short-answer questions, essay questions, or a mix of question
types. While standardized tests were traditionally presented on paper and completed
using pencils, and many still are, they are increasingly being administered on computers
connected to online programs (for a related discussion, see computer-adaptive
test). While standardized tests may come in a variety of forms, multiple-choice and true-
false formats are widely used for large-scale testing situations because computers can
score them quickly, consistently, and inexpensively. In contrast, open-ended essay
questions need to be scored by humans using a common set of guidelines or rubrics to
promote consistent evaluations from essay to essay—a less efficient and more time-
intensive and costly option that is also considered to be more subjective. (Computerized
systems designed to replace human scoring are currently being developed by a variety
of companies; while these systems are still in their infancy, they are nevertheless
becoming the object of growing national debate.)
While standardized tests are a major source of debate in the United States, many test
experts and educators consider them to be a fair and objective method of assessing the
academic achievement of students, mainly because the standardized format, coupled
with computerized scoring, reduces the potential for favoritism, bias, or subjective
evaluations. On the other hand, subjective human judgment enters into the testing
process at various stages—e.g., in the selection and presentation of questions, or in the
subject matter and phrasing of both questions and answers. Subjectivity also enters into
the process when test developers set passing scores—a decision that can affect how
many students pass or fail, or how many achieve a level of performance considered to
be “proficient.” For more detailed discussions of these issue, see measurement
error, test accommodations, test bias and score inflation.
Standardized tests may be used for a wide variety of educational purposes. For
example, they may be used to determine a young child’s readiness for kindergarten,
identify students who need special-education services or specialized academic
support, place students in different academic programs or course levels, or award
diplomas and other educational certificates. The following are a few representative
examples of the most common forms of standardized test:
 Achievement tests are designed to measure the knowledge and skills students
learned in school or to determine the academic progress they have made over a
period of time. The tests may also be used to evaluate the effectiveness of a schools
and teachers, or identify the appropriate academic placement for a student—i.e.,
what courses or programs may be deemed most suitable, or what forms of academic
support they may need. Achievement tests are “backward-looking” in that they
measure how well students have learned what they were expected to learn.
 Aptitude tests attempt to predict a student’s ability to succeed in an intellectual or
physical endeavor by, for example, evaluating mathematical ability, language
proficiency, abstract reasoning, motor coordination, or musical talent. Aptitude tests
are “forward-looking” in that they typically attempt to forecast or predict how well
students will do in a future educational or career setting. Aptitude tests are often a
source of debate, since many question their predictive accuracy and value.
 College-admissions tests are used in the process of deciding which students will be
admitted to a collegiate program. While there is a great deal of debate about the
accuracy and utility of college-admissions tests, and many institutions of higher
education no longer require applicants to take them, the tests are used as indicators
of intellectual and academic potential, and some may consider them predictive of
how well an applicant will do in postsecondary program.
 International-comparison tests are administered periodically to representative
samples of students in a number of countries, including the United States, for the
purposes of monitoring achievement trends in individual countries and comparing
educational performance across countries. A few widely used examples of
international-comparison tests include the Programme for International Student
Assessment (PISA), the Progress in International Reading Literacy
Study (PIRLS), and the Trends in International Mathematics and Science
Study (TIMSS).
 Psychological tests, including IQ tests, are used to measure a person’s cognitive
abilities and mental, emotional, developmental, and social characteristics. Trained
professionals, such as school psychologists, typically administer the tests, which
may require students to perform a series of tasks or solve a set of problems.
Psychological tests are often used to identify students with learning disabilities or
other special needs that would qualify them for specialized services.
Reform
Following a wide variety of state and federal laws, policies, and regulations aimed at
improving school and teacher performance, standardized achievement tests have
become an increasingly prominent part of public schooling in the United States. When
focused on reforming schools and improving student achievement, standardized tests
are used in a few primary ways:
 To hold schools and educators accountable for educational results and student
performance. In this case, test scores are used as a measure of effectiveness, and
low scores may trigger a variety of consequences for schools and teachers. For a
more detailed discussion see high-stakes test.
 To evaluate whether students have learned what they are expected to learn,
such as whether they have met state learning standards. In this case, test scores are
seen as a representative indicator of student achievement.
 To identify gaps in student learning and academic progress. In this case, test
scores may be used, along with other information about students, to diagnose
learning needs so that educators can provide appropriate services, instruction,
or academic support.
 To identify achievement gaps among different student groups, including students
of color, students who are not proficient in English, students from low-income
households, and students with physical or learning disabilities. In this case, exposing
and highlighting achievement gaps may be seen as an essential first step in the effort
to educate all students well, which can lead to greater public awareness and changes
in educational policies and programs.
 To determine whether educational policies are working as intended. In this case,
elected officials and education policy makers may rely on standardized-test results to
determine whether their laws and policies are working or not, or to compare
educational performance from school to school or state to state. They may also use
the results to persuade the public and other elected officials that their policies are in
the best interest of children and society.
Debate
While debates about standardized testing are wide-ranging, nuanced, and sometimes
emotionally charged, many debates tend to be focused on the ways in which the tests
are used, and whether they present reliable or unreliable evaluations of student
learning, rather than on whether standardized testing is inherently good or bad
(although there is certainly debate on this topic as well). Most test developers and
testing experts, for example, caution against using standardized-test scores as an
exclusive measure of educational performance, although many would also contend that
test scores can be a valuable indicator of performance if used appropriately and
judiciously. Generally speaking, standardized testing is more likely to become an object
of debate and controversy when test scores are used to make consequential decisions
about educational policies, schools, teachers, and students. The tests are less likely to
be contentious when they are used to diagnose learning needs and provide students
with better services—although the line separating these two purposes is notoriously
fuzzy in practice (thus, the ongoing debates).
While an exhaustive discussion of standardized-testing debates is beyond the scope of
this resource, the following questions will illustrate a few of the major issues commonly
discussed and debated in the United States:
 Are numerical scores on a standardized test misleading indicators of student
learning, since standardized tests can only evaluate a narrow range of achievement
using inherently limited methods? Or do the scores provide accurate, objective, and
useful evidence of school, teacher, or student performance? (Standardized tests don’t
measure everything students are expected to learn in school. A test with 50 multiple-
choice questions, for example, can’t possibly measure all the knowledge and skills a
student was taught, or is expected to learn, in a particular subject area, which is one
reason why some educators and experts caution against using standardized-test
scores as the only indicator of educational performance and success.)
 Are standardized tests fair to all students because every student takes the same test
and is evaluated in the same way? Do the tests have inherent biases that may
disadvantage certain groups, such as students of color, students who are unfamiliar
with American cultural conventions, students who are not proficient in English, or
students with disabilities that may affect their performance?
 Is the use of standardized tests providing valuable information that educators and
school leaders can use to improve instructional quality? Is the pervasive overuse of
testing actually taking up valuable instructional time that could be better spent
teaching students more content and skills?
 Do the benefits of standardized testing—consistent data on school and student
performance that can be used to inform efforts to improve schools and teaching—
outweigh the costs—the money spent on developing the tests and analyzing the
results, the instructional time teachers spend prepping students, or the time students
spend taking the test?
 Do math and reading test scores, for example, provide a full and accurate picture of
school, teacher, and student performance? Do standardized tests focus too narrowly
on a few academic subjects?
 Does the narrow range of academic content evaluated by standardized tests cause
teachers to focus too much on test preparation and a few academic subjects (a
practice known as “teaching to the test”) at the expense of other worthwhile
educational pursuits, such as art, music, health, physical education, or 21st century
skills, for example?
 Do standardized tests, and the consequences attached to low scores, hold schools,
educators, and students to higher standards and improve the quality of public
education? Do the tests create conditions that undermine effective education, such as
cheating, unhealthy forms of competition, or unjustly negative perceptions of public
schooling?
 Should some of the most important decisions in public education—such as whether
to reduce or increase school funding or fire teachers and principals—be made
entirely or primarily on the basis of test scores? Are standardized-test scores, which
could potentially be misleading or inaccurate, too limited a measure to use as a basis
for such consequential decisions?
Bless the tests: Three reasons for
standardized testing
Aaron Churchill
3.18.2015

A torrent of complaints has been levelled against testing in recent months.


Some of the criticism is associated with the PARCC exams, Ohio’s new
English and math assessments for grades 3–8 and high school. The
grumbling over testing isn’t a brand new phenomenon. In fact, it’s worth noting
that in 2004, Ohioans were grousing about the OGTs! In the face of the latest
iteration of the testing backlash, we should remember why standardized tests
are essential. The key reasons, as I see them, are objectivity, comparability,
and accountability.
Reason 1: Objectivity
At their core, standardized exams are designed to be objective measures.
They assess students based on a similar set of questions, are given under
nearly identical testing conditions, and are graded by a machine or blind
reviewer. They are intended to provide an accurate, unfiltered measure of
what a student knows.
Now, some have argued that teachers’ grades are sufficient. But the reality is
that teacher grading practices can be wildly uneven across schools—and
even within them. For instance, one math teacher might be an extraordinarily
lenient grader, while another might be brutally hard: Getting an A means
something very different. Teacher grading can be subjective in other ways,
including favoritism towards certain students, and it can find its basis in non-
achievement factors like classroom behavior, participation, or attendance.
But when students take a standardized exam, a much clearer view of
academic mastery emerges. So while standardized exams are not intended to
(and should not) replace the teacher grade book, they do provide an objective,
“summative” assessment of student achievement. Standardized assessments
of achievement can be used for comparison and accountability purposes, both
of which are discussed in turn.

Reason 2: Comparability
The very objectivity of standardized exams yields comparability of student
achievement, a desirable feature for parents and practitioners alike. Most
parents, for example, would like to know whether their child is meeting state
benchmarks, or how she compares to statewide peers. Statewide
standardized exams give parents this important information. Meanwhile,
school-shopping parents have every right to inspect and compare the
standardized test results from a range of schools, including charters, district
schools, and STEM schools, before selecting a school for their child.

School practitioners also use statewide test results to benchmark their


students’ achievement across school and district lines. For instance, the
principal of East Elementary could compare the achievement of her students
against those attending West Elementary, the district average, the county
average, and the statewide average. How do her students stack up? Only a
statewide standardized test could tell.
Interestingly, proposals have been floated to allow schools to select their own
assessment—a pick-your-own-assessment policy. This is a flawed idea and
should be rejected. It would undermine the comparability principle of statewide
testing.

First, to be clear, standardized exams are not the all the same. Consider an
obvious example: Ohio’s old state tests and the PARCC exams are both
standardized exams, yet they are as different as night and day. Meantime, a
pick-your-own-assessment policy would open a Pandora’s box of confusion
over how to interpret the results. Imagine that Columbus City Schools selects
NWEA as its testing vendor and reports an 80 percent proficiency rate. Now
let’s say Worthington City Schools (suburban Columbus) selects PARCC and
reports a 50 percent proficiency rate. Should we infer that Columbus students
are actually achieving at higher levels than Worthington? Or is the test just
different? Based solely on these test data, we’d have no clue.
State assessment policy should not amount to a Choose Your Own
Adventure for districts and schools. Instead, Ohio legislators must continue to
implement a single, coherent system of standardized exams that provides
comparable results.
Reason 3: Accountability
Like it or not, standardized exam data remain the best way to hold schools
accountable for their academic performance. To its great credit, Ohio is
implementing a cutting-edge school accountability system. The accountability
metrics include robust measures often referred to as “student growth” or
“value-added” measures, along with conventional proficiency results and
college-admissions results. All of these outcome measures are based on
standardized test results.

The information from these accountability measures enables policymakers to


identify the schools that need intervention, up to closure. For example, the
charter school automatic closure law uses state exam results—both school-
level value added and proficiency—to determine which schools must close. In
addition, districts can go into state oversight via the Academic Distress
Commission if they are low-performing along test-based outcomes. Another
use of standardized testing data is coming in the area of deregulation. One
priority bill being considered in the Senate (SB 3) would give “high-performing”
districts certain flexibilities and freedoms from state mandates. How are these
high performers identified? Answer: Through state accountability measures,
based on standardized test scores. 
Outside of standardized test results, no objective method exists for
policymakers to identify either poor-performing schools needing intervention
or high-performing schools deserving rewards. Consider the alternative: Who
would want policymakers to intervene in a school based on their “gut feeling”
or reward a school based on anecdotes? Statewide standardized exams are
essential for upholding a fair and objective accountability system.

In a utopian world, one could wish away standardized tests. All schools would
be great, and every student would be meeting their potential. But we live in
reality. There are good schools and rotten ones; there are high-flying students
and pupils who struggle mightily. We need hard, objective information on
school and student performance, and the best available evidence comes from
standardized tests. Policymakers need to be careful not to undermine the
integrity of the state’s standardized tests. 

You might also like