You are on page 1of 20

13 Achievement Assessment

JENNIFER WHITE, NANCY MATHER, DEBORAH ANNE SCHNEIDER,


AND JENNIFER BRADEN KIRKPATRICK

Achievement tests are instruments designed to measure progress and develop and revise instructional goals, or
performance in a single academic domain or across multi- summatively, to assess student achievement at the end of
ple academic domains. They may be administered to a unit or course.
groups or individuals. The information derived from In this chapter, we provide a more thorough description
achievement tests may be used for a variety of purposes of comprehensive achievement tests, single-subject
in education, including formative evaluation, summative achievement tests, and curriculum-based measurements.
evaluation, course or program placement, and/or special We also discuss advances in technology, issues related to
education placement. The results may also be used to help achievement testing, matters of culture and diversity, and
identify specific learning disabilities, document an indivi- misuses and misinterpretations of achievement testing.
dual’s strengths and weaknesses, design instructional pro- Finally, we include several interpretive and practical
grams, monitor progress, and conduct research. The most recommendations for achievement testing.
common types of these tests are comprehensive achieve-
ment batteries that measure many aspects of achievement;
single subject area tests, such as reading, writing, or COMMONLY USED COMPREHENSIVE
mathematics; curriculum-based measurements (CBMs), ACHIEVEMENT TESTS
designed to provide ongoing evaluation of a student’s pro- Three examples of widely used norm-referenced achieve-
gress toward curriculum-based achievement goals; and ment tests are the Woodcock-Johnson Tests of
informal tests of achievement, such as teacher-made tests. Achievement – Fourth Edition (WJ ACH IV; Schrank,
Comprehensive achievement batteries measure an indi- Mather, & McGrew, 2014a), the Wechsler Individual
vidual’s performance across major academic areas (read- Achievement Test – Third Edition (WIAT-III; Wechsler,
ing, written language, and mathematics), whereas 2009), and the Kaufman Test of Educational Achievement
standardized single-subject tests measure performance in – Third Edition (KTEA-3; Kaufman & Kaufman, 2014). See
only one achievement area, though typically in much Table 13.1 for an overview of the major norm-referenced
greater detail and depth. Both of these types of assess- achievement tests.
ments are norm-referenced and produce scores that pro-
vide an estimate of an individual’s group standing or rank
Woodcock-Johnson Tests of Achievement –
relative to peers. CBMs are typically brief, time-efficient
Fourth Edition
probes of achievement that are closely aligned to curricu-
lar and learning objectives. They may be standardized or The WJ IV ACH (Schrank et al., 2014a) is a companion
nonstandardized and are designed to provide information instrument to the Woodcock-Johnson IV Tests of
that can be used to inform teaching on an ongoing basis Cognitive Abilities (WJ IV COG: Schrank, McGrew, &
(Hosp, Hosp, & Howell, 2016; Hosp & Suchey, 2014). As an Mather, 2014b) and Tests of Oral Language (WJ IV OL:
example, a CBM probe of reading may constitute a one- Schrank, Mather, & McGrew, 2014b). These three instru-
minute timed test of reading fluency at grade level or at the ments form the Woodcock-Johnson IV (WJ IV: Schrank,
student’s instructional level, whereas a CBM probe of McGrew, & Mather, 2014a), a comprehensive system of
mathematics may include a one-minute timed test of sub- individually administered tests that is designed based on
traction problems, also at grade level or at the student’s Cattell-Horn Carroll (CHC) theory to measure important
instructional level. Finally, informal measures of achieve- broad and narrow factors. Overall measures include gen-
ment include various types of teacher-developed assess- eral intellectual ability, specific CHC abilities, oral language
ments, such as oral and written exams. The results of these abilities, and achievement. Depending on the purpose of the
assessments may be used formatively, to monitor student assessment, the WJ IV batteries may be used independently,

160

Downloaded from https://www.cambridge.org/core. University of Western Ontario, on 12 Dec 2019 at 18:50:56, subject to the Cambridge Core terms of use, available at
https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108235433.013
Table 13.1 Major norm-referenced achievement tests

Achievement Standard Domain/Cluster Alternate


Battery Scores Subtests Normative Sample Psychometrics Scoring Forms Time

Woodcock-Johnson Broad Reading 20 subtests in 7,416 individuals Cluster reliability Online only 3 alternate for -Core battery (subtests 1
Tests of Basic Reading Skills standard and ages 2 to 95 years; over 0.90 ages 5 to the Standard to 6) 40 minutes;
Achievement IV Reading Fluency extended battery representing 46 adult battery only 5 to 10 minutes per
Age range: 2 years Reading Comprehension covering: Reading, states and District Test reliability over subtest;
to 90+ years Reading Rate Mathematics, of Columbia 0.80 Test 6 – Writing Samples
Phoneme-Grapheme Writing, Academic (matching Content Validity – – 15 to 20 minutes
Knowledge Knowledge demographics of Compared to WIAT
Broad Mathematics (Science, Social US population) III (grades 1 to 8)
Math Calculation Skills Studies, and K-TEA II (ages 8
Math Problem Solving Humanities), Oral to 12) correlations:
Broad Written Language Language subtests 0.50 to 0.90
Written Expression in separate battery

https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108235433.013
Basic Writing Skills
Academic Skills
Academic Fluency
Academic Applications
Academic Knowledge
Brief Achievement
Broad Achievement

Wechsler Individual Oral Language 16 subtests 2775 students pre-K Internal consistency Online using None Average student K to
Achievement Test III Total Reading covering Reading, to grade 12; reliability: .80 for all scoring 12th grades:
Age range: 4 years Basic Reading Mathematics, Stratified by grade, subtests but software (Q- 1 to 15 minutes per
to 50 years 11 Reading comprehension Written age, sex, race/ethni- Listening Comp Global) or by subtest
months Oral Reading Fluency Expression, and city, parent educa- (.75) hand 13 to 30 minutes for each
Written Expression Oral Language tion, geographic Validity – stronger composite
Mathematics region to represent correlations
Math Fluency US demographics. between math
Total Achievement Adult norms also composites with
available intercorrelations
ranging from 0.46
to 0.93 among 8
composite scores

Downloaded from https://www.cambridge.org/core. University of Western Ontario, on 12 Dec 2019 at 18:50:56, subject to the Cambridge Core terms of use, available at
Kaufman Test of Reading 19 subtests 3,000 individuals Internal consistency Online using 2 alternate Pre-K to 12th grades:
Educational Math covering Reading, ages 4 to 25 (age reliability range scoring forms 3 to 18 minutes per
Achievement 3 Written Language Mathematics, norms) 0.72 to 0.98 software (Q- subtest

Continued
Table 13.1 (cont.)

Achievement Standard Domain/Cluster Alternate


Battery Scores Subtests Normative Sample Psychometrics Scoring Forms Time

https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108235433.013
Age range: 4 years Academic Skills Battery Written Language, 2,600 individuals Academic skills bat- Global) or by 7 to 35 minutes per
to 25 years Composite and Oral Language pre-K to grade 12. tery composite 0.98 hand composite
Sound-Symbol Based on two sam- a across grades and 16 to 81 minutes
Decoding ples: Fall/Spring to forms Academic Skills Battery
Reading Fluency Norming sample is
Reading Understanding representative of
Oral Language the US population
Oral Fluency related to race, eth-
Comprehension nicity geographic
Expression region, special edu-
Orthographic Processing cation status, gifted
Academic Fluency & talented status

Downloaded from https://www.cambridge.org/core. University of Western Ontario, on 12 Dec 2019 at 18:50:56, subject to the Cambridge Core terms of use, available at
ACHIEVEMENT ASSESSMENT 163

in conjunction with each other, or with other assessment prekindergarten (pre-K) to fifty years, in reading, writing,
instruments. All of the tests are contained in two easel test math, listening, and speaking. Similar to the WJ IV ACH,
books called the Standard Battery and the Extended this assessment can be used to assess specific skill areas or
Battery. A compact disk (CD) is provided with the technical a broad range of academic achievement, based on the
manual and an audio recording is provided for the Spelling subtests administered. The WIAT-III is often used in con-
of Sounds test for standardized administration. The junction with the Wechsler Intelligence Scale for
Extended Battery can be used with any of the three forms Children – Fifth Edition (WISC-V; Wechsler, 2014) to pro-
of the Standard Battery and includes tests that provide vide a comprehensive evaluation of academic skills and
greater breadth of coverage in each academic area. intellectual abilities, as well as patterns of strengths and
weaknesses between and within cognitive and achieve-
Normative data and psychometric properties. The WJ IV ment profiles.
ACH was normed for use with individuals ranging from To guide evaluators, the test kit includes two administra-
the preschool to geriatric ages. Grade norms are reported tion manuals: (1) the Examiner’s manual, which includes
for each tenth of a year from grades K.0 through 17.9. The guidelines for administration, scoring, and interpretation
age norms are reported for each month from ages two of results; and (2) a CD containing the Technical manual,
through eighteen and then by one-year intervals from which describes the development, standardization, reliabil-
nineteen through ninety-five plus years of age. Complete ity, and validity of the assessment. Administration materi-
technical information is available in the Woodcock- als correspond to specific subtests and include the Stimulus
Johnson IV: Technical Manual (McGrew, LaForte, & book for subtests including visual stimuli, the Oral Reading
Schrank, 2014). Fluency booklet for reading passages, the Word Card and
Pseudo Word Card for reading isolated words and non-
Unique features. Two unique features of the WJ IV ACH words, and an audio CD for listening comprehension.
are the variation and comparison procedures. For the intra-
achievement variation procedure, the results from six core Normative data and psychometric properties. The WIAT-
tests (two reading tests, two written language tests, and two III was most recently standardized in 2008 on 2,775 stu-
mathematics tests) can be compared to determine relative dents in pre-K through grade 12. Adult norms are also
strengths and weaknesses among the measures of achieve- available, based on a normative sample that was collected
ment. An individual’s obtained standard score for each test is a year after the initial release of the WIAT-III. The WIAT-
compared to a predicted standard score that is based on the III technical manual (Breaux, 2009) provides detailed
average of the other five core tests. For example, the standard information concerning the instrument’s psychometric
score on Test 1: Letter-Word Identification would be com- properties and norming data.
pared to a predicted score that is derived from the average
standard score obtained from the other five core tests. Unique features. For the WIAT-III, the subtests that con-
Additional tests and clusters can also be included in this tribute to each composite score vary, depending on the
comparison procedure. grade of the individual being tested. For example, thirteen
For the ability-achievement comparison procedure, the of the sixteen subtests contribute to the Total Achievement
Academic Knowledge cluster, which consists of the orally composite score but the Early Reading Skills subtest and
administered Science, Social Studies, and Humanities Alphabet Writing Fluency subtest only contribute to the
tests, can be compared to all other areas of achievement Total Achievement composite for grades Pre-K–1; the Oral
(reading, written language, and mathematics). This com- Reading subtest score is available only for grades 2–12;
parison helps evaluators determine if an individual’s levels and the Spelling subtest only contributes to the Total
of reading, written language, and mathematics achieve- Achievement composite score in grades 2–12. This type
ment are commensurate with their overall level of of grade-dependent clustering is also present in the com-
acquired academic knowledge and whether or not a more posite scores of Total Reading, Math Fluency, and Written
comprehensive evaluation should be considered. For Expression.
example, the comparison can reveal that a student has Another feature of the WIAT-III is that, depending on
significantly higher academic knowledge than basic read- the purpose of the assessment, practitioners have the abil-
ing skills, which can suggest the existence of a reading ity to interpret scores both broadly, through analysis of
disability and should be explored more extensively. standard scores, and narrowly, through item-level skill
analysis. The item-level skills analysis is available for
seven subtests and identifies the skills involved in each
Wechsler Individual Achievement Test –
item. For example, if an individual scores poorly on the
Third Edition
Word Reading subtest, an item-level analysis can be com-
The WIAT-III (Wechsler, 2009) is another standardized, pleted through Q-Global. This breaks down each item into
comprehensive assessment used in schools, private the specific features of the words read (i.e., morphology,
practice, and clinical settings. The WIAT-III provides vowel types, consonant types, etc.) for missed items, allow-
information about the achievement of individuals, ing evaluators to determine which skills the participant is

Downloaded from https://www.cambridge.org/core. University of Western Ontario, on 12 Dec 2019 at 18:50:56, subject to the Cambridge Core terms of use, available at
https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108235433.013
164 JENNIFER WHITE, NANCY MATHER, DEBORAH ANNE SCHNEIDER, AND JENNIFER BRADEN KIRKPATRICK

having difficulty with and which specific skills area to Unique features. The KTEA-3 content has been mapped to
target for intervention. The patterns of strengths and Common Core Standards and provides measures of the eight
weakness analysis are available if the WIAT-III has been specific learning disability areas specified by the Individuals
paired with the WISC-V, the Wechsler Preschool and with Disabilities Education Improvement Act (IDEA; 2004),
Primary Scale of Intelligence – Fourth Edition (WPPSI– as well as areas of impairment listed in the fifth edition of the
IV), the Wechsler Adult Intelligence Scale – Fourth Edition Diagnostic and Statistical Manual of Mental Disorders (DSM-
(WAIS-IV), the Differential Ability Scales – Second Edition 5) (American Psychiatric Association, 2013). Additionally,
(DAS-II), or the Kaufman Assessment Battery for the Administration Manual provides guidance for practi-
Children – Second Edition (KABC-II) and will provide tioners using a CHC approach to assessment as well as to
information comparing cognitive strengths and weak- those using an Information Processing Approach (Kaufman
nesses to achievement strengths and weaknesses. Kaufman, 2014).
Intervention goal statements that provide examples of Online scoring software Q-global is available for the
annual goals with short-term objectives are also available KTEA-3. Optional error analysis with norm tables is also
through the Q-Global program. These goals are based on available. In this analysis, the number of errors made by an
the specific subtest skills on which the individual made individual is compared to the number of errors made by
one or more errors and include recommended interven- grade-level peers who attempted the same items. Two types
tion tasks to assist the practitioner with developing goals of error coding are included: item-level and within item-level
for individualized education plans and selecting academic analysis. In the case of item-level analysis, some subtests
interventions. As with all scoring software, an evaluator provide error categories that correspond to each missed
would also consider additional factors such as the stu- item. For example, on the written expression subtest, practi-
dent’s background, educational history, classroom perfor- tioners can determine whether an individual missed an item
mance, and available resources when interpreting results due to errors in categories such as capitalization, punctua-
and developing program goals. tion, structure, word form, or task. Qualitative analyses of
errors explain why an item was missed rather than just
assigning an error category. For instance, on subtests such
Kaufman Test of Educational Achievement –
as Spelling, Nonsense Word Decoding, and Letter and Word
Third Edition
Recognition, an examiner has the option to determine if
A third comprehensive, standardized assessment is the errors were made for reasons such as using an incorrect
KTEA-3 (Kaufman & Kaufman, 2014). Similar to the WJ vowel or consonant sound, confusing single and double
IV and WIAT-III, this assessment is an individually admi- consonants, or making errors on consonant blends and
nistered battery with measures of reading, math, written digraphs. By understanding the nature of individual errors,
language, and oral language. As with other standardized an evaluator can gain a much greater understanding of an
achievement tests, the KTEA-3 is designed for use in individual’s specific instructional needs. A special issue of
initial evaluations and reevaluations to gain information Journal of Psychoeducational Assessment, devoted to
about specific academic skills and/or broad academic research investigations on the kinds of errors students
abilities, as well as to measure progress or response to make on the KTEA-3 subtests, provides useful data for inter-
intervention. preting achievement errors – within the contexts of cognitive
The KTEA-3 includes the following materials: an admin- profiles and educational interventions – for both normal and
istration manual, scoring manual, stimulus books, written clinical samples (Breaux et al., 2017).
expression booklets, record forms, response booklets, and
a stopwatch. The KTEA-3 includes a flash drive that con- Summary. Comprehensive batteries such as the WJ IV
tains the Technical and Interpretive Manual (Kaufman, ACH, WIAT-III, and KTEA-3 are useful in evaluating stu-
Kaufman, & Breaux, 2014) and the audio files for admin- dent achievement across multiple content areas; however,
istering the Listening Comprehension subtest. It also there are situations in which using a comprehensive
includes demonstrations of administration for several of assessment of multiple domains is not the most appropri-
the subtests, forms and normative data for hand scoring, ate choice. Single-subject achievement tests that focus on
qualitative observation forms, and letter checklists. a specific academic area provide a more in-depth under-
standing of that particular skill and can also be more time-
Normative data and psychometric properties. Normative and cost-efficient than administering a comprehensive
data were collected for the KTEA-3 over two years (2011– achievement battery.
2013). Half were tested in the fall and half were tested in
the spring to create fall and spring norms. Approximately
half of the norm group received Form A and half received SINGLE-SUBJECT ACHIEVEMENT TESTS
Form B, to establish parallel forms. The KTEA-3 Technical A variety of content-specific achievement tests are avail-
manual (Kaufman, Kaufman, & Breaux, 2014) provides able for a more in-depth assessment in specific academic
detailed information concerning the instrument’s psycho- domains, such as reading, writing, and mathematics.
metric properties and norming data. Three examples of norm referenced, content-specific

Downloaded from https://www.cambridge.org/core. University of Western Ontario, on 12 Dec 2019 at 18:50:56, subject to the Cambridge Core terms of use, available at
https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108235433.013
ACHIEVEMENT ASSESSMENT 165

achievement tests are the Woodcock Reading Mastery Test student skills, growth, and response to instruction through
– Third Edition (WRMT-III; Woodcock, 2011), the Test of frequent progress monitoring (Jones, Southern, &
Written Language – Fourth Edition (TOWL-4; Hammill & Brigham, 1998).
Larsen, 2009), and the Key Math – Third Edition (Key CBMs have a variety of uses in education and adminis-
Math-3) Diagnostic Assessment (Connolly, 2007). tration frequency may vary, depending on the assessment
The WRMT-III provides an evaluation of a wide variety purpose. Typically, CBMs are administered based on stu-
of reading readiness and achievement skills within nine dent need – three times a year for universal screening and
subtests, including phonemic awareness, oral reading flu- at regular intervals, often one to two times per week, for
ency, word identification, listening comprehension, pas- students in need of intensive intervention. In most cases,
sage comprehension, and rapid automatized naming. This teachers count the number of correct responses and errors
comprehensive reading test is designed for individuals and then chart each student’s score on a graph. When
aged four years and six months to seventy-nine years and students also participate in the recording and tracking of
eleven months, as well as those in grades K–12. their CBM data, they make more progress and have a
The TOWL-4 is a diagnostic test of written expression greater understanding of and involvement in their own
that measures conventional linguistic and conceptual learning processes (Davis & Fuchs, 1995).
aspects of student writing such as vocabulary, spelling, Extensive research supports the use of CBMs in the
punctuation, sentence building, and story composition. areas of reading, math, and writing, most commonly on
This assessment was intended for use with individuals timed tasks (Fuchs, 2016; Fuchs et al., 2001). CBMs used
between the ages of nine years and seventeen years and in these domains may include measurement of reading
eleven months. fluency on specific levels of text, solving math fact pro-
Finally, the Key Math-3 provides specific measurement blems, and spelling words. The speed and accuracy with
of a range of essential math skills from rote counting to which tasks are performed have been linked to academic
factoring polynomials. All subtests are categorized into success, especially with respect to oral reading fluency
three broad math abilities: basic concepts, operations, (ORF). ORF is frequently evaluated using CBMs and pro-
and applications to provide information about overall vides a strong proxy measurement of overall reading pro-
math achievement, as well as an individual’s performance ficiency, including comprehension (Fuchs et al., 2001;
on particular math skills. Shinn et al., 1992; Van Norman, Nelson, & Parker, 2018).
These types of single-subject assessments are valuable Many CBMs use a measure of words read correctly per
for identifying specific strengths and weaknesses in a par- minute (WCPM), which is used to assess the rate and
ticular subject area, developing detailed instructional accuracy of ORF. Assessments that also include measure-
goals, documenting growth and monitoring progress, ments of prosody and passage comprehension, in addition
and supporting decisions regarding additional educa- to rate and accuracy, provide a more precise measurement
tional services and supports. of ORF and comprehension ability (Valencia et al., 2010).
To date, Hasbrouck and Tindal (2017) have published the
most comprehensive set of ORF norms that provide
CURRICULUM-BASED MEASUREMENTS
benchmark guidelines for students in grades 1–6. Several
CBMs are brief, curriculum-aligned assessments of stu- popular, evidence-based CBM tools have been adopted by
dent’s achievement, most often in the foundational skills schools and districts for educators to use in their class-
of reading and math. These instruments have been shown rooms. Table 13.2 provides web addresses for several pop-
to provide valid and reliable insights into students’ pro- ular commercially available CBM tools.
gress and provide data that can be used to assess response
to or effectiveness of instruction and inform instructional
planning (Fuchs, 2016; Fuchs & Fuchs, 2002). Strengths of Curriculum-Based Measurements
Curriculum-based measurements – which may be norm- The greatest benefits of CBM measures are that they are
referenced, criterion-referenced, or both – are generally quick, easy to administer and score, sensitive to growth,
used to measure fluency in basic skills in a particular and provide immediate feedback regarding student per-
area over time. CBMs are considered general outcome formance. This information allows educators to frequently
measures and do not provide an extensive or thorough gauge the success of an intervention as well as the stu-
understanding of students’ achievement in a broad dent’s response to instruction, so that instruction can be
domain. They can, however, provide practitioners with adjusted and altered accordingly, if needed. The informa-
an overall indicator of student progress in a particular tion produced by CBMs can also be used as a piece of data
skill area. In schools, they often serve as critical data for in determining special education placement and services.
making decisions in a Multi-Tiered System of Supports CBM data in isolation, however, are not adequate to deter-
(MTSS) or a Response to Intervention (RTI) model for mine that a student has a disability. The main utility is to
addressing students’ specific academic needs. Within assist educators in finding interventions that work for
these models, the information obtained from CBMs is students, as well as determining whether a student exhi-
often used as part of a process of assessing individual bits a need for specialized instruction.

Downloaded from https://www.cambridge.org/core. University of Western Ontario, on 12 Dec 2019 at 18:50:56, subject to the Cambridge Core terms of use, available at
https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108235433.013
166 JENNIFER WHITE, NANCY MATHER, DEBORAH ANNE SCHNEIDER, AND JENNIFER BRADEN KIRKPATRICK

to work slowly and carefully. The validity and reliability of


Table 13.2 Websites that provide information on inferences produced by CBMs also vary greatly due to the
curriculum-based measurements (CBMs) lack of standardization among certain CBM materials,
AIMSweb www.aimsweb.com especially those that are teacher-designed. While commer-
cially produced CBMs are often standardized, problems
CBM Warehouse www.interventioncentral may occur when these assessments are not administered
.org/cbm_warehouse
with fidelity. A final consideration is that potential differ-
DIBELS http://dibels.uoregon.edu ences exist for identifying students for additional interven-
tion when using different instruments and cut scores to
Edcheckup www.edcheckup.com
make screening decisions (Ford et al., 2017). For example,
FAST www.fastbridge.org students have been shown to read fewer words using
DIBELS Next than with Formative Reading Assessment
McGraw-Hill www.mhdigitallearning
System for Teachers (FAST) or AIMSWeb (Ford et al.,
.com
2017). Such differences in results could affect the type,
National Center on Student www.studentprogress.org frequency, and intensity of interventions prescribed.
Progress Monitoring Thus, while there are a great number of advantages,
these weaknesses need to be considered when using
CBMs for assessment and progress monitoring.
Specific training is required to ensure the fidelity of CBM
administration and scoring procedures. Reliability checks
should be in place to monitor fidelity. The process, however, TECHNOLOGICAL ADVANCES
does not require specialized knowledge and teachers, para- Information and computer technology–based (ICT-based)
professionals, and even peer tutors can learn to administer assessments of achievement use digital technologies to
CBMs correctly (VanDerHeyden, Witt & Gilberson, 2007). generate, deliver, score, and/or interpret tests (Singleton,
CBMs are also cost-effective, as teachers can create mate- 2001). In this section, we provide a brief overview of the
rials or only need to print or copy the required materials. The history of ICT-based assessments of achievement and a
flexibility in materials allows for adaptation to each student’s description of their current uses, as well as a discussion
learning goals. A further benefit of CBMs is the easy produc- of the advantages and disadvantages of ICT-based achieve-
tion of a number of alternate forms, allowing for repeated ment assessment.
testing. Frequent administration and measurement help to
track a student’s progress over time, as opposed to measure-
Overview
ment from one assessment at one point in time.
Though rudimentary ICT-based assessment has been
attested to as early as the 1960s (Rome et al., 1962), early
Weaknesses of Curriculum-Based Measurements
computers lacked the processing power and storage capa-
Despite their usefulness in measuring student growth and city sufficient to facilitate ICT-based testing in the class-
informing instructional planning, CBMs have several impor- room environment. Consequently, their use in assessment
tant limitations. Because CBMs are general outcome indica- remained limited through the mid-twentieth century. The
tors, they only provide a general overview of a student’s use of computers to perform simple assessments of
fluency in a given area. While CBMs produce a snapshot of declarative knowledge began to develop in earnest in the
a student’s progress relative to specific curricular goals, they United States in the 1970s; however, computer-based
are not intended to provide an in-depth picture of student assessments of achievement generally remained con-
achievement across a domain. While they may be used as strained to the evaluation of content knowledge through
part of a multimethod, multisource, multisetting approach the mid-1990s (Shute & Rahimi, 2017). By the late 1990s,
to assessment typically used in the schools (NASP, 2016), computing capacity had grown to the point where it
they are not meant to be used in isolation to make diagnostic became possible to assess not only declarative knowledge
inferences. Furthermore, the focus of many CBMs is on the but also problem-solving ability and other more complex
measurement of the acquisition of fluency in basic skills, academic skills using digital technologies (Shute &
making them more appropriate for use in elementary set- Rahimi, 2017). Today’s ICT-based assessments have
tings than in secondary schools, where the subject matter become even more sophisticated. While some continue to
becomes more complex and difficult to assess. assess simple content knowledge much in the same way as
CBMs are fluency measures and thus timed tasks in paper-and-pencil tests, others immerse students in virtual
which the speed of production is key. This may render worlds or simulations designed to stealthily evaluate criti-
them ineffective for students with significant weaknesses cal thinking and practical application of learned skills.
in processing speed and may produce anxiety in struggling Many ICT assessments adapt to students’ instructional
students (Deeney & Shim, 2016). Moreover, CBMs may level, streamlining the assessment process and improving
not adequately capture the progress of students who tend the accuracy of evaluation (Shute et al., 2016).

Downloaded from https://www.cambridge.org/core. University of Western Ontario, on 12 Dec 2019 at 18:50:56, subject to the Cambridge Core terms of use, available at
https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108235433.013
ACHIEVEMENT ASSESSMENT 167

Contemporary ICT-based assessments of achievement levels differ substantially from the level for which the test
may be used for diagnostic, formative, or summative pur- was designed.
poses. When used for diagnostic purposes, ICT-based As ICT-based tests of achievement have become more
assessments of achievement allow the evaluator to identify sophisticated, item pools have grown larger, potentially
and target strengths and weaknesses in student achieve- improving content validity by providing broader construct
ment, often at key points during the instructional cycle, coverage (Huff & Sireci, 2001). Furthermore, innovative
such as the beginning and end of a term. Diagnostic assess- item types have become increasingly available, improving
ments may be used at the individual level to identify areas construct validity by permitting the measurement and
in need of remediation or they may be used at the cohort evaluation of higher-order skills. For example, the use of
level to detect common gaps in students’ learning. audio, video, and simulated performance tasks in item
Similarly, formative ICT-based assessments provide feed- design has allowed for the measurement of facets of
back on students’ learning, so teachers can adjust instruc- achievement that might be far more difficult to evaluate
tion at the individual or group level. Summative ICT-based using more conventional means of assessment (Huff &
assessments, by contrast, are used to evaluate student Sireci, 2001). Furthermore, such items may provide
learning relative to expected achievement targets or learn- content in multiple modalities, potentially improving
ing goals, often at the end of a unit or course. accessibility among diverse learners. More sophisticated
Whether it is for formative, summative, or diagnostic ICT-based assessments may also permit for the evaluation
purposes, the use of ICT-based assessment of achievement of complex problem-solving skills in a manner less easily
has grown exponentially since its inception. An article in achieved using paper-and-pencil–based assessments
Education Week indicated that ICT-based assessments are (Greiff et al., 2013).
rapidly displacing print assessments and their use is On a practical level, most ICT-based standardized tests
expected to increase by 30 percent in only three years of achievement allow numerous students to be tested at
(Molnar, 2017). Because the stakes of these assessments the same time with less intrusion on instructional time
are often high, both in terms of monetary cost and in than comparable paper-and-pencil tests. Data produced
terms of educational decisions affected by student out- by these tests are often immediately available and can
comes (e.g., placement, tracking, teacher evaluation, and frequently be aggregated across classes, schools, and/or
school funding), it is important to understand both the school districts. Furthermore, in the case of most standar-
advantages and the potential disadvantages of ICT-based dized ICT-based assessments, large item pools are used
assessments of achievement. and it is impossible to preview item content, making it
much more difficult to teach to the test. This is particularly
the case with computer adaptive tests, whose content is
Advantages
individualized in response to student responses.
ICT-based assessments of achievement have several poten-
tial advantages over traditional paper-and-pencil assess-
Disadvantages
ments. One important potential advantage is the
availability of computer-adaptive testing. ICT-based While ICT-based tests of achievement offer many potential
assessments may be designed such that item difficulty advantages, they have some potential disadvantages as
and/or content are adaptive in response to student perfor- well. Perhaps the most obvious of these is that schools
mance on previous items, streamlining the assessment may lack the ability to meet technical requirements. For
and ensuring that the item difficulty remains consistent example, some schools may not have enough computers
with the student’s actual level of proficiency. This capable of hosting the tests and others, particularly those
increased efficiency leads to reduced fatigue, boredom, in rural areas, might lack the bandwidth necessary to sup-
and frustration for students. On a computer-adaptive port online delivery.
assessment, when a student answers an item or items Concerns also exist in regard to validity, in particular
correctly, the next item may increase in difficulty, whereas construct and content validity, that is, the degree to which
when a student answers an item or items incorrectly, the the assessments measure what they purport to measure
next item may decrease in difficulty. It is therefore possible and the degree to which they provide adequate coverage of
to quickly establish, with fewer items, a student’s present the constructs measured. Ironically, this is particularly
level of proficiency. Furthermore, the potential for admin- true of computer-adaptive assessments of achievement
istration of out-of-grade–level items improves both mea- and those with innovative item times, such as simulated
surement accuracy and test efficiency for students who performance tasks (Huff & Sireci, 2001). As to computer-
perform significantly above or below their grade-level adaptive tests, poor item-selection algorithms may contri-
peers (Wei & Lin, 2015). By contrast, most paper-and- bute to inadequate coverage of a construct and, with
pencil tests require all students to answer the same performance task and other innovative item types, the
questions, without respect to the proficiency level of the probability that the items measure content outside of the
individual student. As a result, such tests may provide little construct may increase (Huff & Sireci, 2001). In addition,
information about individual students whose proficiency student characteristics can influence the validity of online

Downloaded from https://www.cambridge.org/core. University of Western Ontario, on 12 Dec 2019 at 18:50:56, subject to the Cambridge Core terms of use, available at
https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108235433.013
168 JENNIFER WHITE, NANCY MATHER, DEBORAH ANNE SCHNEIDER, AND JENNIFER BRADEN KIRKPATRICK

assessment results. For example, some students experi- Renaissance STAR Reading is another example of a com-
ence heightened test anxiety when using ICT-based assess- puterized, adaptive, standardized (norm-referenced) read-
ments and others may struggle to maintain attention given ing achievement test for students in grades 1–12.2 The
the potentially distracting aspects of the medium. Still assessment may be administered in either English or
others may lack the computer literacy or keyboarding Spanish. The assessment identifies the skills that students
skills required to be successful. have mastered and suggests future content. The test is adap-
Another potential disadvantage of ICT-based testing is tive, so item content and difficulty are adjusted in response
the reduced opportunity for practitioners to make qualita- to student responses. Furthermore, it can be used repeatedly
tive observations regarding the student’s testing behaviors. to measure progress without repetition of content. A variety
When testing in-person, an evaluator can observe when a of district, school, and individual student reports are
student is struggling with a particular task or topic and available.3
garner insights into the nature of the difficulty. The MindPlay Universal Screener is an online diagnostic
Observations of behaviors such as hesitating, guessing, fid- reading assessment that assesses an individual’s reading
geting, losing focus, and failing to use strategies are often skills within five to thirty minutes.4 It can be used with a
key factors in accurate test interpretation. More than eight single student, a classroom, or an entire school district.
decades ago, Monroe (1932) observed: “Two children, read- One unique feature of this screener is that, after the assess-
ing the same paragraph, may make the same number of ment, the program creates an individualized prescriptive
errors, and yet their mistakes may be wholly different in plan for each student. A student can then receive targeted
nature. Their reading performances may be quantitatively instruction through the MindPlay Virtual Reading Coach.
the same but qualitatively unlike” (p. 34). This online instructional and practice program addresses
As with paper-and-pencil assessments, ICT-based assess- the student’s specific areas of need and is designed to be
ments also have some security threats. Someone may pro- used for thirty minutes, four to five days a week. The
vide a student with inappropriate assistance during test instruction is delivered by speech language pathologists
administration or the student may be able to access outside and reading specialists.
resources, such as consulting their smartphones for A few online assessments address one specific area of
information. Furthermore, without careful monitoring, an reading. For example, MOBY.READ (ami) provides a
individual may take an exam for another person. Finally, measure of reading accuracy and rate. This program is
ICT-based assessments, like paper-and-pencil–based an iPad app where the student reads passages aloud.
assessments, may be poorly designed or inadequately vali- The app records the reading and then calculates the
dated. When selecting such assessments, it is therefore words read correctly per minute, the accuracy, and the
essential to review the instrument’s technical specifications expressiveness. The app can identify students in need of
to ensure that the psychometric properties are sound. intervention, as well as track student progress. A teacher
may print out reports and compare readings across
time.
Examples of Online Assessments
Another example of an online assessment is the
Many ICT-based assessments of achievement have been Partnership for Assessment of Readiness for College
developed to measure and evaluate reading skills. Brief and Careers (PARCC).5 The PARCC is a set of assess-
descriptions of several instruments follow to illustrate ments designed to measure student performance in
examples of the types of assessments that are available. English-language arts and mathematics. It is designed
One example of an online reading assessment is the to determine if a student is on a successful pathway to
FAST, which is a suite of progress-monitoring tools college. It includes a paper-based version, as well as an
designed for students in kindergarten to grade 5.1 These ICT-based version.
assessment tools measure different various reading skills ALEKS is a comprehensive program for mathematics
and are individualized for each student. The FAST pro- that includes both assessment and instructional
vides a CBM-Reading assessment, which monitors a stu- components.6 It was developed from research at New
dent’s progress and provides a measure of oral reading York University and the University of California, Irvine,
fluency; an Early Primary Reading assessment (kindergar- by software engineers, mathematicians, and cognitive
ten to grade 3), which includes print concepts, phonologi- scientists. ALEKS is an artificial intelligence engine that
cal awareness (blending and segmenting), letter sounds assesses each student’s knowledge individually and con-
and names, decoding sight words, and sentence reading; tinuously so that instruction is only provided in topics that
and an Adaptive Reading tool that is similar to the testing
format of many statewide assessments. These adaptive 2
See Renaissance Star Reading at: www.renaissance.com/products/
tests are all individualized, making them an efficient and assessment/star-360/star-reading-skills
3
Ibid.
effective way to monitor student reading progress. 4
See Mindplay Universal Screener the Virtual Reading Coach at:
www.mindplay.com
5
See PARCC at: https://parcc-assessment.org
1 6
See FAST progress monitoring tools at: www.fastbridge.org See ALEKS at: www.aleks.com

Downloaded from https://www.cambridge.org/core. University of Western Ontario, on 12 Dec 2019 at 18:50:56, subject to the Cambridge Core terms of use, available at
https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108235433.013
ACHIEVEMENT ASSESSMENT 169

the student is ready to learn. The ALEKS Assessment has been with college or adult populations, some data sug-
asks the student about twenty to thirty questions to gest that feigned impairment also occurs in children as
determine the current level of math knowledge. The young as eight or nine years old (Kirkwood et al., 2010; Lu
questions are determined on the basis of the student’s & Boone, 2002). To minimize this factor when evaluating
answers to all the previous questions. Thus, each stu- children, DeRight and Carone (2015) advised: “In both
dent receives an individualized set of assessment ques- research and clinical practice, a combination of multitest
tions. Once the assessment is completed, the student and multimethod approaches is the current gold standard
enters the Learning Mode where a choice of appropri- in the evaluation of test-taking effort with children” (p. 19).
ate topics to learn is presented. In our practice, we have seen one example of such a case.
Technological advances, including the examples pro- A twenty-one-year-old woman claimed that she had dys-
vided in this section, have expanded assessment options lexia and that she needed extended time on all examina-
for administrators and can be a viable option for forma- tions. Apparently, she had read or been told that a major
tive, summative, and diagnostic assessment. Whether it be characteristic of dyslexia was poor spelling and reversed
a technology-based assessment or a traditional paper-and- and transposed letters. On the WJ III Writing Samples test,
pencil test, factors can impact the reporting of results and she misspelled simple words (e.g., spelling the word “to” as
interpretation of scores. The following sections of this “ot,” the word “the” as “hte,” the word “sun” as “snu” and
chapter will focus on situations of noncredible reporting “fresh air” as “frehs iar”). Three of her written responses
and factors to consider when interpreting results. are presented in Figure 13.1.
When she completed the WJ III Spelling test, however,
she obtained a standard score of 105 and was able to spell
FACTORS IMPACTING VALIDITY
numerous words correctly, with no reversals or transposi-
A number of factors can impact validity of achievement tions. Examples of words that she spelled correctly are
tests, including lack of motivation or effort on the part of provided in Figure 13.2.
the examinee (Adelman et al., 1989), noncredible perfor- After closer examination, we determined that this young
mance by the examinee, and administrator error (e.g., woman suffered from anxiety regarding test-taking and
inappropriate test administration, scoring errors, or mis- that she clearly did not have dyslexia. Certainly, if she
interpretation of results by the examiner). As with all stan- could not spell the word the correctly, she would not have
dardized assessments, achievement assessment is built on been able to spell the word congenial correctly!
the foundation of examiner competence in administra- As with examinees, intentional misrepresentation is rare
tion, scoring, and interpretation. Similar to this, when among examiners; it is extremely unusual for an examiner
interpreting achievement test results, examiners assume to falsify test results. Nonvalid scores more often occur if
a reasonable, valid effort on the part of the examinee. This an examiner has failed to administer tests appropriately
speaks to the importance of the examiner establishing addressing the referral question(s) or has made an error in
rapport with the examinee prior to administering the test test administration and/or scoring. An example of a scor-
and having an awareness of what the student is able to do ing error from a recent report on the core tests of the WJ IV
both in and out of the classroom (Adelman et al., 1989). ACH is in Figure 13.3.
Another concern with all types of testing is the possibility In this report, the examiner failed to notice the discrepan-
of intentional falsification of results or noncredible perfor- cies between the standard scores and percentile ranks on the
mance by the examinee. Research has indicated that rates of Spelling and Passage Comprehension tests. An average stan-
noncredible performance in college populations for students dard score would convert to an average percentile rank (25th
being evaluated for learning disabilities may be as high as to 75th percentile). A standard score of 105 would result in a
15 percent (Harrison & Edwards, 2010; Sullivan, May, & percentile rank of 63, not 25. In this case, the evaluator’s
Galbally, 2007). Related to this, examinees who feign perfor- discussion focused on the percentile ranks, so the interpreta-
mance on psychoeducational tests often do so in a way that tion of the tests’ results was inaccurate and misleading. This
is not detectable by the examiner (Harrison & Edwards, type of error can occur if an evaluator fails to review the
2010; Harrison, Edwards, & Parker, 2008). As a result, obtained scores critically or does not verify that the tables
researchers recommend including measures of performance created for a report are accurate. Another error that occurs is
validity (PVT) to psychoeducational batteries, especially a miscalculation when totaling scores.
when there is a perceived benefit from being labeled as Examiners need to be alert to potential threat to the
having a learning disability (e.g., access to additional validity of assessment results. A careful review of scoring
resources or accommodations) (DeRight & Carone, 2015; and administration procedures, as well as the interpreta-
Harrison & Edwards, 2010; Harrison et al., 2010; Harrison tion of performance, can mitigate examiner error. Sound
et al., 2008). PVTs are typically measures that are easily assessment practices using multiple data sources, includ-
passed, even by those with a diagnosed impairment; thus, ing, when called for, the use of Symptom Validity Testing,
poor performance is indicative of feigned poor performance can alert examiners that an examinee may not be putting
(DeRight & Carone, 2015; Harrison et al., 2010, Harrison et forth the required level of effort whether due to noncred-
al., 2008). Whereas the majority of the research in this area ible performance or a simple lack of motivation.

Downloaded from https://www.cambridge.org/core. University of Western Ontario, on 12 Dec 2019 at 18:50:56, subject to the Cambridge Core terms of use, available at
https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108235433.013
170 JENNIFER WHITE, NANCY MATHER, DEBORAH ANNE SCHNEIDER, AND JENNIFER BRADEN KIRKPATRICK

Figure 13.1 WJ III Writing Samples


test: three example responses

Standard Score Percentile


Six Core Tests (68% band) RPI Rank

Letter-Word 127 (122–131) 100/90 98


Identification

Applied 120 (115–125) 99/90 84


Problems

Spelling 105 (101–109) 95/90 25

Passage 105 (99–110) 94/90 24


Comprehension

Calculation 111 (106–116) 97/90 53


Writing Samples 104 (99–109) 93/90 30

Figure 13.3 WJ IV ACH test: scoring error example

grade and age equivalents, (2) percentile ranks, and (3)


standard scores. Although much of this discussion will
seem rather obvious to seasoned evaluators, teachers and
parents may be confused or lack understanding of this
information.

Grade and Age Equivalents

Figure 13.2 WJ III Spelling test: spelling samples Grade-equivalent scores are expressed as a whole num-
ber and decimal representing the grade and month of
the school year. For example, 3.7 would represent grade
3, month 7. These scores are derived from the median
MISUSES AND MISUNDERSTANDINGS raw score attained by sample participants at a particular
As noted, one common error in the reporting of achieve- grade level. Age-equivalent scores are much the same;
ment test results is misreporting or misinterpretation of however, they are reported using a whole number and
the obtained scores. Substantial misunderstanding often hyphen representing the year and month at which the
exists among educators regarding the meaning of test median sample participant attained a particular raw
scores (Gardner, 1989). The types of scores that can be score. For example, 8-4 would represent eight years
particularly confusing to untrained educators are (1) and four months.

Downloaded from https://www.cambridge.org/core. University of Western Ontario, on 12 Dec 2019 at 18:50:56, subject to the Cambridge Core terms of use, available at
https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108235433.013
ACHIEVEMENT ASSESSMENT 171

Grade and age equivalent scores are somewhat pro- the “average range” is considered to be from the 25th to the
blematic in that, despite a common misconception, they 75th percentile.
do not represent a standard to be attained or the grade
at which a student should be receiving instruction Standard Scores
(Gardner, 1989). If a student obtains a fifth-grade score
on a computation test, it does not necessarily follow that Like percentile ranks, standard scores are norm-refer-
the student can perform fifth-grade–level computations; enced and indicate an individual’s relative group standing:
rather, the score indicates that the student performed They compare the individual to others of the same age or
computations as well as the median fifth-grade grade who took the same test. Unlike percentile ranks,
student. A fifth-grade–equivalent score also should not standard scores are expressed in standard deviation
be taken to indicate that fifth-grade instructional mate- units, that is, the number of standard deviations from the
rials are appropriate for a student, as instructional mean of the sample at which a particular obtained score
materials are not typically aligned to grade-equivalent falls. Achievement tests frequently have a mean of 100 and
scores and a great deal of variability in difficulty exists a standard deviation of 15 for their composite scores;
among instructional materials produced by different however, there is some variation, particularly with respect
publishers. to subtests, which sometimes use scaled scores that have a
Although grade- and age-equivalent scores may provide mean of 10 and a standard deviation of 3.
some useful information concerning the relative achieve- Both parents and teachers can be confused about how to
ment of students, they are not equal-interval scores and interpret statements about standard scores and percentile
thus provide a very imprecise measure of growth. ranks. Consider this example from Schneider and collea-
Additionally, a standard score in the average range may gues (2018) regarding the clarity and interpretation of
be represented by a grade- or age-equivalent score one these two statements:
grade above or below the student’s current level if the stu- 1. Josie’s score on the Woodcock‐Johnson IV Spelling test was
dent scores at either the top or the bottom of the average 95, which corresponds to a percentile rank of 37.
range. Owing to the high likelihood of misuses of these 2. Josie can spell about as well as most children her age.
types of scores, age- and grade-equivalent scores must be The first statement is quite precise but not particularly
interpreted with caution. clear – at least not to an audience of nonexperts. One can
imagine the thoughts of an intelligent but psychometri-
cally naive parent: What is this test, the Woodcock‐
Percentile Rank Johnson Eye‐Vee Spelling test? Does it tell us all we need
to know about a person’s ability to spell? Is 95 a good
A percentile rank shows the percentage of obtained score? What is a percentile rank? Does that mean Josie
scores in a particular sample that is equal to or less came in 37th place? … ’cause there aren’t that many kids
than the specified obtained score. Percentile ranks are in her class. Or does it mean she got 37 percent correct?
typically expressed using a range from 1 to 99, ranking That does not sound like a good performance – we called
the individual’s obtained score within a distribution of that an F when I was in school. That’s the thing about
spelling tests, if you don’t study in advance, you can really
100, or from 0.1 to 99.9, ranking the individual’s
bomb ’em. I know a few times I sure did. Did Josie have
obtained score within a distribution of 1,000. For exam- the opportunity to study the spelling words in advance? If
ple, a rank at the 25th percentile would indicate that not, I don’t see how the test is fair.
25 percent of the participants taking a test had a score The second statement avoids these possible sources of
equal to or less than that of a specified individual. A confusion. Although it is in some ways less precise than the
percentile rank of 99.9 would indicate that the indivi- first statement, it has the virtue of being easy to understand
dual’s score equaled or exceeded those of 999 out of correctly, keeping the focus squarely on what the reader actu-
ally needs to know (i.e., that spelling is not a problem for
1,000 sample participants.
Josie). (p. 4)
People who are unfamiliar with test interpretation may
confuse percentile rank with percent correct. A rank at the Both teachers and parents may also become confused
50th percentile does not indicate that an individual when examining standard scores to understand how much
answered 50 percent of items correctly but rather that progress a student has made. For example, in grade 3,
the individual’s obtained score was equal to or exceeded Rebecca obtained a standard score of 80 on the WIAT-III
that of 50 percent of the norm group sample. To minimize Math Fluency subtest. In grade 5, Rebecca had a standard
the possibility of confusion, it is best not to abbreviate a score of 80 on the same test. At the school meeting, her
percentile rank with the percent sign (e.g., 25th %ile). As father remarked that she had made absolutely no progress
with age- and grade-equivalent scores, percentile ranks are as her score had stayed exactly the same. The truth, how-
norm-referenced and do not provide interval-level data for ever, is that Rebecca did make progress as she kept her
establishing improvement in achievement. Thus, they also place in the group standing across the two years (e.g., a
provide an imprecise measure of growth. Although test fifth-grade student would need to answer more items cor-
developers assign different qualitative descriptors to rectly than a third-grade student to obtain a standard score
score ranges and by association percentile ranks, typically of 80).

Downloaded from https://www.cambridge.org/core. University of Western Ontario, on 12 Dec 2019 at 18:50:56, subject to the Cambridge Core terms of use, available at
https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108235433.013
172 JENNIFER WHITE, NANCY MATHER, DEBORAH ANNE SCHNEIDER, AND JENNIFER BRADEN KIRKPATRICK

Additional Factors clear!” He went on to say: “It is essential that [we] know
(and be reminded) precisely what classification scheme(s)
Several other factors can also affect test interpretation or
we are using with the scores,” as failure to do so increases
cause confusion. One arises from placing too much
the probability of inaccurate interpretation and use of test-
emphasis on a single score or measure. A second stems
ing data. Additionally, examiners should be well versed in
from the various verbal labels assigned to test scores by the
the psychometric properties of tests, as well as various
test authors or publishers. A third involves the misinter-
theoretical models related to disability identification when
pretation of composite scores. A fourth revolves around
making clinical decisions and recommendations.
the misuse of age or grade norms. A fifth and final factor
centers on content validity, particularly among reading
Interpretation of composite scores. Achievement tests
comprehension tests.
routinely have individual subtest scores that are combined
into composite scores. On most tests, the composite scores
Drawing conclusions from one score or one test. Reliance
are not means or averages of the obtained scores across
on a single score, or even the results of a single composite
subtests; rather, they are a weighted composite that shows
test, is inadvisable for educational decision-making.
how the individual performed compared to those in the
Numerous factors contribute to an individual’s achieve-
norm group across measures. Thus, if a composite con-
ment at any point in time and all measurement is subject
sists of three subtests, the overall score reflects how the
to bias and error. Reflecting on the weaknesses inherent in
individual performed across all three measures, when
testing, Linn (2000) cautioned: “Don’t put all of the weight
compared to peers. Consequently, there is often not a read-
on a single test. Instead, seek multiple indicators” (p. 15).
ily apparent equivalence among subtest scores and com-
Brooks, Holdnack, and Iverson (2011) in their work with
posite scores. An individual who had a standard score of 70
adults with traumatic brain injuries, also cautioned
on each of the four subtests comprising a composite may
against the use of single subtest scores when measuring
have a composite standard score lower than 70 (unless the
impairments in individuals. They emphasized that a low
particular test determines the cluster score as the mean of
score on a subtest given in isolation may appear mean-
the four subtests) because of the decreased likelihood that
ingful but a single low score from a composite or battery of
an individual would perform uniformly low on each of the
tests is fairly common in the general population. This
four subtests comprising the composite.
supports that use of multiple methods and sources of
Another consideration regarding composite scores is
data, including factors such as client demographic char-
that sometimes they mask the underlying concern or an
acteristics, as well as having examiners with a sound back-
individual’s specific areas of strengths and weaknesses.
ground in psychometrics who are able to take into account
For example, a reading battery may be composed of
factors such as the intercorrelations between subtests and
three different subtests: basic reading skills, reading flu-
the impact of interpreting a single score versus a battery of
ency or rate, and reading comprehension. A composite
tests concurrently.
score may obscure a weakness in one of these areas if
Furthermore, the evaluator also needs to consider con-
individual subtest scores are not also considered. As an
textual factors. In order to understand an individual’s test
illustration, Manuel, a sixth-grade student, was referred
performance, scores must be interpreted within the full
for a reading evaluation. Although his overall reading
context of gathered information, which would often
composite score fell in the Average range, his individual
include background information (e.g., prior services and
subtest scores showed that he had a specific weakness in
interventions); classroom work samples; behavioral obser-
reading fluency; his reading fluency score was significantly
vations; interviews; and parent, teacher, and examinee
lower than those of the other subtests. Without examining
self-reports. Gardner (1989) explains that a test score is
specific subtest scores, the evaluator may draw erroneous
just a numeric description of a sample of performance at a
conclusions and fail to provide appropriate recommenda-
given point in time but the score does not tell us anything
tions for intervention or accommodations.
about why the individual performed a particular way or
what caused the performance described by the score.
Use of age or grade norms. Many achievement tests pro-
vide both age and grade norms, so an evaluator can decide
Qualitative classification of achievement. An additional
whether to compare the student to individuals of the same
factor that may create misunderstanding regarding test
chronological age or to those of the same grade placement.
scores is the different qualitative descriptors used by the
If the results of an achievement test are going to be com-
test authors and publishers to classify achievement scores.
pared to the results of an intelligence test (which is some-
Different publishers describe the same norm-referenced
times done in evaluations for specific learning disabilities),
score using different labels. What one test publisher
then the achievement test should be scored with age norms,
describes as “low average,” another test publisher describes
as most intelligence tests only provide age norms.
as “below average.” As Dr. John Willis (2015) noted in a
The selection of age or grade norms becomes most pro-
presentation: “My score is 110! I am adequate, average,
blematic in cases of retention. Should the student be com-
high average, or above average. I’m glad that much is
pared to age peers who are in a higher grade or to grade

Downloaded from https://www.cambridge.org/core. University of Western Ontario, on 12 Dec 2019 at 18:50:56, subject to the Cambridge Core terms of use, available at
https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108235433.013
ACHIEVEMENT ASSESSMENT 173

peers? Some would argue grade peers, as the student has to students in grades 1 and 2: WJ III Passage
not yet been exposed to the next grade level of material. Comprehension test (WJPC), a cloze-based format that
Others would argue age peers, as it provides a more accu- requires students to read a passage and fill in a missing
rate indication of how far behind the student really is. The word in a sentence (Woodcock, McGrew, & Mather, 2001);
best advice is to report both, adding an explanation to the CBM-Maze,7 a curriculum-based measurement timed
assist with interpretation. test that requires students to read a passage with every
sixth word omitted and choose the missing word from
Reading comprehension tests and what they purport three options; and a test of reading recall that requires
to measure. A variety of tests are available for evaluating students to read a text and then recall it orally from mem-
reading comprehension. Two major concerns, however, ory. All three tests placed processing demands on decod-
have been raised about these measures. One is that the ing, whereas the CBM-Maze test also placed demands on
comprehension questions can often be answered correctly fluency and vocabulary and the WJPC placed additional
without reading the passage. Another is that the results demands on working memory. The authors noted that
from various tests are often widely discrepant, owing to cognitive processes underlying performance on tests of
the different cognitive and linguistic abilities on which reading comprehension are dependent on a number of
they place demands. factors, including test format; test length; and the avail-
Several studies have shown that many commonly used ability of the text, that is, whether the person can continue
reading comprehension measures have problems with pas- to view the text while answering the questions. The
sage independence, that is, students may be able to answer authors concluded that, owing to the different cognitive
the items correctly without having read the associated pas- processes on which performance on tests of comprehen-
sages. Consequently, these instruments may end up mea- sion may depend, an individual’s performance on one
suring acquired knowledge rather than reading measure would not necessarily correlate well to perfor-
comprehension. Keenan and Betjemann (2006) examined mance on another. For this reason, as well as the others
the content validity of the Gray Oral Reading Test – Fourth previously discussed, it is important to use multiple mea-
Edition (GORT-4; Wiederholt & Bryant, 2001) by assessing sures of reading comprehension, as opposed to only one
whether or not students could answer the questions cor- measure.
rectly without reading the passages. Undergraduate stu- In another study, Keenan, Betjemann, and Olson (2008)
dents (n = 77) were able to answer 86 percent of the examined the content validity of some of the most popular
questions correctly without reading the associated passages reading comprehension measures, including the GORT,
and a small group of students ages seven to fifteen (n = 10) WJPC, the Reading Comprehension test from the
were able to answer the questions with 47 percent accuracy, Peabody Individual Achievement Test (PIAT; Markwardt,
also without having read the associated passages. 1997), and the retellings and comprehension questions
Based on these findings, Keenan and Betjemann (2006) from the Qualitative Reading Inventory (QRI; Leslie &
concluded that these items were unlikely to be sensitive Caldwell, 2001). They found that the tests only had modest
measures for measuring reading comprehension or iden- intercorrelations and that decoding skill accounted for
tifying reading disability. Additionally, they found that most of the variance on both the WJPC and the PIAT.
performance on these items correlated closely with per- More recently, Keenan and Meenan (2014) reaffirmed
formance on other comprehension tests and noted that the that reading comprehension tests are not interchangeable
field of comprehension assessment needs to be more con- and that they do not necessarily provide equivalent mea-
cerned about the passage independence of items. On the sures of the construct. Nine-hundred and ninety-five chil-
most recent version of the GORT-5 (Wiederholt & Bryant, dren were assessed with the GORT-3, the WJPC, the PIAT,
2012), the publishers noted that the comprehension ques- and the QRI-3. Results were more consistent for younger
tions had been completely revised and studies were con- students, for whom weaknesses were most often caused by
ducted to demonstrate and ensure that the questions were poor decoding skills, than for older students, for whom
passage-dependent. weaknesses had more variable causes. When examining
Authors of another study (Coleman et al., 2010) investi- the bottom 10 percent of performers, Keenan and Meenan
gated the content validity of the comprehension section of (2014) found that the average overlap between the tests in
the Nelson-Denny Reading Test (Brown, Fishco, & Hanna, diagnosing reading disabilities was only 43 percent and
1993) and obtained similar results. The authors asked uni- inconsistencies in scores were just as apparent in the top
versity students with and without learning disorders to performers. Furthermore, the authors found that working
answer the multiple-choice comprehension questions with- memory was more important for tests with short texts
out reading the associated passages. They found that the (WJPC and PIAT) rather than for tests with longer texts
students’ accuracy rates were well above chance for both (GORT-3 and QRI-3). Thus, format differences among the
forms of the test, as well as for both groups of students. tests created differences in the types of skills that they
Kendeou, Papadopoulos, and Spanoudis (2012) exam- assessed, reducing the correlations among scores. The
ined the cognitive processes underlying performance on
three different reading comprehension tests administered 7
See CBM-Maze at: interventioncentral.org

Downloaded from https://www.cambridge.org/core. University of Western Ontario, on 12 Dec 2019 at 18:50:56, subject to the Cambridge Core terms of use, available at
https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108235433.013
174 JENNIFER WHITE, NANCY MATHER, DEBORAH ANNE SCHNEIDER, AND JENNIFER BRADEN KIRKPATRICK

authors reiterated the suggestion that evaluators use more overall cluster scores, Black examinees were disproportio-
than one test to assess reading comprehension. They also nately likely to respond incorrectly to items whose distrac-
noted that it is important to evaluate component skills tors contained culturally sensitive information. In a
such as listening comprehension, vocabulary, and work- follow-up study using the same dataset, Banks (2012)
ing memory to identify the source of deficits. found evidence to indicate that inferential reading items
were more susceptible to cultural bias than were literal
items, suggesting that such items are more likely to draw
DIVERSITY AND CULTURAL ISSUES on culturally bound knowledge, potentially disadvanta-
Because the results of achievement tests play a central role ging students whose cultural experiences are not well-
in educational decision-making and may be associated aligned with those reflected in the test items.
with high stakes for examinees and schools alike, the
validity of the inferences derived from these tests is of
Linguistic Impediments
utmost importance. This is of particular concern with
respect to culturally and ethnically diverse learners, as Linguistic impediments to valid assessment of achieve-
well as those who are non-native speakers of the language ment are another important concern, especially when
of instruction. While issues of cultural and ethnic bias in evaluating English-language learners (Martiniello, 2009).
assessment have been better studied in relation to cogni- Abedi and colleagues (Abedi, 2002; Abedi et al., 2001;
tive testing than to achievement testing, modest evidence Abedi & Leon, 1999; Abedi, Leon, & Mirocha, 2003) have
suggests that such bias, as well as other related factors, performed numerous studies of extant testing data derived
including linguistic barriers to item comprehension, tea- from major standardized tests of achievement and have
cher expectancy effects, and stereotype threat, may nega- consistently found that English-language proficiency is
tively affect achievement test performance among positively correlated to standardized test performance
members of cultural, ethnic, and linguistic minorities. across academic domains, inclusive of mathematics and
science. Furthermore, the relationship between test per-
formance and linguistic competence does not appear to be
Ethnic or Cultural Bias
a simple function of differential access to content knowl-
Ethnic or cultural bias in achievement testing may occur edge based on language mastery. In their research, Abedi
when characteristics of a test item unrelated to the achieve- and colleagues found that the greater the linguistic com-
ment construct being assessed differentially affect the plexity of the test item, even among items for which
responses of members of different ethnic and cultural language was presumed to be irrelevant to the
groups (Banks, 2006). For example, a test designed to eval- achievement construct, the greater the difference in per-
uate examinees’ English-language arts achievement might formance tended to be between examinees classified as
contain items whose distractors employ constructs com- English-language learners and those who were not. In
mon in a regional or ethnic dialect, potentially causing fact, in one study of extant achievement testing data from
speakers of that dialect to respond inaccurately, irrespec- several locations in the United States, Abedi (2002)
tive of their mastery of the English-language arts construct revealed that English-language learners and non–
assessed. To illustrate this concept, one culturally sensitive English-language learners had measurable differences in
item derived from a widely used standardized assessment overall math performance, whereas performance on the
of English-language arts achievement asked examinees to computation subtest was nearly identical between the
identify the grammatically correct phrase among options; groups. These findings suggest that English-language pro-
however, it contained a distractor that employed a con- ficiency was an important moderator of achievement test
struction commonly used in African American English performance independent of construct mastery, even in
(AAE) – a construction that would be perfectly grammatical mathematics, a subject for which linguistic competence
to speakers of that sociolect (Banks, 2006). Not surpris- should be of minimal importance.
ingly, this item functioned differently for African
American students than for matched Hispanic or white
Testing Accommodations
students, demonstrating an increased probability of an
incorrect response among members of that population Testing accommodations are often discussed as a means
(Banks, 2006). by which to improve the access of English-language lear-
Other types of culturally sensitive item content may also ners to assessment content and, by consequence, promote
promote differential functioning. In a 2006 study of ethnic test performance more reflective of their knowledge of the
and cultural bias in assessment, Banks performed simul- relevant achievement construct. Unfortunately, research
taneous item bias test (SIBTEST) analyses on a large set of has not generally borne out the efficacy of the most com-
fifth-grade reading cluster data derived from the Terra monly used accommodations in effecting improvements
Nova Test (CTB/McGraw-Hill, 1999). The results of these in achievement test performance among English-language
analyses revealed that, while matched Black, Hispanic, learners. In a meta-analysis of the extant literature, Kieffer
and white examinees did not differ significantly in their and colleagues (2009) found that only one (providing

Downloaded from https://www.cambridge.org/core. University of Western Ontario, on 12 Dec 2019 at 18:50:56, subject to the Cambridge Core terms of use, available at
https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108235433.013
ACHIEVEMENT ASSESSMENT 175

dictionaries or glossaries) of the seven commonly used of minority groups on high-stakes tests (Kellow & Jones,
testing accommodations studied8 had a statistically signif- 2008). A meta-analysis of more than 100 experimental stu-
icant positive effect on achievement test performance dies examining the effects of stereotype threat on the
among English-language learners; their apparent lack of achievement and cognitive test performance of both
efficacy notwithstanding, none of the common accommo- women and members of racial and ethnic minorities
dations were found to threaten the validity of the infer- revealed that stereotype cues had substantial negative
ences produced by the assessments. impacts on the achievement and cognitive test performance
of members of these groups, serving as an important mod-
erator of outcomes (Nguyen & Ryan, 2008).
Teacher Expectancy Effects and Stereotype Threats
Factors independent of the tests themselves, including
Limitations of Standardized Assessments
teacher expectancy effects and stereotype threat, may
of Achievement
also affect the performance of ethnically and culturally
diverse examinees on standardized evaluations of achieve- Whereas the authors and publishers of standardized
ment and these factors merit consideration when examin- assessments of achievement strive to minimize bias and
ing the results of achievement tests in academic decision- maximize the validity of their instruments, some bias in
making. Expectancy effects can be characterized as the assessment is nevertheless unavoidable – and sometimes
results of a self-fulfilling prophecy: Biased perceptions factors independent of the assessments themselves pro-
lead to biased behavior and the effects of the biased beha- duce performance that is unreflective of a student’s mas-
vior further reinforce the initial biased perceptions tery of an achievement construct. Therefore, educators
(Babad, Inbar, & Rosenthal, 1982). In the case of tea- and clinical practitioners alike should bear the limitations
cher–student relationships, expectancy effects may result of standardized tests of achievement in mind when mak-
from biases on the part of teachers, which may then color ing decisions concerning a student’s education or place-
those teachers’ interactions with their students, ultimately ment. Consistent with the American Psychological
moderating student achievement outcomes (Hinnant, Association’s Code for Fair Testing Practices in Education
O’Brien, & Ghazarian, 2009). Teachers’ biases may also (APA, 2004), educators and clinical practitioners should
be the result of implicit or explicit prejudice, leading to avoid basing any educationally meaningful decision on a
subtle behavioral changes that may negatively affect stu- single test or source of information. They should also con-
dents’ achievement. As an example, a multilevel analysis of sider the limitations of the tests used, including any poten-
the effects of prejudices on students’ achievement in more tial sources of bias or threats to validity. Use of multiple
than forty classrooms revealed that teachers’ implicit pre- sources of data provides the best overall picture of an
judices explained part of the ethnic achievement gap examinee and helps reduce problems caused by threats
among classrooms via teacher expectancy effects (Van to validity such as examiner error and examinee effort or
den Bergh et al., 2010). feigning. Examiners should evaluate the totality of the
Stereotype threat is another form of self-fulfilling pro- information required for making appropriate educational
phecy relevant to achievement assessment. Stereotype inferences while considering the rationale, procedures,
threat may occur when individuals of a negatively stereo- and evidence for performance standards or cut scores.
typed group risk confirming stereotype, not because of any Finally, they should use appropriate testing practices for
inherent or acquired trait or quality but because of the the student, particularly in the case of students with lim-
effects of negative expectancies or anxiety produced by ited linguistic proficiency or those with identified disabil-
the stereotype itself (Steele & Aronson, 1995). As an exam- ities; however, any deviation from standardization should
ple, Spencer, Steele, and Quinn (1999) found that women be described and considered with respect to the validity of
participants significantly underperformed equally qualified the inferences made based on the assessment.
men participants on a difficult test of mathematics achieve-
ment when primed with a suggestion that the test produced
INTERPRETIVE OR PRACTICAL RECOMMENDATIONS
gender differences favoring men; however, when women
participants were given a difficult test of mathematics As has been discussed in this chapter, the purposes of
achievement and primed with the suggestion that the test achievement testing are varied. In some cases, testing is
did not produce gender differences, their achievement was done to determine which students are struggling, so inter-
not significantly different than that of equally qualified men ventions can be provided in a timely fashion; in other
participants. Stereotype threat is relevant to the interpreta- cases, evaluation is more in-depth and focused on deter-
tion of the results of standardized assessments of achieve- mining why a particular student is struggling and inform-
ment because it has been shown to disadvantage members ing solutions that will address and help resolve the specific
referral question(s). This type of clinical evaluation
8
Accommodations included extra time, dual-language booklets, requires specific training expertise and achievement test-
dual-language questions, Spanish-language tests, simplified ing in this context is often performed with concurrent
English tests, bilingual dictionaries or glossaries, and English dic-
assessment of cognitive abilities. Although it is relatively
tionaries or glossaries.

Downloaded from https://www.cambridge.org/core. University of Western Ontario, on 12 Dec 2019 at 18:50:56, subject to the Cambridge Core terms of use, available at
https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108235433.013
176 JENNIFER WHITE, NANCY MATHER, DEBORAH ANNE SCHNEIDER, AND JENNIFER BRADEN KIRKPATRICK

easy to administer and score a test properly, it is far more argument for assessments; to obtain valid results for stu-
difficult to interpret the data produced by the test and dents with disabilities, it is necessary to provide accom-
draw meaningful conclusions. modations that help them show what they know and can
In order to interpret the results of achievement tests do without interference from the barriers that their dis-
with validity, an evaluator must have an understanding abilities pose, as long as the accommodations do not
of the deep and reciprocal relationships between language change what the assessment is intended to measure” (p.
abilities and achievement. Language abilities, including 103).
both oral and written, underlie achievement in all other The challenge for practitioners is determining whether
areas, as instruction is typically delivered via written and accommodations are appropriate for an individual on a
oral modalities. Furthermore, achievement assessment particular assessment. For example, for classroom
tends to draw on oral and written languages abilities, achievement testing, a student with attention-deficit/
even in areas, such as mathematics, where linguistic abil- hyperactivity disorder may benefit from taking the test
ities are less relevant to the achievement construct. individually, away from distractions, and with frequent
Across domains, achievement tests draw on skills and breaks. These accommodations do not alter the content
cognitive factors that may not appear immediately rele- of the assessment but provide the student with an optimal
vant to the achievement construct. For example, passage testing environment. Another example of an appropriate
reading tasks place demands on a variety of skills and accommodation would be the use of large print on a CBM
abilities, including phonological awareness, orthographic reading probe for a student with a visual impairment. This
awareness, syntactic knowledge, memory, breadth and accommodation would allow the student to access the
depth of vocabulary knowledge, and background knowl- testing material without changing the nature of the assess-
edge. Even a timed test of math facts knowledge assesses ment. On the other hand, an accommodation of extended
more than simple arithmetic knowledge, as demands are time on a standardized fluency assessment would likely
placed on processing speed and the rapid interpretation of not be appropriate for any student, regardless of disability,
symbols. Therefore, when interpreting the results of because it would invalidate the assessment by failing to
achievement testing, an evaluator should consider a vari- provide an accurate measure of fluency or the student’s
ety of underlying cognitive factors. reading accuracy and speed. Practitioners must thor-
As part of the assessment process, an evaluator must also oughly understand the assessment protocols they use, the
decide which tests to administer based on the referral ques- purposes of the assessment, and the needs of the student in
tion and the goals of the assessment. A poorly defined referral order to determine which accommodations and modifica-
question can make this task far more complex and difficult tions may be appropriate and when to use them in an
than it needs to be (Kaufman, Raiford, & Coalson, 2016). evaluation. Use of accommodations not specified in exam-
Therefore, evaluators should endeavor to clarify an ambig- iners’ manuals will limit the validity of normative compar-
uous referral question so that review of existing data and isons and should only be used after tests are administered
subsequent data collection clarifies the nature of the stu- in a standardized way to “test the limits.” Testing the limits
dent’s difficulties and appropriately informs the goals of the allows the evaluator to test the conditions under which the
assessment process. In some instances, achievement testing student’s more optimal performance is achieved (Sattler,
may be sufficient to answer a referral question. In many 2008). Sattler emphasizes that testing of limits may be
others, an achievement test or tests would be only a part of used after scores have been obtained from a complete
a more comprehensive evaluation, which might also include standardized administration of an assessment, so as to
various cognitive, oral language, and/or behavioral not provide any cues that may assist a student on any
measures. items. Further, he recommends that testing the limits
Once appropriate assessments have been selected, the may include “providing additional cues or aids,” “chan-
evaluator should refer to the examiner’s manual of the ging the stimulus modality” (e.g., question format), or
assessment they are using to decide if standard accommo- “eliminating time limits” (pp. 206–207). Sattler provides
dations and/or modifications are appropriate for a student additional suggested procedures for testing the limits.
during the assessment process (e.g., large print for stu- Within the evaluation process, the results of achievement
dents with a visual impairment). While educators should testing often play a key role in helping to determine whether
use appropriate accommodations for students with dis- or not a student may be eligible for special services or which
abilities in classroom assessments, accommodations and accommodations a student may need in an academic setting.
modifications on standardized individual assessments are In school settings, these types of determinations are made by
often not appropriate. While legal definitions and usage of a multidisciplinary team of professionals who consider data
assessment vary from state to state, appropriate assess- from a variety of sources. If a student qualifies for special
ment accommodations and modifications are an integral education, they are provided with an Individual Education
part of a protocol for appropriately assessing those whose Plan (IEP) that describes individualized goals based on iden-
disabilities affect their ability to take a test. Thurlow, tified student needs. As part of the IEP, the team would also
Lazarus, and Christensen (2013) described assessment consider whether the student is in need of specific accom-
accommodations as “an essential part of the validity modations, such as extended time on certain types of tests or

Downloaded from https://www.cambridge.org/core. University of Western Ontario, on 12 Dec 2019 at 18:50:56, subject to the Cambridge Core terms of use, available at
https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108235433.013
ACHIEVEMENT ASSESSMENT 177

the use of some kind of assistive technology when complet- language background (CSE Technical Report 536). Los Angeles:
ing an assessment. National Center for Research on Evaluation, Standards, and
Although achievement test results play a central role in Student Testing.
the diagnosis of a disability, as with any type of clinical Abedi, J., & Leon, S. (1999). Impact of students’ language back-
ground on content-based performance: Analyses of extant data.
assessment, the evaluator should use the results to inform
Los Angeles: University of California, National Center for
practical recommendations and accommodations that can
Research on Evaluation, Standards, and Student Testing.
be implemented with fidelity in the current setting. Abedi, J., Leon, S., & Mirocha, J. (2003). Impact of student lan-
Cruickshank (1977) advised that “Diagnosis must take sec- guage background on content-based performance: Analyses of
ond place to instruction, and must be made a tool of extant data (CSE Technical Report 603). Los Angeles: National
instruction, not an end in itself” (p. 194). A careful analysis Center for Research on Evaluation, Standards, and Student
of the test results can help an evaluator develop a plan that Testing.
includes a description of the implementation and Adelman, H. S., Lauber, B. A., Nelson, P., & Smith, D. C. (1989).
monitoring of targeted interventions that are designed to Toward a procedure for minimizing and detecting false positive
increase student achievement. In order to write targeted diagnoses of learning disability. Journal of Learning Disabilities,
recommendations, an evaluator needs to have extensive 22, 234–244.
American Psychiatric Association. (2013). Desk reference to the
knowledge of effective academic interventions.
diagnostic criteria from DSM-5. Washington, DC: American
Cruickshank further explained that “A variety of programs
Psychiatric Publishing.
must be available for children who have a variety of needs” APA (American Psychological Association). (2004). Code of fair test-
(p. 194). Fortunately, numerous effective interventions ing practices in education. Washington, DC: Joint Committee on
exist and the National Association of School Testing Practices.
Psychologists includes in their domains of practice Babad, E. Y., Inbar, J., & Rosenthal, R. (1982). Pygmalion,
knowledge of data-based decision-making as well as appro- Galatea, and the Golem: Investigations of biased and unbiased
priateuse of evidence-based interventions (NASP, n.d.). teachers. Journal of Educational Psychology, 74, 459–474.
Thus, school psychology programs today are requiring Banks, K. (2006). A comprehensive framework for evaluating
students to take courses related to linking academic assess- hypotheses about cultural bias in educational testing. Applied
ment data to evidence-based interventions (Joseph, Measurement in Education, 19, 115–132.
Banks, K. (2012). Are inferential reading items more susceptible
Wargelin, & Ayoub, 2016).
to cultural bias than literal reading items? Applied Measurement
In addition, the evaluator must communicate with others
in Education, 25, 220–245.
regarding the nature of instructional interventions, as well Breaux, K. C. (2009). Wechsler individual achievement test:
as how, where, and when those interventions should occur. Technical manual (3rd ed.). San Antonio, TX: Pearson.
For one student, systematic instruction could be provided Breaux, K. C., Bray, M. A., Root, M. M., & Kaufman, A. S. (Eds.)
every day after school in a math lab; for another student, (2017). Special issue on studies of students’ errors in reading,
instruction should be delivered daily for forty-five minutes writing, math, and oral language. Journal of Psychoeducational
in a resource setting by a special education teacher; for still Assessment, 35. https://doi.org/10.1177/0734282916669656
another, the parents may decide to provide tutoring for Brooks, B. L., Holdnack, J. A., & Iverson, G. L. (2011). Advanced
their son or daughter at home. clinical interpretation of the WAIS-IV and WMS-IV: Prevalence
This chapter provided an introduction to commonly of low scores varies by level of intelligence and years of educa-
tion. Assessment, 18, 156–167.
used comprehensive achievement tests, single-subject
Brown, J. I., Fishco, V. V., & Hanna, G. (1993). Nelson-Denny
achievement tests, CBMs, technological advances,
reading test (forms G and H). Austin, TX: PRO-ED.
threats to validity of results including noncredible Coleman, C., Lindstrom, J., Nelson, J., Lindstrom, W., & Gregg, K.
reporting, and interpretive and practical recommenda- N. (2010). Passageless comprehension on the Nelson Denny
tions. Achievement testing is an integral component of Reading Test: Well above chance for university students.
assessment. When used and interpreted appropriately, Journal of Learning Disabilities, 43, 244–249.
the results can help document a student’s current aca- Connolly, A. (2007). Key Math-3 Diagnostic Assessment. Austin,
demic performance levels and can provide key informa- TX: Pearson.
tion for improving student outcomes by informing the Cruickshank, W. M. (1977). Least-restrictive placement:
need for specific academic interventions, instructional Administrative wishful thinking. Journal of Learning Disabilities,
or testing accommodations and/or modifications, and 10, 193–194.
CTB/McGraw-Hill. (1999). Teacher’s guide to Terra Nova: CTBS
appropriate educational services.
battery, survey, and plus editions, multiple assessments.
Monterey, CA: Author.
REFERENCES Davis, L. B., & Fuchs, L. S. (1995). “Will CBM help me learn?”:
Students’ perception of the benefits of curriculum-based mea-
Abedi, J. (2002). Standardized achievement tests and English lan- surement. Education and Treatment of Children, 18(1), 19–32.
guage learners: Psychometrics issues. Educational Assessment, Deeney, T. A., & Shim, M. K. (2016). Teachers’ and students’ views
8, 231–257. of reading fluency: Issues of consequential validity in adopting
Abedi, J., Hofstetter, C, Baker, E., & Lord, C. (2001). NAEP math one-minute reading fluency assessments. Assessment for
performance test accommodations: Interactions with student Effective Instruction, 41(2), 109–126.

Downloaded from https://www.cambridge.org/core. University of Western Ontario, on 12 Dec 2019 at 18:50:56, subject to the Cambridge Core terms of use, available at
https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108235433.013
178 JENNIFER WHITE, NANCY MATHER, DEBORAH ANNE SCHNEIDER, AND JENNIFER BRADEN KIRKPATRICK

DeRight, J., & Carone, D. A. (2015). Assessment of effort in chil- Kaufman, A. S., Raiford, S. E., & Coalson, D. L. (2016). Intelligent
dren: A systematic review. Child Neuropsychology, 21, 1–24. testing with the WISC-V. Hoboken, NJ: John Wiley & Sons.
Ford, J. W., Missall, K. N., Hosp, J. L., & Kuhle, J. L. (2017). Keenan, J. M., & Betjemann, R. S. (2006). Comprehending the
Examining oral passage reading rate across three curriculum- Gray Oral Reading Test without reading it: Why comprehen-
based measurement tools for predicting grade-level proficiency. sion tests should not include passage-independent items.
School Psychology Review, 46, 363–378. Scientific Studies of Reading, 10, 363–380.
Fuchs, L. S. (2016). Curriculum based measurement as the emer- Keenan, J. M., Betjemann, R. S., & Olson, R. K. (2008). Reading
ging alternative: Three decades later. Learning Disabilities comprehension tests vary in the skills that they assess:
Research and Practice, 32, 5–7. Differential dependence on decoding and oral comprehension.
Fuchs, L. S., & Fuchs, D. (2002). Curriculum-based measurement: Scientific Studies of Reading, 12, 281–300.
Describing competence, enhancing outcomes, evaluating treat- Keenan, J. M., & Meenan, C. E. (2014). Test differences in diag-
ment effects, and identifying treatment nonresponders. nosing reading comprehension deficits. Journal of Learning
Peabody Journal of Education, 77(2), 64–84. Disabilities, 47, 125–135.
Fuchs, L. S., Fuchs, D., Hosp, M., & Jenkins, J. R. (2001). Oral Kellow, J. T., & Jones, B. D. (2008). The effects of stereotypes on
reading fluency as an indicator of reading competence: A theo- the achievement gap: Reexamining the academic performance
retical, empirical, and historical analysis. Scientific Studies of of African American high school students. Journal of Black
Reading, 5, 239–256. Psychology, 34(1), 94–120.
Gardner, E. (1989). Five common misuses of tests. ERIC Digest Kendeou, P., Papadopoulos, T. C., & Spanoudis, G. (2012).
No. 108. Washington, DC: ERIC Clearinghouse on Tests Processing demands of reading comprehension tests in young
Measurement and Evaluation. readers. Learning and Instruction, 22, 354–367.
Greiff, S., Wüstenberg, S., Holt, D. V., Goldhammer, F., & Funke, J. Kieffer, M. J., Lesaux, N. K., Rivera, M., & Francis, D. J. (2009).
(2013). Computer-based assessment of complex problem Accommodations for English language learners taking large-
solving: Concept, implementation, and application. Educational scale assessments: A meta-analysis on effectiveness and valid-
Technology Research and Development, 61, 407–421. ity. Review of Educational Research, 79, 1168–1201.
Hammill, D. D., & Larsen, S. C. (2009). Test of written language Kirkwood, M. W., Kirk, J. W., Blaha, R. Z., Wilson, P. (2010).
(4th ed.). Austin, TX: PRO-ED. Noncredible effort during pediatric neuropsychological exam:
Harrison, A. G., & Edwards, M. J. (2010). Symptom exaggeration A case series and literature review. Child Neuropsychology, 16,
in post-secondary students: Preliminary base rates in a 604–618.
Canadian sample. Applied Neuropsychology, 17, 135–143. Leslie, L., & Caldwell, J. (2001). Qualitative reading inventory–3.
Harrison, A. G., Edwards, M. J., Armstrong, I., & Parker, K. C. H. New York: Addison Wesley Longman.
(2010). An investigation of methods to detect feigned reading dis- Linn, R. L. (2000). Assessments and accountability. Educational
abilities. Archives of Clinical Neuropsychology, 25, 89–98. Researcher, 29, 4–16.
Harrison, A. G., Edwards, M. J., & Parker, K. C. H. (2008). Lu, P. H., & Boone, K. B. (2002). Suspect cognitive symptoms in a
Identifying students feigning dyslexia: Preliminary findings 9-year old child: Malingering by proxy? The Clinical
and strategies for detection. Dyslexia, 14, 228–246. Neuropsychologist, 16, 90–96.
Hasbrouck, J., & Tindal, G. (2017). An update to compiled ORF Markwardt, F. C. (1997). Peabody individual achievement test –
norms (Technical Report No. 1702). Eugene, OR: Behavioral revised (normative update). Bloomington, MN: Pearson
Research and Teaching, University of Oregon. Assessments.
Hinnant, J. B., O’Brien, M., & Ghazarian, S. R. (2009). The long- Martiniello, M. (2009). Linguistic complexity, schematic representa-
itudinal relations of teacher expectations to achievement in the tions, and differential item functioning for English language lear-
early school years. Journal of Educational Psychology, 101, ners in math tests. Educational Assessment, 14, 160–179.
662–670. McGrew, K. S., LaForte, E. M., & Schrank, F. A. (2014).
Hosp, M. K., Hosp, J. L., & Howell, K. W. (2016). The ABCs of Woodcock-Johnson IV: Technical manual [CD]. Itasca, IL:
CBM: A practical guide to curriculum-based measurement (2nd Houghton Mifflin Harcourt.
ed.). New York: Guilford Press. Molnar, M. (2017). Market is booming for digital formative
Hosp, J. L., & Suchey, N. (2014). Reading assessment: Reading assessments. Education Week, May 24. http://edweek.org/ew/
fluency, reading fluently, and comprehension – Commentary on articles/2017/05/24/market-is-booming-for-digital-formative-
the special topic. School Psychology Review, 43, 59–68. assessments.html
Huff, K. L., & Sireci, S. G. (2001). Validity issues in computer-based Monroe, M. (1932). Children who cannot read. Chicago. IL:
testing. Educational Measurement: Issues and Practice, 20, 16–25. University of Chicago Press.
Jones, E. D., Southern, W. T., & Brigham, F. J. (1998). Curriculum- NASP (National Association of School Psychologists). (n.d.). NASP
based assessment: Testing what is taught and teaching what is practice model: 10 domains. www.nasponline.org/standards-and-
tested. Intervention in School and Clinic, 33, 239–249. certification/nasp-practice-model/nasp-practice-model-implemen
Joseph, L. M., Wargelin, L., & Ayoub, S. (2016). Preparing school tation-guide/section-i-nasp-practice-model-overview/nasp-prac
psychologists to effectively provide services to students with tice-model-10-domains
dyslexia. Perspectives on Language and Literacy, 42(4), 15–23. NASP (National Association of School Psychologists). (2016).
Kaufman, A. S., & Kaufman, N. L. (2014). Kaufman test of educa- School psychologists’ involvement in assessment. Bethesda,
tional achievement (3rd ed.). San Antonio, TX: Pearson. MD: Author.
Kaufman, A. S., Kaufman, N. L., & Breaux, K. C. (2014). Technical Nguyen, H. H. D., & Ryan, A. M. (2008). Does stereotype threat
and interpretive manual. Kaufman Test of Educational affect test performance of minorities and women? A meta-ana-
Achievement – Third Edition (KTEA-3) Comprehensive Form. lysis of experimental evidence. Journal of Applied Psychology,
Bloomington, MN: NCS Pearson. 93, 1314–1334.

Downloaded from https://www.cambridge.org/core. University of Western Ontario, on 12 Dec 2019 at 18:50:56, subject to the Cambridge Core terms of use, available at
https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108235433.013
ACHIEVEMENT ASSESSMENT 179

Rome, H. P., Swenson, W. M., Mataya, P., McCarthy, C. E., Pearson, Thurlow, M., Lazarus, S., & Christensen, L. (2013).
J. S., Keating, F. R., & Hathaway, S. R. (1962). Symposium on Accommodations for assessment. In J. Lloyd, T. Landrum,
automation techniques in personality assessment. Proceedings of B. Cook, & M. Tankersley (Eds.)., Research-based approaches
the Staff Meetings of the Mayo Clinic, 37, 61–82. for assessment (pp. 94–110). Upper Saddle River, NJ:
Sattler, J. M. (2008). Assessment of children: Cognitive founda- Pearson.
tions. CA: Author. Valencia, S. W., Smith, A. T., Reece, A. M., Li, M., Wixson, K. K., &
Schneider, J. W., Lichtenberger, E. O., Mather, N., & Kaufman, N. Newman, H. (2010). Oral reading fluency assessment: Issues of
L. (2018). Essentials of assessment report writing. Hoboken, NJ: construct, criterion, and consequential validity. Reading
John Wiley & Sons. Research Quarterly, 45, 270–291.
Schrank, F. A., Mather, N., & McGrew, K. S. (2014a). Woodcock- Van den Bergh, L., Denessen, E., Hornstra, L., Voeten, M., &
Johnson IV tests of achievement. Itasca, IL: Houghton Mifflin Holland, R. W. (2010). The implicit prejudiced attitudes of
Harcourt. teachers: Relations to teacher expectations and the ethnic
Schrank, F. A., Mather, N., & McGrew, K. S. (2014b). Woodcock- achievement gap. American Educational Research Journal, 47,
Johnson IV tests of oral language. Itasca, IL: Houghton Mifflin 497–527.
Harcourt. VanDerHeyden, A. M., Witt, J. C., & Gilbertson, D. (2007). A multi-
Schrank, F. A., McGrew, K. S., & Mather, N. (2014a). Woodcock- year evaluation of the effects of a response to intervention (RTI)
Johnson IV. Itasca, IL: Houghton Mifflin Harcourt. model on identification of children for special education.
Schrank, F. A., McGrew, K. S., & Mather, N. (2014b). Woodcock- Journal of School Psychology, 45, 225–256. https://doi.org/
Johnson IV tests of cognitive abilities. Itasca, IL: Houghton 10.1016/j.jsp.2006.11.004
Mifflin Harcourt. Van Norman, E. R., Nelson, P. M., & Parker, D. C. (2018). A
Shinn, M. R., Good, R. H., Knutson, N., & Tilly, D. W. (1992). comparison of nonsense-word fluency and curriculum-based
Curriculum-based measurement of oral reading fluency: A con- measurement of reading to measure response to phonics
firmatory analysis of its relation to reading. School Psychology instruction. School Psychology Quarterly, 33, 573–581. https://
Review, 21, 5, 459–479. doi.org/10.1037/spq0000237
Shute, V. J., Leighton, J. P., Jang, E. E., & Chu, M. W. (2016). Wechsler, D. (2009). Wechsler individual achievement test
Advances in the science of assessment. Educational Assessment, (3rd ed.). San Antonio, TX: Psychological Corporation.
21(1), 34–59. Wechsler, D. (2014). Wechsler intelligence scale for children
Shute, V. J., & Rahimi, S. (2017). Review of computer‐based (5th ed.). San Antonio, TX: Psychological Corporation.
assessment for learning in elementary and secondary educa- Wei, H., & Lin, J. (2015). Using out-of-level items in computerized
tion. Journal of Computer Assisted Learning, 33(1), 1–19. adaptive testing. International Journal of Testing, 15, 50–70.
Singleton C. H. (2001). Computer-based assessment in education. Wiederholt, J. L., & Bryant, B. R. (2001). Gray oral reading test
Educational and Child Psychology, 18(3), 58–74. (4th ed.). Austin, TX: PRO-ED.
Spencer, S. J., Steele, C. M., & Quinn, D. M. (1999). Stereotype Wiederholt, J. L., & Bryant, B. R. (2012). Gray oral reading test
threat and women’s math performance. Journal of Experimental (5th ed.). Austin, TX: PRO-ED.
Social Psychology, 35, 4–28. Willis, J. (2015). The historical role and best practice in identify-
Steele, C. M., & Aronson, J. (1995). Stereotype threat and the ing Specific Learning Disabilities. Paper presented at the New
intellectual test performance of African Americans. Journal of York Association of School Psychologists annual conference.
Personality and Social Psychology, 69, 797–811. Verona, NY, October.
Sullivan, B. K., May, K., & Galbally, L. (2007). Symptom exaggera- Woodcock, R. W. (2011). Woodcock reading mastery test (3rd ed.).
tion by college adults in Attention-Deficit Hyperactivity Disorder San Antonio, TX: Pearson.
and Learning Disorder assessments. Applied Neuropsychology, 14, Woodcock, R. W., McGrew, K. S., & Mather, N. (2001). Woodcock–
189–207. Johnson III tests of achievement. Itasca, IL: Riverside.

Downloaded from https://www.cambridge.org/core. University of Western Ontario, on 12 Dec 2019 at 18:50:56, subject to the Cambridge Core terms of use, available at
https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781108235433.013

You might also like