NWEA Technical Manual 2003

Technical Manual for the
NWEA Measures of Academic Progress

and Achievement Level Tests
September 2003
Copyright © 2003 Northwest Evaluation Association
All rights reserved. No part of this manual may be reproduced or utilized in any form or by
any means, electronic or mechanical, including photocopying, recording, or by any
information storage and retrieval system, without permission in writing from NWEA.
NWEA
12909 SW 68th Parkway, Suite 400
Portland, Oregon 97223
Phone: (503) 624-1951

Fax: (503) 639-7873
Website: www.nwea.org
General Info: munse@nwea.org
© 2003 Northwest Evaluation Association, Portland, Oregon

Table of Contents
Foreword ............................................................................................................. 1
Acknowledgments ............................................................................................... 2
Introduction......................................................................................................... 3
Other Resources ...................................................................................................................................... 3
Organization of This Manual.................................................................................................................. 4
Covalent Assessments............................................................................................................................. 5
Educational Assessment ...................................................................................... 6

The Principles That Guide NWEA Assessments................................................................................... 6
MAP and ALT in a Comprehensive Assessment Plan .......................................................................... 9
Testing for a Purpose: NWEA Assessment Design .............................................. 9

Test Structure .......................................................................................................................................... 9
Achievement Level Tests (ALT)..................................................................................................... 10
Measures of Academic Progress (MAP)......................................................................................... 12
Testing Modality ................................................................................................................................... 15
The Item Banks and Measurement Scales .......................................................... 17

The Measurement Model ...................................................................................................................... 17
Scale Development and Item Calibration............................................................................................. 19
Phase I: Developing the Measurement Scales ............................................................................... 19
Phase II: Maintaining the Measurement Scales ............................................................................. 20
Item Development................................................................................................................................. 21
Item Writing and Editing ................................................................................................................. 21
Item Bias Review ............................................................................................................................. 22
Content Category Assignment......................................................................................................... 23
Field Testing..................................................................................................................................... 23
Item Analyses and Calibration......................................................................................................... 24
Periodic Review of Item Performance ................................................................................................. 27
The Assessment Process .................................................................................... 28

Test Development ................................................................................................................................. 29
Test Design....................................................................................................................................... 29
ALT Test Specifications ............................................................................................................. 29
MAP Test Specifications ............................................................................................................ 31
Content Definition............................................................................................................................ 32
Item Selection................................................................................................................................... 34
Test Construction ............................................................................................................................. 35
Test Administration............................................................................................................................... 36

The Testing Environment ................................................................................................................ 37
Test Scoring and Score Validation ....................................................................................................... 40
Report Generation ................................................................................................................................. 43
Results Interpretation ............................................................................................................................ 44
Customizing Assessments ..................................................................................46

Localizing NWEA Assessments........................................................................................................... 46
Making Global Use of Localized Assessments.................................................................................... 47
Operational Characteristics of the Assessments ..................................................49

Validity .................................................................................................................................................. 51
Reliability of Scores.............................................................................................................................. 54
Precision of Scores................................................................................................................................ 56
Norm-Referenced Precision............................................................................................................. 58
Curriculum-Referenced Precision ................................................................................................... 58
A Final Note on MAP and ALT Operating Characteristics ................................................................ 59
Appendix ...........................................................................................................60
Initial Scale Development..................................................................................................................... 60
Item Bank Overview ............................................................................................................................. 60
Glossary.............................................................................................................63
References .........................................................................................................64

Technical Manual for the NWEA MAP and ALT Assessments 1
Foreword
In the constant debate about the role of assessment in education, one question is central:
How do we efficiently and accurately measure how much

students have achieved and how quickly they are learning?
NWEA was founded in 1976 by a group of school districts looking for practical answers to
this question. Since then, NWEA has developed assessment tools that enable educational
agencies to measure the achievement of virtually all their students with a great deal of
accuracy in a short period of time. Two of these tools are the Measures of Academic Progress
(MAP) and the Achievement Level Tests (ALT).
Both MAP and ALT are designed to deliver assessments matched to the capabilities of each
individual student. In ALT, the assessment for a student is one of a set of pre-designed
paper-and-pencil tests that vary in difficulty. The test that the student takes is based on past
performance or a short locator test. In MAP, the assessment for a student is dynamically
developed and administered on a computer. The test adjusts to match the performance of
the student after each item is given.
This manual details the technical measurement characteristics of the ALT and MAP
assessments. These include item development, test development, the nature of the
measurement scales, and the appropriate use of scores from the assessments. Since many
different tests are developed within MAP and ALT (to match local content and curriculum
standards), this manual emphasizes those elements common to all of the tests developed in
the two systems.
Since MAP and ALT deliver a variety of tests to a particular group of students, their
measurement characteristics are somewhat different than those of a traditional single-form
test. Throughout this manual, we detail the aspects of MAP and ALT that make them
unique. We also highlight the areas in which traditional statistical test evaluation procedures
need to be enhanced to tell the whole story of these unique high-quality testing systems.
As with any technical manual, this one will not answer all questions. In fact, it is likely to
spur as many new questions as it answers. For additional information, please consult the
NWEA web site at www.nwea.org.
Allan Olson
G. Gage Kingsbury, PhD.
September 1, 2003
2 Technical Manual for the NWEA MAP and ALT Assessments
Acknowledgments
Organizations do not write documents, people do. Therefore it is imperative to thank the
individuals who created this document and nurtured it from blank pages to its completed
form. In this manual, many ideas have been drawn from nationally respected testing
guidelines such as the APA/AERA/NCME standards, the Association of Test Publishers’
guidelines, and the American Council of Education guidelines. The unique nature of the
Measures of Academic Progress (MAP) and Achievement Level Tests (ALT) also required the
development of additional processes for evaluating the quality of tests. This development
depended on the contributions of the individuals mentioned below.
Many of the original ideas in this document came from Brian Bontempo of Mountain
Measurement. Brian shaped the document and will hopefully see his vision in the final
manual.
The manual has also enjoyed the thoughts, contributions, and review of the NWEA
Research Team, including Carl Hauser, Ron Houser, and John Cronin. They provided
much of the information used in the manual and consistently worked to maintain the
integrity and accuracy of all information presented.
As with most NWEA documents, much of the inspiration and many of the best ideas and
suggestions have come from individuals in our member agencies and board of directors. If
this document is useful, it is due to these contributions.

Introduction
This manual provides decision makers and testing professionals the technical information
necessary to understand the theoretical framework and design of the NWEA Measures of
Academic Progress (MAP) and Achievement Level Tests (ALT). It provides a technical
description of the psychometric characteristics and the educational measurement systems that
are embedded in the MAP and ALT assessments. It is not designed to be a technical
description or administration guide for the computer and information systems that develop
and deploy the MAP and ALT assessments. Please refer to the appropriate administration
documentation for this information.
This document is written for three primary audiences:
Educators who are new to NWEA will find the document useful in evaluating the
appropriateness of NWEA assessments for their setting.
Educators currently using NWEA assessments will find the document useful in
understanding and interpreting the results of the assessments.
Measurement professionals will find information within this document to help

evaluate the quality of NWEA assessments.
Other Resources
This manual is one element of a series of documents that pertain to the MAP and ALT
assessments. To obtain additional information, consult the Map administration training
materials, The ALT Administration Guide, and the NWEA RIT Scale Norms document. These
documents provide greater detail about how to administer the assessments and interpret the
results. Taken together, they provide a comprehensive description of the MAP and ALT
assessment systems.
Parents, teachers, test proctors, curriculum coordinators, assessment coordinators, and

administrators should find the answers to many of their questions about the MAP and ALT
assessments within the series of documents. Some readers may wish to seek additional
information on various topics such as NWEA, educational testing, or psychometric theory.
References to additional sources of information are provided throughout this manual.

Organization of This Manual

In an effort to provide useful information to all levels of practitioners, this document
intertwines both basic and advanced information into each section:
The Introduction includes this description of the manual and an introduction to the
use of families of assessments that are bound together by shared measurement
characteristics. These covalent assessments underlie NWEA assessments and much
of modern testing practice.
Educational Assessment describes the characteristics that a good educational

assessment tool must have. It also discusses the psychometric principles considered
in the development of NWEA assessments.
Testing for a Purpose: NWEA Assessment Design introduces the measurement

principles used in the design of the assessments. It also describes the design of ALT
and MAP tests.
The Item Banks and Measurement Scales details the development of the item banks
and measurement scales underlying the MAP and ALT assessments. It includes
discussions of the item development process and the field testing process. It
concludes with the theory and application of psychometrics to integrate the items
into the NWEA measurement scales. This section also includes information
concerning Item Response Theory (IRT), the psychometric theory used in MAP and
ALT.
The Assessment Process describes the basic steps of the assessment development
process and explains how these steps are conducted in a psychometrically sound
manner. This section starts with the test development process and moves through
test administration, scoring, reporting, and results interpretation.
The section Customizing Assessments identifies the points in the development process
at which an educational agency can customize the assessments to align with their
local curriculum and match their assessment objectives. Since the MAP and ALT
assessments are designed to be both locally appropriate and globally comparable,
there is a science behind the localization process followed for all MAP and ALT
development. This section covers that science and explains the factors that agencies
must consider when localizing their assessments. This section also includes an
explanation of how these localized assessments provide the capacity for global
comparison.
Operational Characteristics of the Assessments details the psychometric characteristics

of the NWEA assessments. This section includes evidence from a variety of studies
concerning the validity, reliability, and precision of the scores from MAP and ALT
tests.

Covallent Assessments
One of the unique characteristics of the ALT and MAP assessments is that they are classes of
assessments that share common features. These classes are termed covalent assessments.
Covalent assessments are defined as classes of assessments for which test scores are
interchangeable due to common item pools, measurement scales, and design characteristics.
NWEA has developed over 2000 MAP tests and 900 ALT series that share these common
characteristics. While the content specifications of the tests differ somewhat from one
educational agency to another, the measurement characteristics are interchangeable. The
scores from one district can be readily compared to the scores in another district.
One interesting aspect of the development of technical specifications for a class of
assessments is that the outcomes tend to be more robust than those for a single test. The
information in this manual is applicable to all of the covalent assessments. It is also
applicable to tests that might be developed following a change in content standards or
curricular focus. As a result, educational agencies using covalent tests never lose continuity of
data as they improve their curriculum and instruction.
A commonly asked question concerning ALT and MAP tests is “How is it possible for the
scores on these assessments to be interchangeable, since students take different questions in
different school districts and even in the same classroom?” To answer this question it is
useful to describe some of the characteristics that are common to all NWEA assessments.
All of the ALT and MAP tests share these characteristics:
Common item pools in which all items are pretested before use.
Common cross-grade measurement scales for all items in a content area.
Common test design including numbers of items and distribution of difficulties.
Common psychometric characteristics or psychometric characteristics that differ by
design.
These shared characteristics allow scores from different covalent tests to be directly
comparable. This holds for students taking ALT and MAP tests across grades, across schools,
across districts, and across test forms. A substantial amount of evidence concerning the
stability of scores across different tests is included in the section Operational Characteristics of
the Assessments.
This notion of interchangeable test items and interchangeable test scores is central to all the
research concerning adaptive testing that has taken place over the past 30 years. The results
of this research are now commonly accepted testing practice. Adaptive tests are now used in
high-stakes testing settings such as medical licensure and certification, college entry, and
armed services placement examinations. For more information about adaptive tests and the
interchangeable use of different sets of questions in a variety of practical settings, see
Drasgow and Olson-Buchanan (1999).

Educational Assessment
In a traditional assessment, all of the students in a single grade are given a single test form.
This test is commonly designed with a wide range of item difficulties so that it is able to
measure the achievement of most students to some extent. There are two common problems
with this type of test: non-optimal psychological characteristics and low measurement
accuracy.
Psychological characteristics. A traditional wide-range test includes items that span a wide
difficulty range. Most students encounter a number of items that are too easy for them and a
number of items that are too difficult for them. These items tend to either bore the students
or frustrate them. As a result, students may not be measured well by a wide-range test
because of their psychological reactions to the test.
Measurement characteristics. Because of the design of a wide-range test, a student

encounters only a portion of the test that is challenging without being frustrated. This is the
portion of the test that provides the most information about the student’s achievement level.
Many of the items in a wide-range test provide less information than they would if they were
correctly targeted near the performance level of the student. For example, a student who is
achieving at a level lower than his classmates may only see a few items in a wide-range test
that he has the knowledge to answer. The answers to the other questions provide little
information about the student’s capabilities (except to indicate that the test was too
difficult).
To improve the psychological and measurement characteristics of a test, we can design a test
that increases the percentage of students who are challenged without being frustrated. This is
the underlying motivation behind the development of the NWEA MAP and ALT tests. By
improving the fit of the test to the student, we can obtain more information about the
student without causing boredom or frustration.
The Principles That Guide NWEA Assessments

Educators need detailed information about individual student achievement. They use this
information to place students in appropriate classes and to form instructional groups within
a class. They also use it to help parents and other stakeholders understand the strengths of a
particular student’s educational development as well as the areas that need further
instruction. When constructed and used properly, educational assessments are efficient tools
that yield consistent, precise information concerning student achievement and growth.
Before discussing how NWEA assessments achieve these aims efficiently, it is useful to
discuss why reliability, validity, precision, and efficiency are important. To do this, it is
helpful to consider how a ruler is used to measure the length of a piece of wood. The

principles and qualities of measurement in the physical world apply directly to measurement
in education.
Reliability is a primary requirement of measurement. Reliability can be defined as the

consistency of measures obtained from a measurement tool. For example, when the length of
a small, straight piece of wood is measured with a ruler, one expects that additional
measurements of the same piece of wood with the same ruler will have similar results. With
repeated observations, you can determine that the ruler measurements are consistent and
reliable.
Educational assessments that yield consistent results with repeated testing or within different
portions of the same test are considered reliable. Those that provide inconsistent results are
unreliable and of less use to educators. Therefore, one of the primary goals in creating an
educational assessment is to create an assessment that has reliable scores.
Validity is another quality that is important in designing and evaluating assessments. Validity
can be defined as the degree to which an educational assessment measures what it purports to
measure. This attribute of a test and its scores is fundamental to the quality of the
measurement.
While many aspects of validity have been discussed in the past, the two most critical include
the degree to which:
An assessment measures what is expected to be taught in the classrooms in which it

is used.
The scores from an assessment in a content area correspond to other indicators of

student achievement in that same content area.
While the relationship of MAP and ALT scores to other indicators of student achievement
are discussed in the section Operational Characteristics of the Assessments, the relationship of
the assessments to what is expected to be taught requires further discussion here.
The extent to which a measurement is an accurate, complete indicator of the quality to be

measured defines its validity as an expression of the content being assessed. In educational
assessment, this is highly dependent upon the local curriculum. For example, a generic
mathematics assessment may contain content that differs greatly from that which is expected
to be taught in the local classroom. On the other hand, a mathematics assessment designed
to include content that is included in the local curriculum fits the content needs of the
educational agency more completely. The latter, more valid assessment is clearly more
valuable to educators in understanding student achievement.
Precision is a third important aspect of any assessment tool. Precision is the level of detail
that the assessment can render. A ruler capable of measuring length to the nearest 1/16 of an
inch is more precise than a ruler that measures to the nearest inch. Think of a ruler that has
tick marks every 1/16 of an inch. The actual length of a measured piece of wood is
somewhere between the two nearest tick marks of the ruler. For example, between 8 5/16
inch and 8 3/8 inch. With this ruler, the amount of error in any single measurement is less
than 1/16 of an inch if the person taking the measurements is consistent.

In educational assessment, precision or error can be thought of in the same manner. It is

common to refer to the amount of error to be expected in a test score as the Standard Error
of Measurement (SEM). A more precise instrument (an instrument with a lower SEM)
allows educators to see the difference between two similar students better than a less precise
instrument. It is for this reason that measurement precision is valuable to educators, and
therefore one of the primary goals in designing an educational assessment.
Precision is affected by the amount of information obtained from each test question and the
number of questions on a test. Test makers can optimize the precision available for a test of a
given length by developing a test that yields a high amount of information from each item.
Efficiency is defined as the degree to which a measurement tool can be used to perform the
measurement needed in the shortest possible period of time. If you use a ruler to measure the
width of a piece of paper, it is a very efficient tool. If you try to use the ruler to measure the
distance from Cleveland to Canton, it is an inefficient tool for the task.
The number of questions and the types of questions on a test directly relate to the time it
takes students to complete the test. The precision of an assessment is also directly related to
the number of questions on the assessment. The more questions there are on an assessment,
the more precise the results are, but the more administration time the assessment requires.
The degree to which the difficulty of questions on a test is matched to a student’s

achievement level also directly affects the precision of the assessment. The better the match,
the more efficient a test is for any required level of precision.
A test that provides precise results with a relatively small number of items is considered
efficient. Factors that affect efficiency include the quality of the questions on a test, and the
targeting of the difficulty of the questions on the test to the achievement level of a particular
student.
With the increasing demands on schools to bring students to higher standards, it is

important to spend as much time teaching students as possible. At the same time, educators
need more information about students to help them achieve more. Since additional
assessment is time consuming, the need for student achievement information must be
balanced with the need to minimize the time that students spend in assessment situations. In
order to help educators meet these competing needs, assessment tools need to be as efficient
as possible.

MAP and ALT in a Comprehensive Assessment Plan

One approach to gathering student performance information is to include paper and pencil
tests such as ALT and computerized adaptive tests such as MAP in a comprehensive
assessment plan. These assessments incorporate a strong underlying measurement model and
a large bank of test questions to allow choices that fit the local curriculum. This allows the
MAP and ALT tests to meet or exceed any practical standards of reliability and content
validity. At the same time, MAP and ALT provide students with relatively short tests,
preserving classroom time for instruction. Incorporated within a comprehensive assessment
program, ALT and MAP can provide extremely useful information at the student, classroom,
school, district, and state levels.
Testing for a Purpose: NWEA Assessment Design
The NWEA assessments were designed using a small number of guiding principles. Each
principle relates to creating an assessment for a specific educational purpose. The principles
are:
A test should be challenging for a student. It should not be frustrating or boring.
A test should be an efficient use of student time. It should provide as much

information as possible for the time it takes to administer.
A test should provide a reflection of a student’s achievement that is as accurate as

needed for the decisions to be made.
A test should consist of content the student should have had an opportunity to
learn.
A test should provide information about a student’s growth as well as the current
achievement level.
A test should provide results to educators and other stakeholders as quickly as

possible while maintaining a high level of integrity in the reported results.
These principles are carried into the design of each NWEA assessment. This can be seen
clearly in the choice of testing structure and testing modality.
Test Structure
In the past, educational tests included a single test form per grade level. This type of test is
commonly called a wide-range test because the questions have a wide variety of difficulty
levels. A problem with this type of test is that in trying to give a sampling of content that is
appropriate for many students, the test as a whole is appropriate for very few students.
Almost any student taking a wide-range test encounters some items that are too easy, some
that are reasonably challenging, and some that are too difficult.
In any single grade level, the difference in achievement between low and high achieving
students is quite large. In building a single test to assess all of the students in a grade, there
must be some extremely hard questions and some extremely easy questions. In fact, the range

of difficulty is commonly large enough to reduce the degree to which the test is targeted to
any single student. As a result, high achievers find this kind of test to be a snap, while low
achievers are frustrated by its difficulty.
To improve the efficiency of assessments, NWEA has developed two types of tests that target
test difficulty to student achievement. These two types of tests are Achievement Level Tests
(ALT) and Measures of Academic Progress (MAP).
Achievement Level Tests (ALT)

The NWEA ALT assessments are different than wide-range grade-level tests. Rather than
constructing one test for every grade, ALT consists of a series of tests that are aligned with
the difficulty of the content rather than the age or grade of the student. The items of a single
ALT level have a small, targeted range of difficulty, designed to enhance the match of the test
to the set of students who take it. The range of difficulty for a single ALT level is far smaller
than the range of difficulty for a wide-range test. This allows a lot of information to be
collected concerning a student’s achievement, provided the correct level is administered.
Figure 1 compares the SEM from a wide-range test to that from an ALT series of eight levels.
Since error and precision are inversely related, the test with the smaller SEM is the test with
the greatest precision. Figure 1 shows that the ALT series tends to result in less error than the
wide-range test, across the entire measurement range of the tests. Figure 1 also shows that the
ALT test has a range in which student scores are measurable that is twice as large as the wide-
range test.
While the SEM of the test score is an important characteristic, the student’s proportion of
correct responses is also important in identifying the usefulness of a test score. The vertical
lines in Figure 1 indicate the range in which the student’s score falls between near chance
performance (25% to 30% correct, depending on the content area) and 92% correct.
Beyond this range, the student will have very unstable scores since the test is far too easy or
far too difficult. Figure 1 shows that the ALT series has a much broader measurement range
with reasonable precision than the wide-range test.

Figure 1: Measurement error of typical wide-range test vs. ALT series
11
L-1 WR WR L-8
10
Measurable
9 Range of
Typical Wide
8 Range Test
(using same
7 percent correct
Measurement Error
rules)
6
1
Measurable Range of ALT Series
0
130 140 150 160 170 180 190 200 210 220 230 240 250 260 270 280 290
RIT score
ALT assessments are designed to provide better targeting of test difficulty to student
achievement. Figure 2 shows the percentage of students receiving tests of different relative
difficulty. The example is based on 3,000 fourth-grade students taking a reading test in
California. A test that has an average item difficulty equal to the student’s achievement is
perfectly targeted. A test that is too difficult or too easy will not provide as much
information.

Figure 2 shows that ALT results in better targeting of the tests than using a single wide-range
test. In this example, the average absolute difference between the item difficulty and the
student achievement was 8.1 for ALT and 14.5 for the wide-range test. The greatest
targeting error seen for ALT was 23.4 RIT points (see the section The Measurement Model
for information about the RIT scale), while for the wide-range test it was 57.0 RIT points.
Figures 1 and 2 clearly show that the ALT system results in tests that are closer in difficulty
to the achievement of the students and yields more accurate scores over a wider range of
achievement.
Figure 2: Comparison of item difficulty targeting in ALT and wide-range tests
0.10
0.08
of Students
Proportion
0.06
0.04
0.02
0.00
-60 -40 -20 0 20 40 60
Difference Between Student Score and Item Difficulty
Wide-Range ALT
Measures of Academic Progress (MAP)

The other testing structure used in NWEA assessments is computerized adaptive testing
through MAP. The basic concept behind adaptive testing is relatively simple. Given a pool of
items with calibrated item difficulties, a student is presented with an item of reasonable
difficulty based on what is known about the student’s achievement. After the student answers
the item, his or her achievement is estimated. The next item chosen for the student to see is
one that is matched to this new achievement level estimate. If the student missed the
previous item, an easier item is presented. If the student answered the previous item
correctly, a more difficult item is presented.
This process of item selection based on all the responses that have been made by the student
repeats itself again and again. With each item presented, the precision of the student’s
achievement level estimate increases. This results in smaller and smaller changes in item

difficulty as the test pinpoints the student’s actual achievement level. This process continues
until the test is complete.
Figure 3 shows the student score and error band around the score as each item is presented in
a MAP mathematics test. The characteristics of an adaptive test are very clear in this figure.
Early in the test, the student’s score varies substantially from item to item and has a very
wide error band. As the test continues, the student’s achievement level estimate stabilizes,
and the associated error band shrinks substantially.
Figure 3: Student achievement level estimates and error bands following each item administered in a MAP mathematics
test
250
240
233
230
Student RIT Score
Advanced
225
222
220 219 218
214 215 214
Proficicent 212 212 212 212
210 210 210 210 209
206 207 206 207
200
Basic
190
180
170
160
150
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Test Questions Administered
In the test shown in Figure 3, three achievement categories have been identified (basic,
proficient, and advanced). Figure 3 illustrates that at the beginning of the test the error band
around the score overlaps all three categories, indicating that no confident decision can be
made about the student’s actual category. By the end of the test, the error band has decreased
to the point that it is completely contained within the proficient category. At this point a
confident decision can be made concerning the student’s actual achievement category.
One advantage to administering MAP is that every student receives a unique test, custom
tailored to the performance of the student throughout the assessment. This means that the
targeting of item difficulty is typically even better than that of a particular ALT test. This
also means that the precision of MAP is greater than ALT, although the actual difference is
quite small.

Figure 4 displays the average SEM for scores from MAP and ALT for a set of language usage
assessments. Notice that for the most part, MAP and ALT have nearly identical SEM values.
This is in keeping with the underlying test design and measurement theory. The
development and implementation of computerized adaptive testing in education has been
researched over the last 30 years (Weiss and Kingsbury, 1984) and has evolved to a point
where it is now an extremely useful testing approach for many organizations that have large
item banks and a solid computer infrastructure for test administration.
Figure 4: Comparison of SEMs for MAP and ALT
MAP and ALT Measurement Error for Language Usage by RIT

Spring 2001
8
2
ALT n=294672
1 MAP n=75226
0
130 140 150 160 170 180 190 200 210 220 230 240 250 260 270
RIT Scale

Figure 5 shows the targeting of the MAP tests (the difference between average item difficulty
and student achievement) compared to a typical wide-range test. You can see that the MAP
tests are substantially better targeted to student achievement than the wide-range test. The
average absolute difference between the student achievement level and the average difficulty
of the items on the test was 2.0 RIT points, and the largest difference observed was 22 RIT
points. Comparing Figure 4 to Figure 2 shows that the MAP system results in better
targeting than the ALT system.
Figure 5: Comparison of item difficulty targeting in MAP and wide-range tests
0.24
0.22
of Students
0.20
Proportion
0.18
0.16
0.14
0.12
0.10
0.08
0.06
0.04
0.02
0.00
-60 -40 -20 0 20 40 60
Difference Between Student Score and Item Difficulty
Wide -Range MAP
These figures show clearly that ALT and MAP provide almost all students with tests that are
better targeted to their skill levels. As a result, the test scores from MAP and ALT are more
precise than those from a comparable wide-range test. This is in keeping with several of the
principles described at the beginning of this section.
Testing Modality
In order to make this system even more efficient and informative, NWEA uses two different
test modalities: paper and pencil delivery and computer delivery. An educational agency
must decide which testing modality is appropriate for them.
ALT tests are delivered in paper form. MAP tests are delivered via computer. Since the
process for selecting items in an adaptive test requires rapid estimation of the student’s
achievement and rapid calculation to select the item to be administered to the student, a
computer is required for MAP administration.

There are several factors an educational agency should consider in deciding which testing
modality is right for them:
Availability of resources—Since MAP requires computers for test administration,

organizations must have workstations available for use. ALT does not require each
student to be tested by computer, but requires a different set of resources, including
logistic support for printing and distributing test materials, control systems for
processing answer sheets, and disposal or storage procedures for test materials
following testing.
Time frame available for test administration—MAP can be simultaneously

administered to only as many students as there are computers available. There are
rarely enough computers at a particular site to test all students simultaneously.
Therefore, agencies must schedule test administration sessions with individual classes
or students. ALT can be administered simultaneously to many students. Since the
MAP test is created dynamically from a large item pool, however, the need for
simultaneous testing is not as great as with ALT.
Time frame available for scoring—MAP scoring and results validation happen in
less than 24 hours. ALT requires machine scoring that is done by either the
organization or NWEA. The ALT scoring process normally requires approximately
two weeks.
Need to retest students with invalid tests—Sometimes (less than 6 percent of the
time) a student achieves a score outside the valid measurement range for the ALT
level administered. In this situation, the student is retested with a level test more
appropriate to the student’s achievement or with the MAP system. Using the MAP
system, retests are rarely needed and usually occur when hardware fails or the
student becomes ill during testing.
Desire to control item selection—ALT allows an organization to hand pick the

items for their ALT series. Since MAP builds a custom test for each student during
test administration, organizations cannot control the exact content administered to
each student. Nonetheless, organizations choosing MAP do have a great deal of
control over the types of items presented to students. This is due to the robust
blueprinting and item selection process available within MAP.
By considering all of these factors, educational agencies can choose the testing modality that
is most useful to them.

The Item Banks and Measurement

Scales
To develop assessments that have a high degree of validity and reliability for a variety of
different educational agencies, it is important to have large banks of high-quality test items.
These items need to be connected to an underlying measurement scale using procedures that
ensure consistency of scores from one set of items to another. This section outlines the
process that NWEA follows to create items, incorporate them into the NWEA measurement
scales, and maintain the scales and items over time. These processes and the theory behind
them allow the development of valid, reliable, and precise assessments from the NWEA item
banks. An understanding of these processes helps in interpreting and explaining the results of
the assessments.
The Measurement Model

Before discussing how items are added to the NWEA measurement scales, it is important to
understand the measurement model that is used to create the scales. Item Response Theory
(IRT) (Lord & Novick, 1968; Lord 1980, Rasch, 1980) defines models that guide both the
theoretical and practical aspects of NWEA scale development.
For the purpose of understanding these measurement models, consider again the analogy of
the ruler used in the Introduction. The NWEA measurement scales are very much like a
ruler. The NWEA reading scale is a ruler of reading. With an actual ruler, the tick marks are
placed evenly and accurately on the ruler by using a standardized device such as an inch-long
piece of wood to draw each tick mark. Locating the tick marks for an educational scale such
as the reading scale is not this easy, but it is quite possible using IRT.
One important element of IRT is the mathematical model that grounds it. NWEA uses the
one-parameter logistic model (1PL), which is also known as the Rasch model. The model is
as follows:
This model is a probabilistic model that estimates the probability (P) that a student (j) will
answer a question (i) correctly given the difficulty of the question (bj) and the achievement
level of the student ( θ i). The constant, (e), is equal to the base of the natural log
(approximately 2.718).

A benefit of the use of an IRT model is that the student achievement levels and item
difficulties are aligtned on a single, interval scale. This measurement scale is “equal interval”
in nature, meaning that the distance between tick marks is equal, like a ruler. The units
(equivalent to the ruler’s tick marks) are called log-odds units or logits. Logit values are
generally centered at zero and can theoretically vary from negative infinity to positive
infinity. To simplify interpretation and eliminate negative scores, the RIT (Rasch unIT) scale
was developed. Since the RIT score is a linear transformation of the logit scale, it is still an
IRT scale and maintains all of the characteristics of the original scale. In order to calculate a
RIT score from a logit score, a simple transformation is employed:
Y RITs = (X logits *10) + 200
Therefore, if a logit value is zero, then the RIT value equals 200. This scale has positive
scores for all practical measurement applications and is not easily mistaken for other
common educational measurement scales. The RIT scale is used throughout ALT and MAP.
IRT measurement scales have a number of useful characteristics when they are applied and
maintained properly. The characteristics that have been most important to the development
of the NWEA measurement scales and item banks are:
Item difficulty calibration is sample free. This means that if different sets of
students who have had an opportunity to learn the material answer the same set of
questions, the resulting difficulty estimates for the items are equivalent aside from
sampling fluctuation.
Achievement level estimation is sample free. This means that if different sets of
questions are given to a student who has had an opportunity to learn the material,
the student’s score is equivalent aside from sampling fluctuation.
The item difficulty values define the test characteristics. This means that once the
item difficulty estimates for the items to be used in a test are known, the precision
and the measurement range of the test are determined.
These simple properties of IRT result in a variety of test development and delivery
applications. Since IRT enables one to administer different items to different students and
obtain comparable results, the development of targeted tests becomes practical. This is the
cornerstone for the development of level testing and adaptive testing systems.
These IRT properties also enable the building of item banks with items that extend beyond a
single grade level or school district. This has enabled NWEA to develop measurement scales
and item banks that extend from elementary school to high school. By combining these
properties of IRT with appropriate scale development procedures, NWEA has also developed
scales and item banks that endure across time and generalize to be useful in a wide variety of
educational settings.

Scale Development and Item Calibration

The development of the measurement scales and item banks has two primary phases. In the
initial phase, the boundaries for the scale are established, initial field testing of items is
performed, and scale values are established. The second phase includes maintenance of the
scale, extension of content, and addition of new items.
Phase I: Developing the Measurement Scales

The development of each NWEA measurement scale follows the same set of steps:
Identify the content boundaries for the measurement scale. During this step, a
variety of content structures from different agencies for the domain of interest are
reviewed and joined. They create a content index that is as detailed but broader in
scope than any single description of the content area.
Develop items that sample the content with a wide range of difficulty. Groups of
classroom teachers are trained to write high-quality test questions and participate in
a multi-day workshop to produce test items related to each element of the content
domain.
Develop the networked sampling design to be used in field testing. To assure

robustness of difficulty estimation, each item is included in several test forms to
allow the item to be seen in a variety of contexts and positions. The network design
used in most NWEA scale development is the four-square design (Wright, 1977).
Identify samples of students appropriate for the items to be tested. Each test form
is scheduled to be administered to 300-400 students in different classrooms, schools,
and grades.
Administer the field test. Students take the field tests in settings that mimic the
actual test as closely as possible. Tests are teacher-proctored and presented without
fixed time limits. Responses are entered on the same type of answer sheet used for
the operational tests.
Tests for dimensionality. Once field tests are administered, samples of the test
forms are used to investigate whether responses are effected by more than one
primary dimension of achievement. These analyses include factor analytic
procedures and content area calibration procedures (Bejar, 1980).
Estimate item difficulties. Once field test information has been collected, a
conditional maximum-likelihood procedure is used to calculate item difficulty
estimates for the items in each test form (Baker, 2001; Warm, 1989). This results in
a set of estimates and fit statistics for each form in which a particular item appears.
Test items for model fit. Fit statistics (point biserial, revised mean square fit) are
calculated for each item on each test form. In addition, the percentage of students
answering each item correctly and the percentage of students omitting each item are
calculated. Each form is then reviewed, and items are eliminated from further

consideration if they have poor fit statistics or if they are answered correctly by too
high or too low a percentage of the students in the sample.
Triangulate item difficulties. The remaining items that appear on multiple forms
are used in a process of triangulation to identify linking values that result in the
most plausible difficulty values for each item across all of the forms (Ingebo 1997).
This process results in a single difficulty estimate for any particular item that is the
best representation of the difficulty of the item across all of the forms. At the
completion of this process, the difficulty estimates for all of the items administered
are identified along a single, underlying difficulty continuum.
Apply Logit-to-RIT transformation. In the development of any IRT scale, a single

linear transformation is allowed. This gives the scale the numerical characteristics
desired by the developers. Once the items have been identified with triangulated
difficulty estimates, the linear transformation described earlier is used to transfer the
item difficulty values from the Logit scale to the RIT scale.
Once these steps have been carried out, the initial scale development is complete. NWEA has
used this same process to create its scales in reading, mathematics, language usage, general
science, and science concepts and processes. More details about the original development of
the NWEA scales can be found in Ingebo (1997).
Phase II: Maintaining the Measurement Scales

Once the initial phase of scale development has been completed, the scale can be used to
design tests, score student assessments, and measure student growth. At this point, the
highest priority shifts to maintaining the measurement scale. A primary element in this
maintenance is establishing scale stability. Items are always added to the item banks using
processes that assure that the original scale remains intact.
The process of calibrating the difficulty of a particular item to an existing scale can be
compared to calibrating a new bathroom scale using objects of known weight. In the same
manner, adding new items to the original measurement scale requires the use of items with
known difficulty estimates. New items must be added to the scale in a way that does not disrupt
the original scale. The field testing process is designed to collect data to permit this seamless
addition of items to the original scale.
Newly-developed items are field tested by placing them into a test together with active items.
Students’ scores are then estimated using only the active items to anchor them to the original
measurement scale. Using the fixed student achievement estimates, the difficulty of each of the
developmental items is estimated using a one-step estimation procedure. This procedure is
contained within a proprietary item calibration program designed for the purpose. Since these
item difficulty estimates are based on student achievement level estimates from items on the
original scale, they are also on the original scale. This procedure allows virtually unlimited
expansion of the item banks within the range of difficulty represented by the original items. It
also allows for the careful extension of the original measurement scale to easier and more
difficult test questions.

By using IRT to create a scale and by anchoring the difficulty estimates of the items to this
scale, NWEA assures that the scale is constant from one set of items to another, and from
one set of examinees to another. This means that two students can receive different sets of
items and the achievement estimates will connect to the common measurement scale. As a
result, educational decisions based on the results are comparable. Both MAP and ALT take
advantage of this measurement characteristic of the scales to allow different students to take
different sets of test questions. This also allows longitudinal assessments of student
performance to be made without the types of discontinuities that have plagued many test
publishers when they change their norms or re-center their scale scores.
Item Development
The item development process used by NWEA has been refined over the past 25 years as the
item banks in reading, mathematics, language usage, general science, and science concepts
and processes have expanded. The methods that NWEA uses to incorporate items into the
NWEA scales and maintain the scales are in widespread use throughout the measurement
community, although particular aspects of the process are unique to the NWEA item banks.
There are four basic phases of item development:
Item writing and editing
Item bias review
Item index assignment
Field testing
After field testing, new items are analyzed for quality before they are made available for use
in tests. If items have the statistical characteristics required by this quality analysis, they are
made active and may be used within MAP and ALT assessments. The item banks are also
monitored to ensure that the items and scale perform in a stable manner from setting to
setting and from year to year.
This section explains each of the stages of item development, the methods used to
incorporate the test items into the item banks, and the process used to maintain the quality
of the measurement scales and items over time.
Item Writing and Editing

The first phase of item development is simply writing the items. In order to write items for
use in the item banks, item writing workshops are held with groups of classroom teachers.
These teachers may come from a number of different educational agencies and attend the
workshop for the explicit purpose of writing test questions. By using active teachers to write
items, the items can more accurately reflect the curriculum actually being taught in the
classroom.

At each workshop, the teachers are first taught the basic guidelines for constructing multiple-
choice items (Haladyna, 1994; Osterlind, 1998; Roid & Haladyna, 1997). Writers are
instructed on item writing terminology and the need for clarity in the items. They are taught
how to make the stem of an item unambiguous and concise and are directed on how to write
parallel distracters. Writers are instructed to use only four response options for reading and
language usage and five options for mathematics and science. The teachers are encouraged to
write the stem of an item using positive wording and to use completely independent
distracters.
Following training, writers are assigned some general topic areas such as algebra or
computation. They then write items in these areas that would be appropriate for the students
in their classes. Teachers write approximately ten to fifteen items each day. As the items are
written, they are exchanged with other teachers who check the items for technical accuracy,
grammatical accuracy, completeness, readability, potential for bias, and clarity. Once the
items have been reviewed, the item writer specifies the grade range for which the item is
appropriate and the grade range in which the item should be field tested.
After the item writing session, the items are reviewed and edited by NWEA staff. This review
includes editing the grammar, format, and style of the items. It also includes reviewing and
editing the presentation of the items to allow seamless computer administration.
Modifications that may impact the correctness of the item are also reviewed before the item
is moved on to the next stage of development, bias review.
Item Bias Review

The purpose of the bias review is to ensure that all of the items contained in the NWEA item
banks are written in such a way that students of all ethnicities, races, and genders will
interpret them in a similar manner. To ensure that items are free of race, ethnicity, and
gender bias, two different steps are taken. During the item development process, other
teachers review the items for potential bias. The teachers look at each item to identify
whether it contains words or passages that do not pertain to the skill being tested but might
be misleading or misunderstood by students with particular racial or ethnic backgrounds.
In addition, NWEA holds item bias review panels. The focus of these panels is to see that all
students have a fair opportunity to answer the item correctly, without being distracted or
mislead by the context of the item. During these events, many of the items in the item banks
are reviewed by a panel of stakeholders from a variety of racial and ethnic backgrounds. This
panel examines each item using the same guidelines used by the teachers who originally
wrote them. Items that have potential bias are edited by the review panel or sent back to the
original item author to be revised. In addition, the panel may suggest wording changes—
such as the names of individuals contained within the item text—that might help the set of
reviewed items reflect greater diversity. Occasionally items are rejected and removed from
active use in the item banks.

Content Category Assignment

NWEA also assigns every item to a content category. By doing so, the performance of
students in specific content areas can be tracked. This enables NWEA to relate student
performance to the NWEA Learning Continuum and to provide agencies with reports that
contain student performance within the various content categories.
The schemas for the content categories within each bank were developed by NWEA staff and
content specialists. In building each content index, the content of the items within the bank
was analyzed, and a logical, hierarchical structure for the content of each subject was
developed. This structure was based on an amalgamation of the curriculum structures used
in a variety of NWEA districts. From time to time, the content structures are updated as
more precise ways of differentiating the content of the items within a bank are discovered.
They are also updated whenever a new content area is introduced into the bank. Currently,
there are approximately 150 to 200 content categories in each subject area. Each content
category serves as the reference point for up to 200 items.
The process by which an individual item is assigned to a content category is simple.

Generally, the item authors are provided with a copy of the content structure and are
instructed on how to interpret it. During item writing, the item authors identify and
document which index category they believe to be most appropriate for the item. After the
item writing session, these assignments are reviewed by NWEA staff and any questionable
assignments are discussed and altered if needed.
Field Testing
Once the items have been written, edited, reviewed for possible bias, and assigned to an item
index category, they are field tested. The purpose of field testing the items is to collect data
that will be used to analyze the quality of the items and incorporate them into the
measurement scales. Since the collection of field-test data uses student and teacher time,
every effort is made to make field testing as unobtrusive as possible.
There are three different ways that items are field tested. In the first method, students are
given a separate mini-test following the completion of their actual assessment. This mini-test,
called a field-test form, contains between ten and twenty field-test items.
The second field-test method involves administering assessments that contain several field-
test items placed within the assessment so that they are transparent to the student. When the
assessment is scored, only the data from the active items are used, so a student’s score is not
influenced by the presence of field-test items. By constructing and administering the field
tests in this manner, student time is minimally impacted and student scores are unaffected.
NWEA has also field tested items by administering a special test consisting of between thirty
and forty items, most of which are field-test items. Students do not receive scores when they
take these assessments. Field testing in this manner allows a great number of items to be
evaluated in a short period of time but uses student time without providing scores that are
useful in the classroom. For this reason, this method is used only when necessary.

To ensure that the quality of the data is high, field-test items are administered only in the
grade or grades suggested by the author. This ensures that the sample of students taking any
field-test item is reflective of the sample of students who will be taking the item after it
becomes active.
The size of the student sample also affects the quality of the data. Each item is administered
to a sample of at least 300 students. Ingebo (1997) has shown that this sample size is
adequate for very accurate item calibrations and item fit statistics.
Another essential aspect of quality data collection is student motivation. By embedding the
field-test items in a line test that is scored and reported, they appear identical to active items.
As a result, students are equally motivated to answer field-test and active items.
Finally, the environment for data collection should be free from the influence of other
confounding variables such as cheating or fatigue. Since the field-test data are collected
within the normal NWEA test administration process, which is designed to equalize or
minimize the impact of outside influences, the environment is optimal for data collection.
The field-test processes provide excellent field-test data that in turn allows the addition of
high-quality items to the item banks. The items are administered to a sizable sample of
students, and the data students provide are collected in a manner that motivates the students
to work seriously in an environment free from external influences on the data.
Item Analyses and Calibration

Having written, edited, and field tested items, the next step in the item development process
is to analyze how the items perform. Two statistical indices are used to help identify unusual
items.
The first index is the adjusted point-biserial correlation. This index quantifies the
relationship between the performance of students on a specific item and the performance of
students on the test as a whole.
The overall test scores for individuals who answer a particular item correctly is expected to be
higher than the test scores of those who do not answer the item correctly. If this is the case,
then the point-biserial correlation is positive and can approach a theoretical maximum of
1.0. If more lower-performing students answered the item correctly, then the point-biserial is
low or even negative. A low point-biserial value for an item usually indicates that the item is
not working with the scale. Items with negative point-biserial values may in fact be working
against the scale.
There are many reasons why a seemingly reasonable item might not be working well. One
example might be a mathematics question with very difficult vocabulary or sentence
structure. Another might be a reading question that could be answered without reading the
passage for the item.
The second index is the adjusted Root-Mean-Square Fit (RMSF). This index is a measure of
how well the item fits the scale. The RMSF is calculated by comparing actual student

performance to the theoretical values that are expected from the model. A high value for the
RMSF indicates that the item does not fit the model well. Similar to the point-biserial, the
RMSF indicates whether high-performing students are missing an item more often than
expected or low-performing students are answering an item correctly more often than
expected. This index can reveal more subtle difficulties with an item than the point-biserial.
For instance, an item with two correct answers may have an adequate point-biserial, but it
may have a high RMSF value. Items with a high RMSF value are reviewed graphically using
the item response function for the item. The item response function is a plot that shows the
probability of correct response to an item against the achievement level of the student. When
reviewing an item, the empirical item response function is plotted on the same scale as the
theoretical function. When there are large discrepancies between the two curves, there is a
lack of fit between the item and the scale. By reviewing the response functions, a more
comprehensive understanding of item performance can be gained.
Figures 6 and 7 show theoretical and empirically observed response functions for two items
(each item was a difficult mathematics item, field tested with approximately 400 students).
Figure 6 shows the results for an item that has poor fit to the measurement model (indicated
by a RMSF above 2.0). Upon review, the item was identified as being vaguely worded. This
item was rejected for use in the item banks. Figure 7 shows the results from an item with
good fit to the measurement model. This item was approved for use in the item banks.

Figure 6: Item response function for a poorly-performing item
Theoretical and Empirical Response Functions

(RIT=244, RMSF=3.36)
1.000
0.800
0.600
0.400
0.200
0.000
200 220 240 260 280 300
Figure 7: Item response function for a well-performing item
Theoretical and Empirical Response Functions

(RIT=244, RMSF=0.80)
1.000
0.800
0.600
0.400
0.200
0.000
200 220 240 260 280 300
Upon the completion of item analyses, items that perform well are added to the item banks
for use in future test development. Those that do not perform well are flagged for review.

During the review, minor content problems with the item are often uncovered and
corrected. From there, the item is field tested once again. If it performs poorly again and
subsequent review reveals no obvious problem with the item, it is either eliminated from the
item bank or retested in a different grade. These quality assurance procedures allow the item
banks to grow while filtering out those items with performance difficulties.
Periodic Review of Item Performance

After the items have been developed and incorporated into the NWEA scales, NWEA
monitors the items, the banks, and the scales to maintain their quality. Although the
measurement scale underlying the items is theoretically invariant, sometimes the meaning of
an item can change as society evolves or as curriculum changes. For instance, a word may
come into common usage for a short time, either with or without its original definition.
(The word “radical” is a classic example from the 1980s.) This type of change may alter the
difficulty of an item from its initial difficulty estimate for a period of time. (Similar difficulty
drift may arise with items that test monetary skills with obsolete prices.)
This is also the case when the curriculum for a subject changes dramatically on a national
scale, such as the increased emphasis on integrated mathematics over the last twenty years. In
an effort to keep the scale up to date, NWEA conducts periodic reviews of the content
within the item banks. Generally, these reviews are targeted at a specific issue, such as a set of
words that may have come into common usage. When items with such content difficulties
are discovered, they are prohibited from use, although they may be revised and re-field tested
if the content is otherwise worthwhile.
In addition to inspecting performance of specific items, additional studies are also performed
periodically to determine whether the scale itself is fluctuating or drifting across time. This is
done by recalibrating the items after several years have passed since the initial calibration. A
recent study (Kingsbury, 2003) investigated possible drift over a 22-year time period, using
over 1,000 reading items and 2,000 mathematics items administered to over 100,000
students taking ALT and MAP tests. This study replicated two primary findings seen in
earlier studies:
The difficulty values of items has not changed across time more than would be
expected from normal sampling variation.
The measurement scales have not drifted by more than 0.01 standard deviations
over the quarter of a century in which they have been used.
Periodic reviews of item performance and scale stability ensure that the item calibrations are
appropriate and the scales are stable. This is one aspect of the information that is needed to
ensure that educational agencies can construct valid, reliable, and precise assessments using
the item banks.

The Assessment Process

Note: This section describes the rationale behind assessment administration,
scoring, reporting, and results interpretation, as well as the technical specifications
pertinent to each phase. There are several other documents that supplement this
section. Consult the MAP administration training materials and the A L T
Administration Guide to understand how to administer the assessments. For
information on how to score and generate reports for ALT consult the User’s
Manual for the NWEA Scoring and Reporting System.
Once an educational agency decides to incorporate NWEA assessments into their assessment
strategy, it begins a process of creating an assessment system that meets its needs and is as
efficient as possible. One of the first decisions is to determine which testing mode to use.
Next, the tests are developed and then administered. Following administration, the tests are
scored and checked to verify the integrity of the test data. Next, student, class, and school
reports are generated for use by parents, students, teachers, and administrators. Lastly, the
results of the assessments are interpreted and used to inform educational decisions. Since the
quality of each stage in the process is dependent on the quality of all of the previous stages,
you can think of these stages as a pyramid as shown in Figure 8.
Figure 8: The NWEA assessment stages

Test Development
Once a testing mode is selected, the first stage in the NWEA assessment process is test
development. The NWEA test development process was established years ago by following
methods pioneered by experts in the fields of psychometrics and educational assessment.
Since then, the process has been refined to maximize the quality and efficiency of NWEA
assessments and to include the delivery of the assessments via computer. The process is
guided by the standards for educational and psychological testing as established by
professional organizations in the field (AERA/APA/NCME, 1999).
The test development process consists of a set of decisions that must be made and activities
that must be performed to create a functional test, as shown in Figure 9. The first step is to
design the test. The next is to determine the aspects of the curriculum that will be tapped by
the assessment. After that, the test items are selected from the NWEA item bank. Lastly, the
items are assembled into tests and proofread for a variety of characteristics. At that point, the
test development process is complete and the tests are ready to be administered. Details on
each phase of the test development process are provided next.
Figure 9: The phases of the test development process
Test Design Content Definition Item Selection Test Construction
Start Finish
Test Design
The first phase in the test development process is test design. In this phase, the psychometric
test specifications—such as the test length and item difficulty range—are established. NWEA
has developed sets of optimal test specifications to use when designing and constructing these
assessments. These specifications are used by all of the agencies administering MAP and ALT
assessments. They have been designed with a range and flexibility applicable to a wide variety
of educational situations to provide consistent, precise measurement.
ALT Test Specifications

For ALT, the test specifications developed during test design are:
Difficulty range of each level test.
Difficulty overlap of adjacent level tests.
Number of items on each level test (also known as test length).

These specifications adjust to fit the nature of the test content and the use of the scores. By
defining these specifications carefully, NWEA can create a level test series with desirable
measurement characteristics. These specifications allow NWEA to predict with some
accuracy the expected performance of the ALT system in terms of measurement precision
even before the test is administered to a single student.
Difficulty range—Each ALT series contains five to eight individual test forms composed of
items that increase in difficulty from one form (or level) to the next. This allows an
individual level to match the performance of a particular student while the series spans the
complete range of student achievement. The distribution of item difficulty is typically
uniform for any single ALT test, and the range of difficulty is the same for every level test in
the ALT series. When designing the ALT series, the size of the range of difficulty can be
varied. When it is varied, it is always varied for each of the level tests of the entire ALT series.
As the size of the range of difficulty increases, the test’s capacity to assess a broader range of
achievement levels increases. When this is done, the targeting of the test to the students is
less precise, which reduces the precision of each student’s score. In addition, increasing the
range of item difficulty in a particular level increases the chance that a student will see items
that are too difficult or too easy.
The standard range of difficulty for an NWEA ALT test is 20 RIT points. The easiest and
most difficult levels may extend beyond this range as needed to assess students accurately at
the extremes. These levels tend to have a range of about 40 RIT points.
Difficulty overlap—Between adjacent level tests, there is always an overlap in item difficulty.
In other words, the hardest items of the easier level test are always more difficult than the
easiest items of the harder level test. The difficulty overlap used in the ALT tests is half of
the difficulty range, or 10 RIT points. This design allows the entire range of common
student performance to be assessed with a series of seven to nine levels, as shown in Figure
10.
Figure 10: ALT test structure
Test Widths and Overall Structure of a Typical

ALT Series
L-1 L-3 L-5 L-7

L-2 L-4 L-6 L-8
150 160 170 180 190 200 210 220 230 240
RIT Scale
Test length—Since test precision and test length are related, the number of items on a test
has a significant impact on score precision. More items generally result in more precision in
the test scores. This is also true of the precision of goal scores. Although the final

specification determined during ALT test design is the number of items that the test
contains, often NWEA starts with a desired precision and works back toward the number of
items needed to achieve that precision.
Most level tests are designed to have a SEM of approximately three RIT points in the central
portion of the measurement range for the level. Similarly, most level tests aim for a SEM of
about five RIT points for each goal score to be reported. The desired precision of the overall
score can be achieved with 40 items in a level, and the desired precision of the goal scores can
be attained with about seven items per goal. Therefore, NWEA ensures that every ALT series
contains at least seven items per goal area and at least 40 items overall. It takes students
slightly less than one minute, on average, to answer each item, so these tests can be
administered in a very time efficient manner. (Note that all MAP and ALT tests are
administered without time limits. As long as a student is working productively on a test, they
are allowed to continue.)
The optimal test design for ALT differs slightly depending on the subject. Within a subject
area, however, the same test design is used for almost all NWEA tests. The optimal test
design for an ALT series includes:
Five to eight level tests.
Eight or fewer major goal areas per test.
40-50 items per test.
At least seven items per goal area.
20 RIT point range in item difficulty per test.
10 RIT point overlap in item difficulty between adjacent level tests.
MAP Test Specifications

For MAP, test design involves setting the following psychometric specifications:
Size of the item pool.
Distribution of item difficulty by the test blueprint.
Desired length of the test, the scoring algorithm.
Item selection algorithm.
These specifications allow NWEA to predetermine the expected performance of the MAP
system in terms of measurement precision before the test is administered to a single student.
Scoring—All MAP assessments employ a common scoring algorithm. During the

assessment, a Bayesian scoring algorithm is used to inform item selection. Bayesian scoring
for item selection prevents the artificially dramatic fluctuations in student achievement at the
beginning of the test that can occur with other scoring algorithms. Although the Bayesian

scoring works well as a procedure for selecting items during test administration, Bayesian
scores are not appropriate for the calculation of final student achievement scores. This is
because Bayesian scoring uses information other than the student’s responses to questions
(such as past performance) to calculate the achievement estimate. Since only the student’s
performance today should be used to give the student’s current score, a maximum-likelihood
algorithm is used to calculate a student’s actual score at the completion of the test.
Item selection—All MAP assessments employ a common item selection algorithm. The item
selection algorithm works by initially identifying the ten items within the first content area
that provide the most information concerning the student. These items are the items with
difficulties closest to the current achievement level estimate for the student. After the ten
items are selected, one of the ten is selected at random and administered to the student. By
targeting items in this manner, NWEA maximizes the information obtained from each item.
This maximizes the efficiency of the assessment while also balancing the usage of the items
with similar psychometric characteristics. Once the item is administered, the process repeats
itself, except that only items from the second goal area are selected. This continues until an
item from all of the major goal areas has been administered, at which point an item from the
first goal area is selected once again.
Item pool structure—To provide each student a challenging test with an accurate score, the
MAP system requires the use of large pools of items with a difficulty range appropriate for all
of the students being tested. MAP item pools generally contain 1,200-2,400 items. The
distribution of item difficulty for a good item pool reflects the distribution of student
achievement so that each student being tested is challenged, and therefore measured
accurately. MAP item pools have items that challenge virtually every student tested.
Another important factor in item pool design is the distribution of item difficulty in each
reported goal area. If there is a gap in the pool of items in a major goal area at a particular
difficulty level, students at that achievement level are administered less than optimal items.
These items might be slightly easier or more difficult than desired. This reduces the amount
of information obtained from each item, thus reducing the test’s efficiency and precision. To
maintain consistency across goal areas and difficulty levels, the MAP system contains a pool
sculpting tool that maximizes the fit of the item pool to measurement and reporting needs.
Test length—As in ALT, the test length for MAP is determined during test design. The
desired precision of the assessment is the most important factor in determining test length.
Due to its adaptive nature, MAP tends to be slightly more efficient for most students than
ALT. This combination of efficiency and flexibility allows for a fairly standardized test length
for MAP. Generally speaking, MAP assessments that include goal score reporting contain 40
to 50 items.
Content Definition
One of the most important steps in developing a high-quality assessment is to make sure that
the content of the assessment is a valid representation of the content that is to be taught in
the classroom. There are two aspects to defining the content for an educational achievement
test. The first aspect is defining the curriculum, which is a complete description of the

content taught in the classroom. Almost all NWEA agencies have a pre-existing documented
curriculum that can be easily used in the test development process.
Having identified the content to be included in instruction, the next step is to identify how
it is sampled for the test. This is done by detailing the goals and objectives that make up the
span of content that could be included in a reasonable assessment. Figure 11 shows an
example of a goals and objectives document for a reading test aligned to the NWEA
Learning Continuum.
Figure 11: A sample goals and objectives document for a reading test
1. Word Meaning
– A. Use context clues
– B. Use synonyms, antonyms and homonyms
– C. Use component structure
– D. Interpret multiple meanings
2. Literal Comprehension
– A. Recall details
– B. Interpret directions
– C. Sequence details
– D. Classify facts
– E. Identify main idea
3. Inferential Comprehension
– A. Draw inferences
– B. Recognize cause and effect
– C. Predict events
– D. Summarize and synthesize
4. Evaluative Comprehension
– A. Distinguish fact and opinion
– B. Recognize elements of persuasion
– C. Evaluate validity and point of view
– D. Evaluate conclusions
– E. Detect bias and assumptions
No assessment measures all elements of a curriculum, nor should an assessment try. A good
curriculum contains a mix of elements that may be best assessed by formal assessments,
classroom observation, evaluation of major projects, and a variety of other methods. For
example, assessing listening skills is done in a very clumsy manner by formal assessments. It is
necessary for educational agencies to identify the specific elements of the curriculum that the
test will include.
To accomplish this, the agency needs to specify the percentage of test items to be included
on the test from each curricular goal and sub-goal. Most MAP and ALT assessments have
about four to eight goals with five or six sub-goals each, and contain between 40 and 50
items. Some educational agencies craft the distribution on criteria such as the emphasis of
certain content to the overall curriculum. This creates the test blueprint to be used for item
selection. Like the blueprint that an architect follows, the specification of goals and objectives
provides a plan that guides the rest of the test development process.

Item Selection
The next step in the test development process is to select the items to appear on the
assessments. During this step, the test blueprint and test design are used as a guide to select
the items from the item banks that comprise the test or item pool.
ALT—To select items for ALT, item selection typically occurs at a single meeting. During
this meeting, teachers and other stakeholders are instructed on the principles that guide item
selection. The group reviews the items in the item bank and hand-picks the items
individually for each level test of the ALT series. By conducting item selection in this
manner, organizations can customize the assessment content to local requirements.
The principles that guide ALT item selection are fairly straightforward. Each level test of the
ALT series must have a set of items with a particular range and distribution of difficulty. In
addition, the test blueprint needs to be followed precisely. Each level test must contain at
least seven items per major goal. In an effort to provide even greater content specificity,
organizations are instructed to include a wide variety of sub-goals as well. Lastly, the items
are reviewed to ensure that there are no questions that give away information needed to
answer other questions.
MAP—Item selection for MAP is quite different. Rather than selecting the individual items
to comprise a test, it is necessary to construct a pool of items that the computer selects from
during the test administration. MAP item pools can have between 1200 and 2400 active
items. Since these pools are so large, it is more efficient to select items using automated
processes.
During item selection, NWEA staff members work with an agency’s test blueprint to select
the index categories that best match the test blueprint. Staff members familiar with the
indexes and the item banks review each of the selected index categories and assign each index
to a major goal of the test blueprint. From there, all items associated with the index in this
manner become candidates for use. Depending on the number of items in this candidate
population, all items are packaged for use, or the population is reduced through the use of a
pool sculpting tool that is designed to create a subset of the population of items that have
optimal measurement characteristics and fit to the test blueprint.

Test Construction
The final phase of the test development process is to construct the tests. The purpose of this
phase is to package the items into deliverable tests, and where necessary, set some of the test
specifications that were determined during test design. In addition to constructing the tests,
the tests are reviewed as a step in quality control. Although the actual test construction is
executed entirely by NWEA staff, the review process is a joint effort between NWEA staff
and the NWEA partner agency.
ALT—For ALT, two documents are created during test construction. The first is the actual
test form. This document contains an introduction to the test, the instructions for the test,
the test identification information, and the text and response options for each of the items.
The second product created during ALT test construction is the series of files, called TPS
files, that contain the information necessary for the NWEA Scoring and Reporting Software
(SRS) to function.
In creating the test form, the items chosen during item selection are placed into the sequence
in which they will appear. This initial draft has the items that will appear on the test and
additional information that will not appear on the final form such as the correct answer, the
item difficulty, and the item identification number. ALT forms are arranged so that the
easiest items appear toward the beginning of the test and the hardest items appear toward the
end.
Once this initial draft is created, it is reviewed for content and localization by the educational
agency. The content review ensures that all important content is included and well balanced
in the assessment. It also allows the agency to assure that names of geographic references and
people on the tests are appropriate for their students. During this stage, final item
substitutions are made to correct any problems identified.
Once the initial draft has been reviewed and approved, a print master proof is produced.
This master is reviewed by both NWEA staff members and the educational agency to ensure
that all graphics and formatting issues are resolved, such as widows and orphans, pagination,
textual font, and margins. Once the test form is ready, a final print master of the test form is
produced.
The TPS file is the other document produced for each test, and its development parallels the
development of the test form. At each stage in the process, a TPS file is produced that
matches the current version of the test form. This file contains the item identifiers, the
answer key to each of the items, the names of the goals and objectives, and the unique goal to
which each item is associated. In addition, the raw-score-to-RIT-score conversion table is
included.
MAP—For MAP, the test construction phase entails the creation and packaging of several
computer files that are used by the test administration software during the test event. Among
these is the database that includes the item pool with the text, response options, answer key,
and calibrated item difficulty for each of the items in the pool. There are also files containing

the test specifications such as the number of items to be administered, the blueprint, bitmap
graphics, audio files, reading passage files, and the on-screen calculator.
Once the test files are created, NWEA staff members check the test by taking three sample
tests, one simulating high performance, one simulating low performance and one simulating
average performance. During these sample tests, a thorough inspection of the tests’
functionality occurs. The item selection algorithm is checked, the scoring routine is checked,
and the goal scoring routine is checked. In addition, the data being collected during the test
are examined for completeness and accuracy. Upon completion of the reviews, the test
construction phase is complete and the test is ready to be administered.
Test Administration
NWEA helps each agency administer the assessments in a manner that is efficient,
psychometrically sound, and fair to each student. Although complete details about the
administration of NWEA assessments can be found in the MAP administration training
materials, an overview of the technical aspects of the administration process is provided next.
ALT—Prior to test administration, the appropriate level test form for each student is
identified to ensure that the test is appropriately challenging and informative. If the test is
too easy, students will be bored and will not demonstrate what they know because the test
questions are not difficult enough. If the test is too hard, students will be frustrated, and
again, will not demonstrate what they know because the test questions are too difficult.
To identify a challenging test for a student who has previously taken NWEA assessments, the
SRS software is used. Some districts choose to conduct this process themselves, while others rely
on the expertise and resources of NWEA to conduct this step on their behalf. The SRS software
uses a student’s valid test scores from the previous three years in the assessed subject to
identify the predicted score for the upcoming test with the following procedure:
1. The test score from each previous term and year is converted to a standardized score.
This score identifies how far above or below the mean a student scored for the term
and grade in which the test was taken.
2. The most recent of these standardized scores is duplicated so that it has twice the
weight of any other score.
3. The average of all of these standardized scores is calculated. This averaged

standardized score tells us how many standard deviations the student ranks above or
below the district average over the last few tests.
4. In order to obtain the predicted score for the current year, the average standardized
score is multiplied by the standard deviation of scores in the student’s current grade.
This result is added to the average score for the student’s current grade. This
provides a predicted score for the student that is as much above or below the average
district performance as the student’s past performances.

Once the predicted score is obtained for the student, SRS selects the level test that is closest
in mean difficulty to the student’s predicted score. This information is printed on an answer
sheet for the student. If a student’s teacher has a strong reason to believe that the student has
changed markedly in achievement since the last test, then he or she may wish to administer a
different level test.
If a student is being administered her or his first ALT test in a given subject, then a short
locator test must be administered to determine which level test to provide to the student. A
locator test is a short wide-range test (containing 16 items) that determines which level test is
appropriate for a student.
MAP—Since MAP is able to provide a test that is consistently challenging to all students
regardless of their achievement level, agencies need not concern themselves with selecting the
appropriate test form for each student. The MAP system selects the first item to be given to a
student based on the student’s past performance or the grade level in which the student is
enrolled. Specifically, the first item is selected so that it is 5 RIT points below the student’s
previous performance. If no previous score is available, the first item is selected so that it is 5
RIT points below the mean performance for students in the same grade from the norming
study. If no grade level mean is available, the difficulty of the first item is set to an arbitrary
value based on the subject and grade.
The Testing Environment

Test proctors conduct test administration and do so only after they have been trained by
their assessment coordinator or NWEA staff member. Classroom teachers are generally the
proctors for ALT. For MAP, however, the proctor is typically not the classroom teacher. The
proctor can be a technical assistant, educational assistant, or some other staff member
assigned and trained for that task. NWEA encourages the classroom teacher to be in the lab
with the proctor.
Trained proctors have four tasks to perform when administering a test:
Ensure that the testing environment meets administration standards.
For ALT, provide each student with the appropriate test and answer sheet.
For MAP, ensure that each student is at the correct computer.
Provide the test instructions to the students.
Monitor students while they take the test.
The first task that test proctors perform as part of test administration is to ensure that the
testing environment meets administration standards. The proper testing environment helps
to ensure that the test-taking experience is consistent and the results are fair and accurate.
Since each educational agency has different facilities and different resources available for
testing, the testing environments in each agency differ as well. By enabling proctors to make
the testing environment secure, comfortable, and free from distractions and inappropriate

resources such as text books, notes, and instructional posters, the setting provides all students
with an equal opportunity to perform on the assessment.
In order to make the assessments fair for all students, those students with an Individualized
Education Plan (IEP) may be granted special modifications that should be planned for ahead
of time. There are six types of modifications that can be granted to students with an IEP:
Changes in the timing or scheduling of the assessment.
Changes in how the test directions are presented.
Changes in how the test questions are presented.
Changes in how the student responds to test questions.
Changes in the test setting.
Changes in the references and tools that are provided during the test.
Proctors also provide the test instructions to students and monitor the students as they
complete the test. Proctors are provided with specific test instructions that they read to the
students before taking the test, modifying the text only when it does not pertain to either the
test takers or the testing environment.
Once testing begins, the proctor monitors students to ensure that they work independently,
do not disturb each other, and do not attempt to cheat on the test. Proctors are explicitly
instructed not to help students with any problems or to read problems to students unless that
is part of a student’s IEP testing modification.
Proctors are instructed to invalidate the test if a student:
Copies or receives verbal help from another student.
Answers randomly without reading the questions.
Refuses to take or continue the test.
Seems unable to comprehend directions or questions.
Exhibits disabling anxiety.
There are three administration specifications that are worthy of note. First, there is no time
limit placed on students for the administration of the assessment regardless of whether it is
an ALT or a MAP test. Second, students are not permitted to skip questions on MAP tests,
nor can they return to earlier questions. Last, the MAP system tracks the items that students
answer. Each time that a student takes the exam, the MAP system ensures that only fresh
items, items never taken previously by the student, are administered. The MAP system
allows up to four test administrations per student per year per test.
During the administration of a MAP test, the test proctors are responsible for navigating the
computer through test start up, student breaks, and unexpected system failures. Before a

student sits down to take the test, the testing district or agency must submit a Class Roster
File (CRF) to NWEA with all of the pertinent identifying information for each student. This
includes the student IDs, names, and grade levels. Once this data are processed by NWEA, it
is transmitted to the test sites and proctors can schedule tests for students. When students are
ready to test, proctors must log into each student’s workstation and load the test. This entails
locating the student name in the CRF and initiating the test session. Once the start-up
screen appears, the student can begin testing.
Students normally complete each test without interruption. However, if a student needs to
take a break, he or she must raise his or her hand and indicate this to the test proctor.
Proctors then pause the test at the student’s workstation. Upon completing the break, the
proctor resumes the test for the student. When the test is resumed, the student is presented
with a different item than the one that was displayed on-screen prior to the break. This is
done to prevent students from looking up answers while taking their break.
In the event of a system failure, such as a power failure, the proctor can resume tests that
were in progress prior to the failure. The MAP system is designed in such a way that data are
written to the test record following the completion of every item. This design means that no
item responses are lost. When a proctor restarts the system and the student’s test session, the
proctor is automatically given the option to resume the student’s test where the student left
off or to start a new test. This decision is left up to the test proctor, who may choose to
initiate a new test if the student was only at the beginning of an assessment or may choose to
resume the test if the student was well into the test.
Sometimes a student may not complete the assessment during a scheduled test session. For
example, a student may become ill and need to leave school. The proctor can terminate the
test with the option to resume the test later. If the proctor chooses to not make the test
resumable, the terminated test is invalidated and a new test is generated.
Once a student completes a MAP assessment, the student’s overall RIT score and goal scores
are displayed on the screen. The proctor may print these scores if he or she chooses, but this
is not necessary since the scores appear later within various reports. After the student reviews
the scores, the MAP system advances to the next test assigned this student or to the student
selection screen.
Several different mechanisms are in place to ensure the technical quality of the MAP
administration. During training, proctors are instructed to complete an Item Report Form
whenever they encounter a questionable item. These reports are monitored and adjustments
are made to the system as needed. Upon any modifications to the administrative software or
systems, tests are repackaged and go through the quality assurance process described above.
This process includes thorough testing of the scoring algorithms and item selection
algorithms.
Although test administration facilities may differ slightly from one agency to another,
standardization of the test administration procedures is simple and straightforward. NWEA
provides test proctors with the training and resources necessary to administer the MAP and
ALT assessments in a manner that promotes optimal performance and is fair to all students
taking the assessments.

Test Scoring and Score Validation

Test scoring and score validation follow test administration. It is the responsibility of NWEA
and the educational agency to ensure that test scores are accurate. Since scoring and
validation are done using computerized software, these systems have been rigorously tested to
ensure that scores rendered from the system are calculated and displayed correctly.
As with other IRT-based testing systems, the scoring procedures are somewhat different than
those in more traditional testing systems. The proportion correct on a given test does not
provide useful information. Rather, a computer calculates the students’ scores taking into
account not only the students’ performance on the test items but the difficulty of the items
that each student is administered. In a manner similar to a diving or gymnastics competition,
the student’s score is based on the difficulty of the items attempted and the student’s
performance on each item.

For ALT, students’ answer sheets are scanned and scored using the SRS software. Some
districts choose to conduct this process themselves, while others rely on NWEA to conduct
this procedure.
Before scoring the tests, the test identification information on the answer sheets is validated.
The agency administering the test is notified of any incomplete or erroneous student
identifiers, student names, student grade levels, or ALT test levels. After these errors are
corrected, the answer sheets are scanned and scored.
Today’s scanning technology is accurate. There are occasions when a student fails to shade in
a response dark enough for the scanner to detect it. When this occurs, the scanning
equipment very accurately converts the answer sheet data into digital data. The only time
that human intervention is needed in the scanning process is when a student has obviously
filled in the wrong section of the answer sheet, in which case the test data are transferred to
the appropriate section.
The scoring algorithm contained within the SRS is also quite accurate. This algorithm
compares student responses to the key for the items and identifies whether the student
answered the item correctly or incorrectly. Anytime updates to the SRS are made, sample
tests are scored using the new system. The accuracy of the new system is verified when these
scores are compared to known scores.
NWEA has additional procedures that can validate the key for each item. These procedures
compare each item’s known statistics to the statistics that are calculated from the data being
scored. Any items with incongruent statistics are investigated to assure that the key for the
item is coded correctly. Any items with invalid keys are rescored using the correct key.
After the tests are scored, the results are validated. A student’s ALT score is invalidated if:

The percentage of items that the student answered correctly is less than or equal to
the percentage that would be obtained by chance guessing plus five percent (25% for
mathematics and science and 30% for reading and language usage).
The percentage of items that the student answered correctly is greater than or equal
to 95%.
The SEM is greater than 5.3 RIT points.
The student omitted more than half of the items.
All students who receive invalid test results are flagged for retesting. The SRS produces a list
of these students. Agencies are instructed to provide these students with a different test from
the ALT test series. If students have underachieved on the test they were provided, they are
administered a test two levels below their first one. Students who overachieve are provided
with a test two levels above their first one. Agencies typically test any students who were
absent during the initial testing along with the students who require retesting. Once all
students have been tested or retested, the scoring process for these students happens as it did
the first time around.
To summarize, ALT scoring and validation is a three-stage process. First, the student and test
identifying information is validated. Second, the test data are scanned and scored. Third, the
test scores are validated. Any students who have invalid test scores are then retested and their
test results are scored. Once all tests have been scored, reports can be generated.

For MAP, student scores are calculated by the computer during test administration. In
addition, the computer rescores the test event to double-check its correctness immediately
following the assessment. Shortly after the assessment, the results are transmitted back to the
central MAP database for score validation. Score validation also happens during this data
collection process. A MAP score is invalidated if:
The student completed the test in less than six minutes.
The SEM is greater than 5.5 RIT (unless the student’s score is greater than 240
RIT).
The SEM is less than 1.0 RIT, which is an indicator of some kind of technical
difficulty with the assessment.
In addition to calculating the students’ overall scores, both MAP and ALT assessments
provide additional score information for each assessment:
RIT Score—A RIT score is an objective indicator of a student’s overall achievement

level in a particular subject. Although theoretical RIT scores range in value from
negative infinity to positive infinity, typical scores fall between 150 and 300. RIT
scores are equal interval in nature meaning that the distance between 150 RITs and
151 RITs is the same as the distance between 230 RITs and 231 RITs.

SEM—The SEM is a measure of the accuracy or precision of an assessment.

Assessments that are more accurate have a smaller SEM. The SEM of an NWEA
assessment is calculated using maximum-likelihood procedures. Although the SEM
can theoretically range from zero to infinity, typical values fall between 2.5 and 3.5
RIT points for an ALT or MAP test.
RIT Range—The SEM is used to calculate what a student’s expected score would be
on repeated testing. The range of expected scores is called the RIT range. If the SEM
of a score is three RIT, then the student has a 68% chance of scoring within +/-
three RIT points of his or her RIT score.
Goal Performance—A student’s achievement level in each of the goal areas of the
test is calculated. This is done using only the student’s performance on items from a
single goal area. Since there are so few items administered in a single goal
(approximately seven items per goal), goal scores have a relatively high SEM. It is for
this reason that there are only three possible goal scores, HI, AV, and LO. Goal
performance of LO means that the student is performing at the 33rd percentile or
lower. Goal performance of AV means that the student is performing between the
34th and 66th percentile. Goal performance of HI means that the student is
performing at or above the 67th percentile.
Percentile Rank—The percentile rank is a normative statistic that indicates how

well a student performed in comparison to the students in the norm group. The
most recent norm sample was a group of approximately 1,000,000 students from
across the United States. A student’s percentile rank indicates that the student scored
as well as, or better than, the percent of students in the norm group.
Percentile Range—The percentile range includes the percentile ranks that are
included in the RIT range. As a result, there is a 68% probability that a student’s
percentile ranking will fall within this range if the student tested again relatively
soon. The percentile range is often asymmetric due to the fact that percentile ranks
are not an equal interval measure of performance.
Lexile Score—This score is only provided on reading assessments. The Lexile score
is an assessment of the student’s performance on the Metametrics Lexile framework
scale that can be used to assist selection of appropriate reading materials for the
student. Lexile scores are calculated directly from the student’s RIT score using a
transformation identified in a series of research studies. More information on the
Lexile framework can be found at www.lexile.com.

Report Generation

Once ALT test data are scored and validated, SRS can generate reports for a teacher, school,
or district that summarize the data for an individual student, class, school, or district. A
complete description of the ALT reports that are available is found in the Scoring and
Reporting System User’s Manual (NWEA, 1996).
The completeness and accuracy of the reports is largely dependent on the data provided by
NWEA agencies through the Student Master Files (SMF). These files tell the SRS which
students are in each class, school, and district. Most errors produced by the SRS stem from
errors in the information provided on the SMF and the student answer sheets. Whenever an
agency reports an error, SMF information can be reprocessed and reports can be reproduced.
In order to minimize these types of errors, each test site is provided with a copy of the SMF
specifications and given training in their use.
To ensure the confidentiality of the reports and all test-related information, NWEA uses
reputable couriers—Federal Express and UPS—and tracks all documents that are shipped to
agencies.

Once MAP test data have been scored and validated, educators can access the results through
NWEA’s secure reports website. Prior to the close of a testing season, teachers can view the
results of the assessments for all students in their class through the Teacher Report. After the
testing session closes, teachers can view detailed information about each student on the
Individual Student Progress Report. Teachers can also view information on every student in
their class in the Class Report. Administrators can view information on students, classes,
grades, or schools by ordering a Class Report, Grade Report or District Summary Report.
As with the SMF process, the accuracy of MAP reports is largely dependent on the data
provided by the agency administering the tests. The MAP system uses the CRF for student
class information and the Special Programs File (SPF) to identify all of the special programs
in which each student participates. Each MAP agency is trained concerning the details of
completing these files. Additionally, files coming from the MAP agencies are inspected upon
receipt. Although these validations prevent many errors, there are still students who are listed
in the wrong class or school. When this occurs, NWEA works with the agency to correct the
error.
In order to ensure the confidentiality of the MAP reports, the reports website has a robust
authentication system. MAP Coordinators are provided with a six character login and
password that they must use in order to enter the system. The reports website is secured via
128-bit encryption using secure socket layering. This level of security is equivalent to the
level provided by most financial and medical institutions in the United States.

Results Interpretation
“The test developer should set forth clearly how the test scores are intended to be
interpreted and used.” – Standard 1.2 APA/AERA/NCME Standards (1999).
In order to be useful, the results of assessments must be interpreted appropriately. In fact,

one way that NWEA’s mission, “Partnering to help all kids learn,” is fulfilled is by helping
member agencies interpret the reports well and use the test data to provide a better
educational experience for all students. Some of the training elements and documentation
that enhance the appropriate use of data are:
A workshop series on the assessment process, from administration to longitudinal

data use.
On-line resources including the NWEA Learning Continuum, technical

documentation, and research briefs.
Publications detailing administration procedures, modifications and

accommodations, and annotated reports with interpretation guidelines.
Research, test administration, and software support (on site, telephone, and e-mail).
NWEA conducts over 800 workshops each year to train educational agencies to interpret test
scores appropriately and to help these agencies make use of the information to improve each
student’s education. Each of these workshops is designed to meet the needs of a specific
audience. Table 1 details three workshops concerning the use of assessment data.

Table 1. NWEA data assessment workshops
Title Stepping Stones to Climbing the Data Leading with Data

Using Data Ladder
Length 1 day 2 days 2 days
Audience Educators Educators who have District
attended the Stepping Superintendents,
Stones to Using Data Principals,
workshop and one fall Curriculum
and spring of test data Coordinators,
Assessment
Coordinators
Topics Understanding test Using Lexile scores Using test results as a
Covered
results statistics Relating test results to measure of student
Recognizing diversity the NWEA Learning growth
in the achievement of Continuum (2) Using growth data for
students within a Understanding program evaluation
classroom Growth Patterns Using Leader’s Edge:
Using test results to Setting academic goals Growth Analysis Tools
develop flexible with students and Using growth data for
classroom groupings parents school improvement
Introduction to the Teaching with flexible
Lexile Scale classroom groupings
Introduction to the Relating test results to
NWEA Learning state standards
Continuum Using data to guide
Conferencing with instructional practice
students and parents
about test results
A set of on-line resources designed to help educators interpret the reports is also readily
available. The MAP reports website contains annotated sample reports that provide concise
definitions of the information provided on each report in the MAP series. Similar reports are
provided in paper form to districts using ALT reports generated from the SRS system. In
addition, the document RIT Scale Norms (NWEA, 2002) outlines the normal performance of
students taking MAP and ALT tests. The information in this document is fundamental to
using percentile scores appropriately.
Finally, one person at each test site is trained to be a contact person. Contacts are trained to
provide on-site report interpretation support to teachers, parents, students, and other
stakeholders. If the test site contact is unavailable or unable to answer questions, NWEA staff
members are available via phone or e-mail.

Customizing Assessments
One criticism of large-scale standardized tests is that they assess achievement from a regional,
state, or national perspective without capturing the nuances of each local educational system.
Noting this weakness, NWEA designed a testing system that customizes the content of
assessments to improve the pertinence of the results to local educators. Still, NWEA’s testing
system has always had a keen eye towards its global capabilities as well. MAP and ALT are
designed to be able to provide results that are both locally applicable and globally comparable.
This section details how this is accomplished.
Localizing NWEA Assessments

There are two primary ways in which agencies have the opportunity to make the assessment
locally applicable. They can define the content to meet their needs, and for those agencies
that desire an extremely high degree of customization, they can actually write test items for
the item banks. By defining the content to meet their needs, each educational agency makes
the assessment content more locally pertinent and facilitates reports that are aligned with the
structure and language of their local curriculum standards.
When defining the content of the assessment, agencies have the opportunity to undertake
three levels of localization:
They can use the tests aligned to the NWEA Learning Continuum.
They can use a test that has been aligned with their state’s standards.
They can develop a test based entirely on their local curriculum.
The NWEA-Learning-Continuum-aligned test is a logical synthesis of the goals and

objectives of a cross section of educational agencies. While this test is a good synthesis of
existing content standards, it is the test that is the least localized.
Agencies wishing to choose an assessment customized more to their local needs may wish to
use a state-aligned assessment. At this point, NWEA has constructed eight different state-
aligned ALT assessments and 48 different state-aligned MAP assessments. The content for
each of these assessments was designed following the same procedure. The published content
standards established by each State Department of Education were reviewed by NWEA staff
familiar with the item banks and the item indexes.
In creating the goals and objectives for the state-aligned assessment, each major goal was
titled as one of the individual content standards or a logical combination of two or more
standards. Each of the sub-goals within a major goal was titled from the standards within

each of the respective content standards. From there, the blueprint was created by evenly
distributing the content across each major goal.
Item selection was conducted by using the item indexes to select the items from the bank
that most clearly reflected the content standards for the state, after which each item index
was mapped to a single major goal. By developing the test in this fashion, the content of a
state-aligned assessment is solidly aligned with the content standards for the state. Once
constructed and administered, it is capable of providing feedback on each student that is
useful for all educators in helping to monitor student growth toward the state’s standards.
Sometimes a local educational agency may wish to construct an assessment that is totally
customized to meet its own needs. In this situation, a unique goals and objectives document,
a unique test blueprint, and a unique test design are created. The agency may also wish to be
involved in item selection. If so, they may either select the individual items themselves (for
ALT) or work with the item indexes to select items from the item banks to construct the
item pools (for MAP).
By providing these three levels of content customization, NWEA can assist agencies with a
variety of different needs. Agencies wishing to implement a solution quickly can choose
either the NWEA-Learning-Continuum-aligned assessment or the state-aligned assessment
and have confidence that the results will be valid, reliable, and useful to them in many
different ways. On the other hand, agencies with specialized needs may further localize the
assessment.
Some agencies may desire an even higher level of customization. These agencies may write
additional test items to fill perceived gaps and to provide coverage of specific topics. The
development process for these items is identical to the development process normally
followed to add items to the item banks. Like all items that appear on MAP and ALT
assessments, these items are calibrated to the underlying measurement scales using the
psychometric techniques outlined earlier. Agencies choosing to write their own items are
limited only by their capacity when customizing assessments for their local needs.
Making Global Use of Localized Assessments

As mentioned earlier in this section, one of the fundamental benefits of using NWEA
assessments is that test results are globally comparable. That is, the results of any two
assessments of the same subject are directly comparable. These results can be from the same
student at two different testing occasions or from two different students in two different
educational settings.
You may wonder how this is possible considering the amount of effort to make the
assessments locally applicable. The psychometric ingenuity that makes this possible stems
from the use of IRT as the framework for the development of the NWEA assessments and
scales. An explanation of how this framework and the NWEA assessment system promote
global comparisons of locally applicable test results may be useful to those interested in using
tests results to make comparisons of any kind, especially those beyond a single educational

setting. Before proceeding, you may wish to review the section The Measurement Model,
which describes the IRT framework.
A simplistic approach to global comparison is to administer the exact same test to every
student. This is not appropriate, considering the wide range of student achievement levels,
the wide variety of content taught and assessed worldwide, and the small amount of time
allowed for test administration. To overcome this situation, the testing community
developed a method whereby only a sample of appropriate items needed to be administered
to each student in order to infer the student’s overall achievement level. This method hinges
on being able to directly place the difficulty of every test item and the achievement level of
every student on a common scale.
The IRT framework allows for the creation and maintenance of the scale. It also makes it
possible to calculate the difficulty of new test items on the scale. Since each item is field
tested on a wide variety of students, the difficulty of the item on the NWEA scale is a global
value. The item difficulty is applicable to all students regardless of age, grade, achievement
level, or curriculum. In turn, the results of all students to an individual item are directly
comparable.
Extending this, the results of assessments containing any combination of items with difficulty
estimates on the common scale are also directly comparable. Since all NWEA assessments
contain items with difficulty estimates on the common scale, the results of these assessments
are also directly comparable. By having only one IRT measurement scale for each subject
area, NWEA enables districts to construct exams using any combination of items that suits
their local needs while still having the capacity to compare the results globally.
The concepts explained above apply to the NWEA Learning Continuum as well. The
continuum was constructed by determining the average difficulty of the items pertaining to
various content topics. As explained above, the difficulty of each item is a global difficulty,
thereby allowing the average difficulty of items pertaining to a specific content area to be
globally applicable.
In summary, agencies using MAP and ALT create an assessment system containing locally
applicable content, test results, and reports by engaging in content definition, item selection,
and, sometimes, the item development process. By using IRT models in conjunction with
sound field testing processes, these tests provide globally-comparable test results and a
globally-useful learning continuum.

Operational Characteristics of the

Assessments
Each NWEA assessment conforms to the tenets outlined within this document. There are
four basic subject areas: mathematics, reading, language usage, and science. Within science,
there are two different assessments, one is called Science Concepts and Processes and the
other is called Science Topics (Life, Earth/Space, and Physical Science). The following tables
list the optimal test specifications for each test subject and the optimal difficulty range for
each subject.
Table 2: Common ALT Specifications by Subject
Specification Mathematics Reading Language Science

usage
Number of major 7 6 6 6
goals
Number of different 8 8 7 5
ALT levels
Number of items 50 40 40 40
per test
Number of 5 4 4 5
response options
per item
Size of the difficulty 20 RIT points 20 RIT points 20 RIT points 20 RIT points
range of each level
Size of the difficulty 10 RIT points 10 RIT points 10 RIT points 10 RIT points
overlap between
adjacent levels
Administration time Untimed Untimed Untimed Untimed
Table 3: The difficulty range of the test items in each level of a common ALT Test
Subject Level Level Level Level Level Level Level Level

1 2 3 4 5 6 7 8
Mathematics Min- 170-190 180-200 190-210 200-220 210-230 220-240 230-
180 max
Reading Min- 160-180 170-190 180-200 190-210 200-220 210-230 220-
170 max
Language Min- 170-190 180-200 190-210 200-220 210-230 220-
usage 180 max
Science Min- 181-200 191-210 201-220 211-
191 max

Table 4: Common MAP Specifications
Specification Mathematics Reading Language usage Science

Number of major 7 6 6 6
goals
Number of 5 4 4 5
response options
per item
Number of items 50 40 40 40
per test
Minimum number of 7 7 7 7
items per goal
Items in Test Pool
Items in test pool
per Goal
Difficulty of initial 5 RIT points below previous score or 5 RIT points below grade level
item
mean
Ability estimation Bayes for item selection and Maximum Likelihood for final ability
estimate
Item selection Ordered Cycle of Goal Area; select the 10 items from the chosen
algorithm
goal area that maximize test information; select one of the 10 items
at random
Administration time The test is untimed, but most schools schedule 75 minute blocks
Number of tests
allowed by a single
student in a given 4
school year
In addition to the typical NWEA tests, NWEA has constructed a number of other tests from
the same item banks. A description of each test follows.
The Survey Test is a 20 item fixed-length, adaptive assessment that is administered on the
computer via the MAP system. NWEA offers the survey test in each of the four subject areas
aligned with the state’s curriculum. Generally, these tests take students about 20-30 minutes
to complete. This test is designed to provide a quick, overall assessment of a student’s
achievement level and does not provide goal scores.
NWEA also offers five different end-of-course assessments in mathematics. These

assessments include Algebra I, Algebra II, Geometry, Integrated Math I, and Integrated
Math II. Each of these assessments is available in MAP or ALT. Each test is aligned with the
NWEA goal structure and contains 40 items. These assessments are designed to measure
student’s overall achievement at the end of a mathematics class and also provide goal area
information.

Validity
“Validity is the most fundamental consideration in developing and evaluating
tests.”—Chapter 1, Standards for Educational and Psychological Testing
(APA/AERA/NCME, 1999).
Validity was defined earlier as the degree to which an educational assessment measures what
it intends to measure. From the development process, it is clear that MAP and ALT contain
content that is appropriate to measure the achievement level and growth of students in the
subjects of mathematics, reading, language usage, and science. It is also important to present
a body of evidence that indicates that the scores from MAP and ALT can be used to make
accurate statements about student capabilities and growth. In order to support this claim,
NWEA provides the following forms of validity evidence:
Research pertaining to the equivalency of scores across the two testing modalities
(ALT and MAP).
The MAP administration training materials, which provide instructions on the

appropriate uses and inappropriate uses of the various NWEA assessments.
This technical manual as a document that explains the ways in which NWEA
promotes valid test results interpretation.
This technical manual as a document that provides the processes used to ensure that
the content of the exams is valid.
This technical manual as a document that provides the processes used to ensure that
test scores are constructed in a manner that is valid.
Concurrent validity statistics that correlate NWEA test results with other major
national or state educational assessments.
Over the years, the testing field has presented substantial evidence supporting the hypothesis
that tests administered under traditional paper-and-pencil and computerized adaptive testing
modalities can be equivalent (Zara, 1992). Kingsbury and Houser (1988) and Kingsbury
(2002) investigated MAP and ALT scores and found that they were equivalent in each study.
The difference between the test-retest correlation of ALT to ALT tests and ALT to MAP
tests was less than 0.05 and the largest observed mean difference was 1.5 RIT. These
findings are important because they provide evidence that ALT and MAP may be considered
equivalent testing forms.
A test score cannot simply be valid or invalid. Instead, a test score is valid or invalid for a
particular use. ALT and MAP survey-with-goals test scores are valid for measuring the
achievement level and growth of students. They are also validly used for course placement,
parent conferences, district-wide testing, and for identifying the appropriate instructional
level for students and for screening students for special programs.

The content of MAP and ALT assessments is valid for their intended uses. The items are
written by classroom teachers. The manner in which the goals and objectives for each test are
developed promotes a high degree of alignment between the curriculum and the test content.
For ALT, teachers pick the items for their test, maximizing the relevancy of the test for their
setting. For MAP, the item selection algorithms promote the selection of a group of items
that most completely align with the goals and objectives of the test as well as the achievement
level of the students. The statistical and experimental procedures used to develop the item
bank, measurement scales, and the assessment development process are a direct result of
NWEA’s goal to create an assessment system capable of providing tests that are locally
applicable. Taken together, the item development process, test content definition process,
and test construction process provide strong evidence that the content of MAP and ALT
assessments is valid for its intended use.
The NWEA scales and the scores that stem from tests containing scaled items were
constructed in a manner that is widely accepted as valid. The Educational Assessment section
outlined the paradigm of item response theory and explained how NWEA scale and test
construction procedures are guided by this theory. The judicious use of this paradigm,
coupled with additional research and experimental design, results in scales and tests, results
in scores with a variety of valid interpretations arise.
A primary element of evidence supporting the validity of the NWEA assessment scores for
their intended uses is the series of concurrent validity studies that have compared scores from
a variety of state assessments to the scores from MAP and ALT assessments. Table 5 displays
the summary of the outcomes of these studies. MAP and ALT test scores consistently
correlate highly with other measures of academic achievement for each state in which a study
has been performed.
In looking closely at the trend in correlations throughout the grades, NWEA test scores tend
to be more similar to other test scores in the upper grades than the lower grades. This is most
likely due to the increase in the reliability of scores obtained from students in higher grades.
From the data provided in Table 5, it is clear that NWEA test scores are strongly related to
other major educational test scores that they are valid for similar uses.
Overall, there is substantial evidence supporting the validity of the NWEA assessments for
measuring the achievement level and growth of students in the major subject areas. The
content is valid. The comparability of the two test administration modalities is high. The
scores are constructed in a valid manner. Test users are instructed about the appropriate uses
of the test scores. Finally, the tests are correlated with other major tests indicating that they
are valid for similar uses. NWEA has a great deal of confidence in the validity of the
assessments for their intended uses.

Table 5: Concurrent Validity Statistics for the NWEA Assessments

last updated 10/11/02
Type Grade Level
Content Area
Validity Data Set Year Term 2 3 4 5 6 7 8 9 10
Concurrent Stanford Achievement Test, 2001 spring Reading r .86 .87 .87 .86 .86 .87 .87 .82
9th Edition (SAT9) scale
N 5,550 7,840 7,771 7,724 3,832 3,885 3,557 4,759
scores and ALT scores from
same students spring Language r .78 .84 .82 .82 .82 .83 .83 .82
N 5,633 7,806 7,916 7,793 3,799 3,828 3,509 4,438
spring Mathematics r .80 .85 .85 .87 .88 .87 .87
N 5,666 7,878 7,929 7,794 3,834 3,841 3,508
Concurrent Colorado State Assessment 2000 spring Reading r .84 .87 .86
Program (CSAP) scale scores
N 3,488 3,486 6,337
and ALT scores from same
students spring Mathematics r .91
N 5,023
Concurrent Iowa Tests of Basic 1999 fall Reading r .77 .84 .80
Skills (Form K) and Meridian
N 1,456 1,473 1,373
Checkpoint Level Tests
fall Language r .77 .79 .79
N 1,441 1,466 1,397
fall Mathematics r .74 .83 .84
N 1,425 1,460 1,365
Concurrent Indiana Stat ewide Testing for 2000 fall Reading (ALT) r .79 .84 .86
Educational Progress -Plus /Lang Arts
(ISTEP) N 4,096 4,296 3,828
(ISTEP+) and ALT scores
from same students fall Lang Usage r .78 .82 .84
(ALT) /Lang
Arts (ISTEP) N 4,096 4,296 3,828
fall Mathematics r .74 .86 .90
N 4,133 4,299 3,829
Concurrent Washington Assessment of 1998 spring Reading r .81 .80
Student Learning and ALT N 2,286 2,271
scores from same students
spring Mathematics r .80 .85
N 2,203 2,266
Concurrent Washington Assessment of 1999 spring Reading r .75
Student Learning (grd 10, spr
N 1,003
2000) and ALT scores (grd 9,
spr 1999) from same students Mathematics r .81
N 849
Concurrent Wyoming Comprehensive 2000 spring Reading r .76 .79
Assessment System and ALT
N 1,452 1,247
scores from same students
spring Lang Usage r .60 .68
N 1,063 1,002
spring Mathematics r .79 .81
N 1,458 1,552

Reliability of Scores
In assessing the psychometric soundness of an assessment, one of the most widely used
indicators is the reliability of the assessment. Reliability is an indicator of the consistency of
test scores and is expressed in the same manner as a correlation coefficient. Possible values of
most reliability coefficients range from 0.00 to 1.00. Values in excess of 0.70 are generally
considered acceptable and values above 0.90 are considered good.
The reliability of MAP and ALT scores has been calculated in two different manners. One
method uses marginal reliability (Samejima, 1994), which may be applied to any tests
constructed using IRT. Marginal reliability is one of the most appropriate methods of
calculating reliability for adaptive tests. It uses the test information function to determine the
expected correlation between the scores of two hypothetical tests taken by the same student.
It also allows the calculation of reliability across multiple test forms, as in ALT and MAP.
Marginal reliability is, therefore, a very appropriate measure of the overall psychometric
soundness of the NWEA assessments. Table 6 displays the marginal reliability for the three
major subjects in the NWEA assessment suite by grade level. All of the reliabilities are
between 0.89 and 0.96. The reliability of the MAP assessments slightly exceeds that of ALT.
This is not unexpected, given the continuously adaptive characteristics of the MAP system.
The reliability of the MAP and ALT assessments is consistently high across all subjects and
from grades two through ten.
The test-retest reliability of the ALT and MAP assessments has also been investigated in
several large-scale studies. As the name suggests, test-retest reliability is the correlation
between the scores of two different tests taken by the same student. NWEA districts test
students multiple times throughout their educational career. The correlation between the
pairs of scores of students from spring to fall, spring to spring, and fall to spring can
therefore be calculated.
These correlations serve as a test-retest reliability indicator over a long period of time. This
provides an indicator of score consistency throughout the grade spans covered by the
assessments. Table 7 displays the test-retest reliability of the NWEA assessments. Values
range between 0.79 and 0.94 for all test-retest pairs except for those that involve second
graders. It is generally known that the assessment of second graders is inconsistent for many
reasons including the mixed reactions that these students have to taking multiple-choice test
items. For these reasons, NWEA test users should expect less consistency in the test scores of
second graders.

Table 6: Marginal Reliability Estimates for NWEA Assessments

last updated 10/11/02
Type Test Grade Level
Content Area
Reliability Data Set Year Type Term 2 3 4 5 6 7 8 9 10
Marginal NWEA Norms Study 1999 MAP Reading Fall r .95 .95 .95 .94 .94 .94 .94 .94 .94
- (Source for means (Surv w/ Goals) N 4,662 39,590 39,960 40,671 35,508 36,318 34,121 7,620 1,639
and standard
Spring r .95 .95 .94 .94 .94 .94 .94 .93 .94
deviations to
N 10,308 48,566 52,602 54,254 52,696 53,679 43,600 16,619 3,829
calculate marginal
reliabilities) Mathematics Fall r .92 .93 .94 .94 .94 .94 .95 .95 .95
N 4,511 37,022 37,237 37,933 33,131 33,664 31,742 7,910 3,313
Spring r .93 .94 .94 .94 .95 .94 .96 .96 .95
N 9,863 47,635 52,580 53,753 52,581 53,631 43,093 16,725 5,583
Lang Usage Fall r .94 .94 .94 .94 .94 .94 .94 .93 --
N 4,292 20,769 21,593 21,980 20,035 19,869 18,630 3,553
Spring r .94 .94 .94 .94 .94 .94 .93 .92 --
N 4,758 19,676 23,167 25,304 23,389 24,290 21,038 5,914
Marginal NWEA Norms Study 1996 ALT Reading Spring r -- .94 .94 .93 .93 .93 .94 .90 --
N 24,623 25,447 27,512 29,664 26,500 24,676 5,045
Mathematics Spring r -- .93 .94 .94 .94 .95 .95 .94 --
N 27,190 28,628 30,109 32,147 28,244 27,380 5,261
Lang Usage Spring r -- .93 .93 .91 .91 .91 .92 .89 --
N 8,954 9,591 9,810 7,587 7,645 8,344 1,641
Table 7: Test-Retest Reliability Estimates for NWEA Assessments
Type Test Grade Level

Content Area
Reliability Data Set Year Type Term 2 3 4 5 6 7 8 9 10
Test-Retest NWEA Norms Study 1999 ALT Reading Fall to r .76 .85 .88 .89 .89 .89 .89 .84 --
Spring N 4,253 27,460 30,091 34,525 30,079 28,386 26,190 9,231
Mathematics Fall to r .70 .79 .86 .89 .91 .93 .93 .87 .82
Spring N 4,177 26,522 30,100 34,073 29,730 28,077 24,432 8,788 1,598
Lang Usage Fall to r .77 .85 .89 .89 .90 .90 .90 .87 --
Spring N 3,795 14,173 17,285 19,037 16,825 16,822 15,991 3,514
Test-Retest NWEA Norms Study 1999 ALT Reading Spring r .87 .88 .89 .89 .89 .87 .85 .84 --
to Fall N 4,632 15,472 16,106 15,517 15,003 14,299 3,752 1,315
Mathematics Spring r .79 .84 .87 .91 .91 .92 .89 .89 --
to Fall N 4,585 15,456 16,682 15,302 14,739 13,540 3,864 1,612
Lang Usage Spring r .89 .89 .90 .90 .90 .89 .88 -- --
to Fall N 3,749 10,596 11,223 10,623 10,853 10,667 1,445
Test-Retest NWEA Norms Study 1999 ALT Reading Sprg - r .81 .85 .89 .87 .88 .87 .84 .84 --
Sprg N 6,326 22,908 22,294 24,085 26,813 23,756 6,709 2,576
Mathematics Sprg - r .72 .82 .87 .89 .91 .91 .83 .85 --
Sprg N 6,654 23,318 23,183 24,117 26,964 23,828 6,565 2,732
Lang Usage Sprg - r .84 .86 .88 .89 .89 .89 .87 -- --
Sprg N 3,749 10,488 11,035 10,386 11,151 10,101 1,588
Test-Retest NWEA Norms Study 2002 ALT & Reading Fall to r .80 .87 .90 .91 .91 .91 .91 .90 .92
MAP Spring N 5,470 48,033 53,797 55,451 52,257 52,804 46,925 14,798 3,121
Mathematics Fall to r .77 .84 .88 .91 .93 .94 .93 .90 .89
Spring N 5,963 49,806 54,971 56,500 54,325 53,730 46,425 8,971 1,410
Lang Usage Fall to r -- .88 .90 .91 .92 .92 .92 .91 .90
Spring N 35,994 38,970 38,747 36,826 38,350 33,513 11,393 2,590
Test-Retest NWEA Norms Study 2002 ALT & Reading Sprg - r .87 .89 .90 .91 .91 .90 .89 .86 .84
MAP Sprg N 18,512 50,241 50,782 52,507 54,207 44,580 10,684 2,621 1,790
Mathematics Sprg - r .83 .87 .90 .91 .93 .93 .85 .79 --
Sprg
N 19,467 50,536 51,322 53,357 54,170 43,956 12,905 4,939
Lang Usage Sprg - r .89 .89 .90 .91 .91. .92 .90 .89 .88
Sprg
N 11,197 29,555 31,587 31,317 31,321 28,875 8,500 2,438 1,508

Precision of Scores
Another indicator of exam performance is the precision of the assessment as measured by the
SEM. Figure 12 displays the SEM of the three major NWEA subjects by RIT and by test
modality. Notice that the SEM of most assessments is somewhere between 3 and 3.5 RIT
points. The measurement error for scores at the far extremes of the score range tends to
increase. For RIT scores as high as 260, the SEM is still less than 8 RIT points.
In evaluating the psychometric soundness of this precision, one must consider the concept of
test efficiency as it was explained in the section Educational Assessments. Considering that the
length of the NWEA assessment typically varies between 40 and 50 items, this level of
precision is quite impressive. It is safe to say that the NWEA assessments are rather efficient
tests.
Figure 12: The SEM of NWEA assessments by RIT score and test modality (continues on next page)
MAP and ALT Measurement Error for Mathematics by RIT

Spring 2001
9
2
ALT n=437741
1 MAP n=117831
0
135 145 155 165 175 185 195 205 215 225 235 245 255 265 275 285
RIT Score

MAP and ALT Measurement Error for Reading by RIT

Spring 2001
8
2
ALT n=436643
1
MAP n=155609
0
130 140 150 160 170 180 190 200 210 220 230 240 250 260 270
RIT Score
MAP and ALT Measurement Error for Language Usage by

RIT Spring 2001
8
ALT n=294672
1
MAP n=75226
0
130 140 150 160 170 180 190 200 210 220 230 240 250 260 270
RIT Score

It is useful to put meaning to the SEM values that are seen in the graphs in Figure 12. As
with student scores, there are several useful ways to interpret the standard error seen in the
MAP and ALT tests. Two of the most useful ways to look at measurement error (or
precision) are the norm-referenced approach and the curriculum-referenced approach.
Norm-Referenced Precision
One approach to describing the precision of a test score is to compare it to the variability of
achievement in an appropriate sample of students. In this case, an appropriate sample of
students is the norming sample used in the 2002 study that established the RIT scale norms.
This sample included slightly over 1,050,000 students taking approximately 3,040,000 tests
in 321 school districts spread throughout 24 states.
In this norming study, the standard deviation of students’ spring mathematics scores ranged
from 19.58 to 12.52 depending on the grade examined. The average standard error for
students’ scores in the same study averaged 3.1 RIT points. As a result, the ratio of the
standard error of the test scores to the standard deviation of mathematics achievement in the
sample was between 0.16 and 0.25.
This means most students are pinpointed within 0.32 to 0.50 standard deviations of the
students in their grade and a 95% confidence interval about a student’s mathematics score is
between 0.63 and 0.98 standard deviations. The results for reading, language usage, science
concepts and processes, and general science are virtually identical to those in mathematics.
This level of accuracy, seen in all content areas, is substantially higher than the level needed
for most educational decisions.
Curriculum-Referenced Precision
Another way of describing the precision of the scores for MAP and ALT is to consider the
difference in performance that is expected from students at the extremes of the confidence
interval around a student’s score. If we identify a third grade student with a mathematics
score of 185, the extremes of a 95% confidence interval would be 179 and 191.
A student at the low end of this range would typically be challenged by multiplying a 3-digit
number by a 1-digit number with regrouping. A student at the upper end of this range
would be slightly less challenged by the multiplication problem above, but would be
challenged by multiplying a 3-digit number by a 2-digit number with regrouping.
Clearly, the difference in the two extremes of the confidence interval define a barely
noticeable difference in the students capabilities. Once again, the accuracy of MAP and ALT
scores is quite adequate for virtually any educational decision.

A Final Note on MAP and ALT Operating Characteristics

It is clear from the reliability, validity, and precision information that the MAP and ALT
systems produce scores for students with the following characteristics:
They are quite consistent, both within the test being given and across two tests given
at different times.
They are highly related with external measures of achievement.
They have a level of precision that surpasses the level needed for educational
decision making.
These operating characteristics support the application of MAP and ALT scores for a wide
variety of educational uses in a variety of settings.
One element of the MAP and ALT systems that might not be clear is that they have the
capacity to adjust to the requirements for which assessment is being done. The section
Testing For a Purpose discussed the need to have specific types of tests for specific educational
purposes. Both the MAP and ALT systems have the capacity to deliver tests that fit the
purpose for which assessment is being done. One advantage of families of covalent
assessments is that they can be changed without need for new field testing and norming. A
second advantage is that the characteristics of the altered test can be designed to meet a
particular educational need before the test is actually given.
As an example, the precision of a MAP score is dependent on the length of the test given to
the student. If there is a need for additional precision, the length of a MAP test can be
adjusted to shrink the standard error of the scores. If there is a need for decisions about
performance categories to be made for all students with a desired level of confidence, the
number of items given to each student can be customized so that the level of confidence
determines when the test is terminated. If there is a need to delve more deeply into content
areas in which a student needs additional instruction, the content blueprint of the test can be
designed to shift dynamically as the student takes the test.
This ability to match the design of the assessment to the educational purpose is one of the
strongest features of the system. It enables ALT and MAP to be used for a variety of purposes
from low-stakes classroom testing to high-stakes statewide testing. The system meets many
current educational needs in ways that a conventional wide-range test cannot. As the needs of
education change, MAP and ALT will also allow the assessments to change without
distorting the longitudinal information available about our students.

Appendix
Initial Scale Development

The creation of the initial scales was conducted for each of the NWEA subject areas using
the 1PL IRT model (Lord and Novick, and 1968; Rasch, 1980). The development of the
original scales was a multi-stage procedure. The first several stages concerned the utility of
the 1PL model with the items to be used. A series of experiments concerning the reliability of
student scores, the stability of the item difficulties, and the factor structure of the data set
convinced the original researchers that the 1PL model was appropriate for use with the items
in the pool (Ingebo, 1997). The original set of field trials to create the initial scales used a
very conservative four-square linking design (Wright, 1977) that allowed NWEA to create
and recursively compare multiple difficulty estimates for each item. This results in a very
strong scale.
Item Bank Overview

Table A1 shows the number of items in the item banks for each subject area. In addition,
this table shows the mean and range of item difficulties in each item bank. It should be
noted that different items in mathematics have been calibrated with and without the use of a
calculator. There is some overlap between these categories, since many items have been
calibrated in one field test without calculators, and in another field test with calculators.
These items have two separate calibrations in the item bank, but are counted as one item in
the table. This distinction in the mode of administration is quite important, since the use of
calculators has been shown (Kingsbury and Houser, 1990) to reduce the difficulty of
mathematics questions dramatically (approximately six RIT points, or slightly under one year
of growth in mathematics for an average student). For this reason, mathematics items should
never be used with calculators available unless they were calibrated with calculators available.
Table A1: Basic statistics for item banks used to develop level tests
Number Item difficulty

Item bank of Items Minimum Maximum Average
Language 4050 155 243 198.21
Math 8461 130 279 205.72
Reading 5969 141 247 200.18
Science concepts 1121 168 253 201.97
General science 692 161 237 203.10
As can be seen in Table A1, the item banks used to create the level tests and MAP tests are
quite large. Since an ALT series uses approximately 300 unique items and a MAP series uses
approximately 1200 items, a wide variety of ALT tests and MAP tests may be created from

the banks. It is quite unlikely that any two districts will create the same tests, even if they use
the same test design and blueprint.
The range of difficulty of the items in the item banks is quite broad. In the table it can be seen
that the range of item difficulty in any one pool varies from 75 to 149 points on the RIT scale.
To describe the breadth of difficulty in the item pools, it is helpful to compare them to
achievement growth in students. In the Portland, Oregon Public Schools (a fairly large
metropolitan school district) students grow from a mean reading achievement level of
approximately 192 in the fall of the third grade to approximately 223 by the spring of the eighth
grade. Similar growth patterns are seen in the other subject areas, and in every case the range of
difficulty in the item banks is at least twice as great as the change in mean observed student
achievement from the beginning of the third grade to the end of the eighth grade.
In each level test and MAP test, items are chosen from several different content areas according
to a preset test blueprint. The major goal areas, the number of items in each major goal area, and
the subgoals that are included in each item pool are shown in Table A2. Note that the total
number of items in Table A2 does not match Table A1, because not all items are linked to any
specific goal structure.
Table A2: Goal area coverage in each item bank
Language
Writing Process Number of items 626
Difficulty Range 162-234
Mean Difficulty 198.66
Composition Structure Number of Items 688
Grammar/Usage Number of Items 850
Punctuation Number of Items 651
Capitalization Number of Items 577
Mathematics
Number/Numeration Systems Number of Items 1011
Operations/Computation Number of Items 1630
Equations/Numerals Number of Items 420
Geometry Number of Items 462

Measurement Number of Items 608

Problem Solving Number of Items 614
Statistics/Probability Number of Items 524
Applications Number of Items 804
Reading
Word Meaning Number of Items 841
Mean difficulty 195.49
Literal Comprehension Number of Items 1259
Interpretive Comprehension Number of Items 1135
Evaluative Comprehension Number of Items 776
Science Concepts
Concepts Number of Items 566
Processes Number of Items 528
General Science
Life Sciences Number of Items 254
Earth/Space Sciences Number of items 203
Physical Sciences Number of Items 235

Acronym Glossary
ALT Achievement Level Tests
CRF Class Roster File
IEP Individualized Education Plan
IRT Item Response Theory
MAP Measures of Academic Progress
NWEA Northwest Evaluation Association.
RIT Rasch unIT
RMSF Root-Mean-Square Fit
SEM Standard Error of Measurement
SMF Student Master Files
SPF Special Programs File
SRS Scoring and Reporting Software
TPS Test Printing Script

References
American Educational Research Association, American Psychological Association, National
Council on Measurement in Education. (1999). Standards for educational and psychological
testing. Washington, D.C.: American Educational Research Association.
Baker, F. (2001). The basics of item response theory. College Park, MD: ERIC Clearinghouse
on Assessment and Evaluation.
Bejar, I. I. (1980). A procedure for investigating the unidimensionality of achievement tests based
on item parameters. Journal of Educational Measurement, 17, 283-296.
Drasgo, F. & Olson-Buchanan J. B. (1999). Innovations in computerized assessments.

Mahwah, NJ: Lawrence Erlbaum Associates.
Haladyna, T. M. (1994). Developing and validating multiple-choice test items. Hillsdale, NJ:
Lawrence Erlbaum Associates.
Ingebo, G. S. (1997). Probability in the measure of achievement. Chicago, IL: MESA Press.
Kingsbury, G. G. & Houser, R. L. (1988). A comparison of achievement level estimates from

computerized adaptive testing and paper-and-pencil testing. Paper presented at the annual
meeting of the American Educational Research Association. New Orleans, LA.
Kingsbury, G. G. & Houser, R. L. (1990). The impact of calculator usage on the difficulty of
mathematics questions. Unpublished manuscript.
Kingsbury, G. G. (April, 2002). An empirical comparison of achievement level estimates from

adaptive tests and paper-and-pencil tests. Paper presented at the annual meeting of the
American Educational Research Association, New Orleans, LA
Kingsbury, G. G. (April, 2003). A long-term study of the stability of item parameter estimates.
Paper presented at the annual meeting of the American Educational Research Association,
Chicago, IL
Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale,
NJ: Lawrence Erlbaum Associates.
Lord, F. M. and & Novick M. R. (1968). Statistical theories of mental test scores. Menlo Park,
CA: Addison-Wesley.
Northwest Evaluation Association. (1996). Scoring and reporting system user’s manual.
Portland, OR: Northwest Evaluation Association.
Northwest Evaluation Association. (2002, August). RIT scale norms. Portland, OR:
Northwest Evaluation Association.

Osterlind, S. J. (1998). Constructing test items: Multiple-choice, constructed response,

performance, and other formats. Boston, MA: Kluwer Academic Publishers.
Rasch, G. (1980). Probabilistic models for some intelligence and attainment tests. Chicago, IL:
MESA Press.
Roid, G. H. & Haladyna, T. M. (1997). A technology for test-item writing. New York:
Academic Press.
Samejima, F. (1994). Estimation of reliability coefficients using the test information

function and its modifications. Applied Psychological Measurement, 18 (3), 229-244.
Warm, A. W. (1989). Weighted likelihood estimation of ability in item response theory with
tests of finite length. Pyschometrika, 54, 427-450.
Weiss, D. J. & Kingsbury G. G., (1984). Application of computerized adaptive testing to

educational problems. Journal of Educational Measurement, 21, 361-375.
Wright, B. D. (1977). Solving measurement problems with the Rasch model. Journal of
Educational Measurement, 14, 97-116.
Zara, A. R. (April 1992). A comparison of computerized adaptive and paper-and-pencil versions

of the national registered nurse licensure examination. Paper presented at the annual meeting
of the American Educational Research Association, San Francisco.

NWEA Technical Manual 2003

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

NWEA Technical Manual 2003

Uploaded by

Copyright:

Available Formats

Technical Manual for the

NWEA Measures of Academic Progress

Phone: (503) 624-1951

General Info: munse@nwea.org

© 2003 Northwest Evaluation Association, Portland, Oregon

Educational Assessment ...................................................................................... 6

Testing for a Purpose: NWEA Assessment Design .............................................. 9

The Item Banks and Measurement Scales .......................................................... 17

The Assessment Process .................................................................................... 28

© 2003 Northwest Evaluation Association, Portland, Oregon

Customizing Assessments ..................................................................................46

Operational Characteristics of the Assessments ..................................................49

© 2003 Northwest Evaluation Association, Portland, Oregon

How do we efficiently and accurately measure how much

© 2003 Northwest Evaluation Association, Portland, Oregon

This document is written for three primary audiences:

Measurement professionals will find information within this document to help

Parents, teachers, test proctors, curriculum coordinators, assessment coordinators, and

© 2003 Northwest Evaluation Association, Portland, Oregon

Organization of This Manual

Educational Assessment describes the characteristics that a good educational

Testing for a Purpose: NWEA Assessment Design introduces the measurement

Operational Characteristics of the Assessments details the psychometric characteristics

© 2003 Northwest Evaluation Association, Portland, Oregon

© 2003 Northwest Evaluation Association, Portland, Oregon

Measurement characteristics. Because of the design of a wide-range test, a student

The Principles That Guide NWEA Assessments

© 2003 Northwest Evaluation Association, Portland, Oregon

Reliability is a primary requirement of measurement. Reliability can be defined as the

An assessment measures what is expected to be taught in the classrooms in which it

The scores from an assessment in a content area correspond to other indicators of

The extent to which a measurement is an accurate, complete indicator of the quality to be

© 2003 Northwest Evaluation Association, Portland, Oregon

In educational assessment, precision or error can be thought of in the same manner. It is

The degree to which the difficulty of questions on a test is matched to a student’s

With the increasing demands on schools to bring students to higher standards, it is

© 2003 Northwest Evaluation Association, Portland, Oregon

MAP and ALT in a Comprehensive Assessment Plan

A test should be challenging for a student. It should not be frustrating or boring.

A test should be an efficient use of student time. It should provide as much

A test should provide a reflection of a student’s achievement that is as accurate as

A test should provide results to educators and other stakeholders as quickly as

© 2003 Northwest Evaluation Association, Portland, Oregon

Achievement Level Tests (ALT)

© 2003 Northwest Evaluation Association, Portland, Oregon

Figure 1: Measurement error of typical wide-range test vs. ALT series

© 2003 Northwest Evaluation Association, Portland, Oregon

Difference Between Student Score and Item Difficulty

Measures of Academic Progress (MAP)

© 2003 Northwest Evaluation Association, Portland, Oregon

Test Questions Administered

© 2003 Northwest Evaluation Association, Portland, Oregon

MAP and ALT Measurement Error for Language Usage by RIT

© 2003 Northwest Evaluation Association, Portland, Oregon

-60 -40 -20 0 20 40 60

Difference Between Student Score and Item Difficulty

Wide -Range MAP

© 2003 Northwest Evaluation Association, Portland, Oregon

Availability of resources—Since MAP requires computers for test administration,

Time frame available for test administration—MAP can be simultaneously

Desire to control item selection—ALT allows an organization to hand pick the

© 2003 Northwest Evaluation Association, Portland, Oregon

The Item Banks and Measurement

The Measurement Model