You are on page 1of 8

Recommendation 1.

aspects of what is often referred to as

Screen all students to number sense.30 They assess various as-
pects of knowledge of whole numbers—
identify those at risk for properties, basic arithmetic operations,
potential mathematics understanding of magnitude, and applying
mathematical knowledge to word prob-
difficulties and provide lems. Some measures contain only one
interventions to aspect of number sense (such as magni-
tude comparison) and others assess four
students identified to eight aspects of number sense. The sin-
as at risk. gle-component approaches with the best
ability to predict students’ subsequent
mathematics performance include screen-
The panel recommends that schools ing measures of students’ knowledge of
and districts systematically use magnitude comparison and/or strategic
universal screening to screen all counting.31 The broader, multicomponent
students to determine which students measures seem to predict with slightly
have mathematics difficulties and greater accuracy than single-component
require research-based interventions. measures.32
Schools should evaluate and select
screening measures based on their Effective approaches to screening vary in
reliability and predictive validity, with efficiency, with some taking as little as 5
particular emphasis on the measures’ minutes to administer and others as long
specificity and sensitivity. Schools as 20 minutes. Multicomponent measures,
should also consider the efficiency of which by their nature take longer to ad-
the measure to enable screening many minister, tend to be time-consuming for
students in a short time. administering to an entire school popu-
lation. Timed screening measures33 and
Level of evidence: Moderate untimed screening measures34 have been
shown to be valid and reliable.
The panel judged the level of evidence sup-
porting this recommendation to be mod- For the upper elementary grades and mid-
erate. This recommendation is based on a dle school, we were able to locate fewer
series of high-quality correlational studies studies. They suggest that brief early
with replicated findings that show the abil- screening measures that take about 10
ity of measures to predict performance in minutes and cover a proportional sam-
mathematics one year after administration pling of grade-level objectives are reason-
(and in some cases two years).29 able and provide sufficient evidence of reli-
ability.35 At the current time, this research
Brief summary of evidence to area is underdeveloped.
support the recommendation

A growing body of evidence suggests that 30. Berch (2005); Dehaene (1999); Okamoto and
there are several valid and reliable ap- Case (1996); Gersten and Chard (1999).
proaches for screening students in the pri- 31. Gersten, Jordan, and Flojo (2005).
mary grades. All these approaches target 32. Fuchs, Fuchs, Compton et al. (2007).
33. For example, Clarke and Shinn (2004).
29. For reviews see Jiban and Deno (2007); Fuchs, 34. For example, Okamoto and Case (1996).
Fuchs, Compton et al. (2007); Gersten, Jordan, 35. Jiban and Deno (2007); Foegen, Jiban, and
and Flojo (2005). Deno (2007).

( 13 )

How to carry out this more than 20 minutes to administer,

recommendation which enables collecting a substantial
amount of information in a reasonable
1. As a district or school sets up a screen- time frame. Note that many screening
ing system, have a team evaluate potential measures take five minutes or less.38 We
screening measures. The team should select recommend that schools select screen-
measures that are efficient and reasonably ing measures that have greater effi-
reliable and that demonstrate predictive va- ciency if their technical adequacy (pre-
lidity. Screening should occur in the begin- dictive validity, reliability, sensitivity,
ning and middle of the year. and specificity) is roughly equivalent
to less efficient measures. Remember
The team that selects the measures should that screening measures are intended
include individuals with expertise in mea- for administration to all students in a
surement (such as a school psychologist or school, and it may be better to invest
a member of the district research and eval- more time in diagnostic assessment of
uation division) and those with expertise in students who perform poorly on the
mathematics instruction. In the opinion of universal screening measure.
the panel, districts should evaluate screen-
ing measures on three dimensions. Keep in mind that screening is just a means
of determining which students are likely to
r 1SFEJDUJWFWBMJEJUZ is an index of how need help. If a student scores poorly on a
well a score on a screening measure screening measure or screening battery—
earlier in the year predicts a student’s especially if the score is at or near a cut
later mathematics achievement. Greater point, the panel recommends monitoring
predictive validity means that schools her or his progress carefully to discern
can be more confident that decisions whether extra instruction is necessary.
based on screening data are accurate.
In general, we recommend that schools Developers of screening systems recom-
and districts employ measures with mend that screening occur at least twice
predictive validity coefficients of at a year (e.g., fall, winter, and/or spring).39
least .60 within a school year.36 This panel recommends that schools alle-
viate concern about students just above or
r Reliability is an index of the consistency below the cut score by screening students
and precision of a measure. We recom- twice during the year. The second screen-
mend measures with reliability coeffi- ing in the middle of the year allows another
cients of .80 or higher.37 check on these students and also serves to
identify any students who may have been at
r &ãDJFODZ is how quickly the universal risk and grown substantially in their mathe-
screening measure can be adminis- matics achievement—or those who were on-
tered, scored, and analyzed for all the track at the beginning of the year but have
students. As a general rule, we suggest not shown sufficient growth. The panel
that a screening measure require no considers these two universal screenings
to determine student proficiency as distinct
36. A coefficient of .0 indicates that there is no from progress monitoring (Recommenda-
relation between the early and later scores, and tion 7), which occurs on a more frequent
a coefficient of 1.0 indicates a perfect positive
relation between the scores.
37. A coefficient of .0 indicates that there is no 38. Foegen, Jiban, and Deno (2007); Fuchs, Fuchs,
relation between the two scores, and a coeffi- Compton et al. (2007); Gersten, Clarke, and Jordan
cient of 1.0 indicates a perfect positive relation (2007).
between the scores. 39. Kaminski et al. (2008); Shinn (1989).

( 14 )

basis (e.g., weekly or monthly) with a select these grade levels, districts, county offices,
group of intervention students in order to or state departments may need to develop
monitor response to intervention. additional screening and diagnostic mea-
sures or rely on placement tests provided
2. Select screening measures based on the by developers of intervention curricula.
content they cover, with an emphasis on crit-
ical instructional objectives for each grade. 4. Use the same screening tool across a district
to enable analyzing results across schools.
The panel believes that content covered
in a screening measure should reflect the The panel recommends that all schools
instructional objectives for a student’s within a district use the same screening
grade level, with an emphasis on the most measure and procedures to ensure ob-
critical content for the grade level. The Na- jective comparisons across schools and
tional Council of Teachers of Mathematics within a district. Districts can use results
(2006) released a set of focal points for from screening to inform instructional de-
each grade level designed to focus instruc- cisions at the district level. For example,
tion on critical concepts for students to one school in a district may consistently
master within a specific grade. Similarly, have more students identified as at risk,
the National Mathematics Advisory Panel and the district could provide extra re-
(2008) detailed a route to preparing all sources or professional development to
students to be successful in algebra. In the that school. The panel recommends that
lower elementary grades, the core focus of districts use their research and evaluation
instruction is on building student under- staff to reevaluate screening measures an-
standing of whole numbers. As students nually or biannually. This entails exam-
establish an understanding of whole num- ining how screening scores predict state
bers, rational numbers become the focus testing results and considering resetting
of instruction in the upper elementary cut scores or other data points linked to
grades. Accordingly, screening measures instructional decisionmaking.
used in the lower and upper elementary
grades should have items designed to as- Potential roadblocks and solutions
sess student’s understanding of whole and
rational number concepts—as well as com- Roadblock 1.1. Districts and school person-
putational proficiency. nel may face resistance in allocating time re-
sources to the collection of screening data.
3. In grades 4 through 8, use screening data
in combination with state testing results. Suggested Approach. The issue of time
and personnel is likely to be the most sig-
In the panel’s opinion, one viable option nificant obstacle that districts and schools
that schools and districts can pursue is to must overcome to collect screening data.
use results from the previous year’s state Collecting data on all students will require
testing as a first stage of screening. Students structuring the data collection process to
who score below or only slightly above a be efficient and streamlined.
benchmark would be considered for sub-
sequent screening and/or diagnostic or The panel notes that a common pitfall is
placement testing. The use of state testing a long, drawn-out data collection process,
results would allow districts and schools with teachers collecting data in their class-
to combine a broader measure that covers rooms “when time permits.” If schools are
more content with a screening measure that allocating resources (such as providing an
is narrower but more focused. Because of intervention to students with the 20 low-
the lack of available screening measures at est scores in grade 1), they must wait until
( 15 )

all the data have been collected across high on the previous spring’s state as-
classrooms, thus delaying the delivery sessment, additional screening typically
of needed services to students. Further- is not required.
more, because many screening measures
are sensitive to instruction, a wide gap Roadblock 1.3. Screening measures may
between when one class is assessed and identify students who do not need services
another is assessed means that many stu- and not identify students who do need
dents in the second class will have higher services.
scores than those in the first because they
were assessed later. Suggested Approach. All screening mea-
sures will misidentify some students as
One way to avoid these pitfalls is to use data either needing assistance when they do
collection teams to screen students in a not (false positive) or not needing assis-
short period of time. The teams can consist tance when they do (false negative). When
of teachers, special education staff includ- screening students, educators will want to
ing such specialists as school psychologists, maximize both the number of students
Title I staff, principals, trained instructional correctly identified as at risk—a measure’s
assistants, trained older students, and/or sensitivity—and the number of students
local college students studying child devel- correctly identified as not at risk—a mea-
opment or school psychology. sure’s specificity. As illustrated in table 3,
screening students to determine risk can
Roadblock 1.2. Implementing universal result in four possible categories indicated
screening is likely to raise questions such by the letters A, B, C, and D. Using these
as, “Why are we testing students who are categories, sensitivity is equal to A/(A + C)
doing fine?” and specificity is equal to D/(B + D).

Suggested Approach. Collecting data Table 3. Sensitivity and specificity

on all students is new for many districts
and schools (this may not be the case for ACTUALLY AT RISK
elementary schools, many of which use
Yes No
screening assessments in reading).40 But
STUDENTS Yes A (true B (false
screening allows schools to ensure that all
IDENTIFIED positives) positives)
students who are on track stay on track
AS BEING No C (false D (true
and collective screening allows schools to AT RISK negatives) negatives)
evaluate the impact of their instruction
on groups of students (such as all grade
2 students). When schools screen all stu-
dents, a distribution of achievement from The sensitivity and specificity of a mea-
high to low is created. If students consid- sure depend on the cut score to classify
ered not at risk were not screened, the children at risk.41 If a cut score is high
distribution of screened students would (where all students below the cut score are
consist only of at-risk students. This could considered at risk), the measure will have
create a situation where some students at a high degree of sensitivity because most
the “top” of the distribution are in real- students who truly need assistance will be
ity at risk but not identified as such. For
upper-grade students whose scores were
41. Sensitivity and specificity are also influenced
by the discriminant validity of the measure and
40. U.S. Department of Education, Office of Plan- its individual items. Measures with strong item
ning, Evaluation and Policy Development, Policy discrimination are more likely to correctly iden-
and Program Studies Service (2006). tify students’ risk status.

( 16 )

identified as at risk. But the measure will those resources when using screening
have low specificity since many students data to make instructional decisions. Dis-
who do not need assistance will also be tricts may find that on a nationally normed
identified as at risk. Similarly, if a cut score screening measure, a large percentage of
is low, the sensitivity will be lower (some their students (such as 60 percent) will be
students in need of assistance may not be classified as at risk. Districts will have to
identified as at risk), whereas the specific- determine the resources they have to pro-
ity will be higher (most students who do vide interventions and the number of stu-
not need assistance will not be identified dents they can serve with their resources.
as at risk). This may mean not providing interven-
tions at certain grade levels or providing
Schools need to be aware of this tradeoff interventions only to students with the
between sensitivity and specificity, and lowest scores, at least in the first year of
the team selecting measures should be implementation.
aware that decisions on cut scores can be
somewhat arbitrary. Schools that set a cut There may also be cases when schools
score too high run the risk of spending re- identify large numbers of students at risk
sources on students who do not need help, in a particular area and decide to pro-
and schools that set a cut score too low run vide instruction to all students. One par-
the risk of not providing interventions to ticularly salient example is in the area of
students who are at risk and need extra in- fractions. Multiple national assessments
struction. If a school or district consistently show many students lack proficiency in
finds that students receiving intervention fractions,42 so a school may decide that,
do not need it, the measurement team rather than deliver interventions at the
should consider lowering the cut score. individual child level, they will provide a
school-wide intervention to all students. A
Roadblock 1.4. Screening data may iden- school-wide intervention can range from a
tify large numbers of students who are at supplemental fractions program to profes-
risk and schools may not immediately have sional development involving fractions.
the resources to support all at-risk students.
This will be a particularly severe problem
in low-performing Title I schools.

Suggested Approach. Districts and

schools need to consider the amount of 42. National Mathematics Advisory Panel
resources available and the allocation of (2008); Lee, Grigg, and Dion (2007).

( 17 )
Appendix D. stitute on Progress Monitoring197 and the
Technical information National Center on Progress Monitoring
were also used.198
on the studies
The studies of screening measures all
Recommendation 1. used appropriate correlational designs.199
Screen all students to identify In many cases, the criterion variable was
those at risk for potential some type of standardized assessment,
mathematics difficulties and often a nationally normed test (such as
provide interventions to students the Stanford Achievement Test) or a state
identified as at risk. assessment. In a few cases, however, the
criterion measure was also tightly aligned
Level of evidence: Moderate with the screening measure.200 The latter
set is considered much weaker evidence
The panel examined reviews of the tech- of validity.
nical adequacy of screening measures for
students identified as at risk when making Studies also addressed inter-tester
this recommendation. The panel rated the reliability,201 internal consistency,202 test-
level of evidence for recommendation 1 as retest reliability,203 and alternate form reli-
moderate because several reviews were ability.204 Many researchers discussed the
available for evidence on screening mea- content validity of the measure.205 A few
sures for younger students. However, even discussed the consequential valid-
there was less evidence available on these ity206—the consequences of using screen-
measures for older students. The panel ing data as a tool for determining what
relied on the standards of the American requires intervention.207 However, these
Psychological Association, the American studies all used standardized achievement
Educational Research Association, and the measures as the screening measure.
National Council on Measurement in Edu-
cation194 for valid screening instruments In recent years, a number of studies of
along with expert judgment to evaluate screening measures have also begun to
the quality of the individual studies and
to determine the overall level of evidence
for this recommendation. 197.
Relevant studies were drawn from recent 199. Correlational studies are not eligible for
comprehensive literature reviews and re- WWC review.
ports195 as well as literature searches of 200. For example, Bryant, Bryant, Gersten, Scam-
databases using key terms (such as “for- macca, and Chavez (2008).
mative assessment”). Journal articles sum- 201. For example, Fuchs et al. (2003a).
marizing research studies on screening 202. For example, Jitendra et al. (2005).
in mathematics,196 along with summary 203. For example, VanDerHeyden, Witt, and Gil-
information provided by the Research In- bertson (2003).
204. For example, Thurber, Shinn, and Smol-
194. American Educational Research Associa- kowski (2002).
tion, American Psychological Association, and 205. For example, Clarke and Shinn (2004); Ger-
National Council on Measurement in Education sten and Chard (1999); Foegen, Jiban, and Deno
(1999). (2007).
195. For example, the National Mathematics Advi- 206. Messick (1988); Gersten, Keating, and Irvin
sory Panel (2008). (1995).
196. Gersten et al. (2005); Fuchs, Fuchs, Compton 207. For example, Compton, Fuchs, and Fuchs
et al. (2007); Foegen et al. (2007). (2007).

( 61 )

report sensitivity and specificity data.208 proficiency measures.212 Still others de-
Because sensitivity and specificity pro- veloped a broader measure that assessed
vide information on the false positive and multiple proficiencies in their screening.213
false negative rates respectively, they are An example of a single proficiency embed-
critical in determining the utility of a mea- ded in a broader measure is having stu-
sure used in screening decisions linked dents compare magnitudes of numbers.
to resource allocation. Note that work on As an individual measure, magnitude com-
sensitivity and specificity in educational parison has predictive validity in the .50
screening is in its infancy and no clear to .60 range,214 but having students make
standards have been developed. magnitude comparisons is also included in
broader measures. For example, the Num-
The remainder of this section presents ber Knowledge Test (NKT)215 requires stu-
evidence in support of the recommenda- dents to name the greater of two verbally
tion. We discuss the evidence for measures presented numbers and includes problems
used in both the early elementary and assessing strategic counting, simple addi-
upper elementary grades and conclude tion and subtraction, and word problems.
with a more in-depth example of a screen- The broader content in the NKT provided
ing study to illustrate critical variables to stronger evidence of predictive validity216
consider when evaluating a measure. than did single proficiency measures.

Summary of evidence Further information on the characteristics

and technical adequacy of curriculum-
In the early elementary grades, mea- based measures (CBM) for use in screen-
sures examined included general out- ing in the elementary grades was summa-
come measures reflecting a sampling of rized by Foegen, Jiban, and Deno (2007).
objectives for a grade level that focused They explained that measures primarily
on whole numbers and number sense. assessed the objectives of operations or
These included areas of operations and the concepts and applications standards
procedures, number combinations or basic for a specific grade level. A smaller num-
facts, concepts, and applications.209 Mea- ber of measures assessed fluency in basic
sures to assess different facets of number facts, problem solving, or word problems.
sense—including measures of rote and Measures were timed and administration
strategic counting, identification of numer- time varied between 2 and 6 minutes for
als, writing numerals, recognizing quanti- operations probes and 2 to 8 minutes for
ties, and magnitude comparisons—were concepts and applications. Reliability evi-
also prevalent.210 Some research teams dence included test-retest, alternate form,
developed measures focused on a single internal consistency, and inter-scorer,
aspect of number sense (such as strategic with most reliabilities falling between .80
counting),211 and others developed batter- and .90, meeting acceptable standards
ies to create a composite score from single for educational decisionmaking. Similar
evidence was found for validity with most

212. Bryant, Bryant, Gersten, Scammacca, and

208. Locuniak and Jordan (2008); VanDerHey- Chavez (2008).
den et al. (2001); Fuchs, Fuchs, Compton et al. 213. Okamoto and Case (1996).
(2007). 214. Lembke et al. (2008); Clarke and Shinn
209. Fuchs, Fuchs, Compton et al. (2007). (2004); Bryant, Bryant, Gersten, Scammacca, and
210. Gersten, Clarke, and Jordan (2007); Fuchs, Chavez (2008).
Fuchs, and Compton et al. (2007). 215. Okamoto and Case (1996).
211. Clarke and Shinn (2004). 216. Chard et al. (2005).

( 62 )

concurrent validity coefficients in the .50 measures to assess. Included were number
to .70 range. Lower coefficients were found sense measures that assessed knowledge
for basic fact measures ranging from .30 of counting, number combinations, non-
to .60. Researchers have also begun to de- verbal calculation, story problems, num-
velop measures that validly assess magni- ber knowledge, and short and working
tude comparison, estimation, and prealge- memory. The authors used block regres-
bra proficiencies.217 sion to examine the added value of the
math measures in predicting achievement
A study of evaluating a mathematics above and beyond measures of cognition,
screening instrument—Locuniak and age, and reading ability (block 1), which
Jordan (2008) accounted for 26 percent of the variance
on 2nd grade calculation fluency. Adding
A recent study by Locuniak and Jordan the number sense measures (block 2) in-
(2008) illustrates factors that districts creased the variance explained to 42 per-
should consider when evaluating and se- cent. Although the research team found
lecting measures for use in screening. The strong evidence for the measures assess-
researchers examined early mathematics ing working memory (digit span), number
screening measures from the middle of knowledge, and number combinations,
kindergarten to the end of second grade. the array of measures investigated is in-
The two-year period differs from many dicative that the field is still attempting to
of the other screening studies in the area understand which critical variables (math-
by extending the interval from within a ematical concepts) best predict future dif-
school year (fall to spring) to across sev- ficulty in mathematics. A similar process
eral school years. This is critical because has occurred in screening for reading dif-
the panel believes the longer the interval ficulties where a number of variables (such
between when a screening measure and a as alphabetic principle) are consistently
criterion measure are administered, the used to screen students for reading dif-
more schools can have confidence that stu- ficulty. Using the kindergarten measures
dents identified have a significant deficit with the strongest correlations to grade 2
in mathematics that requires intervention. mathematics achievement (number knowl-
The Locuniak and Jordan (2008) study also edge and number combinations), the re-
went beyond examining traditional indices searchers found rates of .52 for sensitivity
of validity to examine specificity and sen- and .84 for specificity.
sitivity. Greater sensitivity and specificity
of a measure ensures that schools provide Another feature that schools will need
resources to those students truly at risk to consider when evaluating and select-
and not to students misidentified ing measures is whether the measure is
timed. The measures studied by Locuniak
The various measures studied by Lo- and Jordan (2008) did not include a timing
cuniak and Jordan (2008) also reflected component. In contrast, general outcome
mathematics content that researchers measures include a timing component.218
consider critical in the development of No studies were found by the panel that
a child’s mathematical thinking and that examined a timed and untimed version of
many researchers have devised screening the same measure.

217. Foegen et al. (2007). 218. Deno (1985).

( 63 )