You are on page 1of 11

670637

research-article2016
SEDXXX10.1177/0022466916670637The Journal of Special EducationRomig et al.

Article
The Journal of Special Education

Meta-Analysis of Criterion Validity


2017, Vol. 51(2) 72­–82
© Hammill Institute on Disabilities 2016
Reprints and permissions:
for Curriculum-Based Measurement sagepub.com/journalsPermissions.nav
DOI: 10.1177/0022466916670637

in Written Language journalofspecialeducation.sagepub.com

John Elwood Romig, MEd1, William J. Therrien, PhD1, and


John W. Lloyd, PhD1

Abstract
We used meta-analysis to examine the criterion validity of four scoring procedures used in curriculum-based measurement
of written language. A total of 22 articles representing 21 studies (N = 21) met the inclusion criteria. Results indicated
that two scoring procedures, correct word sequences and correct minus incorrect sequences, have acceptable criterion
validity with commercially developed and state- or locally developed criterion assessments. Results indicated trends for
scoring procedures at each grade level. Implications for researchers and practitioners are discussed.

Keywords
curriculum-based measurement, meta-analysis, assessment, writing, written language

Since the passage of the Education for All Handicapped the various measures used to monitor students’ writing per-
Children Act (EAHCA) in 1975, Individual Education Plans formance, we discuss the research about psychometric qual-
(IEPs) and annual goals for students with disabilities have ities of measures of writing.
been a hallmark of special education. Schools are required
to report progress toward these goals at least annually. The
Written Language
2004 reauthorization of EAHCA, renamed the Individuals
With Disabilities Education Improvement Act, required that Assessing written language is complicated by the multi-
a plan for monitoring progress toward annual goals be faceted nature of the skill. Although there is some disagree-
included in a student’s IEP and that reports of progress be ment as to the exact components in writing, there is general
given to parents at least as often as peers not receiving spe- consensus about five basic elements: handwriting, spelling,
cial education services receive reports of their academic mechanics, usage, and ideation (Taylor, 2006). Often
progress (i.e., report cards; Individuals With Disabilities experts emphasize some of these elements of writing more
Education Act, 2004). than other elements. For example, Graham, Harris, and
Educators monitor many aspects of students’ educational Hebert (2011) argued that presentation effects (e.g., spell-
activities. They may be concerned about progress toward ing, handwriting, and grammar) should not overly factor
academic outcomes or social-behavioral goals (Archer & into judgments of writing for students with disabilities.
Hughes, 2010; Epstein, Atkins, Cullinan, Kutash, & Weaver, Espin (2014), however, argued that writing instruction and
2008). Within the academic domain, some teachers employ assessment should focus on effectively communicating a
curriculum-based assessments (e.g., Venn, 2013), but cur- message to readers, which includes skills such as grammar,
riculum-based measurement is probably the most widely spelling, and punctuation.
employed method for monitoring progress and making edu- Cutler and Graham (2008) conducted a national survey
cational decisions (Lembke, Hampton, & Hendricker, of primary grade teachers and found that approximately
2013). Curriculum-based measures have been employed to 70% of teachers reported monitoring student writing at least
assess diverse aspects of reading, arithmetic, spelling, and
writing (Hosp, Hosp, & Howell, 2012). 1
University of Virginia, Charlottesville, USA
As with any measurement system, curriculum-based
measures should meet standards for technical adequacy. Corresponding Author:
John Elwood Romig, Department of Curriculum, Instruction, and Special
They should be reliable and valid. In this study, we examine Education, University of Virginia, P.O. Box 400273, Charlottesville,
aspects of the technical adequacy of a progress monitoring VA 22904, USA.
measures for written language. After a brief description of E-mail: romig@virginia.edu
Romig et al. 73

weekly. Although it is encouraging that a large majority of Correct Word Sequences (CWSs)
teachers self-reported assessing student writing weekly, it is
concerning that almost a third of teachers are not assessing CWS was developed by Videen, Deno, and Marston (1982).
student writing on a weekly basis. This scoring procedure involves counting sequences of cor-
The lack of frequency in assessing writing is understand- rect words. A CWS is defined as two adjacent words that are
able given the time required for assessing writing. Teachers both spelled correctly and used appropriately in a sentence.
can spend a significant amount of time assessing writing due The first scored sequence in a sample is from “blank-to-
to the fact that for many components of writing, there is not first-word.” The last scored sequence is from “last-word-to-
a “right answer.” Ideation, for example, is largely subjective blank.” In this scoring procedure, there will always be one
and requires the teacher to assess the ideas developed by the more possible sequence than words in the writing sample.
student in relation to the purpose for writing. These judg-
ments take significant time for the teacher to make for one Correct Minus Incorrect Word Sequences
student, and the time demands are compounded when moni- (CIWSs)
toring an entire class. These time demands increase as the
students age and their writing becomes more complex. CIWS was developed by Espin et al. (2000). The rules for
this scoring procedure follow the rules for CWS. In addi-
tion, the number of incorrect word sequences are counted
Curriculum-Based Measurement and subtracted from the total CWS. The difference between
Curriculum-based measurement (CBM) can decrease the CWS and incorrect word sequences is the student’s score
time necessary to monitor writing progress. Deno (1985; for the writing sample.
Deno, Mirkin, & Marston, 1980) led the initial development
of CBM with the goal of establishing a set of measures that
were technically adequate (e.g., reliable and valid), efficient
Technical Adequacy
to administer and score, inexpensive, sensitive to growth, As with any worthwhile measure, if CBM-W is to be used
and easily understood. His efforts established procedures as a progress monitoring tool, it must meet certain technical
for administering and scoring CBMs in reading, mathemat- requirements of reliability and validity. Previous research in
ics, written language, and spelling (Deno, 1985). CBM-W has focused specifically on criterion validity and
CBM for written language (CBM-W) involves adminis- alternate form reliability. Criterion validity assesses the
tration of a series of probes of equivalent difficulty. Each degree to which a measure correlates with another measure
probe includes a prompt (e.g., picture prompt, story starter, (Taylor, 2006). Although reliability of a test is certainly
narrative prompt, essay prompt) and a timed response from important and necessary, a measure that produces accurate
the student (e.g., 3, 5, 7, or 10 min). Dozens of scoring pro- but invalid scores is meaningless.
cedures have been developed and tested by researchers (see Establishing criterion validity of a measure in writing is
Amato & Watkins, 2011; Gansle, Noell, VanDerHeyden, difficult due to disagreement in the field about how much
Naquin, & Slider, 2002; and Jewell & Malecki, 2005, for a weight should be placed on various elements of writing.
variety of alternative measures), but four have emerged as Researchers have typically reported lower criterion validity
frequently researched and used in the classroom. for writing measures than reading measures. For example,
researchers have reported criterion validity coefficients of .85
or higher for the Woodcock Reading Mastery Test–Revised
Words Written (WW) with the Woodcock–Johnson Psychoeducational Battery as a
WW was one of the CBM-W scoring indices developed by criterion measure (Taylor, 2006). In contrast, researchers have
Deno et al. (1980). In WW, the scorer simply counts the reported criterion validity coefficients of .30 to .50 for the Test
number of words in the student’s writing sample. Words do of Written Language–3 with the Comprehensive Test of
not have to be spelled correctly or used in correct grammati- Nonverbal Intelligence as a criterion measure (Taylor, 2006).
cal function. Any two conjoined letters are counted as a For CBM measures in reading, researchers have reported cri-
word. Single letter words (e.g., “I” and “a”) are also counted. terion validity coefficients that reach .70 to .80 and even
approach .90 at times (Wayman, Wallace, Wiley, Tichá, &
Espin, 2007). Given the criterion validity of standardized
Words Spelled Correctly (WSC) writing measures in comparison with standardized reading
WSC was also originally developed by Deno et al. (1980). measures and the difficulty defining good writing, CBM mea-
This scoring procedure involves counting WSC in the writ- sures in written language would be expected to have lower
ing sample regardless of the appropriateness of usage (much criterion validity than CBM measures in reading.
like a computer spell-checker would assess correctly spelled Another way of determining the validity of a measure is
words). to consider the consequences of decision making with and
74 The Journal of Special Education 51(2)

without the measure (Messick, 1989). Researchers and specificity of criterion validity determinations. Calculating a
practitioners do not have many options for progress moni- criterion coefficient in a meta-analytic way has the advantage
toring tools in writing that are sensitive to growth, techni- of providing a more precise result by giving a criterion valid-
cally adequate, and efficient to administer and score. A ity coefficient with a confidence interval, rather than a range
lack of measures does not justify using measures with a of the minimum and maximum coefficients found in all stud-
dearth of evidence supporting their technical properties. ies. Mean coefficients across several studies are a more stable
However, researchers and practitioners might accept mea- indicator of criterion validity than results from individual
sures with lower criterion validity in writing than they studies. Therefore, this review sought to answer four basic
would in reading due to the numbers of available measures questions regarding the use of the four most common CBM-W
in each of these areas. scoring procedures found in the two previous reviews.
Fuchs (2004) described three stages of research neces-
sary for determining the usefulness of CBM. Stage 1 requires Research Question 1: Overall, what is the mean
investigating the technical features of a static score. Stage 2 weighted criterion validity coefficient of CBM-W scor-
requires examination of technical features of CBM slopes ing procedures?
(i.e., rates of growth over time). Stage 3 requires examina- Research Question 2: What is the mean weighted crite-
tion of CBM’s usefulness in instruction. Although Stages 2 rion validity coefficient of WW, WSC, CWS, and CIWS?
and 3 are important considerations, Stage 1 is foundational Research Question 3: Does a given scoring procedure
for CBM-W to be used as a progress monitoring tool. have stronger criterion validity at varying grade levels?
McMaster and Espin (2007) conducted a comprehensive Research Question 4: Does the criterion validity of
literature review of CBM-W studies investigating the reli- scoring procedures differ when predicting performance
ability and validity of these measures (i.e., Stage 1). Their on commercially developed norm-referenced assess-
review included studies across all K–12 grade levels. Their ments or state-developed norm-referenced assessments?
review was helpful for beginning to answer questions
related to Stage 1; however, the methods of their review led
to highly variable results for validity and reliability. For Method
example, criterion validity was reported as low as −.24 and Literature Review
as high as .99 across the studies. As a literature review, they
did not calculate aggregated statistics across studies. We adopted search procedures from McMaster and Espin
Instead, they presented and summarized the findings of (2007) as a model and searched the ERIC, PsycINFO, and
each study. The most common scoring procedures across Science Citation Index Expanded databases using the search
studies were WW, WSC, CWS, and CIWS. They found cri- terms curriculum based measurement, curriculum based
terion validity differed for varying grade levels. Furthermore, measure, general outcome measure, and progress monitor-
they found that studies conducted after the initial Institute ing. In addition, we searched the ERIC, Academic Search
for Research on Learning Disabilities (IRLD) studies, Complete, and PsycINFO databases using the search terms
which developed CBM, had much lower criterion validity written language and curriculum based measurement. In
than the IRLD studies did and that studies that reported total, our searches returned 11,802 results. Using RefWorks,
results across grades had higher criterion validity than stud- we removed exact duplicates and had 9,771 results remain-
ies that reported results within grades. ing. Before searching the results, two authors reviewed 200
McMaster, Ritchey, and Lembke (2011) summarized the titles and abstracts to ensure agreement with the inclusion
existing CBM research for early writers in each of Fuchs’s criteria. We had 100% agreement after this initial review of
(2004) three stages. Of the 11 studies they described, all had results. Next, we divided the search results between two
answered Stage 1 questions, five answered Stage 2 ques- authors and used 2,195 results for intercoder agreement. We
tions, and none answered Stage 3 questions. Similar to calculated kappa at .67 (with three disagreements) based on
McMaster and Espin (2007), McMaster, Ritchey, and the review of these 2,195 search results. After reviewing titles
Lembke (2011) provided a narrative description of studies and abstracts, 22 met the inclusion criteria for our study.
and identified strengths and weaknesses without calculating
statistics across studies. The most common scoring proce-
dure in these studies was WW, WSC, and CWS.
Inclusion Criteria
Focus of study and design. First, to be included, studies had
to report quantitative measurement of criterion validity
Rationale and Research Questions for at least one of the four forms of CBM-W under inves-
The purpose of this meta-analysis was to extend the work of tigation (WW, WSC, CWS, CIWS) in a peer-reviewed
McMaster and Espin (2007) and McMaster, Ritchey, and journal. We included only peer-reviewed journal articles
Lembke (2011). Specifically, we wanted to increase the due to concerns regarding the rigor of non-peer reviewed
Romig et al. 75

publications. We also reasoned that strong reports and We initially coded grade level as it was reported in each
dissertations would subsequently be published in peer- study. Grade level ranged from K–12. However, there were
reviewed sources. However, including only peer-reviewed too few studies at some grade levels to make analysis at the
studies opens the possibility of publication bias affecting individual grade level sensible, so we grouped studies into
results. Therefore, we conducted a test for publication four categories: K–2, third to fifth, sixth to eighth, and ninth
bias. Criterion validity could be determined by commer- to 12th. Two studies did not fit into these grade categories.
cially developed achievement measures (e.g., the Test of Gansle, VanDerHeyden, Noell, Resetar, and Williams (2006)
Written Language–3, Woodcock–Johnson III) or a state- included students in second to fifth grades and did not report
or district-developed achievement measure. We did not correlations for the individual grade levels; we coded this
require normative data be available for all state- or locally study as a third to fifth study. Cheng and Rose (2009)
developed assessments. Studies that used locally devel- reported results for students in seventh to 12th grades col-
oped measures had to use a test or rubric developed by the lectively; we coded it as a ninth- to 12th-grade study.
district and not by individual teachers. We excluded stud- We coded two types of assessments. We distinguished
ies that used only holistic ratings, teacher ratings, devel- between commercially developed, norm-referenced assess-
opmental sentence scoring, grade point average (GPA), or ments and state- or district-developed assessments. Studies
classroom grades. To be included, studies had to provide could report total or subscale-subtest results for criterion
information necessary to calculate a mean correlation measures. If we had questions about whether a commer-
(correlation coefficient and sample size). cially developed assessment was norm referenced, we
searched the test manual or the Internet for normative infor-
Participants in the study. We included only K–12 studies. We mation for the given assessment. We excluded the Test of
did not require student disability status to be reported. Most of Emerging Academic English used in Campbell, Espin, and
the studies we found included students with and without dis- McMaster (2013) because we were interested in criterion
abilities and did not report results for these groups separately. validity using criterion measures that assess a wide range of
students, not English Language Learners only. Two raters
Final corpus. Campbell (2010) was excluded because it only reviewed all articles, and interrater reliability (agreements/
included students learning English as a second language, total of agreements and disagreements) was 97.3%. We rec-
and it examined the utility of passage copying tasks for stu- onciled the disagreements as a group and used the recon-
dents in ninth to 12th grade. Passage copying is not a typical ciled codes in the subsequent analyses.
writing task for the majority of high school writers. Because
the data for D. C. Parker, McMaster, and Burns (2011) were
Selecting Correlations for Analysis
drawn from McMaster, Du, et al. (2011), we considered
these two articles as one study. After making decisions Due to the large number of correlations, we selected the
about all articles in our search records, we compared our highest reported correlation for each scoring procedure for
results with McMaster and Espin’s (2007) to ensure we did each sample. Selecting the highest correlation for each sam-
not neglect any studies that met our criteria. ple gave each scoring measure the highest possible score to
After excluding and combining studies, 21 studies be included in our calculations. Some studies reported the
remained to form the corpus of this analysis. Often, studies criterion validity of CBM-W with a large number of tests
reported correlations for several grades individually. In and subtests, some of which had little to do with writing
addition, many studies reported correlations with several ability. For example, Espin, Scierka, Skare, and Halverson
criterion measures for each grade level and each scoring (1999) reported criterion validity of CBM-W with subtests
procedure. In total, we examined 739 correlations for this of the California Achievement Test. The subtests included
analysis. reading, math, language arts total, language arts expression,
and language arts mechanics. We were only interested in
CBM-W measures as a predictor of performance on crite-
Coding rion measures that assessed writing or language arts ability.
We assigned codes for scoring procedure, grade level, and By picking the highest correlations, we avoided diluting the
criterion assessment type. We found 44 different types of results with irrelevant criterion measures, such as a math
CBM-W scoring procedures in these studies. By a wide subtest. Selecting the highest correlation also gave the scor-
margin, the three most common scoring procedures were ing procedures the best possible criterion validity under the
WW, WSC, and CWS. Of the 22 articles, 95% (n = 21) best administration procedures (i.e., prompt type and dura-
reported criterion validity of WW, 73% (n = 16) reported tion) for each study. Many studies found that criterion valid-
criterion validity of WSC, 100% (n = 22) reported criterion ity increased as the duration of writing was extended. The
validity of CWS, and 45% (n = 10) reported criterion valid- correlation selection procedures avoided the shorter writing
ity of CIWS. durations tempering the criterion validity of the scoring
76 The Journal of Special Education 51(2)

procedure for these studies. We treated grade levels that to lower the mean weighted correlation to .30. Orwin’s fail-
were reported individually as separate samples and selected safe N indicated that 29 studies would have to be unpub-
the highest correlation for each scoring procedure in each lished (or published via non-peer-reviewed sources) with a
grade level. For example, if a study reported second and mean effect of .00 in the hidden studies.
third grade separately (e.g., Ritchey & Coker, 2013), we Typically, the next step in meta-analytic research would
selected the highest correlation for each scoring procedure be to conduct inferential statistics to explore sources of het-
(in this case WW, WSC, and CWS) in each grade level. erogeneity through our research questions. However, we
Ritchey and Coker used two different prompts (story starter did not calculate inferential statistics due to small sample
and picture story), two different writing durations (3 and 5 sizes. Instead, we calculated weighted correlations and con-
min), and one criterion measure (Woodcock–Johnson III fidence intervals, similar to Slavin’s (1986) recommenda-
Writing Samples subtest). Ritchey and Coker reported a tions for a Best Evidence Synthesis.
total of 18 correlations. However, by selecting the highest
correlation for each scoring procedure at each grade level,
we included six for our analysis: the highest correlation for
Results
WW, WSC, and CWS in second and third grade. In total, we A total of 22 articles (21 studies due to combining D. C.
selected 102 correlations across all studies for our analysis. Parker et al., 2011 and McMaster, Du, et al., 2011) were
included in this meta-analysis. Studies were published
between 1991 and 2015. The data in the studies represented
Calculating Mean Correlations the written language performance of 3,361 students.
To determine whether there was enough variability in our Fourteen studies were published after McMaster and Espin
data to permit an exploration of our research questions (2007). See Table 1 for a summary of each study.
beyond reporting an overall mean weighted correlation To calculate the overall mean criterion validity of all
value, we calculated Q and I2 values. Cochran’s Q assesses CBM-W scoring procedures when combined, we selected
homogeneity of data points that were used to calculate a only the single highest criterion validity coefficient from
weighted mean (e.g., effect size or mean correlation; each sample. When selecting only the highest reported cor-
Cochran, 1954). A significant Q statistic indicates heteroge- relation for each study, we calculated an overall correlation
neity in the data used to calculate the weighted mean of .55 (CI = [.51, .60]). This correlation represents the high-
(Cochran, 1954). Higgins and Thompson (2002) recom- est possible mean weighted correlation for all four CBM-W
mended the I2 statistic, which measures the percentage of scoring measures combined.
homogeneity across the data; they argue that Cochran’s Q is The overall correlations for all four scoring measures are
underpowered to detect heterogeneity when a meta-analysis presented in Table 2. WW had an overall correlation of .37.
includes a small number of studies. All mean weighted cor- WSC had an overall correlation of .44. CWSs had an overall
relations and Q statistics were calculated using correlation of .51. CIWSs had an overall correlation of .60.
Comprehensive Meta-Analysis (Version 2) software. Mean correlations were calculated for each scoring pro-
Using all selected correlations, we calculated a Q statis- cedure in four different age groups. Results are presented in
tic of 396.229 (p < .000, I2 = 76.024). The I2 value calcu- Table 3. In each grade category, CIWS had the highest cri-
lated here indicates that 76% of the variance is likely not terion validity.
due to chance or error. We conducted a second Q statistic Mean correlations were calculated for each type of crite-
calculation using the highest reported correlation for each rion assessment by combining all scoring procedures.
study regardless of scoring procedure. This analysis pro- Results are presented in Table 4. The overall correlation for
duced a Q statistic of 89.297 (p < .000, I2 = 68.644). The I2 state-developed assessments was .61. The overall correla-
value indicates that approximately 64% of the variance is tion for commercially developed assessments was .54.
likely not due to chance or error. The Q statistics and I2
statistics for these overall correlations indicate a high level
Discussion
of heterogeneity within the data. These results indicated
that there was enough heterogeneity within our data set to Our meta-analysis addressed questions about what Fuchs
explore our research questions. (2004) termed Stage 1 research in CBM. Overall, we found a
We checked for publication bias in our results by calcu- mean correlation of r = .55 when combining all CBM-W
lating Orwin’s failsafe N (Borenstein, 2005). When calcu- scoring indices. This correlation is the highest possible crite-
lating Orwin’s failsafe N, we used the mean criterion rion validity coefficient when aggregating the highest corre-
validity coefficient found when combining all scoring pro- lation coefficients from each study. This correlation includes
cedures. The overall mean weighted correlation was .55. all types of scoring procedures for all grade levels. Therefore,
We assumed that the mean effect of all hidden studies was it is difficult to make conclusive statements about CBM-W
.00 and calculated the number of hidden studies necessary based on this one statistic. In general, these findings fall in
Romig et al. 77

Table 1. Peer-Reviewed Studies Examining Criterion Validity of Curriculum-Based Measurement in Written Language.

Sample Writing measure Criterion validity

Time
Study n Grade(s) Type of prompt Scoring procedurea (min) Criterion measure
Parker, Tindal, 54 6–8 Story WW, WSC, CWS 6 TOWL-Vocabulary
and Hasbrouck TOWL-Theme
(1991)b TOWL-Spelling
TOWL-Word Use
TOWL-Style
TOWL-Handwriting
TOWL-Total Quotient
Espin, Scierka, 147 10c Story WW, WSC, CWS 3 CAT-LA Total
Skare, and CAT-LA Exp
Halverson CAT-LA Mech
(1999) CAT-Reading
CAT-Math
Espin et al. 37, 3 min 8 Story WW, WSC, CWS, 3, 5 District writing test
(2000) 34, 5 min Descriptive CIWS
Gansle, Noell, 75, third grade 3–4 Story WW, WSC, CWS 3 ITBS-Total subscale
VanDerHeyden, 96, fourth grade LEAP-Write
Naquin, and Competently
Slider (2002) LEAP-Use Conventions
of Language
Gansle et al. 45 3–4 Story WW, CWS 3 WJ-R-Writing samples
(2004)
Jewell and 87, second grade 2, 4, 6 Story WW, WSC, CWS 3 SAT 9–Language
Malecki (2005) 59, fourth grade SAT 9–Spelling
57, sixth grade
Weissenburger 156, fourth grade, 3 and 4, 8, 10 Story WW, CWS, CIWS 3, 5, 10 WKCE Language Arts
and Espin 5 min WKCE Writing
(2005) 184, fourth grade, 10 min
127, eighth grade, 3 and
5 min
137, eighth grade, 10 min
152, 10th grade, 3 min
151, 10th grade, 5 min
163, 10th grade, 10 min
Gansle, 163, SAT 9 total language 2–5 Story WW, WSC, CWS 3 SAT 9–Total Language
VanDerHeyden, 118, SAT 9 prewriting, SAT 9–Prewriting
Noell, Resetar, composing, and editing SAT 9–Composing
and Williams SAT 9–Editing
(2006)
Espin, Wallace, 183 10 Story WW, WSC, CWS, 3, 5, 7, MBST/MCA-Writing
et al. (2008) CIWS 10 subtest
McMaster and 25, third grade 3, 5, 7 Passage copy WW, WSC, CWS, 3, 5, 7 MCA-Writing subtest
Campbell 43, fifth grade Picture CIWS TOWL-3 Spontaneous
(2008) 55, seventh grade Narrative Writing
Expository
Cheng and Rose 22 7–12 Picture WW, WSC, CWS, 3 TOWl-3 Spontaneous
(2009)b Narrative CIWS Writing
McMaster, 100 1 Sentence WW, WSC, CWS, 3, 5 District rubric
Du, and Copying CIWS TOWL-3 Spontaneous
Pétursdóttir Story prompt Writing
(2009) Picture word
Prompt
Picture Prompt
(continued)
78 The Journal of Special Education 51(2)

Table 1. (continued)
Sample Writing measure Criterion validity

Time
Study n Grade(s) Type of prompt Scoring procedurea (min) Criterion measure
Coker and 75, Kind-TOWL-2 Basic K–1 Sentence WW, WSC, CWS 3 TEWL-2 Basic Writing
Ritchey (2010) Writing TEWL-2 Contextual
69, Kind-TOWL-2 Writing
Contextual Writing WJ-III Spelling
149, first grade, WJ-III WJ-III Writing Samples
spelling and broad WJ-III Broad Writing
writing
154, first grade, WJ-III
writing samples
Amato and 447 8 Story WW, WSC, CWS, 3 TOWL-3 Writing
Watkins (2011) CIWS Quotient
López and 36, sixth grade 6–8 Story CWS 3 AIMS-Writing
Thompson 23, seventh grade assessment
(2011) 24, eighth grade
McMaster, Du, 79 1 Sentence copy WW, WSC, CWS 3, 5 TOWL-3 Contextual
et al. (2011) Picture word Conventions
Story TOWL-3 Contextual
Language
TOWL-3 Story
Construction
TOWL-3 Total
TOWL-3 Total
Combined
Parker, 85 1 Picture word WW, WSC, CWS 3 TOWL-3
McMaster, and Sentence
Burns (2011) copy
Mercer, Martinez, 163 10 Passage copy WW, CWS, CIWS 5 ISTEP Plus EOC-Total
Faust, and Expository ISTEP Plus EOC-
Mitchell (2012) Narrative Reading
ISTEP Plus EOC-
Writing
Campbell, Espin, 23–35 (varied by probe) 10–12 Narrative WW, WSC, CWS, 3, 5, 7 MBST
and McMaster Expository CIWS TOWL-3 Spontaneous
(2013)b Picture Writing
Ritchey and 88, second grade 2–3 Picture story WW, WSC, CWS WJ-III Writing Samples
Coker (2013) 82, third grade Story starter
Ritchey and 150 1 Sentence WW, CWS, CIWS 3, 5 WJ-III Spelling
Coker (2014) writing WJ-III Writing Samples
Picture story WJ-III Broad Writing
Dockrell, 236 3–5 Narrative WW, CWS 5 Wechsler Objective
Connelly, Expository Language Dimensions
Walter, and
Critten (2015)

Note. WW = words written; WSC = words spelled correctly; CWS = correct word sequence; TOWL = Test of Written Language; CAT = California
Achievement Test; CIWS = correct minus incorrect word sequence; ITBS = Iowa Test of Basic Skills; LEAP = Louisiana Educational Assessment Program;
WJ-R = Woodcock–Johnson–Revised; SAT-9= Stanford Achievement Test–9th Edition; WKCE = Wisconsin Knowledge and Concepts Exam; MBST/MCA =
Minnesota Basic Standards Test/Minnesota Comprehensive Assessment; MCA = Minnesota Comprehensive Assessment; TOWL-3 = Test of Written Language–3;
TEWL-2 = Test of Early Written Language–2; WJ-III = Woodcock–Johnson III; AIMS = Arizona Instrument to Measure Standards writing assessment; ISTEP
Plus EOC = Indiana Statewide Testing for Educational Progress Plus End of Course Assessment; IEP = Individual Education Plan.
a
Only scoring procedures related to this study are reported here (WW, WSC, CWS, CIWS). Some of these studies included other scoring
procedures not reported here. bThese studies used specific samples. Parker, Tindal, and Hasbrouck (1991) used a sample of students receiving
IEP services. Cheng and Rose (2009) used a sample of students receiving services for hearing impairments. Campbell et al. (2013) used a sample
of students receiving services as English Language Learners. cCBM-W probes were administered in 10th grade and criterion measures were
administered in 11th grade.
Romig et al. 79

Table 2. Mean Correlations for Scoring Procedures. When examining each scoring procedure individually,
CIWS (r = .60) had the highest criterion validity followed by
Scoring procedure Correlation Confidence interval n
CWS (r = .51), WSC (r = .44), and WW (r = .37). Although
WW .37 [.31, .43] 28 we did not conduct inferential statistics due to the small sam-
WSC .44 [.37, .51] 22 ple size, there was overlap in the confidence intervals of
CWS .51 [.46, .56] 31 CIWS and CWS, CWS and WSC, and WSC and WW. CWS
CIWS .60 [.54, .66] 15 and CIWS appear to substantially outperform WW or WSC.
Note. WW = words written; WSC = words spelled correctly; CWS
Some may find these results surprising. WW had high
= correct word sequence; CIWS = correct minus incorrect word criterion validity at its initial development (Deno et al.,
sequence. 1980), but our results are consistent with the trends
McMaster and Espin (2007) described: After initial IRLD
Table 3. Mean Correlations for CBM Scoring Procedures studies, CBM-W measures tended to have lower criterion
Across Four Grade Categories. validity. CIWS was developed much later than the other
Scoring procedure Correlation Confidence interval n measures and was primarily developed for use with second-
ary students (Espin et al., 2000). Although it has few studies
K–2 examining criterion validity with younger students, it does
WW .46 [.34, .56] 8 appear to be a promising measure in all grade levels.
WSC .52 [.43, .61] 8
When examining CBM-W measures at each grade level,
CWS .56 [.47, .64] 8
we identified trends for each scoring procedure. Specifically,
CIWS .58 [.49, .66] 3
WW had the lowest and CIWS had the highest criterion
3–5
validity at each grade level. The confidence interval for
WW .34 [.26, .42] 8
WSC .34 [.25, .41] 5
WW had little overlap with the confidence interval for
CWS .48 [.42, .54] 8 CIWS indicating that CIWS likely has stronger criterion
CIWS .65 [.54, .73] 2 validity than WW at every grade level. In most grade levels
6–8 (3–5 being the only exception), criterion validity increased
WW .32 [.20, .42] 5 as the complexity of the scoring procedure increase. For all
WSC .39 [.22, .54] 4 other grade levels, CIWS was the highest followed by
CWS .50 [.40, .59] 8 CWS, WSC, and WW.
CIWS .59 [.49, .67] 4 Finally, no major differences were detected between
9–12 using commercially developed (r = .54) and state-devel-
WW .35 [.21, .48] 7 oped assessments (r = .61) as the criterion. CBM-W appears
WSC .47 [.27, .64] 5 to predict performance similarly on both types of criterion
CWS .52 [.37, .65] 7 measures included in this analysis.
CIWS .65 [.49, .77] 6 Based on the results of Orwin’s failsafe N, we believe
these results are robust to publication bias. The failsafe N
Note. WW = words written; WSC = words spelled correctly; CWS =
correct word sequence; CIWS = correct minus incorrect word sequence. test indicated that 29 studies with a mean effect of .00 would
have to be hidden from this search to lower the overall mean
correlation from .55 to .30. We believe it is unlikely that 29
Table 4. Correlations of Combined Scoring Procedures for
State- and Commercially Developed Tests. non-peer-reviewed manuscripts exist that meet the rest of
our criteria, and if 29 studies do exist, we do not believe the
Assessment type Correlations Confidence interval n overall mean effect would be .00.
State-developed .61 [.51, .69] 10 When answering questions about Phase 1 of CBM
tests research, considering context of writing assessments is help-
Commercially .54 [.48, .59] 19 ful. Although lower than what has been reported for CBM in
developed tests reading (Wayman et al., 2007), these mean correlations are
similar to correlations of other standardized writing measures.
In an area such as written language where few progress moni-
the middle of the ranges reported by McMaster and Espin toring measures exist, these correlations could be considered
(2007); however, this result allows researchers to provide a strong enough to influence instructional decision making.
more specific and stable answer to the question of how valid
CBM-W scores are. Based on the study samples included in
this meta-analysis, CBM scoring procedures appear to have
Future Research
moderate criterion validity in relation to commercially and The results from this study have at least four implications for
state- or locally developed criterion measures. future research. First, meta-analytic research should
80 The Journal of Special Education 51(2)

continue answering questions about Stage 1 research in selection procedure means that these results are the highest
CBM. One possibility is to examine the criterion validity of possible mean weighted criterion validity coefficients that
specific administration procedures. For example, McMaster can be calculated from these studies. Including more coef-
and Espin (2007) found that longer durations of writing typi- ficients from each study could lower the criterion validity
cally lead to stronger criterion validity. Also, Espin et al. results reported here.
(2000) hypothesized that expository prompts may have Finally, this analysis included only studies from peer-
higher criterion validity with secondary students because reviewed sources. Orwin’s failsafe N bolsters confidence in
much of the academic writing they do in school is exposi- the results, but it cannot eliminate the possibility of publica-
tory. Second, researchers should continue examining the cri- tion bias.
terion validity of CIWS. Our extensive search revealed
relatively few studies examining the criterion validity of this Conclusion
scoring procedure. Because it had the highest criterion valid-
ity in all grade levels, it merits additional study. The present meta-analysis was the first known study
Third, researchers should continue investigating the which calculated the mean weighted criterion validity of
technical aspects of static scores as criterion measures CBM-W measures. The findings suggest that practitio-
change. The Test of Written Language–3 was commonly ners should consider incorporating CBM-W into IEP
used as a criterion measure in these studies. However, the development and progress monitoring. Researchers
Test of Written Language–4 is now available. In addition, should consider analyzing technical characteristics of
many states are changing state assessments to align with the CBM-W slope data.
Common Core State Standards. As new writing assessments
are developed commercially and at the state level, research- Declaration of Conflicting Interests
ers should continue to investigate the criterion validity of The author(s) declared no potential conflicts of interest with
CBM-W measures with these new measures. respect to the research, authorship, and/or publication of this
Finally, researchers should consider entering Phase 2 of article
CBM-W research. Some have begun this work already. For
example, McMaster, Du, et al. (2011) examined the slopes Funding
of first-grade CBM data and determined that eight or nine The author(s) received no financial support for the research,
data points are needed for reliable and stable slopes. Similar authorship, and/or publication of this article.
work needs to be done across grade levels and demographic
groups. References
References marked with an asterisk indicate studies included in
the meta-analysis.
Implications *Amato, J. M., & Watkins, M. W. (2011). The predictive
Based on the results of this analysis, it appears that practi- validity of CBM writing indices for eighth-grade stu-
tioners should discontinue the use of WW as a measure of dents. The Journal of Special Education, 44, 195–204.
doi:10.1177/0022466909333516
overall writing progress. Of all the CBM-W scoring mea-
Archer, A. L., & Hughes, C. A. (2010). Explicit instruction:
sures, it provided the weakest criterion validity. Furthermore, Effective and efficient teaching. New York, NY: Guilford.
practitioners should feel comfortable with the CWS and Borenstein, M. (2005). Software for publication bias. In H.
CIWS scoring procedures as a moderate predictor of overall R. Rothstein, A. J. Sutton, & M. Borenstein (Eds.),
writing ability. Publication bias in meta-analysis: Prevention, assess-
ment, and adjustments (pp. 193–220). West Sussex, UK:
John Wiley.
Limitations Campbell, H. M. (2010). The technical adequacy of curriculum-
In addition to the shortage of studies in certain grade levels, based measurement passage copying with secondary school
results of this analysis should be viewed in light of at least English language learners. Reading & Writing Quarterly, 26,
three other limitations. First, this review examined criterion 289–307.
*Campbell, H. M., Espin, C. A., & McMaster, K. (2013). The
validity of CBM-W measures only when using commer-
technical adequacy of curriculum-based writing measures
cially developed or state-developed writing assessments. with English learners. Reading and Writing, 26, 431–452.
Using holistic ratings, teacher ratings, and GPA may pro- doi:10.1007/s11145-012-9375-6
duce different validity coefficients. However, based on *Cheng, S. F., & Rose, S. (2009). Investigating the technical ade-
findings from McMaster and Espin (2007), any differences quacy of curriculum-based measurement in written expres-
are likely to be minimal. sion for students who are deaf or hard of hearing. Journal of
Second, this analysis only included the highest coeffi- Deaf Studies and Deaf Education, 14, 503–515. doi:10.1093/
cient for each scoring procedure in each sample. This deafed/enp013
Romig et al. 81

Cochran, W. G. (1954). The combination of estimates brief intervention of alternate curriculum-based measures
from different experiments. Biometrics, 10, 101–129. of writing skill. Psychology in the Schools, 41, 291–300.
doi:10.2307/3001666 doi:10.1002/pits.10166
*Coker, D. L., & Ritchey, K. D. (2010). Curriculum-based mea- *Gansle, K. A., VanDerHeyden, A. M., Noell, G. H., Resetar, J.
surement of writing in kindergarten and first grade: An inves- L., & Williams, K. L. (2006). The technical adequacy of cur-
tigation of production and qualitative scores. Exceptional riculum-based and rating-based measures of written expres-
Children, 76, 175–193. doi:10.1177/001440291007600203 sion for elementary school students. School Psychology
Comprehensive Meta-Analysis (Version 2) [Computer software]. Review, 35, 435–450.
(2005). Englewood, NJ: Biostat. Graham, S., Harris, K. R., & Hebert, M. (2011). It is more than just
Cutler, L., & Graham, S. (2008). Primary grade writing instruc- the message: Presentation effects in scoring writing. Focus on
tion: A national survey. Journal of Educational Psychology, Exceptional Children, 44, 1–12.
100, 907–919. Higgins, J. P., & Thompson, S. G. (2002). Quantifying hetero-
Deno, S. (1985). Curriculum-based measurement: The emerg- geneity in a meta-analysis. Statistics in Medicine, 21, 1539–
ing alternative. Exceptional Children, 52, 219–232. 1558. doi:10.1002/sim.1186
doi:10.1177/001440298505200303 Hosp, M. K., Hosp, J. L., & Howell, K. W. (2012). The ABCs of
Deno, S., Mirkin, P. K., & Marston, D. (1980). Relationships CBM: A practical guide to curriculum-based measurement.
among simple measures of written expression and perfor- New York, NY: Guilford.
mance on standardized achievement tests (Report No. 22). Individuals With Disabilities Education Act, 20 U.S.C. § 1400
Minneapolis: University of Minnesota Institute for Research (2004).
on Learning Disabilities. *Jewell, J., & Malecki, C. K. (2005). The utility of CBM written
*Dockrell, J. E., Connelly, V., Walter, K., & Critten, S. (2015). language indices: An investigation of production-dependent,
Assessing children’s writing products: The role of curricu- production-independent, and accurate-production scores.
lum based measures. British Education Research Journal, 41, School Psychology Review, 34, 27–44.
575–595. doi:10.1002/berj.3162 Lembke, E., Hampton, D., & Hendricker, E. (2013). Data-based
Epstein, M., Atkins, M., Cullinan, D., Kutash, K., & Weaver, R. decision-making in academics using curriculum-based mea-
(2008). Reducing behavior problems in the elementary school surement. In J. W. Lloyd, T. J. Landrum, B. Cook, & M.
classroom: A practice guide (NCEE #2008-012). Washington, Tankersley (Eds.), Research-based approaches for assess-
DC: National Center for Education Evaluation and Regional ment (pp. 18–31). Boston, MA: Pearson.
Assistance, Institute of Education Sciences, U.S. Department *López, F. A., & Thompson, S. S. (2011). The relationship
of Education. doi:10.1037/e602222011-001 among measures of written expression using curriculum-
Espin, C. A. (2014). What are goed righting! (What is good writ- based measurement and the Arizona Instrument to Measure
ing?). Literacy Research and Instruction, 53, 93–95. doi:10.1 Skills (AIMS) at the middle school level. Reading & Writing
080/19388071.2014.869944 Quarterly, 27, 129–152. doi:10.1080/10573561003769640
*Espin, C. A., Scierka, B. J., Skare, S., & Halverson, N. (1999). *McMaster, K. L., & Campbell, H. (2008). New and existing
Criterion-related validity of curriculum-based measures in curriculum-based writing measures: Technical features
writing for secondary students. Reading & Writing Quarterly, within and across grades. School Psychology Review, 37,
15, 5–27. doi:10.1080/105735699278279 550–566.
*Espin, C. A., Shin, J., Deno, S. L., Skare, S., Robinson, *McMaster, K. L., Du, X., & Pétursdóttir, A. L. (2009).
S., & Benner, B. (2000). Identifying indicators of writ- Technical features of curriculum-based measures for begin-
ten expression proficiency for middle school stu- ning writers. Journal of Learning Disabilities, 42, 41–60.
dents. The Journal of Special Education, 34, 140–153. doi:10.1177/0022219408326212
doi:10.1177/002246690003400303 *McMaster, K. L., Du, X., Yeo, S., Deno, S. L., Parker, D., & Ellis,
*Espin, C., Wallace, T., Campbell, H., Lembke, E. S., Long, J. T. (2011). Curriculum-based measures of beginning writing:
D., & Ticha, R. (2008). Curriculum-based measurement in Technical features of the slope. Exceptional Children, 77, 185–
writing: Predicting the success of high-school students on 206. doi:10.1177/001440291107700203
state standards tests. Exceptional Children, 74, 174–193. McMaster, K. L., & Espin, C. (2007). Technical features of curric-
doi:10.1177/001440290807400203 ulum-based measurement in writing: A literature review. The
Fuchs, L. S. (2004). The past, present, and future of curriculum- Journal of Special Education, 41, 68–84. doi:10.1177/002246
based measurement research. School Psychology Review, 33, 69070410020301
188–192. McMaster, K. L., Ritchey, K. D., & Lembke, E. (2011).
*Gansle, K. A., Noell, G. H., VanDerHeyden, A. M., Naquin, Curriculum-based measurement of elementary students’
G. M., & Slider, N. J. (2002). Moving beyond total words writing: Recent developments and future directions. In T. E.
written: The reliability, criterion validity, and time cost of Scruggs & M. A. Mastropieri (Eds.), Assessment and inter-
alternate measures for curriculum-based measurement in vention: Advances in learning and behavioral disabilities
writing. School Psychology Review, 31, 477–497. (pp. 111–148). Bingley, UK: Emerald.
*Gansle, K. A., Noell, G. H., VanDerHeyden, A. M., Slider, N. J., *Mercer, S. H., Martinez, R. S., Faust, D., & Mitchell, R. R.
Hoffpauir, L. D., Whitmarsh, E. L., & Naquin, G. M. (2004). (2012). Criterion-related validity of curriculum-based mea-
An examination of the criterion validity and sensitivity to surement in writing with narrative and expository prompts rel-
82 The Journal of Special Education 51(2)

ative to passage copying speed in 10th grade students. School Slavin, R. E. (1986). Best-evidence synthesis: An alterna-
Psychology Quarterly, 27, 85–95. doi:10.1037/a0029123 tive to meta-analytic and traditional reviews. Educational
Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational Researcher, 15, 5–11. doi:10.3102/0013189x015009005
measurement (3rd ed., pp. 13–103). New York, NY: Taylor, R. L. (2006). Assessment of exceptional students:
Macmillan. Educational and psychological procedures (7th ed.). Boston,
*Parker, D. C., McMaster, K. L., & Burns, M. K. (2011). MA: Pearson Education.
Determining and instructional level for early writing skills. Venn, J. J. (2013). Assessing students with special needs. Boston,
School Psychology Review, 40, 158–167. MA: Pearson Education.
*Parker, R. I., Tindal, G., & Hasbrouck, J. (1991). Progress moni- Videen, J., Deno, S., & Marston, D. (1982). Correct word
toring with objective measures of writing performance for sequences: A valid indicator of proficiency in written expres-
students with disabilities. Exceptional Children, 58, 61–73. sion (Report No. 84). Minneapolis: University of Minnesota
doi:10.1177/001440299105800106 Institute for Research on Learning Disabilities.
*Ritchey, K. D., & Coker, D. L. (2013). An investigation of the Wayman, M. M., Wallace, T., Wiley, H. I., Tichá, R., &
validity and utility of two curriculum-based measurement Espin, C. A. (2007). Literature synthesis on curriculum-
writing tasks. Reading & Writing Quarterly, 29, 89–119. doi: based measurement in reading. The Journal of Special
10.1080/10573569.2013.741957 Education, 41, 85–120. doi:10.1177/0022466907041
*Ritchey, K. D., & Coker, D. L. (2014). Identifying writing dif- 0020401
ficulties in first grade: An investigation of writing and read- *Weissenburger, J. W., & Espin, C. A. (2005). Curriculum-based
ing measures. Learning Disabilities Research & Practice, 29, measures of writing across grade levels. Journal of School
54–65. doi:10.1111/ldrp.12030 Psychology, 43, 153–169. doi:10.1016/j.jsp.2005.03.002

You might also like