EyeoftheBeholder EPP 2020

Evaluation and Program Planning 83 (2020) 101852
Contents lists available at ScienceDirect
Evaluation and Program Planning

journal homepage: www.elsevier.com/locate/evalprogplan
Interpreting the effectiveness of a summer reading program: The eye of the T

beholder
Deborah K. Reed*, Ariel M. Aloe
Iowa Reading Research Center, University of Iowa, 103 Lindquist Center, Iowa City, IA, 52242, USA
A R T I C LE I N FO A B S T R A C T
Keywords: In applying a methods-oriented approach to evaluation, this study interpreted the effectiveness of a summer
Effect sizes reading program from three different stakeholder perspectives: practitioners from the school district, the funding
Growth modeling agency supporting the program, and the policymakers considering mandating summer school. Archival data
Propensity score matching were obtained on 2330 students reading below benchmark in Grades 2–5. After propensity score matching
Summer reading
participants to peers who did not attend the summer program, the final sample consisted of 630 students. Pre-to-
Summative evaluation
posttest growth models revealed positive effects in Grades 2–4 (standardized slopes of .40–.54), but fifth graders
demonstrated negligible improvement (standardized slope of .15). The standardized mean differences of pro-
pensity score matched treatment and control group students indicated null effects in all grade levels (d = −.13
to .05). Achieving proficient reading performance also was not attributable to summer school participation.
Findings underscore the importance of operationalizing effectiveness in summative evaluation.
1. Introduction in whether students in the summer program significantly improved in

their abilities when compared to their peers who did not participate
A primary function of summative evaluation is to determine whe- (hereafter referred to as the funding agency perspective). Finally, the state
ther the program being implemented is achieving its outcomes. In government was considering enacting a policy that would require stu-
school settings, the outcomes often concern whether there are statisti- dents who were not reading proficiently to attend summer reading
cally and practically significant improvements in student performance programs, so they were interested in whether participation raised the
(What Works Clearinghouse [WWC], 2017). This is consistent with a category of students’ reading performance (hereafter referred to as the
method-oriented approach to evaluation where the evaluator judges a policymaker perspective).
program and attempts to attribute effects by employing rigorous sci- Each perspective involved different types of reading scores and
entific research (Boruch, 1997; Cook & Campbell, 1979). Rigor is not statistical analyses, so an effect size calculated to answer the question of
inherent to a particular study design. Rather, the method chosen must one stakeholder group would not be appropriate to answer the question
be appropriate for addressing the stakeholders’ question(s) about the of the other groups. That is, effect sizes cannot be considered absolute
program (Clark, van Kerkhoff, Lebel, & Gallopin, 2016; Patton, 1997). quantities with universal interpretation values as might erroneously be
When a broad stakeholder base is involved, it can be challenging to inferred from suggestions about hinge points (Hattie, 2009) or sub-
balance competing interests and provide a coherent determination of stantively important value thresholds (WWC, 2017).
the program’s effectiveness (Mielke, Vermaßen, Ellenbeck, Milan, &
Jaeger, 2016). 2. Literature review
In the present evaluation study, all stakeholders wanted to know: Is
the school district’s summer reading program effective at improving 2.1. Different calculations of effectiveness
students’ reading abilities? However, groups were asking the question
from different perspectives. The school personnel were interested in Per recommendations (e.g., American Psychological Association,
whether participating elementary students improved their pre-to- 2010; WWC, 2017), researchers and evaluators typically report a single
posttest reading performance (hereafter referred to as the practitioner approach to determining the treatment effect, but more than one
perspective). The private foundation funding the program was interested measure of effect size may be necessary to improve interpretation of the
⁎
Corresponding author.
E-mail addresses: deborah-reed@uiowa.edu (D.K. Reed), ariel-aloe@uiowa.edu (A.M. Aloe).
https://doi.org/10.1016/j.evalprogplan.2020.101852
Received 30 March 2019; Received in revised form 5 September 2019; Accepted 30 July 2020
Available online 10 August 2020
0149-7189/ © 2020 Elsevier Ltd. All rights reserved.
D.K. Reed and A.M. Aloe Evaluation and Program Planning 83 (2020) 101852
estimates (Fritz, Morris, & Richler, 2012). We applied three data ana- multicomponent interventions, even when compared to another treat-
lytic approaches to address the questions different stakeholder groups ment, make it difficult to identify the exact cause of any change and
had about the impact of the summer reading program. In the sections make replications difficult or imprecise (Norman, 2003). Finally, a
that follow, we describe the rationale for these analyses and the influ- significant improvement may not be discernable to educators or may
ences of each on the resulting effect sizes. not be enough improvement for students to achieve academic profi-
ciency (Hanushek, Kain, & Rivkin, 1998).
2.1.1. Practitioner perspective
One means of calculating an effect size is to analyze the raw growth 2.1.3. Policymaker perspective
of students from the start of their intervention to its completion, using A major impetus for adopting new programs is to increase the
the continuous scores obtained on a pre- and posttest. This approach is numbers of students attaining proficiency or meeting grade-level
popular among practitioners who need to document student improve- benchmarks on accountability measures (Baete & Hochbein, 2014).
ment over the course of their instruction (American Institutes for Hence, decision- and policymakers often are concerned with changes in
Research [AIR], 2014; Simkins & Allen, 2000; Sokolsky, Tweedie, & categorical designations for scores (e.g., proficient, basic, or below basic)
McMillan, 2016). Such was the case for the school district personnel used for accountability purposes (National Center for Education
who planned and delivered the summer reading program from which Statistics [NCES], 2018). From this perspective, an effective interven-
the data were obtained for our analyses. School personnel were inter- tion would be one associated with students moving to a higher per-
ested in knowing whether the student participants improved their formance category. This was the concern of the policymakers in the
reading scores, so our first approach to determining effectiveness em- state where the summer reading program we studied took place, so our
ployed growth modeling. third approach to evaluating the effectiveness of the intervention was to
Nevertheless, single-group, pretest-posttest designs are associated compare the percentages of treatment and control students who
with inflated effect sizes compared to control group designs (Lipsey & achieved proficient reading performance.
Wilson, 1993). This can be due to a number of threats to internal va- Calculating effectiveness by changes in performance categories
lidity such as normal maturation of the students, practice effects or lack could be more meaningful or accurately interpreted (Ellis, 2010), but
of equivalence of the baseline and outcome tests, exposure to other such gains likely are difficult to realize in the short term. For example,
instruction or educational experiences, and regression of scores to the the achievement-level results on the National Association of Educa-
mean (Shadish, Cook, & Campbell, 2002). A review of 13 educational tional Progress (NAEP) reading test were unchanged from 2015 to
research journals found 16 of 490 articles published in 2009 reported 2017, but have shown a small and significant increase in the percentage
single-group, pretest-posttest studies (Marsden & Torgerson, 2012). All of students performing at proficient or advanced levels since 1992
claimed a causal relationship was established between the intervention (NCES, 2018). Performance categories represent a range of scores on an
and the outcome, but none acknowledged the potential for regression to assessment (e.g., the NAEP proficient range includes scale scores of
the mean effects. Furthermore, few mentioned other potential con- 238–267), so they can mask growth that is occurring within a given
founds, including those associated with the threats to internal validity. range. This has led to a policy debate about whether student learning
This raises concerns for how a practitioner might interpret the effects should be measured by the attainment of proficiency or by individual
obtained from this approach because it is not possible to know whether growth from one assessment to the next (Lachlan-Haché & Castro,
students would have experienced similar growth without the inter- 2015). Sensitivity to growth would be a concern when outcomes are
vention. expected relatively quickly, as was the case with the summer reading
program investigated in the current study.
2.1.2. Funding agency perspective
Causal inferences typically are reserved for true experiments or high 2.2. Summer reading programs
level quasi-experiments. In both types of designs, effects are often cal-
culated by comparing the differences in the posttest means of treatment Over one-third of the state education agencies in the U.S. re-
and comparison groups, so they are the preferred approach of re- commend or require summer school for elementary students who are
searchers and agencies that fund research or evaluations (Gersten et al., not reading on grade level (Workman, 2014), and these requirements
2005; WWC, 2017). Although experimental studies are the gold stan- may stipulate the implementation of programs with evidence of effec-
dard, random assignment is not always possible in school settings tiveness (McCombs et al., 2011). Although there is a large and ever-
(Shadish et al., 2002). Thus, quasi-experimental designs including his- growing body of literature available to inform the design and delivery
torical cohort control groups (Walser, 2014), regression discontinuity of effective reading instruction for all students and those in need of
(Campbell & Stanley, 1963; Gleason, Resch, & Berk, 2012), and pro- supplementary intervention (e.g., Connor, Alberto, Compton, &
pensity score matching (Pearl, 2010; Rosenbaum & Rubin, 1983) that O’Connor, 2014; Torgesen, 2002), there are limitations to available
meet the assumptions of their respective statistical models are often the research on reading programs offered during the summer when stu-
closest available evidence to randomized studies. Therefore, we un- dents typically are on a two- to three-month break.
dertook a propensity score analysis to answer the question of the pro- First, students typically are not randomly assigned to conditions—if
gram’s funder about whether the participants outperformed comparable comparison conditions are even present in the study—which makes it
students who did not attend summer school. difficult to disentangle the effects of the program from the character-
Having a comparison group can alleviate some (e.g., maturation, istics of the students who are or are not in them (Cooper, Charlton,
unplanned historical events) but not all threats to internal validity (e.g., Valentine, & Muhlenbruck, 2000). Second, most estimations of effects
selection bias, attrition, testing effects, participants’ motivation to be are based on testing conducted before the end of the spring semester
involved, treatment integrity) that have been shown to impact effect preceding the program and after the start of the fall semester following
sizes (Simmerman & Swanson, 2001). Pragmatically, rigorously de- the program when students were receiving other reading instruction
signed randomized control trials may not be relevant to educators or (e.g., Wilkins et al., 2012). In studies that administered progress mon-
ethical for students. For example, positive effects found for an inter- itoring tests during the summer, students often were compared either to
vention mean little when compared to a no-treatment control or to themselves during periods of instruction and no instruction (e.g., Zvoch
business-as-usual instruction that has already been identified as lacking & Stevens, 2015) or were compared to a no-treatment control group
elements associated with the desired outcome. As expressed by Cook (e.g., Johnston, Riley, Ryan, & Kelly-Vance, 2015).
and Beckman (2010), no-treatment controls prove only that “if learners A third limitation is that research on summer programs often targets
spend time learning, they will learn” (p. 460). Similarly, voluntary reading by providing books, rather than providing classroom-
2
based instruction (e.g., Kim et al., 2017; Wilkins et al., 2012). Re- Although these grades changed by the posttest (i.e., the subsequent
latedly, commercially available curricula rarely are designed for the school year), the initial grade-level label is used throughout this report
narrow purpose of a summer program, so studies that include classroom to avoid confusion. We analyzed data only for students with successful
instruction often involve a researcher-developed reading program sui- matches. Before matching, control group sizes were Grade 2 = 466,
table for the particular context (e.g., Johnston et al., 2015). Grade 3 = 534, Grade 4 = 522, and Grade 5 = 493. The treatment
group sizes were Grade 2 = 104, Grade 3 = 91, Grade 4 = 87, and
3. Purpose and research question Grade 5 = 33. After matching control units to treatment units, the total
sample sizes for each grade level were Grade 2 = 208, Grade 3 = 182,
Because existing studies of summer reading programs have used Grade 4 = 174, and Grade 5 = 66.
singular approaches to estimating effectiveness, it is not known how the Prior to the start of the program, the summer school coordinator
interpretation might change based on the kinds of scores and analytic placed the invited students in classes. This was not done in any pur-
approaches used to generate the estimates. Thus, this extant data study poseful manner, other than to achieve balance in class sizes. Because
was conducted to examine different approaches to interpreting the the formation of classes was done before the first day of summer school,
outcomes of the district’s summer program offered to below-level the initial attrition from the students who declined their invitation to
readers in elementary school. The primary research question (RQ) ad- attend resulted in unequal class sizes (range of 6–22 students per class
dressed was: How might effectiveness of the summer reading program across grades 2–5; Mdn = 11).
be interpreted when considered from three different stakeholder per-
spectives? To be able to answer this question, we posed three secondary 4.2. Intervention
research questions:
The summer program was held for 29 days, with 2 h each morning
1 Practitioner perspective: What are the pre-to-posttest improvements devoted to reading instruction and 3 h in the afternoon were devoted to
in participating students’ overall reading achievement? enrichment activities (e.g., field trips, arts and crafts). In keeping with
2 Funding agency perspective: What are the relative gains of eligible recommendations from a review of research (McCombs et al., 2011),
students who do and do not attend a summer reading program when the district designed its summer program as a more intensified version
analyzing scale scores on a measure of overall reading achievement of their usual literacy instruction offered during the regular academic
and a state accountability assessment? year. That is, class sizes were smaller (average of 11–14 students in the
3 Policymaker perspective: What are the relative gains of eligible summer versus an average of 23 in the regular academic year) and the
students who do and do not attend a summer reading program when instruction components (described in the following sections) were tai-
analyzing categorical designations derived from the same measures? lored to students’ needs.
4. Method
4.2.1. Guided reading
Guided reading involved teaching small, heterogeneous groups of
4.1. Participants
3–4 students (Dorn & Soffos, 2009; Fountas & Pinnell, 2001). The ap-
proximately 20-min lessons per group were developed around a text
Archival data were obtained from a Midwestern school district
selected to match that groups’ abilities. Using that text, the teacher
serving approximately 8600 students in kindergarten through twelfth
instructed the small group how to identify and determine the meaning
grade. Districtwide, about 64 % of students received free or reduced-
of unfamiliar words as well as monitor their comprehension of the text.
price lunch (FRL), a proxy for economic disadvantage; 7% were English
The specific skills and strategies taught each day (e.g., word identifi-
learners (EL); and 19 % were receiving special education services.
cation, fluency, vocabulary, or comprehension) were determined using
The district routinely evaluated the progress of their students’
the students’ previous days’ performance on reading tasks.
reading abilities with universal screening measures administered at the
beginning, middle, and end of the year. Scores on these measures were
used to identify students who were eligible for summer school, which 4.2.2. Independent reading activities
was designed as an intensive reading program for elementary students While the teacher met with guided reading groups, the other stu-
not reading proficiently. Given that the costs associated with summer dents rotated through a computer station and reading workshop. The
school were not covered by state appropriations, the district had ap- computer-based supplemental program was used to deliver about 30
plied for and received a grant from a local philanthropic organization min of individualized practice opportunities on reading skills. During
that would allow them to offer a limited number of placements to the approximately 30 min of reading workshop, students read books
students experiencing reading difficulties. selected in consultation with their teacher and kept a journal to record
The 11 elementary school principals were asked to review the re- their thoughts and reactions to the literature (Calkins & Tolan, 2010).
cords of students who were not meeting grade-level benchmarks on the Students shared their journal entries with a peer who offered feedback.
interim assessments (N = 2,330) and decide who was most likely to There also were times during the week to confer with the teacher and to
benefit from participation in the summer program. Principals were to write about the literature.
consider students’ reading performance and their need for ongoing
support. Hence, assignment to the treatment was not random and in- 4.2.3. Interactive writing
troduced a potential selection bias, but the method of assignment was Students who were identified for supplemental intervention also
relevant to practical constraints—a consideration noted by Cronbach worked with the teacher for about 30 min to compose a short text
(1982). Consequently, we relied on propensity score matching and (McCarrier, Pinnell, & Fountas, 2000). Lessons began with brief in-
weighting to make the groups comparable on the observable covariates. struction on writing mechanics (e.g., punctuating sentences) and con-
Moreover, only a small impact of bias has been found to result from ventions (e.g., forming paragraphs) that were intended to address errors
non-randomized formation of groups (Wilson & Lipsey, 2001). found in students’ previous written products. After reading a story that
Summer school was offered to students in kindergarten and first introduced the topic for the day, the teacher presented a sentence
grade, but only students in Grades 2–5 were assessed with the measures starter such as, “In the lunchroom. . .” Students contributed ideas and
of interest. Therefore, only their data were analyzed in this study. The took turns adding to the response progressively by writing the next
grade levels indicated were based on the grade students just completed sentence in turn. The teacher supported the correct construction of
at the time they were identified for the summer reading program. students’ sentences.
3
Table 1
Group Means, Standard Deviations, Effect Sizes, and t-Statistics by Covariate Before and After Propensity Weighting.
Unbalanced Balanced
Treatment Control Treatment Control
Mean SD Mean SD d t-test p-value Mean SD Mean SD d t-test p-value

Second Grade (NControl = 466, NTreatment = 104) Second Grade (NControl = 104, NTreatment = 104)
ELL 0.160 0.370 0.060 0.240 0.369 2.660 0.010 0.150 0.360 0.150 0.360 0.003 0.020 0.980
IEP 0.090 0.280 0.200 0.400 −0.290 −3.410 0.000 0.080 0.270 0.080 0.270 −0.005 −0.040 0.970
SRIScore 280.28 195.41 438.35 238.96 −0.660 −7.190 0.000 288.34 198.09 288.40 211.19 0.000 0.000 1.000
Hispanic 0.180 0.390 0.140 0.350 0.109 0.960 0.340 0.170 0.370 0.170 0.370 −0.001 0.000 1.000
AfriAmer 0.090 0.280 0.050 0.210 0.174 1.350 0.180 0.080 0.280 0.080 0.280 0.002 0.010 0.990
FreeRL 0.620 0.490 0.540 0.500 0.157 1.500 0.140 0.650 0.480 0.650 0.480 −0.001 −0.010 1.000
Female 0.440 0.500 0.500 0.500 −0.109 −1.010 0.310 0.450 0.500 0.450 0.500 0.000 0.000 1.000
Third Grade (NControl = 534, NTreatment = 91) Third Grade (NControl = 91, NTreatment = 91)
ELL 0.180 0.380 0.100 0.300 0.234 1.760 0.080 0.160 0.370 0.160 0.370 0.002 0.010 0.990
IEP 0.240 0.430 0.190 0.390 0.134 1.110 0.270 0.330 0.470 0.330 0.470 0.009 0.060 0.950
SRIScore 440.81 164.27 594.12 238.95 −0.650 −7.660 0.000 443.63 160.38 444.51 196.23 −0.005 −0.030 0.970
Hispanic 0.210 0.410 0.190 0.400 0.038 0.330 0.740 0.190 0.390 0.190 0.390 −0.004 −0.030 0.980
AfriAmer 0.050 0.230 0.050 0.210 0.030 0.250 0.800 0.060 0.230 0.050 0.230 0.002 0.010 0.990
FreeRL 0.600 0.490 0.550 0.500 0.113 1.010 0.310 0.600 0.490 0.600 0.490 −0.001 −0.010 0.990
Female 0.430 0.500 0.490 0.500 −0.115 −1.020 0.310 0.450 0.500 0.450 0.500 0.003 0.020 0.980
Fourth Grade (NControl = 522, NTreatment = 87) Fourth Grade (NControl = 87, NTreatment = 87)
ELL 0.200 0.400 0.060 0.230 0.546 3.320 0.000 0.210 0.410 0.210 0.410 −0.010 −0.060 0.950
IEP 0.170 0.380 0.190 0.390 −0.056 −0.510 0.610 0.250 0.440 0.250 0.440 0.005 0.030 0.970
SRIScore 601.14 182.60 754.26 244.87 −0.630 −6.890 0.000 603.54 184.97 601.13 229.80 0.012 0.070 0.940
Hispanic 0.190 0.400 0.130 0.340 0.160 1.270 0.200 0.190 0.400 0.200 0.400 −0.010 −0.060 0.950
AfriAmer 0.070 0.250 0.050 0.230 0.060 0.490 0.630 0.050 0.220 0.050 0.220 −0.011 −0.070 0.950
FreeRL 0.610 0.490 0.510 0.500 0.203 1.820 0.070 0.620 0.490 0.620 0.490 −0.006 −0.040 0.970
Female 0.560 0.500 0.470 0.500 0.181 1.590 0.110 0.560 0.500 0.550 0.500 0.018 0.110 0.910
Fifth Grade (NControl = 493, NTreatment = 33) Fifth Grade (NControl = 33, NTreatment = 33)
ELL 0.150 0.360 0.050 0.220 0.447 1.620 0.110 0.150 0.360 0.150 0.360 −0.003 −0.010 0.990
IEP 0.270 0.450 0.170 0.370 0.274 1.310 0.190 0.250 0.440 0.250 0.440 0.006 0.020 0.980
SRIScore 705.30 213.90 855.06 214.73 −0.688 −3.950 0.000 714.95 209.42 715.58 218.10 −0.003 −0.010 0.990
Hispanic 0.300 0.470 0.160 0.370 0.385 1.760 0.080 0.320 0.470 0.320 0.470 −0.002 −0.010 0.990
AfriAmer 0.060 0.240 0.050 0.210 0.072 0.360 0.720 0.040 0.210 0.040 0.200 0.013 0.050 0.960
FreeRL 0.670 0.480 0.530 0.500 0.277 1.630 0.100 0.660 0.480 0.660 0.480 0.002 0.010 0.990
Female 0.580 0.500 0.490 0.500 0.169 0.950 0.340 0.610 0.500 0.610 0.500 −0.005 −0.020 0.980
Note. EL = English learner; FRL = qualified for free or reduced-price lunch; Spec. Ed. = served in special education.
4.3. Measures 4.3.2. Iowa Assessment of Reading (IA-R)

IA-R (Dunbar & Welch, 2015) was one of a battery of content tests
All students were tested in the spring prior to the delivery of the composing the state’s summative assessment, formerly known as the
summer program (pretest), in the fall when the new school year com- Iowa Test of Basic Skills or ITBS (Hoover, Dunbar, & Frisbie, 2001). The
menced (proximal posttest), and in the following spring (distal paper-and-pencil versions of IA-R were administered in two parts with
posttest). The measures were typical parts of the district’s assessment an estimated working time of about 30 min each. Students read pas-
plan and were administered by school personnel following the usual sages of varying lengths (a few lines to a full page) and genres (in-
procedures. The district subsequently provided us the data under a data cluding literary and informational texts) before responding to 42–44
sharing agreement. items, depending on grade level. Dunbar and Welch (2015) reported
reliability coefficients of .900–.904 for the grade levels and testing
point of interest. The items targeted key ideas, explicit meaning, im-
4.3.1. Scholastic reading inventory (SRI) plicit meaning, author’s craft, and vocabulary. Students did not begin
SRI (Scholastic Inc., 2014) was a computer-adaptive test of overall taking the IA-R until third grade, and the youngest participants began
reading ability. The resulting score, expressed as a Lexile® was intended the study at the end of their second-grade year. Thus, the IA-R was
to indicate the level of text a student could read and comprehend with analyzed only as a distal posttest.
75 % or better accuracy. Lexiles® were on a developmental scale to
reflect the gradual increase in students’ reading ability over time. 4.4. Data analysis procedures
However, performance was relative to the level of difficulty at which a
student was testing. For example, the range of Lexiles® representing We approached our analyses using propensity scores given that (a)
proficiency with second-grade text was 420−650. Once the student random assignment of participants to the summer reading programs
transitioned to third-grade text, the range of Lexiles® representing was not an option, and (b) there was not a clear cut score to be used as a
proficiency was 520−820. Hence, the third-grade range started at a running variable for regression discontinuity. Not all students who were
Lexile (520) lower than the highest level in the second-grade range eligible to participate in the summer reading program did in fact par-
(650). Test developers reported a .88 correlation between the Lexile ticipate, so the eligible non-participating students formed a control
score and the Iowa Assessment of Reading (the state summative as- group, or counterfactual. To control for selection bias (Rosenbaum &
sessment at the time of the present study). The SRI’s Lexile scale scores Rubin, 1983; Rosenbaum, 2010), propensity scores were created using
from all three time points (pretest, proximal posttest, and distal student demographics and their previous level of reading achievement.
posttest) were used for analyses. A propensity score is defined as the probability of being assigned to a
4
group, given a set of observed covariates. In other words, it is the 5.1.1. Unconditional means models by grade
conditional probability of receiving a treatment based upon observed In all grade level, the unconditional means models indicated that
covariates (Rosenbaum & Rubin, 1983). All of our analyses were stra- the average performance of summer school students was different from
tified by grade level; thus, we created propensity scores for each grade zero (Grade 2: γˆ00 = 388.38, SE = 20.03, p < . 05; Grade 3:
level. γˆ00 = 521.73, SE = 21.71, p < . 05; Grade 4: γ̂00 = 672.98,
The initial step in the analysis was to identify any anomalies in the SE = 25.38, p < . 05; Grade 5: γ̂00 =734.23, SE = 23.43, p < . 05). In
variables presented in the dataset. Because our control group for each examining the random effects, we also found the estimated within-
grade level was considerably larger than the treatment group, we used a student variance (Grade 2: σ̂ε2 = 35025.12 ; Grade 3: σ̂ε2 = 77885.65;
combination of propensity score matching and propensity score Grade 4: σ̂ε2 = 24869.29; Grade 5: σ̂ε2 = 12705.8) and between-students
weighting. For propensity score matching, units were matched using variance (Grade 2: σ̂02 = 27516.17 ; Grade 3: σ̂02 = 219.04 ; Grade 4:
the nearest algorithm available in the R package MatchIt (Ho, Imai, σ̂02 = 35002.67 ; Grade 5: σ̂02 = 45305.12 ) suggested that the average per-
King, & Stuart, 2007). Then, weights were created as one over the formance of students in each grade level varied over time and that
propensity score and one over one minus the propensity score for the students within a grade differed from each other in performance.
treatment and control groups, respectively. However, the contribution of SRI performance varied at each grade
Table 1 reports the means, standard deviations, effect sizes, and t- level. The intraclass correlation coefficient (ρ̂ ) revealed that about 40 %
statistics for the set of covariates before and after balancing the groups. of the variance was attributable to SRI in Grade 2, only 1% in Grade 3,
The left panel presents the comparability of the covariates prior to 60 % in Grade 4, and 78 % in Grade 5.
creating the weights, and the right panel presents the comparability of
the covariates after accounting for the weights.
In keeping with the goal of our analyses to evaluate the effectiveness 5.1.2. Unconditional growth models by grade
of the intervention as observed from three different perspectives, we In all grade levels, the unconditional growth models indicated that
conducted three distinct analyses. First, a growth model was fit to the the average true change trajectory for performance had an intercept
treatment group to determine the group’s improvement across the three that was different from zero (Grade 2:
administrations of the SRI (i.e., the practitioner perspective). In the γˆ00 = 274.36, SE = 27.82, p < . 05; Grade 3: γ̂00 = 429.01,
second analysis, ordinary least square (OLS) was used to model the SE = 20.80, p < . 05; Grade 4: γˆ00 = 586.02, SE = 27.94, p < . 05; Grade
treatment and control groups’ posttest scores, adjusting for the effect of 5: γˆ00 = 702.55, SE = 23.37, p < . 05). In most grades, the slope also was
pretest scores (i.e., the funding agency perspective). Finally, the cate- different from zero (Grade 2: γˆ10 = 114.78, SE = 11.33, p < . 05; Grade
gorical designations of both the SRI and IA-R were analyzed by their 3: γ̂10 =94.01, SE = 8.53, p < . 05; Grade 4: γ̂10 =
established thresholds to compute the percentages of students in the 88.07, SE = 6.39, p < . 05). However, in Grade 5 the slope was not
treatment and control groups that reached at least a proficient level different from zero (γˆ10 = 31.95, SE = 12.12, p < . 05) . The statistically
(i.e., the policymaker’s perspective). Clustering was taken into account significant results of the initial status (Grade 2: σ̂02 = 35547.33; Grade 3:
when fitting each of the outcome models via cluster robust standard σ̂02 = 927.20 ; Grade 4: σ̂02 = 41947.14 ; Grade 5: σ̂02 = 43572.39) and
errors. Equations for and explanations of the modeling can be found in growth rate variances (Grade 2: σ̂12 = 2098.55; Grade 3: σ̂12 = 1494.59,;
Grade 4: σ̂12 = 755.70,; Grade 5: σ̂12 = 1443.24,) suggested that students at
the supplementary material.
each grade level varied significantly in their performance at baseline
and that there was also significant variation in their growth trajectories.
5. Results In the lowest grade levels, we found the relationship between the
true rate of change in performance and the level at baseline were ne-
5.1. Practitioner perspective: growth model analysis gative (Grade 2: -.50; Grade 3: -.28), so students who had limited levels
of performance at baseline tended to gain at a faster rate. In the upper
Student growth from pre- to posttest was modeled to evaluate the elementary grades, the relationship between the true rate of change in
effectiveness of the summer program at increasing students’ SRI reading performance and the level at baseline was negative but negligible
scores, the measure used by the district to monitor students’ progress. (Grade 4: -.06; Grade 5: -.06). Unlike students in Grades 2 and 3, there
Because SRI scores are on a developmental scale, results at different was almost no relationship between fourth and fifth graders’ levels of
grade levels cannot be compared directly. Therefore, we first present performance at baseline and growth rates. The resulting pseudo-R2
results by grade (see Table 2), and then we present the results of values, which indicate the percent of within-student variation in per-
comparing across grades by standardizing the slopes and using the formance explained by linear trajectory, ranged widely across the grade
standardized slope for effect sizes (Hox, 2010). levels examined. More variation was explained in Grade 2 (0.71) and
Grade 4 (0.65). Less variation was explained in Grade 5 (0.40) and,
Table 2
SRI Unconditional Mean and Growth Model (Cluster Corrected Standard Errors in Parentheses).
Second Grade Third Grade Fourth Grade Fifth Grade
U. Mean U. Growth U. Mean U. Growth U. Mean U. Growth U. Mean U. Growth
Fixed Effects
Baseline 388.38* 274.36* 521.73* 429.01* 672.98* 586.02* 734.23* 702.55*
(20.03) (27.82) (21.71) (20.80) (25.38) (27.94) (23.43) (23.37)
Growth Rate 114.78* 94.01* 88.07* 31.95*
(11.33) (8.53) (6.39) (12.12)
Random Effects
Baseline 165.88 188.54 14.80 30.45 187.09 204.81 212.85 208.74
Growth Rate 45.81 38.66 27.49 37.99
Within 187.15 100.00 279.08 250.82 157.70 93.11 112.72 87.10
ICC 0.440 0.003 0.585 0.781
Note. SRI = Scholastic Reading Inventory; ICC = intraclass correlation coefficient; U = unconditional; Random Effects are standard deviations.
5
Table 3 4, indicating that the control groups on average improved more than
Effect Size for Growth Model. the treatment groups on the specified measures. For all grade levels and
Grade SRI outcomes, the pretest was significantly correlated to the posttest, in-
dicating that there was enough power on these models to identify sta-
Second 0.54 tistically significant slopes. Thus, the pretest was a higher predictor of
Third 0.47
students’ performance than involvement in the summer reading inter-
Fourth 0.40
Fifth 0.15
vention for all grade levels.
Nevertheless, we calculated effect sizes for these models to compare
Note. Scores are reported as standar- the funding agency’s perspective on effectiveness with the practitioner’s
dized slopes; SRI = Scholastic Reading perspective (i.e., the slopes obtained from the growth models). The
Inventory. largest positive adjusted standardized mean difference was observed for
second-grade SRI proximal (d = 0.11). In all grades, results for the IA-R
particularly, in Grade 3 (0.19). and the SRI distal cannot be attributed solely to the effect of the
summer intervention given that students were tested after they had
5.1.3. Cross-grade comparison been back in school about 8 months. Taken together, these results in-
To compare the growth model results across Grades 2–5, we stan- dicated that, for all but fifth grade, the effects of the summer school
dardized the slope representing the growth rate for each grade level treatment when compared with a non-participating control group did
(see Table 3). Results suggested that second graders had the steepest not appear as favorable as the pre-to-posttest growth rates of only those
standardized slope (.54), followed by students in third and fourth students who attended the summer program.
grades (.47). However, students in fifth grade had the smallest stan-
dardized slope (.15). Although students in this grade still demonstrated 5.2.2. Growth model with control group
an improvement, the rate was much slower than the earlier grade le- The conditional growth model differed from the previous uncondi-
vels. Overall, results suggested that the summer program was effective tional mean and growth models by including a variable representing
at improving students’ performance from pre- to posttest, which was the group membership (i.e., treatment or control) in the model. The
outcome of interest to practitioners. These findings were not relative to treatment variable was not statistically significant for any of the grade
the growth of a control group, so additional analyses were necessary to levels (see Table 5), indicating that there were no differences in stu-
evaluate effectiveness from a funding agency perspective. dents’ growth trajectories between the groups. To compare the growth
model results across Grades 2–5, we estimated standard mean differ-
5.2. Funding agency perspective ences accounting for the nested design. As expected given the size of the
slopes, results suggested null effects in all grades (d = -.13 to .05).
5.2.1. Ordinary least squared analysis Neither adjusted mean differences nor growth rates indicated whether
Ordinary least squared (OLS) analyses were used to compare the students achieved reading proficiency, which was the outcome of in-
proximal (fall SRI) and distal (spring SRI and IA-R) posttest reading terest to policymakers.
scores of students who did and did not attend the summer program,
after controlling for pretest performance on the SRI. The slopes, ad- 5.3. Policymaker perspective: categorical designations
justed for clustering standard errors, adjusted (partial) standardized
mean differences, variance explained, and sample size for each model The proximal (fall SRI) and distal (spring SRI and IA-R) posttest
are displayed in Table 4. reading scores of students who did and did not attend the summer
No significant treatment results were observed for any of the grade program were compared against established thresholds of performance
levels or outcomes. Moreover, treatment slopes were negative for SRI on each measure. The percentages of treatment and control students
proximal and distal in Grade 5 and for the distal IA-R outcome in Grade who changed from basic or below basic to at least proficient
Table 4
Regression Results and Effect Sizes for Treatment vs. Control and Cluster Corrected Standard Errors.
SRI Proximal SRI Distal IA-R
Slope SE d Slope SE d Slope SE d
Second Grade
Intercept 119.89* 39.54 322.78* 27.92 17.96 21.09
Pretest 0.77* 0.24 0.65* 0.08 0.99* 0.13
Treatment 26.46 34.75 0.11 13.96 27.92 0.05 1.21 4.37 0.06
Third Grade
Intercept 72.55* 36.74 267.30* 47.47 39.32 23.89
Pretest 0.93* 0.06 0.83* 0.09 0.88* 0.13
Treatment 4.40 22.15 0.02 0.24 23.15 0.00 .18 4.94 0.00
Fourth Grade
Intercept 48.79 39.08 299.02* 42.83 70.86* 23.55
Pretest 0.96* 0.06 0.79* 0.13 0.74* 0.12
Treatment 20.79 22.85 0.08 4.89 26.01 0.02 −5.57 6.97 −0.16
Fifth Grade
Intercept −29.34 59.59 98.05 97.40 49.68 33.73
Pretest 1.05* 0.07 0.97* 0.02 0.78* 0.16
Treatment −28.90 28.73 −0.13 −25.85 45.28 −0.12 3.00 9.1 0.08
Note. Adjusted standardized mean differences were computed using adjusted means and unadjusted standard deviations; SRI = Scholastic Reading Inventory; IA-R =
Iowa Assessment of Reading.
6
Table 5 effective at increasing students’ reading performance to a proficient

Conditional Growth Model (Cluster Robust Standard Errors in Parentheses). designation (Baete & Hochbein, 2014; Ellis, 2010). Each perspective
Second Grade Third Grade Fourth Grade Fifth Grade required different analyses and produced different results, as discussed
in the sections that follow.
Fixed Effects
Baseline 271.03* 429.55* 579.96* 717.44*
6.1. Practitioner perspective
(29.94) (10.59) (12.64) (35.05)
Growth Rate 114.77* 94.01* 88.01* 31.98*
(3.54) (4.00) (3.05) (4.33) Student outcomes appeared most favorable when examining their
Treatment 6.62 −1.09 13.47 −29.79 pre-to-posttest growth over summer. Reading development tends to be
(26.92) (23.51) (26.67) (48.09) more rapid in lower grade levels (Silberglitt & Hintze, 2007) and, ac-
cordingly, students in Grades 2 and 3 with lower baseline performance
Random Effects
Baseline 188.93 30.45 181.97 210.42 on the SRI grew at a faster rate during summer than those in Grades 4
Growth Rate 45.85 38.71 27.49 37.86 and 5. Among the upper elementary students, there was almost no re-
Within 100.45 251.58 162.63 86.12 lationship between baseline SRI performance and growth. The majority
Effect Size (d) .02 −.00 .05 −.13
of within-student variation was explained by linear trajectory in Grades
2 (71 %) and 4 (65 %), with less explained in Grades 5 (40 %) and 3 (20
%).
Table 6
Effect sizes, expressed as standardized slopes, provided a similar
Percentages of Students Who Progress and Did Not Make Progress.
picture of the moderate growth in Grade 2 (.54), followed by slightly
SRI Proximal SRI Distal IA-R smaller effects in Grades 3 (.47) and 4 (.40). Among fifth graders, the
effect size was negligible (.15). The educators in the school system had
Grade Treatment Control Treatment Control Treatment Control
asked whether students in the program successfully improved their pre-
From Basic or Below Basic to Proficient or Better to-posttest reading performance. From this stakeholder perspective, the
Second 5% – 27 % 26 % 17 % 18 % evaluation revealed the program was effective in Grades 2–4.
Third 1% 2% 7% 11 % 15 % 14 %
Fourth 3% 3% 21 % 24 % 19 % 28 %
Fifth 3% 6% 6% 12 % 6% 15 % 6.2. Funding agency perspective
Students Who Did Not Positively Change Categories When compared to students who were not enrolled in the summer
Second 95 % 100 % 73 % 74 % 83 % 82 %
program, the growth in reading performance of the participants was
Third 99 % 98 % 93 % 89 % 85 % 86 %
Fourth 97 % 97 % 79 % 76 % 81 % 72 %
less positive. In no grade level were there statistically significant dif-
Fifth 97 % 94 % 94 % 88 % 94 % 85 % ferences between the improvement of the treatment and control groups
on any outcome measures. Rather, variance in posttest performance
Note. SRI = Scholastic Reading Inventory; IA-R = Iowa Assessment of Reading. was explained by students’ pretest performance rather than whether or
not they attended summer school. This may be a logical result of the
performance are displayed in the top panel of Table 6, and the per- schools having identified students for participation based on principals’
centage of treatment and control students who did not reach at least professional judgment about those students who were most likely to
proficient performance on the SRI and IA-R are displayed in the bottom improve. However, previous research has found that students who at-
panel of Table 6. tended summer programs subsequently had better academic perfor-
Results indicated that the largest percentage of students who posi- mance than their non-attending peers, and that these benefits were
tively changed their designation to at least proficient on the SRI prox- exhibited for at least two years after participation (McCombs et al.,
imal was 5% for the second-grade treatment students. Treatment stu- 2011).
dents in Grades 2 and 4 accounted for the largest percentage of students Effect sizes were expressed as standardized mean differences, or
who positively changed their designation to at least proficient on the Cohen’s d (Cohen, 1988), in the slopes of the conditional growth
SRI distal (27 % and 21 %, respectively), but similar improvement was models. Across grades, these were negligible and not significant (range
observed in the control group for the same grades (26 % and 24 %, of -.13 to .05). The funding agency was interested in knowing whether
respectively). On the IA-R administered approximately 8 months after their investment in summer school had an immediate or lasting impact
the summer program, only third graders in the treatment group de- on enrolled students as compared to their non-participating peers. From
monstrated a larger percentage of positive change in designation (15 %) this stakeholder perspective, the results would suggest the program was
than students in the control group (14 %), arguably a negligible dif- not effective.
ference. For all other grade levels, a larger percentage of students ex-
perienced a positive change of designation in the control group. 6.3. Policymaker perspective
6. Discussion Few students in either the treatment or control groups experienced

enough growth in reading performance over the summer to achieve
Consistent with a method-oriented approach to summative evalua- grade-level proficiency on the SRI proximal posttest. This was not un-
tion (Boruch, 1997; Cook & Campbell, 1979), this study investigated expected given the difficulties in realizing these kinds of gains in the
the interpretation of a summer reading program’s effectiveness from short term (Lachlan-Haché & Castro, 2015). By the distal posttests,
three different perspectives corresponding to different questions that more students were moving to proficient performance, but relatively
stakeholders might ask about students’ improvement. The practitioners similar improvement was observed across treatment and control
inquired whether the intervention was effective at helping students groups. More often, the control group demonstrated a higher percen-
improve their pre-to-posttest reading performance (AIR, 2014; Simkins tage of positive movement such as on the SRI distal in Grades 3–5 and
& Allen, 2000; Sokolsky et al., 2016). The funding agency questioned the IA-R in Grades 2 and 5. Both distal posttests were administered after
whether summer school was effective at improving participating stu- students had been receiving their typical reading instruction for about 8
dents’ reading performance significantly more than what non-rando- months during the regular academic year, so it is logical that extended
mized, matched peers demonstrated without the intervention. Finally, time and instruction would lead to greater improvement.
the policymaker’s were interested in whether the intervention was Policymakers who were recommending summer school for
7
elementary students struggling with reading were asking whether par- students to improve their overall reading achievement from pre- to
ticipation in the summer program accomplished the intent of helping posttest, but it was not effective when participating students’ growth
the students meet grade-level proficiency benchmarks. Overall, it rates were compared to those of their non-participating peers. Nor was
seemed the district-designed treatment made this no more likely, and in it effective at increasing students’ attainment of reading proficiency.
some respects it was less likely that participating students would attain The differing interpretations have related implications. As a policy
a proficient designation. From this stakeholder perspective, the eva- targeting low ability readers, summer school seems a less fruitful time
luation revealed the summer program was not effective. for investing intervention resources than the regular academic year. To
the extent that summer programs are supportive of students’ ongoing
6.4. Limitations reading development, they appear more beneficial when provided in
earlier grade levels. It is also may be that summer interventions need to
Using extant data precluded us from monitoring teachers’ im- better target students’ specific skill deficits.
plementation of the reading intervention, so it is possible that treatment The results of this study can inform future evaluations that are de-
integrity influenced the results. Fidelity of implementation has been signed to address different goals and limitations (Rossi, Freeman, &
found to influence the outcomes of reading interventions delivered Lipsey, 1999). The findings also highlight the importance of helping
during the regular academic year (e.g., Stein et al., 2008), so future stakeholders to navigate interpreting the effectiveness of an interven-
research should consider the role that treatment integrity might play in tion. A single effect size can be easily misunderstood if only searching
evaluating the effectiveness of summer programs from any stakeholder for a positive value threshold (e.g., Hattie, 2009; WWC, 2017), so
perspective defined here. Nevertheless, fidelity was less of a concern in evaluators can play an important role in clearly defining the multiple
this study because there were not options for different kinds of in- perspectives on successful outcomes.
structional programming as are typically available during the regular
school year. In the summer break, students either receive formal Acknowledgements
reading instruction or they do not.
Another potential limitation to determining the effectiveness of This research was supported by the Iowa Department of Education
summer reading programs involves the measurement of students’ (Contract #013916). The content is solely the responsibility of the
abilities. The instruments used here were broad, standardized assess- authors and does not necessarily represent the official views of the
ments of students’ grade-level reading abilities. As such, they were ill- funder.
suited for detecting growth in foundational skills that may be necessary
to enable students to close the gap between their current abilities and Appendix A. Supplementary data
grade level. Alternatively, the measures may have motivated the district
educators to design and deliver the summer program in a way that Supplementary material related to this article can be found, in the
inappropriately targeted reading skills that were too advanced for online version, at doi:https://doi.org/10.1016/j.evalprogplan.2020.
students’ current level of performance. Reading interventions are more 101852.
successful when they are carefully aligned to students’ needs, so the
assessments used to identify students for the summer program and plan References
the instruction need to provide sufficiently diagnostic information for
making data-based decisions (Coyne, Kame’enui, & Simmons, 2001). American Institutes for Research (2014). Selecting an approach to measuring student learning
The short timeframe of a summer program also may elevate the im- growth for educator evaluation: Information for Florida school districts. Retrieved
fromWashington, DC: Author. www.fldoe.org.
portance of measuring students’ abilities with embedded assessments American Psychological Association (2010). Publication manual of the American
that are sensitive to the targeted areas and provide a clear link to in- Psychological Association (6th ed.). Washington, DC: Author.
struction (Begeny et al., 2015). Embedded assessments, or periodic Baete, G. S., & Hochbein, C. (2014). Project proficiency: Assessing the independent effects
of high school reform in an urban district. The Journal of Educational Research, 107,
checks on students’ performance within an instructional unit, have 493–511. https://doi.org/10.1080/00220671.2013.823371.
demonstrated adequacy as predictors of students’ reading outcomes but Begeny, J. C., Whitehouse, M. H., Methe, S. A., Codding, R. S., Stage, S. A., & Nuepert, S.
are not as widely used or researched as non-embedded measures such as (2015). Do intervention-embedded assessment procedures successfully measure stu-
dent growth in reading? Psychology in the Schools, 52, 578–593. https://doi.org/10.
standardized assessments (Begeny et al., 2015; Oslund et al., 2012). 1002/pits.21843.
Finally, we note that the archival data analyzed were from a single Boruch, R. F. (1997). Randomized experiments for planning and evaluation: A practical guide.
year and did not follow the same students as they progressed through Thousand Oaks, CA: Sage.
Calkins, L., & Tolan, K. (2010). A guide to reading workshop: Grades 3-5. Portsmouth, NH:
multiple grade levels (e.g., from Grade 2 through Grade 5). This limits
Heinemann.
the conclusions that can be drawn about any developmental changes in Campbell, D. T., & Stanley, J. C. (1963). Experimental and quasi-experimental designs for
growth students may be experiencing over time. In addition, the sample research on teaching. In N. L. Gage (Ed.). Handbook of research on teaching (pp. 171–
for the Grade 5 treatment group was quite small, thus requiring added 246). Chicago, IL: Rand McNally.
Clark, W. C., van Kerkhoff, L., Lebel, L., & Gallopin, G. C. (2016). Crafting usable
caution in interpreting the results. knowledge for sustainable development. Proceedings of the National Academy of
Sciences of the United States of America, 113, 4570–4578 10.10.73/pnas.1601266113.
7. Lessons learned Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). New York,
NY: Academic Press.
Connor, C. M., Alberto, P. A., Compton, D. L., & O’Connor, R. E. (2014). Improving reading
When the desired summative outcomes of a program vary in defi- outcomes for students with or at-risk for reading disabilities: A synthesis of the contribu-
nition by stakeholder group, evaluators are challenged to include tions from the Institute of Education Sciences, Research Centers (NCSER 2014-3000).
Retrieved fromWashington, DC: National Center for Special Education Research,
multiple perspectives in judging effectiveness (Clark et al., 2016; Institute of Education Science, US Department of Education. http://ies.ed.gov/.
Patton, 1997). The present study highlighted the potential for different Cook, D. A., & Beckman, T. J. (2010). Reflections on experimental research in medical
interpretations to be made based on the nature of the question posed education. Advances in Health Sciences Education, 15, 455–464. https://doi.org/10.
1007/s10459-008-9917-3.
about the intervention. The school district who designed the summer Cook, T. D., & Campbell, D. T. (1979). Quasi-experimentation: Design and analysis issues for
program, the private organization that funded it, and the policymakers field settings. Chicago: Rand McNally.
who recommended it all were interested in whether participating in a Cooper, H., Charlton, K., Valentine, J. C., & Muhlenbruck, L. (2000). Making the most of
summer school: A meta-analytic and narrative review. Monographs of the Society for
summer reading program worked at helping below-level readers im-
Research in Child Development, 65, 1–127 i-vi.
prove. They asked the question in different ways, requiring different Coyne, M. D., Kame’enui, E. J., & Simmons, D. C. (2001). Prevention and intervention in
evaluations of effectiveness (Fritz et al., 2012). beginning reading: Two complex systems. Learning Disabilities Research and Practice,
The results suggested that summer school was effective at helping 16, 62–73. https://doi.org/10.1111/0938-8982.00008.
8
Cronbach, L. J. (1982). Designing evaluations of educational and social problems. San Coyne, M. D. (2012). Predicting kindergarteners’ response to early reading inter-
Francisco: Jossey Bass. vention: An examination of progress-monitoring measures. Reading Psychology, 33,
Dorn, L., & Soffos, C. (2009). Comprehensive intervention model: A systemic design for re- 78–103. https://doi.org/10.1080/02702711.2012.630611.
versing reading failure. MarchBoston, MA: Allyn & Bacon. Patton, M. Q. (1997). Toward distinguishing empowerment evaluation and placing it in a
Dunbar, S., & Welch, C. (2015). Iowa assessments: Research and development guide, forms E larger context. The American Journal of Evaluation, 18, 147–163. https://doi.org/10.
and F. Retrieved fromBoston: Houghton Mifflin Harcourt. https://itp.education. 1177/109821409701800114.
uiowa.edu. Pearl, J. (2010). The foundations of causal inference. Sociological Methodology, 40(1),
Ellis, P. D. (2010). The essential guide to effect sizes: Statistical power, meta-analysis, and the 75–149.
interpretation of research results. Cambridge, UK: Cambridge University Press. Rosenbaum, P. R. (2010). Design of observational studies (2nd ed.). New York: Springer.
Fountas, I. C., & Pinnell, G. S. (2001). Guiding readers and writers: Teaching comprehension, Rosenbaum, P. R., & Rubin, D. B. (1983). The central role of the propensity score in
genre, and content literacy. Portsmouth, NH: Heinemann. observational studies for causal effects. Biometrika, 70, 41–55. https://doi.org/10.
Fritz, C. O., Morris, P. E., & Richler, J. J. (2012). Effect size estimates: Current use, cal- 1017/cbo9780511810725.016.
culations, and interpretation. Journal of Experimental Psychology, 141, 2–18. https:// Rossi, P., Freeman, H. E., & Lipsey, M. W. (1999). Evaluation: A systematic approach.
doi.org/10.1037/a0024338. Thousand Oaks, CA: Sage.
Gersten, R., Fuchs, L. S., Compton, D., Coyne, M., Greenwood, C., & Innocenti, M. S. Scholastic Inc (2014). Scholastic reading inventory. New York: Author.
(2005). Quality indicators for group experimental and quasi-experimental research in Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002). Experimental and quasi-experimental
special education. Exceptional Children, 71, 149–164. https://doi.org/10.1177/ designs for generalized causal inference. Boston, MA: Houghton Mifflin.
001440290507100202. Silberglitt, B., & Hintze, J. M. (2007). How much growth can we expect? A conditional
Gleason, P. M., Resch, A. M., & Berk, J. A. (2012). Replicating experimental impact estimates analysis of R-CBM growth rates by level of performance. Exceptional Children, 74,
using a regression discontinuity approach (NCEE Reference Report 2012-4025)Retrieved 71–84. https://doi.org/10.1177/001440290707400104.
from. Washington, DC: National Center for Education Evaluation and Regional Simkins, S., & Allen, S. (2000). Pretesting students to improve teaching and learning.
Assistance, Institute of Education Sciences, U.S. Department of Education. http:// International Advances in Economic Research, 6(1), 100–112. https://doi.org/10.1007/
ncee.ed.gov. BF02295755.
Hanushek, E. A., Kain, J. F., & Rivkin, S. G. (1998). Does special education raise academic Simmerman, S., & Swanson, H. L. (2001). Treatment outcomes for students with learning
achievement for students with disabilities? (Working Paper No. 6690). Cambridge, MA: disabilities: How important are internal and external validity? Journal of Learning
National Bureau of Economic Research. Disabilities, 34, 221–236. https://doi.org/10.1177/002221940103400303.
Hattie, J. A. C. (2009). Visible learning: A synthesis of over 800 meta-analyses relating to Sokolsky, G., Tweedie, J., & McMillan, A. (2016). Instructional guide to reporting Title I, Part
achievement. London, UK: Routledge. D data in the CSPR for SY 2015–2016Retrieved from. Washington, DC: National
Ho, D. E., Imai, K., King, G., & Stuart, E. A. (2007). MatchIt: Nonparametric preprocessing Technical Assistance Center for the Education of Neglected or Delinquent Children
for parametric causal inference. Journal of Statistical Software, 42(8), 1–28. Retrieved and Youth (NDTAC). www.neglected-delinquent.org.
from http://www.jstatsoft.org/v42/i08/. Stein, M., Berenda, M., Fuchs, D., McMaster, K., Saenz, L., Yen, L., ... Compton, D. (2008).
Hoover, H. D., Dunbar, S. B., & Frisbie, D. A. (2001). Iowa tests of basic skills, forms A, B, Scaling up an early reading program: Relationships among teacher support, fidelity of
and C. Itasca, IL: Riverside Publishing. implementation, and student performance across different sites and years.
Hox, J. J. (2010). Multilevel analysis: Techniques and applications. Mahwah, NJ: Lawrence Educational Evaluation and Policy Analysis, 30, 368–388. https://doi.org/10.3102/
Erlbaum Associates. 0162373708322738.
Johnston, J., Riley, J., Ryan, C., & Kelly-Vance, L. (2015). Evaluation of a summer reading Torgesen, J. K. (2002). The prevention of reading difficulties. Journal of School Psychology,
program to reduce summer setback. Reading and Writing Quarterly, 31, 334–350. 40, 7–26. https://doi.org/10.1016/S0022-4405(01)00092-9.
https://doi.org/10.1080/10573569.2013.857978. Walser, T. M. (2014). Quasi-experiments in schools: The case for historical cohort control
Kim, J. S., Burkhauser, M. A., Quinn, D. M., Guryan, J., Kingston, H. C., & Aleman, K. groups. Practical Assessment, Research, & Evaluation, 19(6), Available online: http://
(2017). Effectiveness of structured teacher adaptations to an evidence-based summer pareonline.net/getvn.asp?v=19&n=6.
literacy program. Reading Research Quarterly, 52, 443–467. https://doi.org/10.1002/ What Works Clearinghouse (2017). What Works Clearinghouse procedures and standards
rrq.178. handbook (Version 4.0). Washington DC: Institute of Educational Science.
Lachlan-Haché, L., & Castro, M. (2015). Proficiency or growth?: An exploration of two ap- Wilkins, C., Gersten, R., Decker, L., Grunden, L., Brasiel, S., Brunnert, K., ... Jayanthi, M.
proaches for writing student learning targets. Retrieved fromWashington, DC: American (2012). Does a summer reading program based on Lexiles affect reading comprehension?
Institutes for Research. www.air.org. (NCEE 2012-4006). Washington, DC: National Center for Education Evaluation and
Lipsey, M. S., & Wilson, D. B. (1993). The efficacy of psychological, educational, and Regional Assistance, Institute of Education Sciences, U.S. Department of Education.
behavioral treatment. The American Psychologist, 48, 1181–1209. https://doi.org/10. Wilson, D. B., & Lipsey, M. W. (2001). The role of method in treatment effectiveness
1037/0003-066x.48.12.1181. research: Evidence from meta-analysis. Psychological Methods, 6, 413–429. https://
Marsden, E., & Torgerson, C. J. (2012). Single group, pre- and post-test research designs: doi.org/10.1037//1082-989x.6.4.413-429.
Some methodological concerns. Oxford Review of Education, 38, 583–616. https://doi. Workman, E. (2014). Third-grade reading policies. Education Commission of the States:
org/10.1080/03054985.2012.731208. Reading/literacy: Preschool to third grade. Retrieved fromhttp://ecs.org/
McCarrier, A., Pinnell, G. S., & Fountas, I. C. (2000). Interactive writing: How language & clearinghouse/01/03/47/10347.pdf.
literacy come together, K-2. Portsmouth, NH: Heinemann. Zvoch, K., & Stevens, J. J. (2015). Identification of summer school effects by comparing
McCombs, J. S., Augustin, C. H., Schwartz, H. L., Bodilly, S. J., McInnis, B., Lichter, D. S., the in- and out-of-school growth rates of struggling early readers. The Elementary
... Cross, A. B. (2011). Making summer count: How summer programs can boost children’s School Journal, 115, 432–456. https://doi.org/10.1086/68.
learning. Retrieved fromNew York: The Wallace Foundation. www.
wallacefoundation.org. Deborah K. Reed, pH.D., is the Director of the Iowa Reading Research Center (IRRC) and
Mielke, J., Vermaßen, H., Ellenbeck, S., Milan, B. F., & Jaeger, C. (2016). Stakeholder an Associate Professor at the University of Iowa. Her evaluation and research interests
involvement in sustainability science: A critical view. Energy Research & Social include effective practices for reading instruction, intervention, and assessment as well as
Science, 17, 71–81. https://doi.org/10.1016/j.erss.2016.04.001. the use of data-based decision making within reading programs.
National Center for Education Statistics (2018). The nations report card. 2017 reading
resultsRetrieved from. Washington, D.C: Institute of Education Sciences, U.S.
Department of Education. www.nationsreportcard.org. Ariel M. Aloe, Ph.D., is an Associate Professor in Psychological and Quantitative
Norman, G. (2003). RCT = results confounded and trivial: The perils of grand educa- Foundations at the University of Iowa and the IRRC’s Associate Director for Statistical
tional experiments. Medical Education, 37, 582–584. https://doi.org/10.1046/j.1365- Methods. He is interested in meta-analytic methods or statistical solutions to complex
problems by combining results from studies with multiple outcomes.
2923.2003.01586.x.
Oslund, E. L., Hagan-Burke, S., Taylor, A. B., Simmons, D. C., Simmons, L., Kwok, O. M., &

EyeoftheBeholder EPP 2020

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

EyeoftheBeholder EPP 2020

Uploaded by

Copyright:

Available Formats

Evaluation and Program Planning 83 (2020) 101852

Contents lists available at ScienceDirect

Evaluation and Program Planning

Interpreting the eﬀectiveness of a summer reading program: The eye of the T

1. Introduction in whether students in the summer program signiﬁcantly improved in

Treatment Control Treatment Control

Mean SD Mean SD d t-test p-value Mean SD Mean SD d t-test p-value

4.3. Measures 4.3.2. Iowa Assessment of Reading (IA-R)

U. Mean U. Growth U. Mean U. Growth U. Mean U. Growth U. Mean U. Growth

Slope SE d Slope SE d Slope SE d

Table 5 eﬀective at increasing students’ reading performance to a proﬁcient

6. Discussion Few students in either the treatment or control groups experienced

You might also like