You are on page 1of 61

Received: 4 January 2022 Revised: 27 June 2023 Accepted: 15 August 2023

DOI: 10.1002/tea.21900

RESEARCH ARTICLE
|

The relationships between elementary students'


knowledge-in-use performance and their
science achievement

Tingting Li 1,2 | I-Chien Chen 2 | Emily Adah Miller 3 |


Cory Susanne Miller | Barbara Schneider | Joseph Krajcik 1
1 4

1
CREATE for STEM Institute, College of
Education, Michigan State University, Abstract
East Lansing, Michigan, USA This longitudinal study examines the relationship
2
Department of Counseling, Educational between students' knowledge-in-use performance and
Psychology & Special Education, College
their performance on third-party designed summative
of Education, Michigan State University,
East Lansing, Michigan, USA tests within a coherent and equitable learning environ-
3
Mary Frances Early College of ment. Focusing on third-grade students across three
Education, University of Georgia, Athens, consecutive project-based learning (PBL) units aligned
Georgia, USA
4
with the Next Generation Science Standards (NGSS),
College of Education, Michigan State
University, East Lansing, Michigan, USA the study includes 1067 participants from 23 schools in
a Great Lakes state. Two-level hierarchical linear
Correspondence
modeling estimates the effects of post-unit assessments
Tingting Li, CREATE for STEM Institute,
College of Education, Michigan State on end-of-year summative tests. Results indicate that
University, East Lansing, MI, USA. post-unit assessment performances predict NGSS-
Email: litingt1@msu.edu
aligned summative test performance. Students
Funding information experiencing more PBL units demonstrate greater gains
George Lucas Educational Foundation, on the summative test, with predictions not favoring
Grant/Award Number: APP# 13987
students from diverse backgrounds. This study under-
scores the importance of coherence, equity, and the
PBL approach in promoting knowledge-in-use and sci-
ence achievement. A systematically coherent PBL envi-
ronment across multiple units facilitates the
development of students' knowledge-in-use, highlight-
ing the significance of designing science and

This is an open access article under the terms of the Creative Commons Attribution-NonCommercial-NoDerivs License, which permits
use and distribution in any medium, provided the original work is properly cited, the use is non-commercial and no modifications or
adaptations are made.
© 2023 The Authors. Journal of Research in Science Teaching published by Wiley Periodicals LLC on behalf of National Association for
Research in Science Teaching.

J Res Sci Teach. 2023;1–61. wileyonlinelibrary.com/journal/tea 1


|

10982736, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/tea.21900 by Cochrane Chile, Wiley Online Library on [14/09/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
2 LI ET AL.

engineering practices (SEPs) and crosscutting concepts


coherently and progressively, with intentional rev-
isitation of disciplinary core ideas (DCIs). The study
also investigates how the PBL approach fosters equita-
ble learning environments for diverse demographic
groups, offering equitable opportunities through
equity-oriented design. Contributions include a coher-
ent assessment system that tracks and supports learn-
ing aligned with NGSS, emphasizing the predictive
power of post-unit assessments, continuous monitoring
and tracking. The implications of context similarity
and optimal performance expectations within units are
discussed. Findings inform educators, administrators,
and policymakers about the benefits of NGSS-aligned
PBL systems and the need for coherent and equitable
learning and assessment systems supporting
knowledge-in-use development and equitable opportu-
nities for all learners.

KEYWORDS
coherent learning system, elementary science, equity-oriented
design, knowledge-in-use, project-based learning

1 | INTRODUCTION

In an era marked by rapid change and an information saturation, it becomes increasingly nec-
essary for individuals to acquire and apply 21st-century skills (Trilling & Fadel, 2009). Such a
paradigm shift demands a reimagining of science education, transitioning from rote memoriza-
tion to nurturing knowledge-in-use proficiencies (National Research Council [NRC], 2012;
Pellegrino & Hilton, 2012). Knowledge-in-use signifies adaptive thinking that enables students
to apply their knowledge to explain novel and real-world phenomena or tackle complex prob-
lems by figuring out context specific solutions, including iteratively considering and making use
of more than one practice or idea (Li et al., 2023; NRC, 2012).
As we mark the 10th anniversary of the Next Generation Science Standards (NGSS; NGSS
Lead States, 2013), it is critical to underscore the need for further empirical evidence demon-
strating the effectiveness of a coherent NGSS-aligned learning system in fostering students'
knowledge-in-use—a fundamental aspect of NGSS. Although our understanding of “knowl-
edge-in-use” has considerably evolved over the past decade (e.g., Anderson et al., 2018;
Baumfalk et al., 2019; He, Chen, et al., 2023; He, Zhai, et al., 2023; Nordine et al., 2021), the cre-
ation of supportive learning environments, particularly at a systematic level, remains an elusive
task (Charara et al., 2021).
Our research aims to address these issues by investigating the design of a coherent
NGSS-aligned project-based learning (PBL) system for elementary science. This system seeks to
|

10982736, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/tea.21900 by Cochrane Chile, Wiley Online Library on [14/09/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
LI ET AL. 3

support the development of knowledge-in-use while also effectively assessing student learning.
We build on the foundational work of the Framework of K-12 Science Education (hereafter the
Framework, 2012) and NGSS, delving into how science education environments can more effec-
tively develop and assess knowledge-in-use. Previous research indicates our system can signifi-
cantly support students' academic science learning and social-emotional learning (Krajcik
et al., 2023). Consequently, our study investigates how such a system can effectively aid
students' development of knowledge-in-use. We delve into the relationship between students'
performances on post-unit assessments and summative science achievement over time.
We address overarching questions such as: How can we support all students, including those
historically marginalized in STEM, in developing knowledge-in-use over time? And, how can
we effectively gather evidence showing that learners have developed knowledge-in-use?
Our study underscores the role of coherence, a characteristic of high-quality educational sys-
tems (Fulmer et al., 2018; NRC, 2006; Roseman et al., 2010). Coherence entails a systematic,
sequential building of understanding through experiences over time (Miller & Krajcik, 2019).
Within the system, “all of the elements have to be built on a shared vision of what is important
for students to know and understand about science, how instruction affects that knowledge and
understanding over time, and what can be taken as evidence that learning has occurred”
(NRC, 2006, p. 4). Despite the widespread acknowledgment of coherence, empirical evidence
connecting it to student learning is sparse (NRC, 2006). We address this gap by investigating
the design and implementation of coherent learning systems in classroom environments to pro-
mote knowledge-in-use.
In addition, we seek to unpack the potential of PBL environments in fostering elementary
students' long-term development of knowledge-in-use. PBL, a prominent inquiry-based learning
approach, has been recognized as an effective approach to design productive learning environ-
ments (Craig & Marshall, 2019; Haas et al., 2021; Han et al., 2015; Zhao & Wang, 2022). PBL
engages students in long-term, problem-solving projects, thus promoting the development of
knowledge-in-use (Krajcik & Shin, 2023). Research supports this approach for immersing stu-
dents in authentic inquiries, enabling them to collaboratively design artifacts as representations
of their investigations' outcomes (Ford & Forman, 2006; Li, 2021; Reiser, 2014). Evidence indi-
cates that PBL can cultivate a deep understanding of scientific concepts (Hardy et al., 2006),
amplify proficiency in applying scientific practices (Walker & Sampson, 2013), and enhance
their overall achievement in science (Mupira & Ramnarain, 2018). Students who are engaged in
PBL environments often exhibit superior performance on assessments measuring integrated
knowledge (e.g., Lee et al., 2013) and knowledge-in-use proficiency (e.g., Anderson et al., 2018;
Schneider et al., 2022). However, existing studies mainly employ summative assessments, which
are valuable for system monitoring and program effectiveness evaluation (NRC, 2014). Rela-
tively little is known about what and how well students have learned during a long-term and
persistent PBL involvement (He, Chen, et al., 2023). Expanding our understanding of this aspect
of PBL affordances is crucial for monitoring student learning, providing timely feedback to
instructors, and guiding students in reflecting on their learning (National Academies of Sci-
ences, Engineering, and Medicine [NASEM], 2019).
Moreover, PBL is recognized for its potential to incorporate elements of coherence and sus-
tainability to foster students' development of knowledge-in-us proficiencies (Fortus et al., 2015).
Although coherent PBL curriculum has shown its capacity to boost student's knowledge-in-use
(Miller, et al., 2023; Zhao & Wang, 2022), most current research focuses on the impact of single
PBL unit interventions, typically employing pre- and post-tests around these interventions
(e.g., Craig & Marshall, 2019; Han et al., 2015). The influence of a systematically coherent PBL
|

10982736, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/tea.21900 by Cochrane Chile, Wiley Online Library on [14/09/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
4 LI ET AL.

environment involving multiple, sequential units on students' scientific achievement, and more
importantly, their knowledge-in-use proficiency, remains less understood. To date, only one study
explored students' knowledge-in-use proficiency change within a comprehensive, long-term PBL
system (He, Chen, et al., 2023). However, He and his colleague's study focused on high school stu-
dents. No study has investigated how a coherent long-term PBL system supports elementary stu-
dents' knowledge-in-use development. Our work addresses this research gap by exploring third
graders' knowledge-in-use development over the course of a year in a coherent learning system.
Here we define system as a coherent and align curriculum materials for teachers and students
and embedded assessment tasks, post-tests, and a summative test, and professional learnings.
These components work together to support teachers in instruction and students in learning.
Our study also contributes to the ongoing dialogue on inclusive science education for histor-
ically marginalized students in STEM. PBL design should offer learning opportunities for stu-
dents across diverse demographic groups, including gender, economic status, English language
learner (ELL), race/ethnicity, and region (e.g., Krajcik et al., 2023). Previous research (Harris
et al., 2015) found that PBL can foster achievement among diverse middle school students.
However, for elementary students, limited but growing empirical evidence suggests that PBL
can support learning across various demographic groups (Krajcik et al., 2022). We extend the
findings of previous studies (Li et al., 2021; NASEM, 2022) by investigating how PBL can foster
inclusive learning environments across diverse demographic groups.
Furthermore, our research delves into enhancing assessment strategies to systematically col-
lect, track, and monitor student learning against the science standards (detailed in Section 4.6).
Assessments serve as a primary feedback mechanism in a coherent learning system, aiding in the
guiding of instructional decisions, holding schools accountable for meeting learning goals, and
monitoring program effectiveness (NRC, 2006). For an assessment system to function optimally,
it requires a variety of measurement approaches and multiple assessment strategies to provide
sources of evidence to support educational decision-making across different levels of the learning
system (NRC, 2006). In addition, the assessment system, which is part of the large learning sys-
tem, must link tightly to curriculum, instruction, and professional learning so that each element
is built with a shared vision of the student learning goals (NGSS Lead States, 2013; NRC, 2006,
2012). In the assessment system, post-unit assessments are closer to the students’ unit learning
experience, providing proximal learning information, whereas science achievement tests are often
distal in content, typically providing information about summative performance (Ruiz-primo
et al., 2002). Therefore, examining the relationships between students' ongoing learning-goals
expressed performances on post-unit assessments and their end-of-year summative test can pro-
vide evidence for designing a coherent learning system (Miller & Krajcik, 2019).

2 | S TUDY B ACK GROUND A ND RES EARC H QUESTIONS

Our research, as part of the efficacy study conducted in 2018–2019 of the Multiple Literacies in
Project-Based Learning (ML-PBL) project, focus on three coherent, consecutive third-grade ML-
PBL curriculum units. Each unit, along with its corresponding post-unit assessment, builds
toward the NGSS performance expectations (PEs; Miller & Krajcik, 2019). The overarching goal
of this study is to examine the ability of the post-unit assessments to predict achievement on an
end-of-year summative test, developed independently by a third party. The predictive capacity
embedded in the post-unit assessments aims to determine students' likelihood of achieving the
designated criteria in the final summative assessment. Our expectation is that these assessments
|

10982736, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/tea.21900 by Cochrane Chile, Wiley Online Library on [14/09/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
LI ET AL. 5

F I G U R E 1 ML-PBL learning system. Post-U1 means Unit 1 post-unit assessment, and so on and so forth.
ML-PBL, multiple literacies in project-based learning.

will display a predictive characteristic, thereby informing and supporting a coherent design of
the learning system (NRC, 2006, 2014). This success would, in turn, affirm the coherence of the
ML-PBL learning system and its capacity to support effective standard-aligned learning system
design, inclusive of all students' science learning needs.
The ML-PBL learning system, including curriculum materials, assessments, and professional
learnings for both teachers and students (Figure 1), is tailored for students in Grades 3–5. The
units were developed leveraging key design principles such as PBL (Krajcik & Shin, 2023),
coherent design (Fortus et al., 2015), and equity-oriented principles (Miller, Reigh, et al., 2021).
The first third-grade unit (Squirrel unit1), has been recognized with an NGSS Design Badge,
awarded after a rigorous external expert review process, validating it as a high-quality, NGSS-
aligned curriculum and showcasing the effectiveness of these principles. The design principles
applied to the Squirrel unit were consistently implemented across the remaining units. In
Section 4, we detail the specific anatomy of design principles. In addition, we established the
ML-PBL assessment system (Figure 2) with different types of assessments to capture students'
knowledge-in-use (see details in Section 4.6).
ML-PBL strives to create contexts that encourage students to make sense of phenomena and
tackle engineering problems, which can boost student engagement, social-emotional learning,
and science achievement. A randomized control trial revealed that students in the treatment
group significantly outperformed those in the control group in the end-of-year summative test,
across demographic backgrounds (Krajcik et al., 2023). Building on these findings, this study
delves into the learning process of students from the treatment group, aiming to elucidate how
a coherent PBL system supports the development of knowledge-in-use. We seek to deepen our
understanding of students' science learning process across the first three units. We focus on the
first three units of the ML-PBL because of weather complications—specifically, heavy snowfall
caused school closures—that restricted most teachers from conducting Unit 4 during the spring
semester of 2019. Only 5 teachers, instructing approximately 147 students, taught Unit 4; these
5 teachers constitute only 12% of the treatment teachers. Furthermore, among the 147 students
taught, only 65 had available reading benchmark data. Given this lack of representation, we

1
https://www.nextgenscience.org/resources/grade-3-ml-pbl-why-do-i-see-so-many-squirrels-i-cant-find-any-
stegosauruses
|

10982736, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/tea.21900 by Cochrane Chile, Wiley Online Library on [14/09/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
6 LI ET AL.

F I G U R E 2 Grade 3 ML-PBL unit learning structure (from https://sprocket.lucasedresearch.org/course/


science3). ML-PBL, multiple literacies in project-based learning.

decided to exclude Unit 4 from analysis. We aim to explore the process through the lens of three
research questions (RQs) and associated assumptions.

RQ1. Considering students' prior unit learning experiences, does their performance on
post-unit assessments cumulatively predict their science achievement on the end-
of-year summative test? By “considering,” we refer to the predictive effect of unit learn-
ing, measured by post-unit assessment on the end-of-year summative test, considering
their prior unit learning. Due to the inter-unit coherent design principles, we hypothe-
size that the more units that treatment students experienced, the higher students
would score on the end-of-year summative test, reflecting improved science achieve-
ment. This hypothesis seeks empirical evidence for the effectiveness of incorporating
inter-unit coherence and learning goal principles to develop learning systems.

RQ2. After controlling for students' prior unit learning experiences, does student per-
formance on post-unit assessments predict their science achievement on the end-
of-year summative test? By “controlling,” we refer to the predictive effect of each spe-
cific unit on the summative science achievement, excluding students' prior learning
experience. Considering that each unit was designed to adhere to intra-unit and learn-
ing goal coherence design principles, we argue that students' science learning in each
unit, as measured by post-unit assessments, should predict their performance on the
end-of-year summative test. This RQ aims to deepen our understanding of the
|

10982736, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/tea.21900 by Cochrane Chile, Wiley Online Library on [14/09/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
LI ET AL. 7

application of intra-unit coherence and learning goal coherence principles in the devel-
opment of coherent units and assessments aligned with the NGSS.

RQ3. Do the relationships between student performance on post-unit assessments and


the end-of-year summative test vary by gender, race, socio-economic status, or ELL sta-
tus? Given the equity-oriented design principles integrated into the curriculum (Krajcik
& Schneider, 2021), we assume that our learning system offers equitable learning oppor-
tunities. Thus, we further explore whether our learning system effectively supports the
science learning of students from underrepresented demographic groups and examine
how students from these groups respond to the coherent PBL design.

3 | LITERAT U RE R EVI EW A N D T HE O R E T I C A L
PERSPECTIVES

3.1 | Knowledge-in-use as a new vision of science learning

In light of the complex real-world problems confronting humanity, problems which rarely yield
straightforward solutions (Anderson et al., 2018; Organization for Economic Co-operation and
Development [OECD], 2019), the idea of knowledge-in-use has gained significant traction in edu-
cational reform documents globally (He et al., 2021; He et al., 2022; Kulgemeyer & Schecker, 2014;
NRC, 2012; People's Republic of China Ministry of Education, 2014). Rooted in the theories of situ-
ated cognition and social constructivism (Greeno et al., 1996; Vygotsky, 1978), the “knowledge-in-
use” idea redefines scientific proficiency, shifting the focus from what students know to how they
utilize their knowledge. This paradigm shift reflects the evolving consensus among educators,
learning scientists, and policymakers about the necessary competencies for global citizens in the
21st century (OECD, 2019).
The “knowledge-in-use” framework situates knowledge as an outcome of context-specific
activities and experiences, with individuals being active contributors in the knowledge creation
process (Schreiber & Valle, 2013). It is exemplified when learners apply their scientific under-
standing to make sense of phenomena or to solve intricate problems, encapsulating the essence
of adaptive skills, transferable knowledge, and cognitive flexibility (Mensah & Chen, 2022; Spiro
et al., 2019; Ward et al., 2018). However, developing such proficiency is not an instantaneous
event but a gradual process that requires continuous exposure to disciplinary experiences that
involve open-ended, unresolved problems (Esposito & Bauer, 2017). Therefore, effectively fos-
tering “knowledge-in-use” demands a learning environment characterized by coherence, depth,
and integration of critical scientific ideas and practices, ultimately driving a more effective and
relevant educational experience (Fulmer et al., 2018).

3.2 | Three-dimensional learning as an effective approach for


knowledge-in-use

The Framework and the NGSS advance a three-dimensional (3D) learning approach as the ave-
nue to explain relevant phenomena and offer solutions to complex problems, proposing perfor-
mance goals that develop knowledge-in-use across Grades K-12. This approach necessitates
students' integration of disciplinary core ideas (DCIs), SEPs, and crosscutting concepts (CCCs)
|

10982736, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/tea.21900 by Cochrane Chile, Wiley Online Library on [14/09/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
8 LI ET AL.

to comprehend phenomena or resolve problems (NASEM, 2022; NASEM, 2019; NRC, 2012).
These three dimensions of scientific knowledge mandate a holistic integration of each aspect in
teaching and learning, and an assessment of what students can achieve with their knowledge.
Each dimension interacts with the others, fostering knowledge-in-use. Based on the Framework,
the NGSS delineates PEs that integrate all three dimensions: DCIs, SEPs, and CCCs. Despite 3D
learning's recognized value, it presents operational challenges for teachers (Penuel et al., 2015).
Teachers must adapt their teaching practice, conceptualize learning as a trajectory toward gen-
erative ideas, and nurture the use of scientific practices. PBL provides a potential platform to
support students' 3D learning and facilitate knowledge-in-use development.

3.3 | PBL as a supportive platform for knowledge-in-use

PBL is a teaching and learning approach involving students in exploring their own significant real-
world queries, thereby provoking curiosity (Krajcik & Czerniak, 2018). Leveraging student interest,
PBL creates a more equitable learning environment than traditional approaches (Sanchez Tapia
et al., 2018). The theoretical foundations of PBL are active construction, situated learning, social
interactions, and cognitive tools (Bransford et al., 1999; Krajcik & Shin, 2023). Although PBL envi-
ronments vary (e.g., Barron et al., 1998), they share common elements. PBL employs a driving ques-
tion (DQ) that engages learners, motivates ongoing exploration, and results in tangible artifacts
through a culminating learning sequence. Often, the DQ and student-created artifacts connect to
the students' community (Helle et al., 2006). By focusing on students and their interests, PBL adapts
to the intellectual resources and experiences of diverse students, responsive to culture, ethnicity,
language, race, socioeconomic status (hereafter SES), and gender (Boaler, 2002). As such, PBL can
contribute to culturally relevant pedagogy (Ladson-Billings, 1995). There is growing evidence of
PBL's effectiveness in supporting students meeting the NGSS PEs (NASEM, 2022).

3.4 | Coherent curriculum system as a requisite space for knowledge-


in-use

Knowledge-in-use develops when students have opportunities to apply evolving knowledge to


meaningful phenomena or problems (NRC, 2012). Curriculum coherence considered a prominent
predictor of student achievement (Schmidt et al., 2005) refers to the alignment of specific ideas,
their depth, and sequencing within and across grades (Fortus et al., 2015). Through careful coher-
ence design, students should develop the capacity to apply knowledge, progressively building
knowledge-in-use. This involves integrating DCIs, CCCs, and SEPs (Kang et al., 2018), considering
the development of the three dimensions within each unit (intra-unit coherence), and across units
(inter-unit coherence) (Fortus et al., 2015). Inter-unit coherence pertains to larger inquiry
sequences, multiple scientific practices, and different content domains within and across academic
years, facilitating a deeper and more integrated understanding of core ideas and practices.

4 | M U L T I P L E L I T E R A C I E S I N PB L S Y S T E M

We structured the ML-PBL as a coherent educational system that promotes students'


knowledge-in-use. ML-PBL encompasses curriculum materials, PLs, and 3D assessments to
|

10982736, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/tea.21900 by Cochrane Chile, Wiley Online Library on [14/09/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
LI ET AL. 9

develop and test coherent elementary learning environments aligned with the NGSS (Krajcik
et al., 2023). The system's constituents work cohesively, providing students the opportunity to
make sense of meaningful science phenomena, apply knowledge to their surrounding commu-
nities, and foster knowledge-in-use through 3D learning (Figure 1). The system's impact is mea-
surable through intermediate outcomes—knowledge-in-use—as measured by post-unit
assessments, leading to distal outcomes reflected in students' science achievement on the end-
of-year summative test.

4.1 | ML-PBL learning system design principles

To foster students' knowledge-in-use development, we employed three types of design princi-


ples in developing the ML-PBL system. Every ML-PBL unit uses the PBL approach to motivate
students to continually address scientific inquiries over several weeks. Grade 3 comprises four
units throughout the academic year (Table 1), addressing a series of PEs (Appendix 1). The
design principles include integrated PBL and 3D learning design principles, coherence princi-
ples, and equity-oriented principles. Table 2 outlines each type of principle.

4.2 | Integrated PBL and 3D learning design principles

The ML-PBL learning system effectively integrates the PBL and 3D learning (Miller, Sever-
ance, & Krajcik, 2021). Its structure adheres to several design principles: engagement with a DQ
tied to a natural phenomenon requiring explanation or an engineering problem demanding res-
olution; focus on 3D learning goals aligned with PEs to depict student proficiency through
performance-based assessments; involvement in NGSS scientific practices; engagement of stu-
dents, teachers, and community members in collective sense-making activities to generate
shared knowledge applicable to the resolution of the DQ; scaffolded learning experience to
enable students to leverage teacher cues and peer assistance for tasks beyond their immediate
capability; and iterative production of tangible artifacts addressing the DQ.
ML-PBL units are divided into learning sets (LSs), each framed by specific questions that
build toward a response to the unit DQ. Each LS, a coherent sequence of lessons, spans 1–
2 weeks. Students utilize one or more SEPs, DCIs, and CCCs as they progress through each les-
son and LS. Their comprehension of the unit phenomenon gradually deepens (Table 1). Each
unit is anchored by a DQ and culminates in a final artifact. The DQ introduces students to a
pertinent scientific phenomenon or an engineering problem. The DQ operates on three levels:
the unit, the LS, and the individual lesson, with each tier designed to guide students toward
applying the NGSS's three dimensions to make sense of the unit's anchoring phenomenon. The
nested DQs fortify the coherence of the learning experience. Each lesson-level DQ aims to
answer the broader question of the LS, which, in turn, contributes to addressing the unit-
level DQ.
The ML-PBL also integrates mathematical and English language arts (ELA) practices
throughout the units, providing an interdisciplinary approach to knowledge-in-use. Table 1 also
denotes the number of lessons with the guidance for integration of ELA or math. Throughout
the course of each unit, students engage in collaborative learning, working together in whole-
class and small group settings. Their collective goal is to incrementally construct knowledge,
working progressively toward formulating a response to the DQ. To guide students toward key
10

T A B L E 1 ML-PBL unit information.

Unit DQ Unit 1 (Squirrels/Adaptation) Unit 2 (Toys/Forces and Unit 3 (Birds/Biodiversity) Unit 4 (Plants/Weather and climate)
Why do I see so many squirrels but I can't motion) How can we help the birds How can we plan gardens for our
find any stegosauruses? How can we design fun moving near our school grow up community to grow plants for food?
toys that other kids can build? and thrive?
Phenomenon Animals meet their needs to survive in Different forces make toys move Many birds migrate in Plants all over the world grow in
their environment in different ways flocks hazardous weather
|

PEs with core 3-LS4-1 Biological evolution: unity and 3-PS2 Motion and stability: 3-LS2 Ecosystems, 3-ESS2 Earth's systems
DCIs (see diversity forces and interactions interactions, energy, and 3-ESS3-1 Earth and human activity
Appendix 1) 3-LS1 From molecules to organisms: 3–5-ETS1 Engineering design dynamics
structures and processes 3-LS3 Heredity: inheritance
and variation of traits
Unit artifacts A play, story, or model of how the Students build a toy according Students design and build a Students grow plants for their
stegosaurus did not survive and the to the specifications of a bird feeder for local bird community, and plan for protecting
eutheria did survive Kindergartener flocks the plants from flooding
LSs and lessons 6 LS, 29 lessons 4 LS, 19 lessons 6 LS, 32 lessons 4 LS, 20 lessons
LS1: Squirrel survival (7 lessons) LS1: Toys that move (7 lessons) LS1: Local birds (7 lessons) LS1: Planning a garden (6 lessons)
LS2: Structures (5 lessons) LS2: Changes in motion (5 LS2: Meeting food needs (5 LS2: Deciding where to plant (4 lessons)
LS3: Squirrels' environment (3 lessons) lessons) lessons) LS3: Considering climate (6 lessons)
LS4: Prehistoric organism (5 lessons) LS3: Non-contact forces (5 LS3: Migration (6 lessons) LS4: Hazardous weather (4 lessons)
LS5: Fossils as evidence (6 lessons) lessons) LS4: Feeder placement (5
LS6: Survival and extinction stories (3 LS4: Design portfolio (2 lessons) lessons)
lessons) LS5: Variation in traits (5
lessons)
LS6: Life cycles (6 lessons)
Lessons with 19 7 16 10
multiple
literacies
support

Abbreviations: DCIs, disciplinary core ideas; DQ, driving question; LSs, learning sets; ML-PBL, multiple literacies in project-based learning; PEs, performance expectations.
LI ET AL.

10982736, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/tea.21900 by Cochrane Chile, Wiley Online Library on [14/09/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
T A B L E 2 Design principles and the ML-PBL learning system.
LI ET AL.

Design principle Description


Integrated PBL Drawing on PBL features, our ML-PBL system has expanded and adjusted traditional features of PBL to incorporate the 3D
and 3D learning goals resulting in an essential set of design attributes for our learning system.
learning design • Engagement with a DQ tied to a natural phenomenon requiring explanation or an engineering problem demanding
principle resolution; focus on 3D learning goals aligned with PEs to depict student proficiency through performance-based
assessments; involvement in NGSS scientific practices; engagement of students, teachers, and community members in
collective sense-making activities to generate shared knowledge applicable to the resolution of the DQ; scaffolded learning
experience to enable students to leverage teacher cues and peer assistance for tasks beyond their immediate capability; and
iterative production of tangible artifacts addressing the DQ. See elaboration in Section 4.3.2.
Coherent design Intra-unit coherent All different parts, including the 3D LP statements, look for, evidence statements within each unit were designed toward
principle design achieve a bundle of NGSS PEs.
Inter-unit coherent Each unit of the ML-PBL learning system emphasizes different dimensions of PEs and works together to support students in
design reaching the unit learning goals across the academic year (NRC, 2006). See elaboration in Section 4.3.3.
Learning goal All learning goals within a unit and across the units build on each other toward addressing the unit driving question and overall
coherent design PEs across the academic year. See elaboration in Section 4.3.1.
• Within each unit, the design is underpinned by articulated 3D learning goals which build toward a bundle of NGSS PEs.
These bundles serve as a guide to inform the design of learning goals for learning sets and individual lessons.
• Beyond the scope of individual units, the units collectively emphasize differing dimensions of the NGSS PEs and
collaboratively facilitate students in achieving the overarching learning goals for the academic year, encapsulated by
knowledge-in-use (NRC, 2006).
Equity-oriented Equity-oriented Equity-oriented goals embedded within our curriculum, require students to demonstrate an understanding of equity practices
design principle support when collaboratively using the three dimensions of science understanding to solve problems and explain phenomena.
SEL-based support Accompanying the equity-oriented goals, SEL performance-based objectives derived from four constructs (identity development,
agency, interest, and belongingness), are also infused throughout the curriculum. Together, these goals encourage the co-
construction of cultural practices, fostering students' social and emotional growth as they engage with the curriculum.
Language support The language of instruction plays a pivotal role in the successful implementation of these equity and SEL goals. See elaboration
|

in Section 4.4.

Abbreviations: DQ, driving question; LP, learning performance; ML-PBL, multiple literacies in project-based learning; NGSS, Next Generation Science Standards; PBL, project-
based learning; PEs, performance expectations; SEL, social and emotional learning.
11

10982736, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/tea.21900 by Cochrane Chile, Wiley Online Library on [14/09/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
|

10982736, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/tea.21900 by Cochrane Chile, Wiley Online Library on [14/09/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
12 LI ET AL.

3D learning performances (LPs), teachers systematically scaffold student learning through a


combination of strategic dialog, guided questioning, thoughtfully structured student groups,
and dedicated time for end-of-lesson reflection. By adopting these approaches, the ML-PBL sys-
tem facilitates a coherent and authentic learning experience for students. This approach rein-
forces science literacy and provides students with the skills they need to interpret and
communicate scientific concepts.

4.3 | Coherent design principles

We employ intra- and inter-unit coherence design principles, along with learning goal coher-
ence (Fortus et al., 2015), to support students' knowledge-in-use development. Guided by our
framing of coherence, and common goal expectations and norms of the discipline, ML-PBL cre-
ated a system of activity that develops over time. We created coherence through a learning envi-
ronment design in which students collaboratively and incrementally develop and refine
knowledge (Gouvea et al., 2014).

4.3.1 | Learning-goal coherence

Learning-goal coherence is a fundamental principle both within and across the units. Within
each unit, the design is underpinned by articulated 3D learning goals which build toward a
bundle of NGSS PEs. These bundles inform the design of learning goals for LSs and individual
lessons. To devise these learning goals, we first analyzed the PEs, concentrating on the different
facets of the DCIs encompassed in the PE, and then selected suitable SEPs and CCCs that could
address the phenomena or problems at hand. Each LS is accompanied by an evidence state-
ment, articulating the expected learning outcomes of each lesson and aligning with the lesson's
learning goals. All learning goals within a unit build upon each other and contribute collec-
tively toward addressing the unit's DQ.
Beyond the scope of individual units, the units collectively emphasize different dimensions
of the NGSS's PEs and collaboratively assist students in achieving the overarching learning
goals for the academic year (NRC, 2006). Although each unit centers on different PE combina-
tions, some PEs are addressed across multiple units. This iterative approach allows students to
revisit and refine their understanding of questions, ideas, and problems, deepening their com-
prehension and enabling the application of their knowledge to novel scenarios. For example,
the Grade 3 ML-PBL units collectively highlight “structure and function” and “systems and sys-
tem models” as central CCCs, along with “developing models” as a key SEP. Each unit, how-
ever, presents varying levels of learning goals for these SEPs and CCCs. This coordinated
progression across units is orchestrated in a mutually reinforcing manner, supporting the over-
arching goal of developing students' knowledge-in-use throughout the academic year.

4.3.2 | Intra-unit coherence design

Each unit is meticulously designed to ensure intra-unit coherence, predominantly supported by


learning-goal coherence. Each unit is anchored around a DQ that unifies all components. The
DQ is aligned with the unit's performance learning goals and prompts students to make sense
|

10982736, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/tea.21900 by Cochrane Chile, Wiley Online Library on [14/09/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
LI ET AL. 13

of an anchoring phenomenon through a series of progressively complex lesson-level questions.


Furthermore, the application of SEPs progressively advances throughout the units, ensuring a
coherent learning experience.
The development of 3D LP statements provides coherence within and across lessons and
units. These LP statements, present in each LS and lesson, represent 3D learning goals and are
evaluated both informally (through “Look For” statements indicating what the teacher should
observe) and formally (through Evidence Statements describing the lesson product as an artifact
of the LP). Figuring Out Statements reflect the DCI elements in relation to their interaction
with the phenomenon and the DQ, representing the core cognitive task of sense-making
expected from the students. Concurrently, Look For Statements encourage collaborative prac-
tice and/or the application of CCCs in co-creating cultural practices and achieving lesson-level
LPs. These statements also guide formative assessment practices by indicating what to observe
for and prompt. Lastly, Evidence Statements provide a description of the tangible lesson prod-
uct, linking back to the LP and clearly defining what should be considered as evidence of stu-
dents learning. The curriculum materials also include opportunities for formative assessment
through the Look For and Evidence Statements, ensuring coherent, aligned assessments in rela-
tion to the learning goals within each unit.

4.3.3 | Inter-unit coherence design

With inter-unit coherence, each part of our learning system targets the same NGSS PEs and
works together to support students in reaching the unit learning goals across the academic year
(NRC, 2006). The overarching learning goals for each unit build on each other across time.
Although some of the DCIs across the three units do not build on each other, the SEPs and
CCCs progressively lead to more sophisticated performance, allowing students to develop
knowledge-in-use. Those most prominent SEPs in each unit (e.g., developing and using models)
and CCCs (e.g., cause and effect) were selected based on the features of DCIs and phenomena.
In addition, the PBL curriculum supports the application of knowledge throughout the project
as students develop, revise, and present the final artifact. Each time students figure out an addi-
tional piece of knowledge, they enhance their ability to developing explanation, model, or
designing solutions. The experiences shape a narrative that provides an intentional path toward
building understanding.

4.4 | Equity-oriented design principles

The design of our ML-PBL units integrates a strong commitment to fostering an equitable learn-
ing environment and a concurrent emphasis on SEL goals. These principles are woven into the
curriculum fabric, permeating the LSs and lessons, with the intention to be universally accessi-
ble, especially for students from underrepresented demographic groups in STEM (Agarwal &
Sengupta-Irving, 2019; Carr & Steele, 2009; Krajcik et al., 2023). Across the 4 units, 8 SEL and
10 equity-forward lessons have been formulated, wherein the SEL and equity-oriented goals are
stipulated alongside 3D LPs2. As students work with peers and self-reflect, they hone their SEL
and equity-related skills (Miller, Severance, & Krajcik, 2021). An overview of the learning

2
https://sprocket.educurious.org/course/science3/
|

10982736, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/tea.21900 by Cochrane Chile, Wiley Online Library on [14/09/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
14 LI ET AL.

structure of the four third-grade units is presented in Figure 2. More details can be found on
the website (https://sprocket.lucasedresearch.org/course/science3).3
Equity-oriented goals embedded require students to demonstrate an understanding of equity
practices when collaboratively using the three dimensions of science to solve problems and explain
phenomena (Miller & Krajcik, 2019). These goals, grounded in contexts, encourage critical thinking,
foster cultural sustainability, leverage funds of knowledge, and emphasize place-based experiential
resources and social justice (Adah Miller, Li, et al., 2022; Adah Miller, Makori, et al., 2022; Gonzalez
et al., 2006; Paris, 2012). Parallel to the equity-oriented goals, the curriculum infuses SEL perfor-
mance-based objectives derived from four constructs: identity development, agency, interest, and
belongingness. These goals facilitate the co-construction of cultural practices, fostering students'
social and emotional growth.
Instructional language is pivotal in effectively implementing the equity and SEL goals. Lan-
guage supports adhere to the English Language Proficiency Development framework and
underscores the language of doing science, fostering high expectations for multilingual learners
and enriching opportunities for negotiating meaning within scientific contexts (van Lier, 2001).
The discourse moves implemented in the units enhance the equity and SEL-oriented design.
Utilizing discourse moves from the Wisconsin Center for Educational Research, teachers facili-
tate various facets of scientific practice through communication (MacDonald et al., 2017). This
approach values diverse perspectives and cultivates agency, identity, and belongingness as stu-
dents advance in their scientific explorations. The design elements aim to create an environ-
ment where each student's ideas are seen as potentially valuable and productive, nurturing
their performance as both language and disciplinary content learners.

4.5 | Supportive and sustained professional learning

In conjunction with the teacher-oriented materials, the ML-PBL system incorporates approxi-
mately 7 days of PL opportunities for educators throughout the academic year. The PLs mirror the
curriculum design, highlighting the elements of DQs, artifact creation, and collaboration. Familiar-
izing teachers with these components allow for a more profound understanding and application of
the science learning inherent in our system. Our team provides two forms of PL—in-person and
optional online sessions. At the start of the academic year, a 3-day in-person PL session introduces
teachers to the foundational frameworks of the NGSS and PBL. Furthermore, teachers are exposed
to key lesson activities. Before the commencement of each unit, an additional day-long, in-person
session is provided, focusing on specific unit activities, material usage, and understanding the
coherence of the design. Aspects of our learning system are progressively introduced during these
sessions to deepen teachers' understanding. To supplement these sessions, optional online meet-
ings are organized approximately bi-weekly via video conferencing. These meetings allow teachers
to discuss upcoming lessons and any challenges they face with the research team members.

4.6 | Aligned and coherent assessment system

Our ML-PBL system employs a multilevel, coherent assessment system (Figure 3) that includes
lesson-specific classroom-embedded assessments, four post-unit assessments, and an end-

3
https://sprocket.educurious.org/home/curriculum
|

10982736, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/tea.21900 by Cochrane Chile, Wiley Online Library on [14/09/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
LI ET AL. 15

F I G U R E 3 ML-PBL assessment system. ML-PBL, multiple literacies in project-based learning.

of-year summative test. These different assessment types serve various purposes: to measure
student learning at different intervals, to align with learning goals within the system, and to
provide immediate, near, and far assessments (Perie et al., 2007; Ruiz-Primo et al., 2002). Class-
room-embedded assessments, integrated into daily lessons, assist teachers in monitoring student
performance as they apply 3D knowledge to problem-solving and explanation tasks.
The post-unit assessments in the ML-PBL learning system uses a modified design process
that utilizes the principles of assessment theory, incorporating both domain analysis and
domain modeling phases (Harris et al., 2019). This methodology integrates elements of
construct-centered design and evidence-centered design (ECD; Mislevy & Haertel, 2006),
enabling the identification of critical, assessable aspects of knowledge-in-use LP and the collec-
tion of tangible evidence (Harris et al., 2019). The ECD approach emphasizes gathering diverse
evidence from performance to elucidate the target constructs. We pivoted directly from LPs and
evidence statements to task design. These evidence statements can be perceived as articulations
of Knowledge, Skills, and Abilities. There was no requirement to focus on variable task features
since we only crafted one item and had no need for similar items. However, our design incorpo-
rates characteristic features such as engaging phenomena related to the science ideas, and the
practices specified in the LP. Our design also encompasses task features derived from an equity
and fairness framework developed by our team to ensure the fair assessment of students across
diverse socio-cultural groups.
Each post-unit assessment, designed to assess students' 3D learning, presents students with
several engaging scenarios. Within each scenario, a set of items prompts students to produce
written explanations, perform data analysis, or create models. We devised an item-level rubric
to evaluate students' 3D understanding of each item. Post-unit assessments within the ML-PBL
learning system aim to gauge students' learning progress and their capacity to apply knowledge
acquired from the unit to address novel yet similar phenomena or problems. These assessments
provide contexts akin to those encountered during the unit, yet with novel elements and varied
problem types or heuristics needed for their resolution or comprehension. Consequently, this
design allows the assessments to be proximally aligned to the unit content, effectively
|

10982736, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/tea.21900 by Cochrane Chile, Wiley Online Library on [14/09/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
16 LI ET AL.

measuring near-transfer proficiency. This proficiency is assessed in two main ways: First, stu-
dents need to synthesize and apply the knowledge they gained during the unit to novel but
related phenomena or problems. Second, the context resembles the phenomena students experi-
enced during the unit but includes novel elements that diverge from the unit phenomena.
To highlight the nuanced differences between post-unit assessment and the phenomena, we
illustrate this with the “sq_v3” item from the Unit 1 assessment. During the learning unit, stu-
dents engage in a series of LSs, each addressing different aspects of the overall unit phenome-
non about squirrels' adaptation and survival in diverse environments and evolution over time.
Each LS presents various organisms for students to study, thereby providing a wide array of
examples contributing to their understanding. The post-unit assessment shifts the perspective
slightly. The post-unit assessment introduces new phenomena, closely associated with the unit,
to evaluate the students' ability to transfer their knowledge to novel situations. For instance, the
Unit 1 post-unit assessment encompasses four scenarios, each containing multiple items. The
initial three scenarios refer to a chart outlining two types of squirrels—Eastern Grey Squirrels
and Antelope Squirrels, providing critical details such as their physical attributes, dietary prefer-
ences, and shelter types. An assessment item, designated as sq_v3, corresponds to the second
scenario, prompting students to consider the necessities of Eastern Grey Squirrels within a for-
est ecosystem and predict the impacts of deforestation on their survival. Here, students are
required to synthesize their accumulated knowledge from the unit and apply it to this new situ-
ation. What renders it novel? While students have previously explored squirrels, their habitats,
and survival strategies, the post-unit assessment requires them to utilize this knowledge in fresh
contexts involving different squirrel species and scenarios. They might, for example, infer that
deforestation could endanger Eastern Grey Squirrels, reliant on trees for shelter, but not impact
Antelope Squirrels, which burrow in the ground. This task encourages students to integrate
their prior learning with new information, thereby demonstrating their understanding and apt
application of the knowledge acquired.
The end-of-year summative test uses items from the MDE state test corresponding to the
Grade 3 NGSS PEs. Unlike the post-unit assessments designed by ML-PBL, the item contexts in
this end-of-year summative test were not explicitly crafted to mirror the phenomena experi-
enced during the units. Any similarities were merely coincidental. Therefore, the test serves as
a means to appraise students' proficiencies in applying knowledge in novel and unfamiliar con-
texts, essentially evaluating students' far-transfer abilities (Perkins & Salomon, 1992).

5 | METHODS

5.1 | Sample

A cluster randomized experimental design was utilized (Raudenbush, 1997), and a power analy-
sis performed (Krajcik et al., 2023), leveraging efficacy study data from the 2018 to 2019 school
year to assess the impact of the ML-PBL intervention. Forty-six schools, representing 2371
third-grade students across 4 regions in Michigan, USA, were randomized into treatment
(23 schools, 1165 students) and control groups (23 schools, 1206 students). These regions,
including Detroit (Region 1), Genesee (Region 2), Kent (Region 3), and other Michigan districts
(Region 4), ensured a diverse sample (e.g., race).
Our study prioritizes treatment students involved in the efficacy study. The analytic sample
was selected from the complete treatment group, including only those with valid pre-
|

10982736, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/tea.21900 by Cochrane Chile, Wiley Online Library on [14/09/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
LI ET AL. 17

benchmark reading test scores and end-of-year summative test scores. The final analytic sample
comprises 1067 students distributed among 53 classrooms across the 23 treatment schools in
4 regions. Table 3 presents demographic details of the analytic sample participants. Gender dis-
tribution was approximately even (48% female, 6% unspecified), with 58%–60% of participants
classified as economically disadvantaged. Ten percent of the students held ELL status. The
racial composition varied slightly across the three-unit post-unit assessments, with a predomi-
nantly White student representation, followed by Black and Hispanic students.
Among our sample, 470 students completed all three post-unit assessments. We conducted a
thorough investigation into reasons for participant drop-out or missing data, from changes in
teacher assignments, class cancellations, personal emergencies, adverse weather conditions, to
curriculum mismatch with student needs. Appendix 2 provides the missing cases percentage for
each post-unit assessment relative to the end-of-year summative test. A regional analysis
exposed a notable concentration of missing data in Region 1, a trend consistent across all three
post-unit assessments (see Appendix 3). This was attributed to late initiation of ML-PBL teach-
ing, first-time PBL teachers acclimating to new curriculum materials, and adverse winter condi-
tions disrupting teaching schedules in Region 1. To preserve the integrity of our sample, we
generated three “missing flag” variables, one for each post-unit assessment, to capture the
impact of missing data on our subsequent analytic models, facilitating model comparison
between students who completed different numbers of post-unit assessments. The forthcoming
analyses utilize this sample. This study also applies several missing data techniques to address
the impact of post-unit assessment missingness on the end-of-year summative test. Main find-
ings were consistent across various methods; therefore, the current study reports the simplest
results using the missing flags to reduce methodological complexity.

5.2 | Measures and variables

5.2.1 | Outcomes

The outcome of interest was students' scores on the end-of-year summative test. Though the test
was designed to be computer-based, practical constraints in certain classrooms necessitated its
conversion to a paper-and-pencil format. Initially, the test aimed to assess student learning
corresponding to the K-5th grade science standards and was intended to be administered at the
end of fifth grade. As such, the test initially had a multitude of questions, too many for third
graders to complete within a reasonable time frame. Therefore, items aligning specifically with
Grade 3 NGSS PEs were extracted and split randomly into three distinct forms (A, B, and C) to
accommodate the limitations of a third-grade classroom setting. Forms A, B, and C comprised
11, 4, and 7 questions, respectively, without any overlap. All the questions were multiple-
choice, scored based on an MDE-designed rubric, where “0” indicated an incorrect response,
and “1” indicated a correct one. We conducted a series of item response theory (IRT) analysis to
check the psychometric properties of the items (Appendix 4; for details, see the ML-PBL techni-
cal report, https://mlpbl.open3d.science/techreport). The forms were of equivalent difficulty
and were randomly distributed across classrooms (Krajcik et al., 2023). For information on the
distribution of the various forms, refer to Appendix 5. In this study, student raw scores were
standardized by z-scores as a measure of student achievement on the end-of-year summative
test, and the different forms of the end-of-year summative test were included as a covariate in
subsequent analyses.
|

10982736, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/tea.21900 by Cochrane Chile, Wiley Online Library on [14/09/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
18 LI ET AL.

T A B L E 3 Descriptive table.

Treatment Unit 1 Unit 2 Unit 3


1067 682 896 959
N Mean (SD) Mean (SD) Mean (SD) Mean (SD)
End-of-year science achievement test z-score 0.18 0.31* 0.24 0.21
(1.04) (1.03) (1.03) (1.02)
Pretest z-score 0.03 0.15* 0.09 0.06
(0.99) (0.98) (0.99) (0.99)
Unit 1 z-score 0.07 0.06 0.07 0.08
(0.94) (0.95) (0.95) (0.94)
Unit 2 z-score 0.03 0.05 0.03 0.07
(0.98) (0.97) (0.96) (0.94)
Unit 3 z-score 0.03 0.11 0.07 0.03
(0.97) (0.94) (0.94) (0.97)
Average number of students 26.06 26.53* 26.13 26.16
(3.36) (2.89) (2.86) (3.36)
Post-unit assessment missing case 385 171 108
36.08% 16.03% 10.12%
Form A 354 230 308 331
33.18% 33.72% 34.38% 34.52%
Form B 340 228 296 314
31.87% 33.43% 33.04% 32.74%
Form C 337 223 292 313
31.58% 32.70% 32.59% 32.64%
Female 514 333 442 469
48.17% 48.83% 49.33% 48.91%
Female missing 70 49 50 68
6.56% 7.18% 5.58% 7.09%
White 496 388 474 471
46.49% 56.89%* 52.90% 49.11%
Black 290 126 204 252
27.18% 18.48%* 22.77% 26.28%
Hispanic 106 65 97 101
9.93% 9.53% 10.83% 10.53%
Asian 37 32 42 38
3.47% 4.69% 4.69% 3.96%
Other 31 21 28 29
2.91% 3.08% 3.13% 3.02%
Economic disadvantage 626 399 531 575
58.67% 58.50% 59.26% 59.96%
|

10982736, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/tea.21900 by Cochrane Chile, Wiley Online Library on [14/09/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
LI ET AL. 19

T A B L E 3 (Continued)

Treatment Unit 1 Unit 2 Unit 3


1067 682 896 959
N Mean (SD) Mean (SD) Mean (SD) Mean (SD)
ELL status 112 69 110 109
10.50% 10.12% 12.28% 11.37%
Focal section 801 591 724 756
75.07% 86.66% 80.80% 78.83%
Region 1 296 101 198 262
27.74% 14.81%* 22.10%* 27.32%
Region 2 151 95 149 143
14.15% 13.93% 16.63% 14.91%
Region 3 304 287 301 309
28.49% 42.08%* 33.59%* 32.22%
Region 4 280 195 248 245
26.24% 28.59% 27.68% 25.55%

Note: The rage of Unit 1 raw score is 0–27; the range of the Unit 2 raw score is 0–45 and the rage of the Unit 3 raw score is 0–2.
Abbreviation: ELL, English language learner.
*p < 0.05, two-tailed t-tests comparing each unit assessment group with the treatment sample (n = 1067).

5.2.2 | Independent variables

Students' scores on the three post-unit assessments were the independent variables. The design
of the post-unit assessments allowed for assessing students' knowledge-in-use aligned with the
NGSS PEs, examining phenomena such as squirrel adaptation to habitat change, application of
balanced and unbalanced force concepts to moving toy cars, and observation of bird character-
istics, behavioral patterns, and life cycles (Li, Miller, et al., 2021). All items are 3D and open-
ended. Cognitive interviews about the assessment items were conducted several times to ensure
that all students would understand the phrasing of the items, which resulted in revisions of the
items. Table 4 provides the overall item information for each post-unit assessment. The post-
unit assessments for Units 1, 2, and 3 each consist of a varying number of items, scored based
on partial credit rubrics ranging from 0 to 3. To enable the comparability of the post-unit assess-
ments and the end-of-year summative test, a standardization procedure was employed to turn
the raw scores into z-scores. The Unit 1 post-unit assessment includes nine items, with a total
raw score range of 0–27. The average standardized (z) score for this unit is 0.06, accompanied
by a standard deviation of 0.95. In contrast, the Unit 2 post-unit assessment comprises 15 items
and has a wider raw score range of 0–45. The mean z-score for this unit is slightly lower than
Unit 1, standing at 0.03, though the standard deviation is marginally higher at 0.96. Finally, the
Unit 3 post-unit assessment, like Unit 1, consists of nine items and a raw score range of 0–27.
This unit has an average z-score of 0.07, similar to that of Unit 1, and a standard deviation of
0.94. Psychometric properties of three post-unit assessments were reported (Appendixes 6
and 7).
A sample item with an associated student response is depicted in Figure 4. The item (sq_v3)
from Unit 1 asks students to construct a model to explain the potential impacts on squirrels if
20

T A B L E 4 Descriptions of post-unit assessments.

Unit ID Scenarios Item ID Main focal SEPs


Unit 1 S1: Two different types of squirrels' features and habitat information and their survival. sq_v1–sq_v2 Obtain information and
(Squirrels/ build claims
Adaptation) S2: If the habitat of the Eastern Grey Squirrels changes, what would happen to the Eastern sq_v3–sq_v5 Develop model and
Grey Squirrels who live in this area? construct explanation
|

S3: If the habitat of the Eastern Grey Squirrels is flooded, what are the possible changes in sq_v6–sq_v7 Develop model and
Eastern Grey Squirrels that survived in the area after millions of years? construct explanation
S4: How does this unit learning connect to people? sq_v8–sq_v9 Community-oriented
items
Unit 2 T1: What makes a toy car stand still? ty_v1–ty_v3 System and systems
(Toys/Forces and model
motion) T2: When we push two toy cars into each other, what makes toy cars start moving, hit ty_v4–ty_v8 Develop model and
each other, move backward, and then stop? construct explanation
T3: How can we change the car design to prevent two cars from colliding? ty_v9–ty_v12 Design solutions
T4: What forces are acting in the toy car when it fell off the table? ty_v13 Develop model and
construct explanation
T5: How does this unit learning connect to people? ty_v14–ty_v15 Community-oriented
items
Unit 3 B1: How do birds' traits and living environment help them survive? bd_v1–bd_v2 Obtain information
(Birds/Biodiversity) bd_v3–bd_v5 Develop models and
bd_v6–bd_v7 construct explanation
Construct explanation
B2: Community-oriented items: How does this unit learning connect to people? bd_v8–bd_v9 Community-oriented
items
LI ET AL.

10982736, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/tea.21900 by Cochrane Chile, Wiley Online Library on [14/09/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
|

10982736, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/tea.21900 by Cochrane Chile, Wiley Online Library on [14/09/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
LI ET AL. 21

F I G U R E 4 Example of a modeling item (sq_v3) and student's response in Unit 1.

all trees in the area were removed. The example response (model) shows the essential compo-
nents in the system that could affect squirrels' survival, including trees, habitat or shelter, food
(acorns), and squirrels. The model also illustrates the interrelationships among these compo-
nents using arrows, showcasing the causal relationship between the squirrel and its survival.
Holistic rubrics were devised for each item to objectively and reliably assess students' level of
|

10982736, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/tea.21900 by Cochrane Chile, Wiley Online Library on [14/09/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
22 LI ET AL.

T A B L E 5 Rubric of a modeling item (sq_v3) in Unit 1.

Rubric Score
Model clearly shows (labels, words, pictures, symbols) 3
1. What happens to the squirrel's survival and
2. Clearly states or shows the relationship of no trees to what happened to the squirrel.
(The model is interpreted with the text below.)
Model shows (labels, words, pictures, symbols) what happens to the squirrel's survival but the 2
relationship to the event of no trees is unclear or needs to be inferred. (The model is interpreted
with the text below.)
Not clear that the event of no trees caused a change to the squirrel's survival. Or not clear that 1
there was an event.
No model or no explanation, or illegible. 0

knowledge-in-use. The rubric for item sq_v3 is presented in Table 5, awarding up to three
points for the demonstrated response.
Students responded to the post-unit assessments using paper/pencil. The tests were then
scanned and converted into PDFs for scoring. Research assistants with a background in science
education served as raters and scored all post-unit assessments. The scoring procedure consisted
of two phases: a training phase and an official scoring phase. We provided raters with item-
specific scoring rubrics and response examples. Each unit involved training for four to six raters.
Intra-class correlations (ICCs), Cohen kappa, and percentage of agreement were employed as
indexes to gauge the inter-rater reliability (IRR) between pairs of raters (Table 5). An ICC coeffi-
cient of 0.70, a widely adopted reliability index in IRR, was used as a baseline criterion to
ensure agreement between pairs of raters for each workbook (Koo & Li, 2016).
Raters were trained via several sessions to review and apply rubrics on practice cases and
assigned scoring workbooks until the ICC baseline was attained. Randomly selected rounds of
scoring assigned each rater with 100–150 cases, including 15 repeated cases. These repeated
cases served to check the IRR across all raters. Table 6 displays the IRRs across all rounds for
the three post-unit assessments. The averages of ICCs across units were either close to or
exceeded 0.70, indicating good reliability of raters' scoring on the post-unit assessment tests
(Fleiss, 1981). An average ICC of 0.71 for the three post-unit assessment tests met the
established criterion, signifying all scores were reliable for use. In addition, Cohen's kappa coef-
ficients and percentage of agreement were reported as supplementary IRR indexes to evaluate
raters' consensus. The overall agreement percentage across units approximated 0.67, with
Cohen's kappa indicating slightly lower reliability at 0.53. A lower than acceptable Cohen's
kappa (Cohen, 1960) can be attributed to the complexity of calculating the average of kappa
values when more than two raters are involved in scoring assessments. Also, the 0–3 scoring
scale for each item in the post-unit assessment inherently presents more difficulty in reaching
agreement, unlike dichotomous codes (0 and 1).
To ascertain if all items on each post-unit assessment load on a single factor—measuring
knowledge-in-use—we executed a confirmatory factor analysis to obtain the unidimensional
construct validity of the post-unit assessments. Cronbach's alpha, measuring the internal consis-
tency of items within each unit, ranged from 0.71 to 0.84 (see Appendix 5). A high consistency
is indicated by values approaching 1.00 (DeVellis, 2003; Tavakol & Dennick, 2011). In addition,
we employed the Grade Response Model, which analyzes ordinal and rating scale responses for
|

10982736, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/tea.21900 by Cochrane Chile, Wiley Online Library on [14/09/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
LI ET AL. 23

T A B L E 6 Inter-rater reliability indexes among three post-unit assessments.

Inter-rater
Number correlation Percent of
Workbook of raters coefficient Kappa agreement
Unit 1
1 3 0.57 0.49 0.65
2 3 0.51 0.42 0.61
3 2 0.65 0.47 0.68
4 2 0.88 0.62 0.63
5 2 0.75 0.58 0.69
6 2 0.74 0.52 0.69
68% 52% 66%
Unit 2
1 4 0.78 0.69 0.71
2 4 0.76 0.65 0.74
3 4 0.7 0.59 0.75
4 4 0.75 0.6 0.76
75% 63% 74%
Unit 3
1 3 0.84 0.56 0.71
2 4 0.76 0.56 0.7
3 4 0.77 0.55 0.7
4 4 0.75 0.58 0.71
5 4 0.73 0.55 0.7
6 4 0.66 0.43 0.6
7 4 0.59 0.44 0.59
8 4 0.65 0.46 0.62
9 3 0.64 0.41 0.58
10 4 0.67 0.53 0.63
69% 50.70% 64%
Overall average 71% 54% 67%

Note: The workbook, an Excel spreadsheet, contains a set of student responses to post-unit assessments. Each workbook also
includes overlapping cases among different raters to ensure their inter-rater reliability (IRR) can be checked and maintained.

polytomous items using a 2-PL model, to verify the validity of the post-unit assessments
(Samejima, 1969). Based on the results, all post-unit assessment items across the three units
have moderate to high discrimination, ranging from 0.41 to 6.59, indicating all items can distin-
guish students with varying abilities (Baker, 2001). The threshold of item difficulty values (B
value) of each item increases gradually from low to high, that is, the rubrics were suitable for
measuring students' ability. Appendix 7 shows marginal reliability for each of the unit assess-
ments, which indicates the exact probability density function of the latent trait distribution for
|

10982736, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/tea.21900 by Cochrane Chile, Wiley Online Library on [14/09/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
24 LI ET AL.

students who took the post-unit assessments. Overall, the results confirm the reliability and
validity of the three post-unit assessments. To enable the comparability of the post-unit assess-
ment tests and the end-of-year summative test, a standardization procedure was employed to
turn the raw scores into z-scores. The z-scores evaluate the students' learning compared to the
average students in the analytical sample. A positive z-score indicates students' performance
above the mean of the sample, while a negative z-score indicates students' performance below
the mean of the sample.

5.2.3 | Covariates

Considering student-level covariates, we took into account gender, race and ethnicity,
economic status, ELL status, the form of the end-of-year summative test, and the pre-test
score (e.g., reading benchmark). To establish a baseline for student academic achieve-
ment, schools, teachers, or districts were asked to provide students' third-grade fall and
winter reading or math benchmark scores. In Michigan, different elementary schools use
different benchmark tests. The five different benchmark tests include the Northwest Eval-
uation Association Measure of Academic Progress, Star, i-Ready, DIBELS, and Fountas
and Pinnell (F&P). Student raw scores were converted into percentile rankings and com-
pared across various tests by using the most recent national norming guides for each test
(for details, see the ML-PBL Technical Report4). In this study, we also transform the per-
centile ranking into z-scores to make a comparable interpretation across models in the
analyses.
At the teacher classroom level, we also account for several classroom characteristics,
including the number of students in the classroom, school region, and focal section. Focal sec-
tions refer to the teachers who taught more than one section for third-grade science class.
One of the multiple sections was randomly selected as the focal section (for details, see the
ML-PBL Technical Report5). Names and descriptions of the above variables are presented in
Table 3.

5.3 | Analytic models

Due to the nested structure of the data, we employed a two-level hierarchical linear model-
ing (HLM) to estimate the associations between post-unit assessments and the end-of-year
summative test while controlling for covariates (Raudenbush & Bryk, 2002). The variation in
the end-of-year summative test can be estimated within and between classrooms. We began
our analysis with the unconditional model (null model), assessing the appropriateness of
using the two-level model to analyze our data. Then, we included the covariates in the final
model and calculated the variance component. Next, we examined the relationship between
three consecutive post-unit assessments and the end-of-year summative test, considering and
controlling for prior unit learning experience. The following equation shows our analytic
models:

4
https://mlpbl.open3d.science/techreport
5
https://mlpbl.open3d.science/techreport
|

10982736, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/tea.21900 by Cochrane Chile, Wiley Online Library on [14/09/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
LI ET AL. 25

Y ij ðEnd-of -year summative test z-scoreÞ ¼ β0j þ β1j ðPRETEST Þij


þ β2j Post-unit assessment z-scoreij
þ β34j Form in end-of -year summative test ij
þ β5j ðFemaleÞij þ β69j Race Groupij
þ β10j N of student in classroomj
þ β11j Whether teach a focal sectionj þ β1214j Regionsj
þ β1517j Missing flags in post-unit assessment ij þ εij
þ β1829j Interaction termij þ εij ,
ð1Þ

where Y ij represents outcomes of interest (i.e., end-of-year summative test z-score) for student i
of classroom j. PRETEST ij is a continuous variable and standardized into z-score to represent an
individual student's initial level of reading achievement before participating in ML-PBL.
Post-unit assessment z-scoreij represents a key predictor in this study.
Form in end-of -year summative test ij is two dummy variables to indicate which forms of end-
of-year summative test that students received. Femaleij is a dummy indicator for female students.
Race Groupij is four dummy variables to represent racial ethnic group statuses.
N of student in classroomj is a continuous variable to represent the average number of students
in the classroom. Whether teach a focal sectionj is a dummy variable to indicate whether the
classroom serves as a focal section. Regionsj is three dummy variables for the region areas of the
school location. Missing flags in post-unit assessment ij is three dummy variables to represent the
missingness in each unit assessment among treatment samples. Interaction termij includes in
the final model when answering the RQ3. An interaction term captures the effect of a con-
structed variable by multiplying those two variables together: the post-unit assessment (predic-
tor of interest) and student demographics (e.g., gender, SES).

5.4 | Analytic strategy

Detailed HLM models applied to answer the three RQs are provided in Appendix 8. To address RQ1,
we investigated the effect of each post-unit assessment on the end-of-year summative test, taking into
account students' prior learning experiences. Predicated on the inter-unit coherence design principles
underpinning the three ML-PBL units, we postulated that exposure to an increased number of ML-
PBL units would enhance the depth of students' scientific understanding in comparison to students
with fewer unit experiences or limited learning opportunities. Consequently, we theorized that, given
prior learning experiences, students' performances on Units 2 and 3 post-unit assessments would
exhibit a progressive amplification in associative strength with the end-of-year summative test out-
comes (refer to Table 8). We evaluated five regression models to respond to the RQ1. Each model
served as a nested model to enable comparative analysis of model improvements. Our analytical
approach entailed the implementation of an unconditional model, facilitating validation of the suit-
ability of the two-level model for our data. Subsequently, we integrated the aforelisted covariates into
Model 1 and computed the variance component. In Models 2–4, we scrutinized the relationships
between post-unit assessment performance and end-of-year scientific achievement, considering the
influence of prior learning experiences associated with the post-units. In addition, we utilized the z-
test to assess the equality of two coefficients (Clogg et al., 1995; Paternoster et al., 1998).
In line with the intra-unit coherence design, each ML-PBL unit contributes uniquely to the
end-of-year summative test. Therefore, we aimed to ascertain the predictive value of each unit
|

10982736, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/tea.21900 by Cochrane Chile, Wiley Online Library on [14/09/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
26 LI ET AL.

on summative scientific achievement, while controlling for students' prior learning experiences
(RQ2). Our equity-oriented design principles integrated into the ML-PBL units hypothesized that we
could facilitate learning opportunities for students across diverse demographic backgrounds. Thus, we
sought to determine if the associations between post-unit assessment performances and the end-
of-year summative test varied according to gender, race, economic status, or ELL status (RQ3). Several
interaction terms were incorporated to investigate variations in students' post-unit assessment impacts
on the end-of-year summative test. We conducted all descriptive analyses and correlations using Stata
15, while two-level HLM analysis was performed utilizing the HLM 7.0 software.

6 | R E SUL T S

This section first presents the descriptive statistics (see Table 3) of sample sizes (both treatment
sample and sub-samples that completed post-unit assessment), means, and standard deviations
of the key variables, including the post-unit assessment scores, end-of-year summative test
scores, covariates (pre-test z-score, gender, race and ethnicity, districts, ELLs, form of the end-
of-year summative test, and focal section). Demographic differences between the overall sample
and sub-samples across three units are also detailed in Table 3. Compared to the analytic treat-
ment sample, the Unit 1 sample significantly outperforms in end-of-year science achievement
(M = 0.31, SD = 1.03, p < 0.05) and reading pre-test (M = 0.15, SD = 0.98, p < 0.05). Yet, the
Unit 1 sample shows a significantly lower proportion of Black students and students in Region
1. Units 2 and 3 samples demonstrate similar levels of end-of-year science achievement and
reading pre-test to the final analytic treatment sample. The Unit 2 sample, however, has a sig-
nificantly higher proportion of Black students and residents in Region 1. The demographic dis-
crepancies between the full treatment sample and the three subsamples hint at the potential
impacts of missingness for students who completed different sets of post-unit assessments. To
alleviate this selection bias in the analytical sample, a statistical control function is incorporated
to correct for endogeneity problems related to missingness on the summative test. This includes
incorporating missing flags or some demographic variables (e.g., gender) as covariates.
Before estimation, inter-unit coherence among the three post-unit assessments was checked,
with Table 6 displaying the correlations of student end-of-year summative test, pre-test, and
post-unit assessment scores. Significant correlations were found between the three consecutive
post-unit assessment z-scores and the end-of-year summative test z-score, that is, 0.21 (Unit 1),
0.29 (Unit 2), and 0.28 (Unit 3). Assumptions of the HLM on the end-of-year summative test
were tested (Snijders & Bosker, 2011), and our results conformed to the HLM assumptions.
Appendix 9 outlines the HLM assumption checks for the HLM models in Table 8. To account
for the potential impact of non-normal data, we performed an additional multilevel analysis
with robust standard error estimates (Hox & Maas, 2001; see Appendix 10). Based on these cor-
relations, the following results address the RQs.

6.1 | Effects of post-unit assessments performance on student end-


of-year summative test performance accounting for their prior unit
learning experience

Table 7 presents the single unit estimates of students' post-unit assessment performance and
their end-of-year summative test performance when considering their prior unit learning
|

10982736, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/tea.21900 by Cochrane Chile, Wiley Online Library on [14/09/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
LI ET AL. 27

T A B L E 7 Correlation table: post-unit assessments and end-of-year summative test.

(1) (2) (3) (4) (5)


(1) End-of-year summative test z-score 1
(2) Pretest reading z-score 0.346*** 1
(3) Unit 1 post-unit assessment z-score 0.217*** 0.338*** 1
(4) Unit 2 post-unit assessment z-score 0.294*** 0.410*** 0.361*** 1
(5) Unit 3 post-unit assessment z-score 0.280*** 0.392*** 0.344*** 0.426*** 1

*p < 0.10; **p < 0.05; ***p < 0.001.

experience. We first estimated a model without predictors (i.e., unconditional model in Table 8)
to discern if student achievement on the end-of-year summative test varied by classrooms. A
significant result (B = 0.137, p < 0.05) indicated a need for a multilevel model for the end-
of-year summative test. The ICC value in the unconditional model was 0.12, indicating 12% of
science achievement variance on the end-of-year summative test was attributed to the differ-
ences between teacher classrooms. Model 1 revealed that students' pre-test reading z-scores was
significantly associated with their end-of-year summative test z-score (B = 0.355, p < 0.001)
when controlling for the covariates. One standardized deviation increased in the pretest reading
is associated with 0.355 standardized score in end-of-year summative test. Black and Asian stu-
dents had significantly lower end-of-year summative test z-score compared to their White coun-
terparts. Students in Regions 3 and 4 also had lower end-of-year summative test z-score
compared to Region 1. Covariates explained approximately 12% of student-level variance in
end-of-year summative test z-score, whereas teacher classroom covariates explained approxi-
mately 65% of classroom-level variance of the end-of-year summative test z-score when calculat-
ing the variance difference between unconditional model and Model 1.
Model 2 reported the relationship between Unit 1 post-unit assessment z-score and end-
of-year summative test z-score. Results showed that students' Unit 1 post-unit assessment per-
formance (B = 0.056, p > 0.05) is not significantly associated with their end-of-year summative
test performance controlling for other covariates. Students who missed or did not participate in
the Unit 1 assessment (B = 0.145, p < 0.05) perform significantly lower on the end-of-year
summative test compared to Unit 1 assessment test takers. Model 3 shows that students' perfor-
mance on Unit 2 post-unit assessment had a significant positive association with their end-
of-year summative test performance (B= 0.154, p < 0.001) considering their Unit 1 and Unit
2 learning. This indicated that students who had experienced Units 1 and 2 tended to have
higher average end-of-year summative test achievement. One standardized deviation increase
in Unit 2 assessment is associated with a 0.154 standardized score in the end-of-year summative
test considering Unit 1 learning and other covariates. Model 4 shows Unit 3 post-unit assess-
ment z-score was significantly positively associated with the end-of-year summative test z-score
(B = 0.127, p < 0.001) considering Unit 1, Unit 2 learning, and other covariates. That is, for stu-
dents who had experienced all three units, one unit increase in Unit 3 post-unit assessment z-
score was associated with a 0.127 increase in the end-of-year summative test z-score.
To further test the equivalence of two regressions of the effect of students who had the first
two-unit learning experiences (B = 0.154, p < 0.001) and the effect of students who had three-
unit learning experiences (B = 0.127, p < 0.001) (two regression coefficients of Unit 2 and Unit
3 in Table 8), we employ Wald test (Clogg et al., 1995; Paternoster et al., 1998) to test the differ-
ence between Units 2 and 3 coefficients. The results show that the effect of Unit 3 post-unit
|

10982736, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/tea.21900 by Cochrane Chile, Wiley Online Library on [14/09/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
28 LI ET AL.

T A B L E 8 Effect of students' performance on post-unit assessments on end-of-year summative test


considering prior unit learning experience.

Unconditional
model Model 1 Model 2 Model 3 Model 4
b/SE b/SE b/SE b/SE b/SE
Pretest: benchmark 0.355*** 0.340*** 0.310*** 0.319***
reading z-score (0.033) (0.034) (0.034) (0.034)
U1 post-unit assessment 0.056
z-score (0.041)
U1 post-unit assessment 0.145*
z-score missing flag (0.072)
U2 post-unit assessment z-score 0.154***
(0.041)
U2 post-unit assessment z-score 0.140
missing flag (0.095)
U3 post-unit assessment z-score 0.127***
(0.035)
U3 post-unit assessment z-score 0.004
missing flag (0.104)
Form B (ref. Form A) 0.029 0.027 0.021 0.041
(0.068) (0.068) (0.068) (0.068)
Form C 0.024 0.024 0.031 0.022
(0.069) (0.068) (0.068) (0.068)
Female 0.003 0.003 0.003 0.039
(0.059) (0.059) (0.059) (0.060)
Female missing 0.891 0.854 0.922 0.914
(0.935) (0.934) (0.929) (0.936)
Race missing 0.980 0.926 0.968 1.018
(0.927) (0.926) (0.921) (0.928)
Race: Black (ref. White) 0.604 0.571 0.658 0.659
(0.928) (0.926) (0.922) (0.928)
Race: Hispanic 0.970 0.917 0.995 1.010
(0.933) (0.932) (0.927) (0.934)
Race: Asian 0.560 0.480 0.597 0.623
(0.945) (0.943) (0.938) (0.946)
Race: multi-race 1.149 1.116 1.140 1.187
(0.938) (0.937) (0.931) (0.939)
Disadvantaged student 0.072 0.068 0.052 0.055
(0.071) (0.071) (0.071) (0.071)
|

10982736, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/tea.21900 by Cochrane Chile, Wiley Online Library on [14/09/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
LI ET AL. 29

T A B L E 8 (Continued)

Unconditional
model Model 1 Model 2 Model 3 Model 4
b/SE b/SE b/SE b/SE b/SE
ELL student 0.203 0.189 0.238 0.210
(0.148) (0.147) (0.147) (0.146)
Region 2 (ref. Region 1) 0.028 0.053 0.147 0.043
(0.178) (0.176) (0.181) (0.171)
Region 3 0.215 0.278+ 0.238+ 0.250+
(0.142) (0.144) (0.141) (0.137)
Region 4 0.424** 0.443** 0.400** 0.437**
(0.148) (0.148) (0.147) (0.142)
Focal section 0.243+ 0.222+ 0.267* 0.250*
(0.125) (0.123) (0.126) (0.119)
N of student in classroom 0.021 0.018 0.022+ 0.019
(0.013) (0.013) (0.013) (0.013)
Constant 0.137* 1.186 0.953 1.218 1.146
(0.058) (0.984) (0.985) (0.976) (0.982)
ICC 0.12 0.051 0.048 0.049 0.042
Random component
Classroom level 0.129 0.044 0.042 0.042 0.036
0.036 0.017 0.017 0.017 0.016
Student level 0.952 0.830 0.827 0.819 0.820
0.042 0.037 0.036 0.036 0.036

Note: Coefficients are measured in standard deviations in the summative test. Standard errors are in parenthesis. N = 1067.
Abbreviations: ELL, English language learner; ICC, Intra-class correlation.
+
p < 0.1.
*p < 0.05; **p < 0.01; ***p < 0.001.

assessment performance on the end-of-year summative test performance is similar to the effect
of Unit 2 post-unit assessment (Wald test = 0.51). It indicates that although the regression coef-
ficient of Unit 3 is smaller than Unit 2, there is no significant difference. Interestingly, a signifi-
cant difference of the regression coefficient was observed between Units 1 and 2 and Units
1 and 3, indicating the prediction rate of Units 2 and 3 on the end-of-year summative test is sig-
nificantly higher than Unit 1. In short, when students experience more ML-PBL units, they are
more likely to have a higher end-of-year summative test z-score.

6.2 | Effects of single post-unit assessment performance on student


end-of-year summative test performance controlling for their prior unit
learning

In Table 9, Model 1 shows a similar result as previously discussed, students' performance on


Unit 1 post-unit assessment is not significantly associated with their end-of-year summative test
|

10982736, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/tea.21900 by Cochrane Chile, Wiley Online Library on [14/09/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
30 LI ET AL.

T A B L E 9 Effect of students' performance on post-unit assessments on the end-of-year summative test


controlling prior unit learning experience.

Model 1 Model 2 Model 3


b/SE b/SE b/SE
Pretest: benchmark reading z-score 0.340*** 0.303*** 0.285***
(0.034) (0.035) (0.036)
U1 post-unit assessment z-score 0.056 0.033 0.021
(0.041) (0.041) (0.041)
U1 post-unit assessment missing flag 0.145* 0.137+ 0.141*
(0.072) (0.072) (0.072)
U2 post-unit assessment z-score 0.149*** 0.122**
(0.042) (0.043)
U2 post-unit assessment missing flag 0.114 0.117
(0.095) (0.094)
U3 post-unit assessment z-score 0.101**
(0.036)
U3 post-unit assessment missing flag 0.036
(0.105)
Form B 0.027 0.021 0.032
(0.068) (0.068) (0.068)
Form C 0.024 0.030 0.028
(0.068) (0.068) (0.068)
Female 0.003 0.001 0.034
(0.059) (0.059) (0.060)
Female_missing 0.072 0.038 0.053
(0.139) (0.140) (0.139)
Race_missing 0.355** 0.298* 0.298**
(0.116) (0.117) (0.115)
Race: Black (ref. White) 0.009 0.026 0.021
(0.138) (0.138) (0.137)
Race: Hispanic 0.446* 0.397+ 0.393+
(0.212) (0.212) (0.211)
Race: Asian 0.190 0.191 0.193
(0.174) (0.173) (0.173)
Race: multi-race 0.926 0.937 1.017
(0.926) (0.921) (0.924)
Disadvantaged student 0.068 0.051 0.042
(0.071) (0.071) (0.071)
ELL student 0.189 0.224 0.226
(0.147) (0.147) (0.146)
|

10982736, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/tea.21900 by Cochrane Chile, Wiley Online Library on [14/09/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
LI ET AL. 31

T A B L E 9 (Continued)

Model 1 Model 2 Model 3


b/SE b/SE b/SE
Region 2 (ref. Region 1) 0.053 0.160 0.162
(0.176) (0.180) (0.174)
Region 3 0.278+ 0.295* 0.321*
(0.144) (0.143) (0.139)
Region 4 0.443** 0.423** 0.448**
(0.148) (0.148) (0.144)
Focal section 0.222+ 0.252* 0.249*
(0.123) (0.125) (0.121)
N of student in classroom 0.018 0.019 0.017
(0.013) (0.013) (0.013)
Constant 0.027 0.080 0.003
(0.351) (0.348) (0.340)
ICC 0.048 0.043 0.041
AIC 2900.662 2891.97 2887.47
BIC 3014.967 3015.80 3021.16
df 23 25 27
Random component
Classroom level 0.042 0.041 0.035
0.017 0.016 0.016
Student level 0.827 0.817 0.814
0.036 0.036 0.036

Note: Coefficients are measured in standard deviations in the summative test. Standard errors are in parenthesis. N = 1067.
Abbreviations: AIC, Akaike Information Criterion; BIC, Bayesian Information Criterion; ELL, English language learner; ICC,
Intra-class correlation.
+
p < 0.1.
*p < 0.05; **p < 0.01; ***p < 0.001.

performance. Model 2 shows that students' Unit 2 assessment z-score had a significantly posi-
tive relationship with end-of-year summative test z-score (B = 0.149, p < 0.001) controlling for
prior Unit 1 performance and other covariates. This result indicates that students' performance
on Unit 2 post-unit assessment has an independent impact on their end-of-year summative test
z-score controlling for Unit 1 performance and other covariates. When students increased one
standardized deviation in their Unit 2 assessment, they gained a 0.149 end-of-year summative
test z-score holding constant their prior Unit 1 performance and the covariates. Model 3 shows
students' Unit 3 assessment z-score also has a significantly positive relationship with end-
of-year summative test z-score (B = 0.101, p < 0.01) controlling for prior units' performances
and covariates. One standardized deviation increases in Unit 3 post-unit assessment, students
gain a 0.101 end-of-year summative test z-score controlling for Units 1 and 2 learning, and other
covariates. We also employed the likelihood ratio (LR) test to evaluate the difference between
nested models and show evidence of model improvement between Models 2 and
|

10982736, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/tea.21900 by Cochrane Chile, Wiley Online Library on [14/09/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
32 LI ET AL.

3 (LR test = 7.89, p < 0.05). The significant result indicated Model 3, containing all three post-
unit assessments, fit the data significantly better than Model 2.
In short, students' performance on Units 2 and 3 assessments positively associates with their
gains on the summative test controlling their prior performances and other covariates. Students'
gender disadvantaged economic status, or ELL status, and three forms of end-of-year summa-
tive test had no effect on the end-of-year summative test after controlling for their pretest read-
ing and post-unit assessment z-scores. Only students in the focal section classroom had
significantly higher end-of-year summative test performance compared to their White counter-
parts. Referring to Region 1, students from Regions 3 and 4 had lower end-of-year summative
test performance. This tendency is consistent across Models 1–3.

6.3 | Prediction rates and student demographic groups

To investigate whether the relationships between post-unit assessments and the end-of-year
summative test vary by student demographics (e.g., gender, economically disadvantaged, ELL,
race/ethnicity, region), we examine a series of multilevel models with two-way interaction
terms between students' demographics and post-unit assessments. The summary of the hetero-
geneity results is presented in Table 10.
The results in the first panel suggest that there are no student demographic differences in
the relationship between Unit 1 post-unit assessment and the end-of-year summative test. Unit
1 post-unit assessment did not show a bias in favor of students from certain backgrounds. A
similar approach is applied to examine the subgroup difference of Unit 2 in the second panel.
A race/ethnicity group difference was found in the prediction of Unit 2 post-unit assessment on
the end-of-year summative test. The prediction rate of Unit 2 on the end-of-summative test var-
ies by race and ethnicity, particularly for Black students. The third panel is to examine the sub-
group difference of Unit 3. No other group differences were detected in the Unit 3.

6.4 | Robustness check

To ensure that the findings were robust across missing data, model specifications and the ana-
lytic sample, we conducted several sets of additional analyses. The first set of analyses is the
impact of missingness across post-unit assessment takers on the end-of-year summative test.
We conducted the identical analysis using a structural equation model with full information
maximum likelihood (FIML). FIML is a well-recognized method to handling missing data and
producing unbiased parameter estimates (Kline, 2011; Muthén & Muthén, 2017). Our initial
results indicated positive relationships between post-unit assessments and the end-of-year sum-
mative test, particularly for Unit 2 and Unit 3 in Appendix 11.
The second set of analysis is to ensure the consistency of the main findings in various deci-
sions of sample selections. We were concerned that the decision of the analytical sample could
lead to bias because children did not take or completed the post-unit assessments due to various
reasons. Therefore, we conducted three additional analyses to account for a potential bias based
on the decisions of final sample selection. The first analysis was to run the final model only
based on students (N = 470) who completed all three post-unit assessment tests (Appendix 12).
When we analyzed student sample for those who completed all three post-unit assessments, our
results show that the positive relationship between post-unit assessments (especially Units
LI ET AL.

T A B L E 1 0 Summary of student-level heterogeneity.

Female SES ELL White Black Hispanic Asian Others Region 1 Region 2 Region 3 Region 4
Unit 1
Predictor of interest 0.005 0.086 0.082 0.254 0.360** 0.096 0.514* 0.249 0.468** 0.172 0.307* 0.454**
(0.063) (0.077) (0.154) (0.181) (0.124) (0.146) (0.217) (0.182) (0.155) (0.183) (0.149) (0.155)
Interaction 0.096 0.013 0.076 0.033 0.063 0.013 0.225 0.081 0.072 0.101 0.005 0.120
(0.083) (0.094) (0.115) (0.084) (0.092) (0.157) (0.142) (0.235) (0.102) (0.123) (0.088) (0.096)
Covariates Included Included Included Included Included Included Included Included Included Included Included Included
Unit 2
Predictor of interest 0.006 0.046 0.135 0.255 0.329** 0.062 0.444* 0.225 0.417** 0.227 0.253+ 0.422**
(0.062) (0.077) (0.153) (0.180) (0.126) (0.144) (0.216) (0.183) (0.157) (0.195) (0.145) (0.155)
Interaction 0.094 0.097 0.010 0.131 0.215* 0.049 0.202 0.010 0.049 0.084 0.028 0.129
(0.070) (0.081) (0.114) (0.082) (0.093) (0.131) (0.155) (0.185) (0.091) (0.098) (0.093) (0.107)
Covariates Included Included Included Included Included Included Included Included Included Included Included Included
Unit 3
Predictor of interest 0.057 0.062 0.117 0.224 0.355** 0.111 0.475* 0.235 0.462** 0.132 0.272* 0.459**
(0.063) (0.076) (0.152) (0.180) (0.121) (0.144) (0.212) (0.181) (0.147) (0.173) (0.139) (0.146)
Interaction 0.010 0.016 0.078 0.007 0.064 0.179 0.067 0.065 0.026 0.131 0.008 0.048
|

(0.066) (0.073) (0.106) (0.068) (0.074) (0.124) (0.135) (0.165) (0.074) (0.104) (0.070) (0.091)
Covariates Included Included Included Included Included Included Included Included Included Included Included Included

Abbreviation: ELL, English language learner; SES, socioeconomic status.


33

10982736, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/tea.21900 by Cochrane Chile, Wiley Online Library on [14/09/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
|

10982736, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/tea.21900 by Cochrane Chile, Wiley Online Library on [14/09/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
34 LI ET AL.

2 and 3) and the end-of-year summative test still holds based on the above three analyses. The
second analysis was to use an analytic sample that is consistent with Unit 3 sample, which
resulted in 889 students in the final model (Appendix 13). The third analysis was to keep the
analytic sample consistent with the Unit 1 sample, which resulted in 629 students in the final
model (Appendix 14). We also report the results using a single unit of student subsample data
in Appendix 15. In doing so, we can confirm our result is consistent regardless of using a single
unit of assessment sample or the full treatment sample.
The last set of additional analyses is to check the concern of unobserved teacher characteris-
tics in relation to the prediction of post-unit assessments on the end-of-year summative test. We
employed a school fixed effect model, and a teacher fixed effect model to check the robustness
of our HLM model. Appendix 16 shows the teacher fixed effect results, whereas the significance
and direction are unchanged as the main findings reported beforehand.

7 | DISCUSSION

This study investigates students' knowledge-in-use development resulting from using a curricu-
lum system designed for coherence and equity. It provides empirical evidence on how elemen-
tary students' developing knowledge-in-use predicts science achievement, as assessed by a
third-party designed assessment. Specifically, this study explores the extent to which post-unit
assessments, designed to measure student knowledge-in-use, predicts performance on an end-
of-year summative test designed to align with NGSS PEs. Furthermore, the study probes
whether the predictions changed as students experienced additional units, especially for stu-
dents from demographic groups underrepresented in STEM.

7.1 | Coherent learning system and student knowledge-in-use


development

7.1.1 | Inter-unit coherence and cumulative unit effects

This study identifies a key observation: when accounting for students' prior unit learning expe-
riences, the predictive impact of student performance on each post-unit assessment tends to
accumulate across three units. Although we cannot confirm an increased linear gain from Units
2 to 3, our results indicate a significant linear gain from Units 1 to 2 and from Units 1 to 3. This
indicates cumulative learning gains on student post-unit assessments with respect to their
achievement on the end-of-year summative test. As students experience more units, they per-
form better on the final test. This finding suggests that a learning system, structured with a
coherent PBL approach, can effectively facilitate the development of students' knowledge-in-
use. It supports the call for boosting students' science learning (NRC, 2012) and offers a fresh
perspective on the principles of inter-unit coherence design. It specifically extends our under-
standing of these design principles by demonstrating the benefits of designing SEPs and CCCs
coherently and progressively across three units, with a purposeful revisitation of DCIs
(Duschl, 2019; Jin et al., 2019).
In evaluating the guiding principles behind the coherence design of the ML-PBL curriculum
across three consecutive units, it is clear that several SEPs and CCCs were selected based on the
DCIs' characteristics and phenomena within each unit. Despite the learning system aligning to
|

10982736, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/tea.21900 by Cochrane Chile, Wiley Online Library on [14/09/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
LI ET AL. 35

support student 3D learning (Fulmer et al., 2018), not all DCIs within each unit build progres-
sively across units. However, certain ideas (e.g., adaptations and traits) are revisited periodi-
cally. To understand Unit 2 DCIs (forces and motion), students did not require prior DCIs
explored in Unit 1 (adaptation and survival). Similarly, the DCIs in Unit 3 (traits inherited from
parents and variations of traits) did not build upon ideas from Unit 2, though there were con-
nections to Unit 1 (survival-related DCIs were secondary in Unit 3). Notably, SEPs and CCCs
were structured to build progressively across the units, thereby enabling students to incremen-
tally develop their knowledge-in-use (Miller & Krajcik, 2019). This aspect distinguishes this
study from previous ones, which mainly emphasize the coherent design of science ideas such as
matter and energy (Hadenfeldt et al., 2016). This unique coherence design approach may clarify
why we did not find linear cumulative effects of student post-unit assessment performances on
predicting their end-of-year summative test performance when considering their prior units'
learning. Besides the design rationale, the curriculum implementation could contribute to the
nonlinear cumulative effect. During the spring semester of the 2018–2019 academic year, when
our teachers were implementing the curriculum materials, there were unexpected severe
weather events (such as heavy snow causing many schools to shut down). As a result, science
class time in almost all schools was cut short, resulting in only superficial or partial Unit 3 learn-
ing. However, although the cumulative result is not as predicted, extends existing coherent cur-
riculum design by highlighting the coherent design of CCCs and SEPs. It provides empirical
evidence for designing an NGSS-aligned 3D coherent elementary curriculum. As we move for-
ward, research could investigate the effectiveness of emphasizing different dimensions of NGSS
PEs when designing coherent curriculums, and in doing so, further extend the application of
inter-unit coherence design principles in assessment design and science education research.

7.1.2 | Intra-unit coherence and single unit effects

The second insight gained from this study is that, while controlling for students' prior unit
learning experiences, the performances in Unit 2 and Unit 3 can predict their scores on the
end-of-year summative test. These positive associations across units fortify the principle of
intra-unit coherence and contribute to our understanding of how this principle can be
harnessed to design effective units. The design principles, which ensure that the unit compo-
nents are not only aligned but also coherent with PBL and 3D learning features, are detailed in
Section 4. Each of the two units displays a positive relationship between the student's post-unit
assessment performance and their end-of-year summative test scores. Despite the lack of a sig-
nificant correlation for Unit 1, this can be attributed to the novelty of the teaching, learning,
and testing approaches for both students and teachers in this initial phase. The noticeable
improvement in student performance in subsequent units reinforces the importance of familiar-
ization with a learning system to promote student achievement. Yet, it is necessary to reassess
and refine the design and implementation of Unit 1.
These findings confirm the indispensable role of coherence among various aspects of a suc-
cessful learning system, including curriculum materials, instructions, assessments, and PL
(NRC, 2006). Furthermore, this research-based case offers valuable insights for the development
of a coherent NGSS-aligned learning system to support elementary students' knowledge-in-use
development (NRC, 2006). This extends the current knowledge about high school chemistry
learning system design (He, chen, et al., 2023). Lastly, this research contributes to understand-
ing students' continuous learning process by using post-unit assessments to delve deeper into
|

10982736, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/tea.21900 by Cochrane Chile, Wiley Online Library on [14/09/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
36 LI ET AL.

the effects of ML-PBL intervention, expanding upon the existing efficacy studies of interven-
tions that rely solely on pre- and post-test design (Craig & Marshall, 2019; Han et al., 2015).

7.2 | Equity-oriented learning system and student knowledge-in-use


development

A notable finding from this study is the consistent predictive power of post-unit assessments
across different demographic groups, barring racial/ethnic disparities in the Unit 2 assessment.
This pattern suggests a potential bias against Black and Latin students within the Unit 2 assess-
ment. Although numerous factors could contribute to grading bias, including teachers’ expecta-
tions, and racial attitudes (Bonefeld & Dickhäuser, 2018; Ferguson, 2003), it should be noted
that our post-unit assessments were evaluated by rigorously trained raters rather than the
teachers themselves. This discrepancy suggests a potential bias within some assessment items,
necessitating future research into the quality of post-unit assessments, rubric design, and scor-
ing procedures. This could also be due to the teachers not focusing on the content of force and
motion. Future research should consider strategies to rectify this issue.
Interestingly, no demographic disparities were observed in Units 1 and 3. This suggests that
our curriculum and post-unit assessments provide equitable opportunities for students from
underrepresented groups to learn and excel. This aligns with the proposition that curriculum
and assessment should be designed to meet the needs of diverse students (NRC, 2012). For
instance, our learning system emphasizes the importance of engaging all students by contextu-
alizing phenomena within culturally and locally relevant settings, validating and strengthening
the resource pedagogy on cultural responsiveness and local connection (Gay, 2018;
NASEM, 2019). Our learning system also ensures adequate time for children to engage in 3D
learning, in line with the principles put forth by NASEM (2019).
Furthermore, our curriculum materials cater to students with diverse learning needs by
offering language support and mathematical tools, informing the design of inclusive learning
environments. Our post-unit assessments have been designed using the equity/fairness frame-
work from Harris et al.'s (2019) work, integrating principles from Universal Design for Learning
(Rose et al., 2005; Rose & Meyer, 2006) and research on fair and equitable assessment in science
(e.g., Lee et al., 2013; Wolf & Leon, 2009). These strategies ensure equal learning opportunities
for all students, contributing to our understanding of designing an equitable and inclusive
learning environment (e.g., Bang et al., 2017; Cunningham & Helms, 1998).

7.3 | Coherent assessment system and student knowledge-in-use


development

Another significant contribution of our research is to the design of a coherent assessment sys-
tem to track and support student learning, aligned with the NGSS (2013). Our assessment
system plays a vital role in achieving different types of design principles including coherence
and equity-oriented design goals. Crucially, designing and administering the three coherent
post-unit assessments allowed us to monitor student science learning over time. This research
is the first to offer empirical evidence on elementary students' knowledge-in-use development
over an academic year, reinforcing the advocacy for an assessment system to monitor and track
their performance against 3D learning goals (NRC, 2014). The three central findings of our
|

10982736, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/tea.21900 by Cochrane Chile, Wiley Online Library on [14/09/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
LI ET AL. 37

study underscore the value of a coherent assessment system in providing a comprehensive over-
view of student learning to track student knowledge-in-use development. In our system, the
ongoing LP of students, captured by post-unit assessments, equips teachers with insights into
student learning challenges (e.g., difficulties in model construction; NRC, 2014; Ruiz-Primo
et al., 2002). Moreover, the end-of-year summative test establishes a shared vision of science
learning goals across all ML-PBL classrooms and the state, contributing to the understanding of
knowledge-in-use performance. Consequently, real-time information on students' knowledge-
in-use performance can provide administrators and policymakers with immediate feedback,
informing their policies or decisions (NRC, 2006).
Our study highlights the intricate design of assessments for gauging knowledge-in-use (e.g.,
Harris et al., 2019). The results indicate that among the three units, Unit 2 had the most significant
effect on student end-of-year summative test performance. This could be because Unit 2, focusing
on a single DCI across its four PEs, facilitated students to master learning goals more easily com-
pared to Units 1 and 3, which spanned at least six PEs and multiple DCIs. Moreover, while force is
often a challenging and abstract concept for learners (Reiner et al., 2000; Vosniadou et al., 2001),
the phenomena explored in Unit 2 were tangible, centered around toy cars and rockets. Interest-
ingly, the contexts of items on the Unit 2 post-unit assessment closely resembled those on the end-
of-year summative test, which could have influenced the transfer of knowledge. This resemblance
might have eased the cognitive load on students when applying their learning to new situations
(Kalyuga, 2011; Sweller, 2011). Consistent with existing studies on transferable knowledge or adap-
tive skills (NRC, 2012b; Spiro et al., 2019), our findings indicate that context similarity can indeed
affect far-transfer (Bransford et al., 1999; Singley & Anderson, 1989). Further robust studies could
explore the optimal number of PEs or DCIs to include in each unit to best support students'
knowledge-in-use development. This could facilitate better monitoring of student learning process
over time and allow for the adjustment of instruction and support as needed. Moreover, the design
and selection of item context could be more intentional depending on specific learning goals, offer-
ing students more meaningful learning experiences and better opportunities to apply their knowl-
edge in various scenarios.

8 | LIMITATI ON S A N D FU T U RE DI R E C T I O NS

Several constraints should be acknowledged in the interpretation of our results. First, our
approach to handling missing data could influence the study's outcomes. Despite performing
multiple robustness checks to mitigate the impact of missing data (see Section 6.4), residual lim-
itations may persist. Future research should explore strategies to further minimize the impact
of missing data. Second, a discrepancy exists between our post-unit assessments, which com-
prise solely constructed-response questions, and the end-of-year summative test, which consists
only of multiple-choice questions. This variation could affect the prediction estimates. Subse-
quent studies should investigate how different assessment formats might influence predictive
validity. Lastly, we recognize that students' performance on the Unit 2 post-unit assessment
shows the most substantial prediction on their end-of-year summative test performance. We
speculate that this may be due to an unexpected similarity in item design, potentially influenc-
ing the predictive characteristics of the post-unit assessments. Future rigorous studies should
examine the degree to which item context affects student knowledge-in-use performance.
Looking ahead, further exploration is required to understand how classroom learning aligns
with their post-unit assessment performance. While we have carefully examined the correlation
|

10982736, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/tea.21900 by Cochrane Chile, Wiley Online Library on [14/09/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
38 LI ET AL.

between student performance on post-unit assessments and their end-of-year summative test,
future research should collect and analyze data on student classroom-embedded assessment
performance to develop a more nuanced understanding of this relationship. In addition, we rec-
ommend that future research should delve into the dynamics of students' knowledge-in-use
proficiency as it develops across consecutive and PBL units. Finally,
Given the relatively small number of participants from underrepresented groups in this
study, it is important for future research to place more focus on these populations. This would
not only diversify the sample but also enhance the validity of the findings. Including underrep-
resented groups can provide insightful perspectives into a broader range of students' learning
patterns, promoting more equitable educational practices. As this study was exploratory, a more
targeted approach could further strengthen the overall research field.

9 | C ON C L U S I ON

This study expands existing studies on the positive effect of a PBL curriculum intervention on
student science achievement (Krajcik et al., 2023). It explores student learning trajectories
across three coherent elementary curriculum units. This study advances the field in two signifi-
cant ways. First, the PBL contexts were purposefully designed to be system-wide and centered
around a student-centered and culturally relevant design. This attention to an equity-oriented
design principle permeated all aspects of the system from its inception, emphasizing that con-
siderations of equity must be integrated from the start of curriculum development rather than
as an afterthought. This study thus reinforces the assertion that a consistent application of
equity principles throughout the design process is vital for equitable outcomes. Second, our
research scrutinizes students' learning processes and highlights strategies for ongoing
knowledge-in-use performance. Although most existing studies often explore the effect of a sin-
gle unit (e.g., Harris et al., 2015), this research is situated within a long-term, coherent PBL
learning system. This system comprises four interconnected curriculum units, professional
learning, and assessments, with each component contributing to forming of a cohesive system
aligned with NGSS. Furthermore, the findings from this study provide robust evidence unde-
rscoring the importance of designing a coherent and inclusive learning and assessment system
to enhance student science learning (NRC, 2006, 2012). The findings thereby strengthen the
argument for such designs in educational practice.

A C K N O WL E D G M E N T S
This study received funding from Lucas Education Research, a sector of the George Lucas Edu-
cational Foundation under APP# 13987. The opinions, findings, conclusions, or recommenda-
tions presented in this content are solely those of the authors and do not necessarily represent
the perspectives of the George Lucas Educational Foundation. Our gratitude extends to Kayla
Bartz for her contributions to the analytical plan of this research. The authors also appreciate
Peng He for his proofreading and valuable feedback. A special acknowledgment goes to the
Michigan teachers who participated in our efficacy study.

ORCID
Tingting Li https://orcid.org/0000-0002-5692-2042
Emily Adah Miller https://orcid.org/0000-0003-3473-5729
|

10982736, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/tea.21900 by Cochrane Chile, Wiley Online Library on [14/09/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
LI ET AL. 39

R EF E RE N C E S
Adah Miller, E., Li, T., Bateman, K., Akgun, S., Makori, H., Codere, S., Richar, S., Simani, M. C., & Krajcik, J.
(2022). Adaptation principles to foster engagement and equity in project-based science learning. In Proceed-
ings of the 16th International Society of the Learning Sciences-ICLS 2022, pp. 1289-1292. International Society
of the Learning Sciences.
Adah Miller, E., Makori, H., Akgun, S., Miller, C., Li, T., & Codere, S. (2022). Including teachers in the social jus-
tice equation of project-based learning: A response to Lee & Grapin. Journal of Research in Science Teaching,
59(9), 1726–1732. https://doi.org/10.1002/tea.21805
Agarwal, P., & Sengupta-Irving, T. (2019). Integrating power to advance the study of connective and productive
disciplinary engagement in mathematics and science. Cognition and Instruction, 37(3), 349–366. https://doi.
org/10.1080/07370008.2019.1624544
Anderson, C. W., de los Santos, E. X., Bodbyl, S., Covitt, B. A., Edwards, K. D., Hancock, J. B., Lin, Q., Thomas,
C. M., Penuel, W. R., & Welch, M. M. (2018). Designing educational systems to support enactment of the
next generation science standards. Journal of Research in Science Teaching, 55(7), 1026–1052. https://doi.org/
10.1002/tea.21484
Baker, F. B. (2001). The basics of item response theory. http://ericae.net/irt/baker
Bang, M., Brown, B., Calabrese Barton, A., Rosebery, A. S., & Warren, B. (2017). Toward more equitable learning
in science. In C. V. Schwarz, C. Passmore, & B. J. Reiser (Eds.), Helping students make sense of the world
using next generation science and engineering practices (pp. 33–58). NSTA Press.
Barron, B. J., Schwartz, D. L., Vye, N. J., Moore, A., Petrosino, A., Zech, L., & Bransford, J. D. (1998). Doing with
understanding: Lessons from research on problem- and project-based learning. Journal of the Learning Sci-
ences, 7(3-4), 271–311.
Baumfalk, B., Bhattacharya, D., Vo, T., Forbes, C., Zangori, L., & Schwarz, C. (2019). Impact of model-based sci-
ence curriculum and instruction on elementary students' explanations for the hydrosphere. Journal of
Research in Science Teaching, 56(5), 570–597. https://doi.org/10.1002/tea.21514
Boaler, J. (2002). Learning from teaching: Exploring the relationship between reform curriculum and equity.
Journal for Research in Mathematics Education, 33(4), 239–258.
Bonefeld, M., & Dickhäuser, O. (2018). (Biased) grading of students' performance: Students' names, performance
level, and implicit attitudes. Frontiers in Psychology, 9, 481. https://doi.org/10.3389/fpsyg.2018.00481
Bransford, J. D., Brown, A. L., & Cocking, R. R. (1999). How people learn: Brain, mind, experience, and school.
National Academies Press.
Carr, P. B., & Steele, C. M. (2009). Stereotype threat and inflexible perseverance in problem solving. Journal of
Experimental Social Psychology, 45(4), 853–859. https://doi.org/10.1016/j.jesp.2009.03.003
Charara, J., Miller, E. A., & Krajcik, J. (2021). Knowledge in use: Designing for play in kindergarten science con-
texts. Journal of Leadership, Equity, and Research, 7(1), n1.
Clogg, C. C., Petkova, E., & Haritou, A. (1995). Statistical methods for comparing regression coefficients between
models. American Journal of Sociology, 100(5), 1261–1293.
Cohen, J. E. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement,
20(1), 37–46.
Craig, T. T., & Marshall, J. (2019). Effect of project-based learning on high school students' state-mandated, stan-
dardized math and science exam performance. Journal of Research in Science Teaching, 56(10), 1461–1488.
https://doi.org/10.1002/tea.21582
Cunningham, C. M., & Helms, J. V. (1998). Sociology of science as a means to a more authentic, inclusive science
education. Journal of Research in Science Teaching, 35(5), 483–499.
DeVellis, R. (2003). Scale development: Theory and applications. Sage.
Duschl, R. A. (2019). Learning progressions: Framing and designing coherent sequences for STEM education.
Disciplinary and Interdisciplinary Science Education Research, 1(1), 1–10.
Esposito, A. G., & Bauer, P. J. (2017). Going beyond the lesson: Self-generating new factual knowledge in the
classroom. Journal of Experimental Child Psychology, 153, 110–125.
Ferguson, R. F. (2003). Teachers’ perceptions and expectations and the Black–White test score gap. Urban Edu-
cation, 38(4), 460–507.
Fleiss, J. L. (1981). Statistical methods for rates and proportions (2nd ed.). Wiley.
|

10982736, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/tea.21900 by Cochrane Chile, Wiley Online Library on [14/09/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
40 LI ET AL.

Ford, M. J., & Forman, E. A. (2006). Chapter 1: Redefining disciplinary learning in classroom contexts. Review of
Research in Education, 30(1), 1–32. https://doi.org/10.3102/0091732X030001001
Fortus, D., Sutherland Adams, L. M., Krajcik, J., & Reiser, B. (2015). Assessing the role of curriculum coherence
in student learning about energy. Journal of Research in Science Teaching, 52(10), 1408–1425. https://doi.org/
10.1002/tea.21261
Fulmer, G. W., Tanas, J., & Weiss, K. A. (2018). The challenges of alignment for the next generation science stan-
dards. Journal of Research in Science Teaching, 55(7), 1076–1100. https://doi.org/10.1002/tea.21481
Gay, G. (2018). Culturally responsive teaching: Theory, research, and practice. Teachers College Press.
Gonzalez, N., Moll, L. C., & Amanti, C. (Eds.). (2006). Funds of knowledge: Theorizing practices in households,
communities, and classrooms. Routledge.
Gouvea, J. S., Jamshidi, A., & Passmore, C. (2014). Model-based reasoning: A framework for coordinating authen-
tic scientific practice with science learning. International Society of the Learning Sciences.
Greeno, J. G., Collins, A. M., & Resnick, L. B. (1996). Cognition and learning. In D. Berliner & R. Calfee (Eds.),
Handbook of Educational Psychology (pp. 15–46). Macmillan.
Haas, A., Januszyk, R., Grapin, S. E., Goggins, M., Llosa, L., & Lee, O. (2021). Developing instructional materials
aligned to the next generation science standards for all students, including English learners. Journal of Sci-
ence Teacher Education, 32(7), 735–756. https://doi.org/10.1080/1046560X.2020.1827190
Hadenfeldt, J. C., Neumann, K., Bernholt, S., Liu, X., & Parchmann, I. (2016). Students' progression in under-
standing the matter concept. Journal of Research in Science Teaching, 53(5), 683–708. https://doi.org/10.
1002/tea.21312
Han, S., Capraro, R., & Capraro, M. M. (2015). How science, technology, engineering, and mathematics (STEM)
project-based learning (PBL) affects high, middle, and low achievers differently: The impact of student fac-
tors on achievement. International Journal of Science and Mathematics Education, 13(5), 1089–1113. https://
doi.org/10.1007/s10763-014-9526-0
Hardy, I., Jonen, A., Möller, K., & Stern, E. (2006). Effects of instructional support within constructivist learning
environments for elementary school students' understanding of “floating and sinking”. Journal of Educa-
tional Psychology, 98(2), 307.
Harris, C. J., Krajcik, J. S., Pellegrino, J. W., & DeBarger, A. H. (2019). Designing knowledge-in-use assessments
to promote deeper learning. Educational measurement: issues and practice, 38(2), 53–67. https://doi.org/10.
1111/emip.12253
Harris, C. J., Penuel, W. R., D'Angelo, C. M., DeBarger, A. H., Gallagher, L. P., Kennedy, C. A., Cheng, B. H., &
Krajcik, J. S. (2015). Impact of project-based curriculum materials on student learning in science: Results of
a randomized controlled trial. Journal of Research in Science Teaching, 52(10), 1362–1385. https://doi.org/10.
1002/tea.21263
He, P., Chen, I. C., Touitou, I., Bartz, K., Schneider, B., & Krajcik, J. (2023). Predicting student science achieve-
ment using post-unit assessment performances in a coherent high school chemistry project-based learning
system. Journal of Research in Science Teaching, 60(4), 724–760. https://doi.org/10.1002/tea.21815
He, P., Zhai, X., Shin, N., & Krajcik, J. (2023). Applying Rasch measurement to assess knowledge-in-use in sci-
ence education. In X. Liu & W. J. Boone (Eds.), Advances in applications of rasch measurement in science edu-
cation. contemporary trends and issues in science education (Vol. 57). Springer. https://doi.org/10.1007/978-3-
031-28776-3_13
He, P., Zheng, C., & Li, T. (2021). Development and validation of an instrument for measuring Chinese chemis-
try teachers' perceptions of pedagogical content knowledge for teaching chemistry core competencies. Chem-
istry Education Research and Practice, 22(2), 513–531. https://doi.org/10.1039/C9RP00286C
He, P., Zheng, C., & Li, T. (2022). Development and validation of an instrument for measuring Chinese chemis-
try teachers' perceived self-efficacy towards chemistry core competencies. International Journal of Science
and Mathematics Education, 20(7), 1337–1359. https://doi.org/10.1007/s10763-021-10216-8
Helle, L., Tynjälä, P., & Olkinuora, E. (2006). Project-based learning in post-secondary education–theory, practice
and rubber sling shots. Higher Education, 51(2), 287–314.
Hox, J. J., & Maas, C. J. (2001). The accuracy of multilevel structural equation modeling with pseudobalanced
groups and small samples. Structural Equation Modeling, 8(2), 157–174.
|

10982736, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/tea.21900 by Cochrane Chile, Wiley Online Library on [14/09/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
LI ET AL. 41

Jin, H., Mikeska, J. N., Hokayem, H., & Mavronikolas, E. (2019). Toward coherence in curriculum, instruction,
and assessment: A review of learning progression literature. Science Education, 103(5), 1206–1234. https://
doi.org/10.1002/sce.21525
Kalyuga, S. (2011). Cognitive load theory: How many types of load does it really need? Educational Psychology
Review, 23(1), 1–19. https://doi.org/10.1007/s10648-010-9150-7
Kang, E. J., Donovan, C., & McCarthy, M. J. (2018). Exploring elementary teachers' pedagogical content knowl-
edge and confidence in implementing the NGSS science and engineering practices. Journal of Science
Teacher Education, 29(1), 9–29. https://doi.org/10.1080/1046560X.2017.1415616
Kline, R. B. (2011). Principles and practice of structural equation modeling (3rd ed.). Guilford.
Koo, T. K., & Li, M. Y. (2016). A guideline of selecting and reporting intraclass correlation coefficients for reli-
ability research. Journal of Chiropractic Medicine, 15(2), 155–163. https://doi.org/10.1016/j.jcm.2016.02.012
Krajcik, J. S., & Czerniak, C. M. (2018). Teaching science in elementary and middle school: A project-based learn-
ing approach. Routledge.
Krajcik, J. S., & Shin, N. (2023). Project-based learning. In R. K. Sawyer (Ed.), The Cambridge handbook of the
learning sciences (3rd ed.). Cambridge University Press.
Krajcik, J. S., Miller, E. C., & Chen, I. C. (2022). Using project-based learning to leverage culturally relevant ped-
agogy for science sensemaking in urban elementary classrooms. In International handbook of research on
multicultural science education (pp. 913–932). Springer.
Krajcik, J., & Schneider, B. (2021). Science education through multiple literacies: Project-based learning in elemen-
tary school. Teaching methods and materials. Harvard Education Press.
Krajcik, J., Schneider, B., Miller, E. A., Chen, I. C., Bradford, L., Baker, Q., Bartz, K., Miller, C., Li, T., Codere, S.,
& Peek-Brown, D. (2023). Assessing the effect of project-based learning on science learning in elementary
schools. American Educational Research Journal, 60(1), 70–102. https://doi.org/10.1080/1046560X.2017.
1415616
Kulgemeyer, C., & Schecker, H. (2014). Research on educational standards in German science education—
Towards a model of student competences. EURASIA Journal of Mathematics, Science and Technology Educa-
tion, 10(4), 257–269. https://doi.org/10.12973/eurasia.2014.1081a
Ladson-Billings, G. (1995). Toward a theory of culturally relevant pedagogy. American Educational Research
Journal, 32(3), 465–491. https://doi.org/10.3102/00028312032003465
Lee, O., Quinn, H., & Valdés, G. (2013). Science and language for English language learners in relation to Next
Generation Science Standards and with implications for Common Core State Standards for English language
arts and mathematics. Educational Researcher, 42(4), 223–233. https://doi.org/10.3102/0013189X13480524
Li, T. (2021). Developing deep learning through systems thinking. In J. Krajcik & B. Schneider (Eds.), Science
education through multiple literacies: Project-based learning in elementary school (pp. 79–94). Harvard Educa-
tion Press https://hep.gse.harvard.edu/9781682536629/science-education-through-multiple-literacies/
Li, T., Miller, E. A., & Krajcik, J. S. (2023). Theory into practice: Supporting knowledge-in-use through project-
based learning. In G. Bansal & U. Ramnarain (Eds), Fostering science teaching and learning for the fourth
industrial revolution and beyond (pp. 1–35). IGI Global. https://doi.org/10.4018/978-1-6684-6932-3.ch001
Li, T., Miller, E., Chen, I. C., Bartz, K., Codere, S., & Krajcik, J. (2021). The relationship between teacher's sup-
port of literacy development and elementary students' modelling proficiency in project-based learning sci-
ence classrooms. Education 3-13, 49(3), 302–316. https://doi.org/10.1080/03004279.2020.1854959
MacDonald, R., Miller, E., & Lord, S. (2017). Doing and talking science: Engaging ELs in the discourse of the sci-
ence and engineering practices. In A. Oliveria, & M. Weinburgh, (Eds), Science teacher preparation in con-
tent-based second languageacquisition. ASTE Series in Science Education. Springer, Cham. https://doi.org/10.
1007/978-3-319-43516-9_10
Mensah, F. M., & Chen, J. L. (2022). Elementary multicultural science teacher education. In International hand-
book of research on multicultural science education (pp. 1–39). Springer International Publishing.
Miller, E. C., & Krajcik, J. S. (2019). Promoting deep learning through project-based learning: A design problem.
Disciplinary and Interdisciplinary Science Education Research,1(1), 1–10. https://doi.org/10.1186/s43031-019-
0009-6
Miller, E. C., Reigh, E., Berland, L., & Krajcik, J. (2021). Supporting equity in virtual science instruction through
project-based learning: Opportunities and challenges in the era of COVID-19. Journal of Science Teacher
Education, 32(6), 642–663. https://doi.org/10.1080/1046560X.2021.1873549
|

10982736, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/tea.21900 by Cochrane Chile, Wiley Online Library on [14/09/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
42 LI ET AL.

Miller, E. C., Severance, S., & Krajcik, J. (2021). Motivating teaching, sustaining change in practice: Design prin-
ciples for teacher learning in project-based learning contexts. Journal of Science Teacher Education, 32(7),
757–779. https://doi.org/10.1080/1046560X.2020.1864099
Miller, E., Li, T., Chen, I., & Codere, S. (2023). Using flexible thinking to assess student sensemaking of phenom-
ena in project-based learning. In R. Tierney, F. Rizvi, & K. Ercikan (Eds.), International encyclopedia of edu-
cation (4th ed., pp. 444–457). Elsevier. https://doi.org/10.1016/B978-0-12-818630-5.13047-7
Mislevy, R., & Haertel, G. (2006). Implications of evidence-centered design for educational testing. Educational
Measurement: Issues and Practice, 25(4), 6–20. https://doi.org/10.1111/j.1745-3992.2006.00075.x
Mupira, P., & Ramnarain, U. (2018). The effect of inquiry-based learning on the achievement goal-orientation of
grade 10 physical sciences learners at township schools in South Africa. Journal of Research in Science Teach-
ing, 55(6), 810–825. https://doi.org/10.1002/tea.21440
Muthén, B., & Muthén, L. (2017). Mplus. In Handbook of item response theory (pp. 507–518). Chapman and
Hall/CRC.
National Academies of Sciences, Engineering, and Medicine. (2019). Science and engineering for grades 6–12:
Investigation and design at the center. National Academies Press.
National Academies of Sciences, Engineering, and Medicine. (2022). Science and engineering in preschool through
elementary grades: The brilliance of children and the strengths of educators. Washington, DC: The national
Academies Press. https://doi.org/10.17226/26215
National Research Council. (2006). Systems for state science assessment. National Academies Press.
National Research Council. (2012). A framework for K-12 science education: Practices, crosscutting concepts, and
core ideas. National Academies Press.
National Research Council. (2014). Developing assessments for the next generation science standards. National
Academies Press.
NGSS Lead States. (2013). Next generation science standards: For states, by states. National Academies Press.
Nordine, J., Sorge, S., Delen, I., Evans, R., Juuti, K., Lavonen, J., Nilsson, P., Ropohl, M., & Stadler, M. (2021).
Promoting coherent science instruction through coherent science teacher education: A model framework for
program design. Journal of Science Teacher Education, 32(8), 911–933. https://doi.org/10.1080/1046560X.
2021.1902631
Organization for Economic Cooperation and Development. (2019). PISA 2018 assessment and analytical frame-
work. OECD Publishing.
Paris, D. (2012). Culturally sustaining pedagogy: A needed change in stance, terminology, and practice. Educa-
tional Researcher, 41(3), 93–97. https://doi.org/10.3102/0013189X12441244
Paternoster, R., Brame, R., Mazerolle, P., & Piquero, A. (1998). Using the correct statistical test for the equality of
regression coefficients. Criminology, 36(4), 859–866. https://doi.org/10.1111/j.1745-9125.1998.tb01268.x
Pellegrino, J. W., & Hilton, M. L. (Eds.). (2012). Education for life and work: Developing transferable knowledge
and skills in the 21st century. National Academies Press.
Penuel, W. R., Harris, C. J., & DeBarger, A. H. (2015). Implementing the Next Generation Science Standards. Phi
Delta Kappan, 96(6), 45–49. https://doi.org/10.1177/0031721715575299
People's Republic of China Ministry of Education. (2014). Opinions on deepening curriculum reform and
implementing the fundamental tasks of Lide-Shuren. http://www.moe.gov.cn/srcsite/A26/jcj_kcjcgh/201404/
t20140408_167226.html
Perie, M., Marion, S., Gong, B., & Wurtzel, J. (2007). The role of interim assessments in a comprehensive assess-
ment system: A policy brief. Aspen Institute.
Perkins, D. N., & Salomon, G. (1992). Transfer of learning. International Encyclopedia of Education, 2, 6452–
6457.
Raudenbush, S. W. (1997). Statistical analysis and optimal design for cluster randomized trials. Psychological
Methods, 2(2), 173.
Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical linear models: Applications and data analysis methods
(Vol. 1). Sage.
Reiner, M., Slotta, J. D., Chi, M. T., & Resnick, L. B. (2000). Naive physics reasoning: A commitment to
substance-based conceptions. Cognition and Instruction, 18(1), 1–34. https://doi.org/10.1207/
S1532690XCI1801_01
|

10982736, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/tea.21900 by Cochrane Chile, Wiley Online Library on [14/09/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
LI ET AL. 43

Reiser, B. J. (2014, April). Designing coherent storylines aligned with NGSS for the K-12 classroom. In National
Science Education Leadership Association Meeting, Boston, MA: NSELA Conference.
Rose, D. H., & Meyer, A. (2006). A practical reader in universal design for learning. Harvard Education Press.
Rose, D. H., Meyer, A., & Hitchcock, C. (Eds.). (2005). The universally designed classroom: Accessible curriculum
and digital technologies. Harvard Education Press.
Roseman, J. E., Stern, L., & Koppal, M. (2010). A method for analyzing the coherence of high school biology text-
books. Journal of Research in Science Teaching, 47(1), 47–70. https://doi.org/10.1002/tea.20305
Ruiz-Primo, M. A., Shavelson, R. J., Hamilton, L., & Klein, S. (2002). On the evaluation of systemic science edu-
cation reform: Searching for instructional sensitivity. Journal of Research in Science Teaching, 39(5), 369–
393. https://doi.org/10.1002/tea.10027
Sanchez Tapia, I., Krajcik, J., & Reiser, B. (2018). “We do not know what is the real story anymore”: Curricular
contextualization principles that support indigenous students in understanding natural selection. Journal of
Research in Science Teaching, 55(3), 348–376. https://doi.org/10.1002/tea.21422
Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika Mono-
graph Supplement, 17(4), 2.
Schmidt, W. H., Wang, H. C., & McKnight, C. C. (2005). Curriculum coherence: An examination of US mathe-
matics and science content standards from an international perspective. Journal of Curriculum Studies, 37
(5), 525–559. https://doi.org/10.1080/0022027042000294682
Schneider, B., Krajcik, J., Lavonen, J., Salmela-Aro, K., Klager, C., Bradford, L., Chen, I.-C., Baker, Q., Touitou,
I., Peek-Brown, D., Dezendorf, R. M., Maestrales, S., & Bartz, K. (2022). Improving science achievement—Is
it possible? Evaluating the efficacy of a high school chemistry and physics project-based learning interven-
tion. Educational Researcher, 51(2), 109–121. https://doi.org/10.3102/0013189X211067742
Schreiber, L. M., & Valle, B. E. (2013). Social constructivist teaching strategies in the small group classroom.
Small Group Research, 44(4), 395–411.
Singley, M. K., & Anderson, J. R. (1989). The transfer of cognitive skill (Vol. No. 9). Harvard University Press.
Snijders, T. A., & Bosker, R. J. (2011). Multilevel analysis: An introduction to basic and advanced multilevel model-
ing. Sage.
Spiro, R. J., Feltovich, P. J., Gaunt, A., Hu, Y., Klautke, H., Cheng, C., Clemente, I., Leahy, S., & Ward, P. (2019).
Cognitive flexibility theory and the accelerated development of adaptive readiness and adaptive response to
novelty. In P. Ward, J. M. Schraagen, J. Gore, & E. M. Rother (Eds.), The Oxford handbook of expertise.
Oxford University Press.
Sweller, J. (2011). Cognitive load theory. In J. P. Mestre & B. H. Ross (Eds.), Psychology of learning and motiva-
tion (Vol. 55, pp. 37–76). Academic Press.
Tavakol, M., & Dennick, R. (2011). Making sense of Cronbach's alpha. International Journal of Medical Educa-
tion, 2, 53–55.
Trilling, B., & Fadel, C. (2009). 21st century skills: Learning for life in our times. John Wiley & Sons.
Van Lier, L. (2001). Constraints and resources in classroom talk: Issues of equality and symmetry. English lan-
guage teaching in its social context, 90, 107.
Vosniadou, S., Ioannides, C., Dimitrakopoulou, A., & Papademetriou, E. (2001). Designing learning environ-
ments to promote conceptual change in science. Learning and Instruction, 11(4-5), 381–419. https://doi.org/
10.1016/S0959-4752(00)00038-4
Vygotsky, L. (1978). Interaction between learning and development. Readings on the Development of Children, 23
(3), 34–41.
Walker, J. P., & Sampson, V. (2013). Learning to argue and arguing to learn: Argument-driven inquiry as a way
to help undergraduate chemistry students learn how to construct arguments and engage in argumentation
during a laboratory course. Journal of Research in Science Teaching, 50(5), 561–596. https://doi.org/10.1002/
tea.21082
Ward, P., Gore, J., Hutton, R., Conway, G. E., & Hoffman, R. R. (2018). Adaptive skill as the conditio sine qua
non of expertise. Journal of Applied Research in Memory and Cognition, 7(1), 35–50. https://doi.org/10.1016/
j.jarmac.2018.01.009
Wolf, M. K., & Leon, S. (2009). An investigation of the language demands in content assessments for English lan-
guage learners. Educational Assessment, 14(3-4), 139–159. https://doi.org/10.1080/10627190903425883
|

10982736, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/tea.21900 by Cochrane Chile, Wiley Online Library on [14/09/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
44 LI ET AL.

Zhao, Y., & Wang, L. (2022). A case study of student development across project-based learning units in middle
school chemistry. Disciplinary and Interdisciplinary Science Education Research, 4(1), 1–20. https://doi.org/
10.1186/s43031-021-00045-8

S UP PO RT ING IN FOR MAT ION


Additional supporting information can be found online in the Supporting Information section
at the end of this article.

How to cite this article: Li, T., Chen, I.-C., Adah Miller, E., Miller, C. S., Schneider, B.,
& Krajcik, J. (2023). The relationships between elementary students' knowledge-in-use
performance and their science achievement. Journal of Research in Science Teaching,
1–61. https://doi.org/10.1002/tea.21900
|

10982736, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/tea.21900 by Cochrane Chile, Wiley Online Library on [14/09/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
LI ET AL. 45

A P P END I X 1 : DESCRIPTION OF THE PEs FOR THE ML-PBL UNITS

Primary PEs addressed in Unit 1

3-LS4-1 Analyze and interpret data from fossils to provide evidence of the organisms and the
environments in which they lived long ago.
3-LS4-2 Use evidence to construct an explanation for how the variations in characteristics among
individuals of the same species may provide advantages in surviving, finding mates, and
reproducing.
3-LS4-3 Construct an argument with evidence that in a particular habitat some organisms can survive
well, some survive less well, and some cannot survive at all.
3-LS4-4 Make a claim about the merit of a solution to a problem caused when the environment changes
and the types of plants and animals that live there may change.
3-LS3-2 Use evidence to support the explanation that traits can be influenced by the environment.
3-LS1-1 Develop models to describe that organism have unique and diverse life cycles, but all have in
common birth, growth, reproduction, and death.

Primary PEs addressed in Unit 2

3-PS2-1 Plan and conduct an investigation to provide evidence of the effects of balanced and unbalanced
forces on the motion of an object.
3-PS2-2 Make observations and/or measurements of an object's motion to provide evidence that a pattern
can be used to predict future motion.
3-PS2-3 Ask questions to determine cause and effect relationships of electric or magnetic interactions
between two objects not in contact with each other.
3-PS2-4 Define a simple design problem that can be solved by applying scientific ideas about magnets.

Primary PEs addressed in Unit 3

3-LS3-2 Use evidence to support the explanation that traits can be influenced by the environment.
3-LS3-1 Analyze and interpret data to provide evidence that plants and animals have traits inherited
from parents and the variation of these traits exists in a group of similar organisms.
3-LS2-1 Construct an argument that some animals form groups that help members survive.
3-5-ETS1-1 Define a simple design problem that can be solved through the development of an object, tool,
process, or system and includes several criteria for success and constraints on materials,
time, or cost.
|

10982736, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/tea.21900 by Cochrane Chile, Wiley Online Library on [14/09/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
46 LI ET AL.

Primary PEs addressed in Unit 4

3-LS1-1 Develop models to describe that organisms have unique and diverse life cycles but all have in
common birth, growth, reproduction, and death.
3-LS3-1 Analyze and interpret data to provide evidence that plants and animals have traits inherited
from parents and that variation of these traits exists in a group of similar organisms.
3-LS3-2 Use evidence to support the explanation that traits can be influenced by the environment
(https://www.nextgenscience.org/pe/3-ls3-2-heredity-inheritance-and-variation-traits).
3-LS4-3 Construct an argument with evidence that in a particular habitat some organisms can survive
well, some survive less well, and some cannot survive at all (https://www.nextgenscience.
org/pe/3-ls4-4-biological-evolution-unity-and-diversity).
3-LS4-4 Make a claim about the merit of a solution to a problem caused when the environment
changes and the types of plants and animals that live there may change (https://www.
nextgenscience.org/pe/3-ls4-4-biological-evolution-unity-and-diversity).
3-ESS2-1 Represent data in tables and graphical displays to describe typical weather conditions expected
during a particular season (https://www.nextgenscience.org/pe/3-ess2-1-earths-systems).
3-ESS2-2 Obtain and combine information to describe climates in different regions of the world
(https://www.nextgenscience.org/pe/3-ess2-2-earths-systems).
3-ESS3-1 Make a claim about the merit of a design solution that reduces the impacts of a weather-
related hazard (https://www.nextgenscience.org/pe/3-ess3-1-earth-and-human-activity).
3-5-ETS1-1 Define a simple design problem reflecting a need or a want that includes specific criteria for
success and constraints on materials, time, or cost.
3-5-ETS1-2 Generate and compare multiple possible solutions to a problem based on how well each is
likely to meet the criteria and constraints of the problem.

A P P END I X 2 : PERCENTAGE OF MISSING CASES ACROSS THREE POST-UNIT


ASSESSMENTS

Unit 1 Unit 2 Unit 3


Students with valid post-unit assessment 681 872 957
Missing in post-unit assessment 384 193 108
% of Missing in post-unit assessment 36.06% 18.12% 10.14%
Missing in post-unit assessment (Region 1) 52.34% 55.44% 40.74%
Missing in post-unit assessment (Region 2) 14.58% 1.04% 7.41%
Missing in post-unit assessment (Region 3) 10.94% 26.94% 18.52%
Missing in post-unit assessment (Region 4) 22.14% 16.58% 33.33%

To preserve the integrity of our sample, we created three “missing flag” variables, one for
each post-unit assessment, to identify the impact of the missing data in our subsequent analytic
models. Appendix 2 details the percentage of missing cases for each post-unit assessment with
respect to the end-of-year summative test. It reveals that 36% of cases are missing in Unit 1, 18%
in the Unit 2, and 10% in the Unit3.
|

10982736, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/tea.21900 by Cochrane Chile, Wiley Online Library on [14/09/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
LI ET AL. 47

A P P END I X 3 : PERCENTAGE OF MISSING CASES PER TEACHER AMONG THREE


POST-UNIT ASSESSMENTS

Teacher Region Post unit % of missing


Unit 1 % of missing
Guerra Genesee (Region 2) 15%
McRaith Detroit (Region 1) 9%
Granger Detroit (Region 1) 7%
Valerie Reed Detroit (Region 1) 6%
Byrd Detroit (Region 1) 10%
Unit 2 % of missing
Palma K—Kent 12%
Byrd Detroit (Region 1) 12%
King Detroit (Region 1) 8%
Valerie Reed Detroit (Region 1) 9%
Chuby Detroit (Region 1) 7%
Unit 3 % of missing
Byrd Detroit (Region 1) 21%
Berkshire MI Other (Region 4) 20%
Guerra Genesee (Region 2) 6%
Chuby Detroit (Region 1) 5%

A regional analysis of the data shows a notable concentration of missingness in Region 1, a


trend consistent across all three post-unit assessments. This can be seen in Appendix 3, which
also indicates a higher incidence of missingness among teachers in this region. There are sev-
eral factors contributing to this regional discrepancy. First, several teachers in Region 1 began
ML-PBL teaching later than their peers in other regions, which led to delays in assessment. Sec-
ond, there were instances of first-time PBL teachers who, while acclimating to the new curricu-
lum materials, overlooked the administration of the post-unit assessments. Lastly, harsh winter
conditions during the first and second weeks of January and February disrupted the teachers'
schedules, leading to student absences, incomplete unit teaching, and thus, missing assess-
ments.
|

10982736, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/tea.21900 by Cochrane Chile, Wiley Online Library on [14/09/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
48 LI ET AL.

A P P END I X 4 : ONE PARAMETER LOGISTIC (1 PL) ITEM RESPONSE ANALYSIS IN


END-OF-YEAR SUMMATIVE TEST

Coefficient SE z p-value
Form A
Discriminant 0.706 0.048 14.72 0.000
Difficulty
Question 7 0.145 0.125 1.17 0.244
Question 11 0.363 0.127 2.87 0.004
Question 9 0.363 0.127 2.87 0.004
Question 10 0.810 0.136 5.96 0.000
Question 1 0.904 0.139 6.52 0.000
Question 6 0.936 0.140 6.71 0.000
Question 3 1.322 0.154 8.60 0.000
Question 8 1.787 0.176 10.18 0.000
Question 2 2.944 0.247 11.89 0.000
Question 4 4.383 0.369 11.88 0.000
Form B
Discriminant 1.412 0.097 14.49 0.000
Difficulty
Question 3 0.878 0.093 9.47 0.000
Question 4 0.084 0.079 1.07 0.000
Question 1 0.027 0.078 0.35 0.729
Question 2 0.388 0.081 4.77 0.000
Form C
Discriminant 1.12 0.073 15.26 0.000
Difficulty
Question 1 0.445 0.094 4.72 0.000
Question 4 0.696 0.099 7.00 0.000
Question 2 0.761 0.101 7.52 0.000
Question 6 1.498 0.129 11.63 0.000
Question 5 1.598 0.134 11.96 0.000
Question 3 1.814 0.145 12.51 0.000
Question 7 3.419 0.271 12.60 0.000

Note: The data source comes from the ML-PBL technical report: https://mlpbl.open3d.science/techreport (tab. 13, IRT for
Each Form).
|

10982736, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/tea.21900 by Cochrane Chile, Wiley Online Library on [14/09/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
LI ET AL. 49

A P P END I X 5 : END-OF-YEAR SUMMATIVE TESTS FORMS BY NUMBER OF


STUDENTS BY CLASSROOMS

Treatment classroom

(n = 53)

Mean SD
Classroom average student took Form B 7.19 1.99
Classroom average student took Form A 7.30 1.56
Classroom average student took Form C 7.09 2.02
Percent of student took Form A 0.34 0.04
Percent of student took Form B 0.33 0.04
Percent of student took Form C 0.33 0.04

A P P END I X 6 : GRADE RESPONSE MODEL RESULT BY POST-UNIT ASSESSMENT

(a) (b1) (b2) (b3)


sq_v1 0.87 3.41 1.48 0.39
sq_v2 1.11 2.27 0.71 1.89
sq_v3 1.17 4.08 1.50 3.40
sq_v4 1.16 3.54 1.74 0.17
sq_v5 0.94 3.13 0.66 0.78
sq_v6 1.49 2.26 0.12 1.00
sq_v7 1.43 1.80 0.86 0.25
sq_v8 1.48 1.84 0.51 0.87
sq_v9 1.49 1.67 1.15 2.35
Range 0.87–1.49 4.08 to 1.67 1.74 to 1.15 0.17–3.40
Unit 1 (Cronbach's alpha = 0.71; marginal reliability = 0.77)
ty_v1 0.41 1.73 0.77 4.86
ty_v2 1.30 0.74 1.93 3.21
ty_v3 1.87 0.46 0.93 1.92
ty_v4 4.72 0.00 2.02 2.34
ty_v5 6.56 0.12 2.05 2.44
ty_v6 5.91 0.10 1.76 2.03
ty_v7 0.81 3.11 1.53 1.12
ty_v8 0.88 0.92 0.72 5.16
ty_v9 1.13 2.60 2.00 0.25
ty_v10 1.22 2.19 1.64 0.26
(Continues)
|

10982736, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/tea.21900 by Cochrane Chile, Wiley Online Library on [14/09/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
50 LI ET AL.

(a) (b1) (b2) (b3)


ty_v11 1.33 2.17 1.03 1.57
ty_v12 1.17 1.88 0.28 1.35
ty_v13 0.99 0.45 1.26 5.03
ty_v14 1.00 2.22 1.30 3.21
ty_v15 1.05 2.05 1.39 3.46
Range 0.41–6.56 3.11 to 0.12 2.00 to 2.05 0.25 to 5.16
Unit 2 (Cronbach's alpha = 0.84; marginal reliability = 0.88)
bd_v1 1.18 1.19 1.18 3.92
bd_v2 1.42 1.25 0.80 3.84
bd_v3 1.99 2.53 1.04 1.76
bd_v4 1.55 2.89 1.16 2.51
bd_v5 1.28 2.49 1.40 0.12
bd_v6 0.87 2.72 0.71 1.16
bd_v7 1.17 1.96 0.42 3.17
bd_v8 1.39 1.91 0.00 2.11
bd_v9 1.40 1.79 0.44 1.20
Range 0.87–1.99 2.89 to 1.19 1.40 to 1.18 0.12 to 3.92
Unit 3 (Cronbach's alpha = 0.76; marginal reliability = 0.79)

Note: (a) The item discrimination and (b) item difficulty parameters (item threshold). For example, an item with four response
categories has three estimated thresholds. Marginal reliability of a test is to assume the exact probability density function of the
latent trait distribution, g(θ).

Cronbach's alpha, measuring the internal consistency of items within each unit, ranged
from 0.71 to 0.84. A high consistency is indicated by values approaching 1.00 (DeVellis, 2003;
Tavakol & Dennick, 2011).

A P P END I X 7 : MARGINAL' RELIABILITY OF POST-UNIT ASSESSMENT BY UNITS

F I G U R E A 1 Marginal reliability of Unit 1.


|

10982736, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/tea.21900 by Cochrane Chile, Wiley Online Library on [14/09/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
LI ET AL. 51

F I G U R E A 2 Marginal reliability of Unit 2.

F I G U R E A 3 Marginal reliability of Unit 3.

Figures A1–A3 demonstrate the curve related to both scale information and conditional stan-
dard errors through transformations mathematically. Three score estimates are the most reli-
able in the 3 to + 3 theta. Appendix 7 shows marginal reliability for each of the unit
assessments, which indicates the exact probability density function of the latent trait distribu-
tion for students who took the post-unit assessments.

A P P END I X 8 : HLM MODELS

RQ1. Considering students' prior unit learning experiences, does their performance on
post-unit assessments cumulatively predict their science achievement on the end-
of-year summative test? By “considering,” we refer to the predictive effect of unit
learning, measured by post-unit assessment on the end-of-year summative test, con-
sidering their prior unit learning. Due to the inter-unit coherent design principles,
we hypothesize that the more units that treatment students experienced, the higher
|

10982736, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/tea.21900 by Cochrane Chile, Wiley Online Library on [14/09/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
52 LI ET AL.

students would score on the end-of-year summative test, reflecting improved science
achievement. This hypothesis seeks empirical evidence for the effectiveness of incor-
porating inter-unit and learning goal coherence principles when developing learning
systems.

Table 8 (Model 2):

Y ij ðEnd-of -year summative test z-scoreÞ ¼ β0j þ β1j ðPRETEST Þij


þ β2j Post-unit 1 assessment z-scoreij
þ β34j Form in end-of -year summative test ij
þ β5j ðFemaleÞij þ β69j Race Groupij
þ β10j disadvanatge student j þ β11j ELL student j
þ β12j N of student in classroomj
þ β13j Whether teach a focal sectionj þ β1416j Regionsj
þ β17j Missing flags in post-unit 1 assessment ij þ εij :
ð1aÞ
Table 8 (Model 3):

Y ij ðEnd-of -year summative test z-scoreÞ ¼ β0j þ β1j ðPRETEST Þij


þ β2j Post-unit 2 assessment z-scoreij
þ β34j Form in end-of -year summative test ij
þ β5j ðFemaleÞij þ β69j Race Groupij
þ β10j disadvanatge student j þ β11j ELL student j
þ β12j N of student in classroomj
þ β13j Whether teach a focal sectionj þ β1416j Regionsj
þ β17j Missing flags in post-unit 2 assessment ij þ εij :
ð1bÞ
Table 8 (Model 4):

Y ij ðEnd-of -year summative test z-scoreÞ ¼ β0j þ β1j ðPRETEST Þij


þ β2j Post-unit 3 assessment z-scoreij
þ β34j Form in end-of -year summative test ij
þ β5j ðFemaleÞij þ β69j Race Groupij
þ β10j disadvanatge student j þ β11j ELL student j
þ β12j N of student in classroomj
þ β13j Whether teach a focal sectionj þ β1416j Regionsj
þ β17j Missing flags in post-unit 3 assessment ij þ εij ,
ð1cÞ

where Y ij represents outcomes of interest (i.e., end-of-year summative test z-score) for student i
of classroom j. PRETEST ij is a continuous variable and standardized into z-score to represent
an individual student's initial level of reading achievement before participating in ML-PBL pro-
gram. Post-unit assessment z-scoreij represents a key predictor in this study.
Form in end-of -year summative test ij is two dummy variables to indicate which forms of
|

10982736, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/tea.21900 by Cochrane Chile, Wiley Online Library on [14/09/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
LI ET AL. 53

end-of-year summative test that students received. Femaleij is a dummy indicator for female stu-
dents. Race Groupij is four dummy variables to represent racial ethnic group statuses.
N of student in classroomj is a continuous variable to represent the average number of students
in the classroom. Whether teach a focal sectionj is a dummy variable to indicate whether the
classroom serve as a focal section. Regionsj is three dummy variables for the region areas of the
school location. Missing flags in post-unit assessment ij is three dummy variables to represent the
missingness in each unit assessment among treatment samples.

RQ2. After controlling for students' prior unit learning experiences, does student
performance on post-unit assessments predict their science achievement on the end-
of-year summative test? By “controlling,” we refer to the predictive effect of each
specific unit on the summative science achievement, excluding students' prior learn-
ing experience. Considering that each unit was designed to adhere to intra-unit and
learning goal coherence design principles, we argue that students' science learning
in each unit, as measured by post-unit assessments, should predict their perfor-
mance on the end-of-year summative test. This RQ aims to deepen our understand-
ing of the application of intra-unit coherence and learning goal coherence principles
in the development of coherent units and assessments aligned with the NGSS.

Table 9 (Model 3):

Y ij ðEnd-of -year summative test z-scoreÞ ¼ β0j þ β1j ðPRETEST Þij


þ β2aj Post-unit 1 assessment z-scoreij
þ β2bj Post-unit 2 assessment z-scoreij
þ β34j Form in end-of -year summative test ij
þ β5j ðFemaleÞij þ β69j Race Groupij
þ β10j disadvanatge student j þ β11j ELL student j
þ β12j N of student in classroomj
þ β13j Whether teach a focal sectionj þ β1416j Regionsj
þ β17j Missing flags in post-unit 1 assessment ij
þ β18j Missing flags in post-unit 2 assessment ij þ εij :

Table 9 (Model 4):

Y ij ðEnd-of -year summative test z-scoreÞ ¼ β0j þ β1j ðPRETEST Þij


þ β2aj Post-unit 1 assessment z-scoreij
þ β2bj Post-unit 2 assessment z-scoreij
þ β2cj Post-unit 3 assessment z-scoreij
þ β34j Form in end-of -year summative test ij
þ β5j ðFemaleÞij þ β69j Race Groupij
þ β10j disadvanatge student j þ β11j ELL student j
þ β12j N of student in classroomj
þ β13j Whether teach a focal sectionj þ β1416j Regionsj
þ β17j Missing flags in post-unit 1 assessment ij
þ β18j Missing flags in post-unit 2 assessment ij
þ β19j Missing flags in post-unit 3 assessment ij þ εij :
|

10982736, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/tea.21900 by Cochrane Chile, Wiley Online Library on [14/09/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
54 LI ET AL.

RQ3. Do the relationships between student performance on post-unit assessments


and the end-of-year summative test vary by gender, race, socio-economic status, or
ELL status? Given the equity-oriented design principles integrated into our curricu-
lum (Krajcik & Schneider, 2021), we assume that our learning system offers equita-
ble learning opportunities for all students. Thus, we further explore whether our
learning system effectively supports the science learning of students from underrep-
resented demographic groups and examine how students from these groups respond
to the coherent PBL design.

Table 10 (Unit 1):

Y ij ðEnd-of -year summative test z-scoreÞ ¼ β0j þ β1j ðPRETEST Þij


þ β2j Post-unit 1 assessment z-scoreij
þ β34j Form in end-of -year summative test ij
þ β5j Femaleij þ β69j Race Groupij
þ β10j N of student in classroomj
þ β11j Whether teach a focal sectionj þ β1214j Regionsj
þ β1517j Missing flags in post-unit 1 assessment ij þ εij
þ β18j Female  post-unit assessment z-score þ εij ,

where Y ij represents outcomes of interest (i.e., end-of-year summative test z-score) for student i
of classroom j. PRETEST ij is a continuous variable and standardized into z-score to represent
an individual student's initial level of reading achievement before participating in ML-PBL pro-
gram. Post-unit assessment z-scoreij represents a key predictor in this study.
Form in end-of -year summative test ij is two dummy variables to indicate which forms of end-
of-year summative test that students received. Femaleij is a dummy indicator for female stu-
dents. Race Groupij is four dummy variables to represent racial ethnic group statuses.
N of student in classroomj is a continuous variable to represent the average number of students
in the classroom. Whether teach a focal sectionj is a dummy variable to indicate whether the
classroom serve as a focal section. Regionsj is three dummy variables for the region areas of the
school location. Missing flags in post-unit assessment ij is three dummy variables to represent the
missingness in each unit assessment among treatment samples. Interaction termij includes in
the final model when answering the RQ3. An interaction term captures the effect of a con-
structed variable by multiplying those two variables together: the post-unit assessment (predic-
tor of interest) and student demographics (e.g., gender, SES status, ELL, race and ethnicity, and
region).
We examine the heterogeneity effect of post-unit 1 assessment separately by including each
of the variables below into the final model. And then we apply the same procedure to the post-
unit 2 and post-unit 3.
β19j Black  post-unit assessment z-score,
β20j Hispanic  post-unit assessment z-score,
β21j Asian  post-unit assessment z-score,
β22j Multiracial  post-unit assessment z-score,
β23j ELL  post-unit assessment z-score,
β24j SES  post-unit assessment z-score,
β25j Region 1  post-unit assessment z-score,
|

10982736, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/tea.21900 by Cochrane Chile, Wiley Online Library on [14/09/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
LI ET AL. 55

β26j Region 2  post-unit assessment z-score,


β27j Region 3  post-unit assessment z-score,
β28j Region 4  post-unit assessment z-score:

A P P END I X 9 : HLM ASSUMPTION CHECK

The outcome variable (i.e., end-of-year summative test z-score) varies linearly with the
explanatory variables (i.e., pretest z-score and the three post-unit assessment z-scores).

We also checked the assumptions for all HLM analysis (Snijders & Bosker, 2011) using the
linearity between outcome variable and the primary independent variables, the residual vari-
ance homoscedasticity, and random intercepts. Overall, our findings support the HLM assump-
tions (see one example of checking assumptions for HLM models in Appendix 9)
Linearity between outcome variable and explanatory variables (end-of-year summative test
z-score and post-unit assessment z-scores).
|

10982736, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/tea.21900 by Cochrane Chile, Wiley Online Library on [14/09/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
56 LI ET AL.

F I G U R E A 4 The Distribution of Residual Plot on the Summative Test after HLM model.

Figure A4 is the residual plot when we use the summative test as the outcome after the
Hierarchical linear model. We expect that the error terms follow the normal distribution.

F I G U R E A 5 The Standardized Normal Probability Plot after HLM model.

In Figure A5, the normal probability plot visualizes whether the distribution of the outcome
variable is approximately close to a normal distribution.
|

10982736, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/tea.21900 by Cochrane Chile, Wiley Online Library on [14/09/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
LI ET AL. 57

F I G U R E A 6 The Q–Q plot of residuals (end-of-year summative test z-scores).

A Q–Q plot is often used to assess whether the residuals in a regression analysis are nor-
mally distributed. As shown in Figure A6, the Q–Q plot above shows a straight line at a 45
angle, which indicates the residuals tend to be roughly normally distributed. The result shows a
slight deviation from normal at the bottom tail. Nevertheless, the tail seems to be a minor devia-
tion from the normality.

F I G U R E A 7 Fitted plot in the multilevel model.

In Figure A7, we are looking for either a blatant deviation from a mean of 0, or an increas-
ing/decreasing variability on the y-axis over the predicted estimates. The plot shows the resid-
uals against the fitted plot in the multilevel model. The plot when the fitted values increase, the
error spreads out with slightly heteroskedasticity tendency on the residual toward 45 line.
Therefore, we employ the HLM estimates with robust standard error estimates in the final
model in Appendix 6. The robust estimates come with a cost in power, such as a larger standard
error than standard asymptotic estimates. But we found all HLM analyses (in Tables 7 and 8)
are still held after applying the robust estimates.
|

10982736, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/tea.21900 by Cochrane Chile, Wiley Online Library on [14/09/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
58 LI ET AL.

A P P END I X 1 0 : FINAL MODEL WITH ROBUST STANDARD ERROR ESTIMATES

Model 1a Model 1b Model 1c Model 2a Model 2b


b/SE b/SE b/SE b/SE b/SE
Pretest: 2018 reading score 0.332*** 0.299*** 0.307*** 0.303*** 0.285***
(0.046) (0.043) (0.046) (0.045) (0.045)
Unit 1 z-score 0.064 0.033 0.021
(0.043) (0.043) (0.041)
Unit 1 z-score missing flag 0.138* 0.137* 0.141*
(0.068) (0.069) (0.069)
Unit 2 z-score 0.168*** 0.149** 0.122*
(0.048) (0.050) (0.051)
Unit 3 z-score missing flag 0.173 0.114 0.117
(0.106) (0.114) (0.111)
Unit 3 z-score 0.148*** 0.101**
(0.033) (0.035)
Unit 3 z-score missing flag 0.067 0.036
(0.118) (0.125)
Covariates Included Included Included Included Included

*p < 0.05; **p < 0.01; ***p < 0.001.

A P P END I X 1 1 : INITIAL RESULT OF STRUCTURE EQUATION MODEL:


EXAMINING THE RELATIONSHIP BETWEEN POST-UNIT ASSESSMENT AND END-
OF-YEAR SUMMATIVE TEST

Estimate SE t-statistics p-value


Pre-test ! U1 ! End-of-year summative test 0.029 0.027 1.073 0.283
Pre-test ! U2 ! End-of-year summative test 0.045* 0.022 2.046 0.041
Pretest ! U3 ! End-of-year summative test 0.053 0.025 2.120 0.039

Note: We conducted structure equation model using Mplus 7.0. Results were reported in National Association for Research in
Science Teaching (NARST) Annual Conference 2022.
*p < 0.05.
|

10982736, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/tea.21900 by Cochrane Chile, Wiley Online Library on [14/09/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
LI ET AL. 59

A P P END I X 1 2 : FINAL MODEL USING RESTRICTED STUDENT SAMPLE WITH


THREE POST-UNIT ASSESSMENTS

Model 1a Model 1b Model 1c Model 2a Model 2b


b/SE b/SE b/SE b/SE b/SE
Pretest: reading z-score 0.351*** 0.316*** 0.330*** 0.308*** 0.292***
(0.045) (0.045) (0.045) (0.047) (0.048)
Unit 1 z-score 0.064 0.036 0.022
(0.046) (0.046) (0.047)
Unit 2 z-score 0.169*** 0.161** 0.138**
(0.051) (0.052) (0.053)
Unit 3 z-score 0.123** 0.083*
(0.045) (0.041)
Covariates Include Include Include Include Include

Note: We conducted similar analyses as we did in Table 7 (Models 2, 3, and 4) and Table 8 (Models 2 and 3) using student
sample who completed all three post-unit assessment tests (n = 470).
*p < 0.05; **p < 0.01; ***p < 0.001.

A P P END I X 1 3 : FINAL MODEL USING RESTRICTED STUDENT SAMPLE WITH


UNIT 3

Model 1a Model 1b Model 1c Model 2a Model 2b


b/SE b/SE b/SE b/SE b/SE
Pretest: reading z-score 0.011*** 0.011*** 0.011*** 0.009*** 0.008***
(0.001) (0.001) (0.001) (0.001) (0.001)
Unit 1 z-score 0.087+ 0.057 0.039
(0.044) (0.045) (0.045)
Unit 2 z-score 0.194*** 0.185*** 0.148**
(0.045) (0.046) (0.047)
Unit 3 z-score 0.160*** 0.120**
(0.037) (0.039)
Covariates Include Include Include Include Include

Note: We conducted similar analyses as we did in Table 7 (Models 2, 3, and 4) and Table 8 (Models 2 and 3) using an analytic
sample that is consistent with Unit 4 sample (N = 889).
*p < 0.05; **p < 0.01; ***p < 0.001.
|

10982736, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/tea.21900 by Cochrane Chile, Wiley Online Library on [14/09/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
60 LI ET AL.

A P P END I X 1 4 : FINAL MODEL USING RESTRICTED STUDENT SAMPLE WITH


UNIT 1

Model 1a Model 1b Model 1c Model 2a Model 2b


b/SE b/SE b/SE b/SE b/SE
Pretest: benchmark reading z-score 0.325*** 0.304*** 0.313*** 0.291*** 0.276***
(0.041) (0.042) (0.041) (0.043) (0.043)
SQ z-score 0.071+ 0.050 0.038
(0.042) (0.042) (0.043)
TY z-score 0.137** 0.126** 0.111*
(0.047) (0.048) (0.049)
BD z-score 0.120** 0.091*
(0.043) (0.045)
Covariates Included Included Included Included Included

Note: We conducted similar analyses as we did in Table 7 (Models 2, 3, and 4) and Table 8 (Models 2 and 3) using an analytic
sample that is consistent with Unit 4 sample (N = 629).
*p < 0.05; **p < 0.01; ***p < 0.001.

A P P END I X 1 5 : HIERARCHICAL LINEAR MODEL PER UNIT ASSESSMENT

Model 1SQ Model 2TY Model 3BD


b/SE b/SE b/SE
Pretest: 2018 reading score 0.011*** 0.010*** 0.010***
(0.001) (0.001) (0.001)
Unit 1 z-score 0.087+
(0.044)
Unit 1 z-score missing flag 0.182*
(0.079)
Unit 2 z-score 0.194***
(0.045)
Unit 2 z-score missing flag 0.178
(0.108)
Unit 3 z-score 0.160***
(0.037)
Unit 3 z-score missing flag 0.168
(0.102)
All covariates Yes Yes Yes
|

10982736, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/tea.21900 by Cochrane Chile, Wiley Online Library on [14/09/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
LI ET AL. 61

Model 1SQ Model 2TY Model 3BD


b/SE b/SE b/SE
Constant 0.459 0.718* 0.546
(0.362) (0.345) (0.333)
a a
No. of students 629 839 889a
No. of classrooms 43 46 51
a
Number reported students in each sub-sample who completed single unit of assessment with all valid covariates.
+
p < 0.01.
*p < 0.05; **p < 0.01; ***p < 0.001.

A P P END I X 1 6 : CLASSROOM/TEACHER FIXED EFFECT MODEL

Model 1a Model 1b Model 1c Model 2a Model 2b


b/SE b/SE b/SE b/SE b/SE
Pretest: reading z-score 0.339*** 0.304*** 0.325*** 0.296*** 0.283***
(0.035) (0.036) (0.035) (0.036) (0.037)
Unit 1 z-score 0.060 0.040 0.031
(0.043) (0.043) (0.043)
Unit 1 z-score missing flag 0.108 -0.104 -0.106
(0.077) (0.077) (0.077)
Unit 2 z-score 0.175*** 0.169*** 0.151***
(0.044) (0.044) (0.045)
Unit 2 z-score missing flag 0.073 0.061 -0.073
(0.115) (0.115) (0.116)
Unit 3 z-score 0.103** 0.073*
(0.037) (0.032)
Unit 3 z-score missing flag 0.032 0.038
(0.116) (0.117)

*p < 0.05; **p < 0.01; ***p < 0.001.

You might also like