Etr 528 Test Evaluation-Aynur Aytekin

1
Test Evaluation
ETR 528
Aynur Aytekin
Part 1: Test Purpose, Population, and Design
1. Assessment name
PARCC Grade 3 Spring 2018 Assessment
2. Assessment type/s: formative (interim/benchmark, diagnostic), summative, norm-referenced,
criterion-referenced, traditional, performance, high-stakes, standardized, etc.
PARCC is a standardized, summative, criterion-referenced, traditional state test that measures 3rd
through 11th-grade mathematics and English language arts/literacy standards.
3. Population for which assessment is designed, including grade/s/ages/s and any special
inclusion or exclusion criteria, including whether test is intended for special populations (e.g.,
ELLs, SWDs)
This test is designed for all 3rd -grade students to measure the mastery of grade-level standards. All
3rd-grade students have to take the test, including English learners and diverse learners. The only
exception is the diverse learners with severe cognitive disabilities. For those students with severe
cognitive disabilities, the state test is not appropriate, even if they have the accommodations. They
take Dynamic Learning Maps (DLM) assessment instead.
4. Content assessed, in terms of general content area (e.g., mathematics), domains and sub-
domains, and specific content standards (e.g., Common Core State Standards for
mathematics) or other content framework.
PARCC assesses the mastery of CCSS in 3rd-11th grade for the subjects of math and ELA. PARCC
is also aligned with the cognitive complexity and rigor of the Common Core Standards in each
grade level and subject. PARCC measures;

2
 Grade level standards
 Integrated standards
 Mathematical reasoning statements
 Mathematical modeling statements
The list of the CCSS and integrated standards, mathematical reasoning, and mathematical modeling
statements assessed in PARCC Grade 3 Spring 2018 math test is below:
Grade 3 Math CCSS
Domain 1: Operations and Algebraic Thinking
Standard
Statement Text
Key
CCSS.MATH.CONTENT.3.OA.A.2
Interpret whole-number quotients of whole numbers, e.g., interpret 56 ÷ 8 as the
number of objects in each share when 56 objects are partitioned equally into 8
3.OA.A.2
shares, or as a number of shares when 56 objects are partitioned into equal shares of
8 objects each. For example, describe a context in which a number of shares or a
number of groups can be expressed as 56 ÷ 8.
Use multiplication and division within 100 to solve word problems in situations
3.OA.A.3
involving equal groups, arrays, and measurement quantities, e.g., by using drawings
and equations with a symbol for the unknown number to represent the problem.
Determine the unknown whole number in a multiplication or division equation
3.OA.A.4
relating three whole numbers. For example, determine the unknown number that
makes the equation true in each of the equations 8 × ? = 48, 5 = _ ÷ 3, 6 × 6 = ?
Domain 2: Number & Operations in Base Ten
Standard
Statement Text
Key
CCSS.MATH.CONTENT.3.NBT.A.1
3.NBT.A.1
Use place value understanding to round whole numbers to the nearest 10 or 100.
Fluently add and subtract within 1000 using strategies and algorithms based on
3.NBT.A.2
place value, properties of operations, and/or the relationship between addition and
subtraction.
3
3.NBT.A.3 Multiply one-digit whole numbers by multiples of 10 in the range 10-90 (e.g., 9 ×
80, 5 × 60) using strategies based on place value and properties of operations.
Domain 3: Number & Operations - Fractions
Standard
Statement Text
Key
CCSS.MATH.CONTENT.3.NF.A.1
Understand a fraction 1/b as the quantity formed by 1 part when a whole is
3.NF.A.1
partitioned into b equal parts; understand a fraction a/b as the quantity formed by a
parts of size 1/b.
3.NF.A.2 Understand a fraction as a number on the number line; represent fractions on a
number line diagram.
CCSS.MATH.CONTENT.3.NF.A.2.A
Represent a fraction 1/b on a number line diagram by defining the interval from 0
3.NF.A.2.A to 1 as the whole and partitioning it into b equal parts. Recognize that each part has
size 1/b and that the endpoint of the part based at 0 locates the number 1/b on the
number line.
CCSS.MATH.CONTENT.3.NF.A.2.B
Represent a fraction a/b on a number line diagram by marking off a lengths 1/b
3.NF.A.2.B
from 0. Recognize that the resulting interval has size a/b and that its endpoint
locates the number a/b on the number line.
3.NF.A.3 Explain equivalence of fractions in special cases, and compare fractions by
reasoning about their size.
CCSS.MATH.CONTENT.3.NF.A.3.A
3.NF.A.3.A Understand two fractions as equivalent (equal) if they are the same size, or the
same point on a number line.
CCSS.MATH.CONTENT.3.NF.A.3.B
3.NF.A.3.B Recognize and generate simple equivalent fractions, e.g., 1/2 = 2/4, 4/6 = 2/3.
Explain why the fractions are equivalent, e.g., by using a visual fraction model.
CCSS.MATH.CONTENT.3.NF.A.3.C
Express whole numbers as fractions, and recognize fractions that are equivalent to
3.NF.A.3.C
whole numbers. Examples: Express 3 in the form 3 = 3/1; recognize that 6/1 = 6;
locate 4/4 and 1 at the same point of a number line diagram.
CCSS.MATH.CONTENT.3.NF.A.3.D
Compare two fractions with the same numerator or the same denominator by
reasoning about their size. Recognize that comparisons are valid only when the two
3.NF.A.3.D
fractions refer to the same whole. Record the results of comparisons with the
symbols >, =, or <, and justify the conclusions, e.g., by using a visual fraction
model.
4
Domain 4: Measurement & Data
Standard
Statement Text
Key
CCSS.MATH.CONTENT.3.MD.A.1
Tell and write time to the nearest minute and measure time intervals in minutes.
3.MD.A.1
Solve word problems involving addition and subtraction of time intervals in
minutes, e.g., by representing the problem on a number line diagram.
CCSS.MATH.CONTENT.3.MD.A.2
Measure and estimate liquid volumes and masses of objects using standard units of
grams (g), kilograms (kg), and liters (l).1 Add, subtract, multiply, or divide to solve
3.MD.A.2
one-step word problems involving masses or volumes that are given in the same
units, e.g., by using drawings (such as a beaker with a measurement scale) to
represent the problem
CCSS.MATH.CONTENT.3.MD.B.3
Draw a scaled picture graph and a scaled bar graph to represent a data set with
3.MD.B.3 several categories. Solve one- and two-step "how many more" and "how many
less" problems using information presented in scaled bar graphs. For example,
draw a bar graph in which each square in the bar graph might represent 5 pets.
CCSS.MATH.CONTENT.3.MD.B.4
Generate measurement data by measuring lengths using rulers marked with halves
3.MD.B.4
and fourths of an inch. Show the data by making a line plot, where the horizontal
scale is marked off in appropriate units— whole numbers, halves, or quarters.
CCSS.MATH.CONTENT.3.MD.C.5
3.MD.C.5 Recognize area as an attribute of plane figures and understand concepts of area
measurement.
CCSS.MATH.CONTENT.3.MD.C.5.A
3.MD.C.5.A A square with side length 1 unit, called "a unit square," is said to have "one square
unit" of area, and can be used to measure area.
CCSS.MATH.CONTENT.3.MD.C.5.B
3.MD.C.5.B A plane figure which can be covered without gaps or overlaps by n unit squares is
said to have an area of n square units.
3.MD.C.6 Measure areas by counting unit squares (square cm, square m, square in, square ft,
and improvised units).
3.MD.C.7
Relate area to the operations of multiplication and addition.
CCSS.MATH.CONTENT.3.MD.C.7.A
3.MD.C.7.A Find the area of a rectangle with whole-number side lengths by tiling it, and show
that the area is the same as would be found by multiplying the side lengths.
CCSS.MATH.CONTENT.3.MD.C.7.B
3.MD.C.7.B
Multiply side lengths to find areas of rectangles with whole-number side lengths in
5
the context of solving real world and mathematical problems, and represent whole-
number products as rectangular areas in mathematical reasoning.
CCSS.MATH.CONTENT.3.MD.C.7.C
Use tiling to show in a concrete case that the area of a rectangle with whole-
3.MD.C.7.C
number side lengths a and b + c is the sum of a × b and a × c. Use area models to
represent the distributive property in mathematical reasoning.
CCSS.MATH.CONTENT.3.MD.C.7.D
Recognize area as additive. Find areas of rectilinear figures by decomposing them
3.MD.C.7.D
into non-overlapping rectangles and adding the areas of the non-overlapping parts,
applying this technique to solve real world problems.
CCSS.MATH.CONTENT.3.MD.D.8
Solve real world and mathematical problems involving perimeters of polygons,
3.MD.D.8 including finding the perimeter given the side lengths, finding an unknown side
length, and exhibiting rectangles with the same perimeter and different areas or
with the same area and different perimeters.
Domain 5: Geometry
Standard
Statement Text
Key
CCSS.MATH.CONTENT.3.G.A.1
Understand that shapes in different categories (e.g., rhombuses, rectangles, and
others) may share attributes (e.g., having four sides), and that the shared attributes
3.G.A.1
can define a larger category (e.g., quadrilaterals). Recognize rhombuses, rectangles,
and squares as examples of quadrilaterals, and draw examples of quadrilaterals that
do not belong to any of these subcategories.
CCSS.MATH.CONTENT.3.G.A.2
Partition shapes into parts with equal areas. Express the area of each part as a unit
3.G.A.2
fraction of the whole. For example, partition a shape into 4 parts with equal area,
and describe the area of each part as 1/4 of the area of the shape.
Integrative Standard Statements
Given a two-step problem situation with the four operations, round the
values in the problem, then use the rounded values to produce an
3.Int.1
approximate solution.
Content Scope: 3.OA.8, 3.NBT.1, 3.NBT.2, 3.NBT.3
6
Solve two-step word problems using the four operations requiring a

substantial addition, subtraction, or multiplication step, drawing on
3.Int.2
knowledge and skills articulated in 3.NBT.
Content Scope: 3.OA.8, 3.NBT.2, and 3.NBT.3
Solve real world and mathematical problems involving perimeters of

polygons requiring a substantial addition, subtraction, or multiplication
3.Int.3
step, drawing on knowledge and skills articulated in 3.NBT.
Content Scope: 3.MD.8, 3.NBT.2, and 3.NBT.3
Use information presented in a scaled bar graph to solve a two-step “how

many more” or “how many less” problem requiring a substantial addition,
3.Int.4 subtraction, or multiplication step, drawing on knowledge and skills
articulated in 3.NBT.
Add, subtract, or multiply to solve a one-step word problem involving
masses or volumes that are given in the same units, where a substantial
3.Int.5 addition, subtraction, or multiplication step is required drawing on
knowledge and skills articulated in 3.NBT, e.g., by using drawings.
Mathematical Reasoning Statements
Base explanations/reasoning on the properties of operations.

3.C.1-1
Content Scope: Knowledge and skills articulated in 3.OA.5

3.C.1-2

3.C.1-3
Content Scope: Knowledge and skills articulated in 3.MD.7
Base explanations/reasoning on the relationship between multiplication

3.C.2 and division.
7
Base arithmetic explanations/reasoning on concrete referents such as

diagrams (whether provided in the prompt or constructed by the student
3.C.3-1 in her response), connecting the diagrams to a written (symbolic) method.
Content Scope: Knowledge and skills articulated in 3.NF.3b,
3.NF.3d
Base explanations/reasoning on concrete referents such as diagrams

(whether provided in the prompt or constructed by the student in her
3.C.3-2 response).
Content Scope: Knowledge and skills articulated in 3.MD.5,
3.MD.6, 3.MD.7
Distinguish correct explanation/reasoning from that which is flawed, and –

if there is a flaw in the argument – present corrected reasoning.(For
3.C.4-1 example, some flawed ‘student’ reasoning is presented and the task is to
correct and improve it.)

if there is a flaw in the argument – present corrected reasoning. (For

8

example, some flawed ‘student’ reasoning is presented and the task is to
3.C.4-4
Content Scope: Knowledge and skills articulated in 3.NF.3b,
3.NF.3d

Content Scope: Knowledge and skills articulated in 3.MD.7


Content Scope: Knowledge and skills articulated in 2.NBT
Present solutions to two-step problems in the form of valid chains of

reasoning, using symbols such as equals signs appropriately (for
example, rubrics award less than full credit for the presence of nonsense
3.C.5-1 statements such as 1 + 4 = 5 + 7 = 12, even if the final answer is correct),
or identify or describe errors in solutions to two-step problems and
present corrected solutions.
9
Present solutions to multi-step problems in the form of valid chains of

reasoning, using symbols such as equals signs appropriately (for
example, rubrics award less than full credit for the presence of nonsense
statements such as 1 + 4 = 5 + 7 = 12, even if the final answer is correct),
3.C.5-1
or identify or describe errors in solutions to multi-step problems and
present corrected solutions.
Content Scope: Knowledge and skills articulated in 3.MD.7b,
3.MD.7d
Base explanations/reasoning on a number line diagram (whether

3.C.6-1 provided in the prompt or constructed by the student in her response)
Content scope: Knowledge and skills articulated in 3.NF.2
Base explanations/reasoning on a number line diagram (whether

3.C.6-2 provided in the prompt or constructed by the student in her response)
Content scope: Knowledge and skills articulated in 3.MD.1
Mathematical Modeling Statements
Solve multi-step contextual word problems with degree of difficulty

3.D.1 appropriate to Grade 3, requiring application of knowledge and skills
articulated in Type I, Sub-Claim A Evidence Statements.
Solve multi-step contextual problems with degree of difficulty appropriate
3.D.2 to Grade 3, requiring application of knowledge and skills articulated in
2.OA.A, 2.OA.B, 2.NBT, and/or 2.MD.B.
5. Purpose for which the assessment is designed, and decisions made (i.e., for what and by whom
are the collected data to be used). When considering the assessment purpose and assessment
users, cast a wide net to include teachers, administrators, policymakers, parents, etc.
10
The purpose of the PARCC testing is to provide information for the schools and teachers to
improve instruction in the classrooms. The test gives directions to the instructors about the student’s
current academic level, what areas they need more support, and what knowledge and skills they
need to know for the next grade level.
Schools use the test results to put the changes in place to improve the quality of the
instruction. The PARCC assessment also provides data regarding the sub-groups performance and
what type of support they need.
Parents are informed the weaknesses and strength of their children and how to work together
collaboratively with their teacher and school to close those gaps.
Policymakers check the data and plan accordingly to support schools and districts to have
higher achievements. PARCC data can be used to see how much the curriculum and instructional
practices are aligned to CCSS. For the state of Illinois, Illinois State Board of Education (ISBE)
releases the Illinois Report Card to show the picture of how the state, districts, and schools are
performing academically. Based on the school’s overall data for all accountability indicators, the
schools are categorized as lowest-performing, underperforming, commendable, or exemplary
schools. Lowest-performing and underperforming schools receive support from commendable and
exemplary schools, and they have to complete school improvement plans or other requirements
depending on the state.
*Illinois State-administered PARCC testing in 2018, and the PARCC results were part of the report cards. In
March 2019, ISBE decided to stop administering PARCC and switched to Illinois Assessment of Readiness
(IAR). IAR is a very similar test to PARCC in terms of the rigor and test structure.
11
6. Assessment Design (nature of assessment task): general test design (traditional selected-
response, traditional constructed-response, performance, hybrid) and specific nature of
assessment task/s (e.g., numbers of types of selected- or constructed-response items)
PARCC assessment contains three task types.
Task Type Sub-Claim Description

 Problems involving the major content for the grade level
(Sub-Claim A)
Sub-Claims  Problems involving the additional and supporting content
Type I
A and B for the grade level (Sub-claim B)
 Conceptual understanding, fluency, and application
 Machine-scorable items
 Problems involving expressing mathematical reasoning by
constructing mathematical arguments and critiques (Sub-
Claim C)
Type II Sub-Claims C
 Includes written arguments/justifications, the critique of
reasoning
 Machine-scorable and hand-scored items
 Modeling and application in a real-world context

Type III Sub-Claims D
 Machine-scorable and hand-scored items
PARCC 3rd-Grade Math Assessment Item Specifications
Task Evidence Sub-

Number Question Type Scoring*
Type Statement Claims
1 Type I 3.OA.1 A Drag-and-drop MS
2 Type I 3.NBT.2 B Fill-in-the-blank MS
3 Type I 3.MD.1-1 A Fill-in-the-blank MS
A. Constructed Response HS
4 Type III 3.D.2 D
B. Constructed Response HS
5 Type I 3.NF.1 A Multiple Choice MS

12
6 Type I 3.OA.3-1 A Multiple Select MS
8 Type II 3.C.5-1 C B. Constructed Response HS
C. Fill-in-the-blank MS
A. Fill-in-the-blank MS
10 Type I 3.Int.2 A
B. Fill-in-the-blank MS
11 Type I 3.G.2 B Fill-in-the-blank MS
12 Type I 3.MD.1-2 A Fill-in-the-blank MS
14 Type I 3.OA.7-2 A Fill-in-the-blank MS
15 Type III 3.D.1 D Constructed Response HS
16 Type I 3.OA.2 A Multiple Select MS
17 Type I 3.MD.7b-1 A Fill-in-the-blank MS
18 Type I 3.OA.3-4 A Multiple Choice MS
19 Type I 3.MD.5 A Multiple Choice MS
20 Type I 3.G.2 B Drag-and-drop MS
21 Type I 3.OA.4 A Fill-in-the-blank MS
24 Type II 3.C.4-6 C
B. Multiple Choice MS
13
C. Constructed Response HS
25 Type I 3.NF.3a-1 A Multiple Select MS
26 Type I 3.G.1 B Multiple Select MS
27 Type I 3.NF.3d A Drag-and-drop MS
28 Type I 3.MD.8 B Multiple Select MS
29 Type I 3.OA.1 A Multiple Select MS
A. Multiple Choice MS
30 Type II 3.C.3-2 C
B. Constructed Response HS
34 Type I 3.NF.3c A Multiple Select MS
35 Type I 3.MD.4 B Multiple Select MS
*Machine Scored (MS)
Hand Scored (HS)
7. Assessment design (scoring): scoring procedures for assessment items/task (e.g., answer key,
checklists, scoring rubric/s, rating scales); and scores reported by the assessment (e.g., scale
scores, percentile ranks, z-scores, RIT scores)
PARCC 3rd Grade Math Spring 2018 –Scoring Guidelines
There are three different types of prompts for the PARCC 3rd-grade math assessment. The
type of prompts for each item is listed on the table above.

14
Machine-Scored Prompts: Multiple choices, multiple select, fill-in-the-blank type
questions are machine-scored items. The machine scored prompts are either right (1 point), or
wrong (0 points) prompts.
Equation Editor Prompts: Students enter their response using the equation editor box.
Only numbers and equations can be entered in the box. The responses in the box are machine-
scored prompts scored by the Pearson’s Knowledge Technology group. These items are 0, 1, or 2
points.
Constructed Response Prompts: Students may enter both text and mathematics in the
boxes provided. Pearson’s Performance Scoring group scores these items based on the scoring
rubric. For the 3rd-grade math assessment, constructed response prompts have a maximum of 3, 4,
or 6 points.
The PARCC reports provide two different types of data;
 Overall Scale Score: The scale score summarizes student performance. The overall score
ranges from 600 to 850. The scores 750 and above are considered to be proficient level.
 Performance Level: PARCC also uses performance levels to determine how well the
students meet grade-level expectations. There are five levels;
o Level 1: Did not meet the expectations
o Level 2: Partially met expectations
o Level 3: Approached expectations
o Level 4: Met expectations
o Level 5: Exceeded expectations

15
For the PARCC Grade 3 Math Spring 2018 test the table of specifications is below;
Items Number of Tasks Total Points

Type I
32 32
1 Point
Type I
4 8
2 Point
Type II
2 6
3 Point
Type II
2 8
4 Point
Type III
2 6
3 Point
Type III
1 6
6 Point
TOTAL 43 66
Type I 36 (84%) 40 (61%)

Total
Points by Type II 4 (9%) 14 (21%)
Task Type
Type III 3 (7%) 12 (18%)
Part 2: Reliability
8. Describe the procedures used to investigate score reliability (e.g., KR-20, construct reliability
from confirmatory factor analysis), and the types of reliability investigated (e.g., internal
consistency, test-retest reliability, parallel-forms reliability, inter-rater reliability).
Reliability focuses on how much the test is consistent and stable in assessing what it is
indented to assess. For the PARCC test, there are several ways to estimate reliability. Internal
consistency evidence, reliability of classification estimate, inter-rater reliability, and standard error
of measurement (SEM) will be explained in this part.
Internal Consistency Evidence

16
Internal consistency evidence explains the consistency of the performance of each individual
student across the test items. The higher reliability coefficients of internal consistency explain that
without changing the academic level or the skills of the students, most likely, they will be obtaining
similar scores form the test if they were administered the test again. Any score above .70 is
considered as acceptable, and the scores above .90 show a very high internal consistency. The
average reliability of the math test for the 3rd-grade is .94. The average reliability estimates for the
3rd through 8th-grade range from .91 to .94. Having the Cronbach alpha so high tells us that the
internal consistency of the PARCC test is very high across all the grade levels. The data for the 3rd
grade math internal consistency evidence is below.
Average Avg. Raw

Min. Max.
Test Max. Score Avg. Min. Max.
Sample Sample
Form* Possible SEM Reliability Reliability Reliability
Size Size
Score
3-CBT 66 3.37 0.94 2,070 0.92 33,176 0.94
3-PBT 65 3.39 0.93 4,405 0.90 37,527 0.93
*CBT-Computer Based Test

PBT-Paper-based Test
Reliability of the Classification Estimates
PARCC divides the performance levels into five categories. The reliability of classification
estimates measures how accurately the students are placed into those performance levels.
Classification estimates have two parts; decision accuracy and decision consistency. Decision
accuracy explains the relations between the actual classifications and the classifications if we had
the perfect scores for the reliability. Decision consistency measures the consistency between the
two independent forms of the tests. (Livingston & Lewis, 1995).

17
The table of the decision accuracy and decision consistency scores for 3rd-grade math is below;
Decision Accuracy Decision Consistency

Proportion Accurately Classified Proportion Consistently Classified
Level 4 or
Level 4 or higher
higher vs.
Form of the test Exact Level Exact Level vs. Level 3 or
Level 3 or
lower
lower
CBT 0.77 0.93 0.68 0.90
PBT 0.76 0.93 0.67 0.90
The column “Exact level” explains the estimation of the classification of the students into
one of the five performance levels. For the 3rd-grade math test, the table shows that the proportions
of decision accuracy for the five performance levels of PARCC math were.77 of the CBT takers
and .76 of the students who take PBT. PARCC 3rd-grade math test classifies students as being at
Level 4 or higher vs. being at Level 3 or lower with an accuracy of 93%. Also, the proportion of the
students being classified consistently is .90. Pearson uses the computer program BB-Class
(Brennann, 2004), to calculate the decision accuracy and decision consistency proportions. This
statistical method was developed by Livingston and Lewis (1993, 1995). The decision accuracy
and decision consistency scores are acceptable scores in terms of the reliability of classification of
performance levels.
Inter-Rater Agreement
Inter-rater agreement is the relation between the first score and the second score of the
students’ extended responses. It measures exact, adjacent, and nonadjacent agreements. The data is
used by the decision-makers for training and intervention purposes.

18
Inter-rater agreement expectation and the results are as below for all math grade levels;
Perfect Perfect Within One

Score Point Within One
Subject Agreement Agreement Point
Range Point Result
Expectation Results Expectation
0-1 90% 97% 96% 100%
0-2 80% 94% 96% 100%
0-3 70% 93% 96% 99%
Math
0-4 65% 92% 95% 98%
0-5 65% 90% 95% 98%
0-6 65% 91% 95% 97%
The percentiles in the column “Perfect Agreement Results” and “Within One Point
Expectation” shows the level of agreement between the scorers. Scores above 90% agreement are
considered a high agreement. For the PARCC test, all agreement results are on or above 90%.
9. Describe and interpret the results of procedures used to investigate score reliability.
The constructed response items in the PARCC test are human scored items. All the scorers
get the training on scoring the items. All the scorers successfully complete the training and qualify
the task to score the online and paper-based, constructed response items using the ePEN (Electronic
Performance Evaluation Network, second-generation) platform. After all items are scored once by
the scorers, 10% of the items are assigned randomly to the scorers again. The scorers evaluate those
items again. The scorers are not informed whether it is the first time or the second the item is being
scored. If the first score and the second score are non-adjacent, the item is assigned again to be
scored one more time until the disagreement is resolved, and the item is accurately scored.
The expected agreement percentiles are listed in the table above. Scorers have to meet these
criteria to be able to continue scoring the items. If the scorers cannot reach the level expected
agreements, they will be subjected to a series of interventions and tests to check their scoring
abilities. Scorers who cannot meet the requirements will be removed from the group, and all the
items that they score will be reevaluated.

19
10. Describe and interpret information about the test standard error of measurement (or
conditional standard error of measurement).
Standard error of measurement (SEM) measures the amount of error in the scores. SEM is
the difference between the actual student scores and the scores that the student would get if the test
was 100% reliable. Higher SEM values indicate the increased variability in scores when the
students take the same test repeatedly. The average value of SEM for 3rd grade CBT is 3.37, and for
the PBT, it is 3.39. For a perfectly reliable test, the SEM value is zero. The lower the SEM value,
the more reliable the test is. For this 3rd-grade test, the SEM value is lower for CBT. These SEM
values are acceptable values.
11. Describe and interpret evidence related to item discriminations (e.g., corrected item-total
correlations), including how the item discrimination values compare to the recommended
range, noting in particular any items with negative or small-positive discriminations.
Item discrimination explains the relationship between the student’s performances on a
specific item vs. the total test performance. The scores for the item discrimination range from -1.00
to 1.00. The scores above 0.15 are acceptable. Negative scores represent low-ability students
perform better on that item than the high-ability students. Pearson investigates any items with the
item discrimination score lower than 0.15. If the score is extremely low or negative values are
subjected to be excluded from the test after the investigation. For the 3rd-grade math assessment, the
item discrimination values could not be found.
12. Evaluate the degree to which the overall body of reliability evidence is sufficient to support
test use, noting in particular any outstanding and unaddressed threats to reliability.
20
For the 3rd grade PARCC math spring 2018 assessment, no issues were observed that might
risk the reliability of the test. The average reliability coefficient scores for the test are high. The
score for the 3rd grade math CBT is 0.94 which indicates high reliability for the test.
Part 3: Test Validity
13. Describe the procedures used to investigate test validity (e.g., factor analysis, expert reviews,
think-alouds), and the specific forms of validity evidence gathered (e.g., based on test content,
based on relations with external variables, based on internal structure, based on response
processes, and based on the consequences of testing).
According to the American Educational Research Association (AERA), American
Psychological Association (APA), and National Council on Measurement and Education (NCME)
(2014) reports, validity is described as;
“The degree to which evidence and theory support the interpretations of test scores for
proposed uses of tests. Validity is, therefore, the most fundamental consideration in
developing tests and evaluating tests. The process of validation involves accumulating
relevant evidence to provide a sound scientific basis for the proposed score interpretations
(p. 11)”.
Evidence-Based on Test Content
Construct validity explains how much a test is able to measure what it claims to measure.
PARCC testing is designed to assess the CCSS for math and ELA for 3-8 grades. Test blueprints
and evidence statements are derived from those Common Core State Standards. In addition to those
statements, the PARCC College and Career Ready Determinations (CCRD) identify the skills,
knowledge, and performances that the students are expected to demonstrate to be college and career
ready. PARCC Governing Board and PARCC Advisory Committee on College Readiness reviewed
21
the standards and the test design and concluded that students who got Level 4 and 5 in PARCC high
school assessments are most likely to earn at least a “C” from their college courses without any
extra help.
When the math and ELA PARCC tests are constructed, experts from the states which
administer the PARCC test gather and hold meetings to review the test forms. The experts work
mostly on the alignment of the CCSS and test blueprints evidence statements. Necessary item
replacements are made during those meetings while maintaining the integrity of the design of the
test.
PARCC items are subjected to field tests before they are included in the assessments.
Selected schools administer the field test, and the data is used to the development of test items.
Evidence-Based on Internal Structure
The validity of the internal structure refers to the relation between the test items and the test
components and sub-claims. The ELA reports provide data regarding the overall ELA score,
Reading (RD), Writing (WR) claim scores, and five sub-claim scores; Reading Literature (RL),
Reading Information (RI), Reading Vocabulary (RV) Writing Written Expression (WE), and
Writing Language and Conventions (WKL). Math reports have data for four sub-claims; Major
Content (MC), Mathematical Reasoning (MR), Modeling Practice (MP), and Additional and
Supporting Content (ASC).
The validity of the internal structure depends on both the total group consistency and the
subgroup consistencies. Another way to check the internal structure is to measure the relationship
among the ELA and Math scores and the ELA and Math sub-claim scores.
The intercorrelations for the 3rd grade Math sub-claims are below;
22
CBT PBT
MC ASC MR MP MC ASC MR MP MC
MC 0.89 305,106 305,106 305,106 0.87 42,794 42,794 42,794 42,794
ASC 0.81 0.75 305,106 305,106 0.80 0.72 42,794 42,794 42,794
MR 0.78 0.72 0.70 305,106 0.78 0.72 0.67 42,794 42,794
MP 0.74 0.68 0.70 0.71 0.73 0.68 0.69 0.67 42,794
The average intercorrelations and the sample sizes are provided in this table for the CBT and
PBT students taking the 3rd grade Math test. The intercorrelation between MC and the other
subclaims, ASC, MR, and MP, show higher averages than the other relations. The scores for the
intercorrelation between MP and ASC are the lowest for both CBT and PBT. The results show a
moderate relation overall among the subclaims.
The table for the correlations between ELA and Math for the 3rd grade is below;
CBT PBT
ELA RD WD MA ELA RD WD MA ELA
ELA 295,747 295,747 295,747 42,794 42,794 42,794 42,794
RD 0.95 295,747 295,747 0.95 42,794 42,794 42,794
WD 0.88 0.73 295,747 0.89 0.73 42,794 42,794
MA 0.79 0.76 0.70 0.77 0.75 0.67 42,794
The average scores for the correlations between Math and ELA show similar results for both
CBT and PBT. There is a stronger correlation among Math and ELA and Math and Reading domain
scores. The correlation between the Math and Writing domain can be considered as moderate.
Evidence-Based on Response Processes
Evidence-based on the response process indicates that the students are following the expected
response process while they are answering the questions. The data for this evidence can be collected
using the feedback from the test proctors’ observations during the test, and also the item scorers’
23
feedback mostly about the extended-response and constructed-response test items. Several studies
were conducted by Pearson to check the validity of the response processes. There is a drawing tool
available during the mathematics test, and several studies were done to check the effectiveness and
usability of the drawing tool. To check how drawing tool impacts the students’ math performance,
Pearson included drawing tool in the field tests. The students were randomly selected and assigned
the drawing tool during the field test. The results showed that there is no statistical difference
between the performances of the students using the drawing tool and the students without the
drawing tool feature.
Pearson conducts several researches regarding the topics included below but not limited to;
item quality, whether the students engage with the items as expected while they are taking the test, if
the time given for each item is sufficient or not, the level of accuracy and reliability of the test
scoring rubrics, the accommodations for the English learners and students with disabilities, the
efficiency of the format of the test, and the technology features of the test.
Evidence-Based on the Consequences of the Testing
Each state and districts use the results of the PARCC testing in various ways. For some
states, PARCC proficiency is required to graduate from high school, and some don’t require students
to be PARCC proficient to high school graduates. The PARCC data can be used by some districts to
place the students in gifted programs or course placements such as Algebra I or Geometry in middle
school. Each state and district has different ways of consequences of the test in place.
14. Describe and interpret the results of procedures used to investigate test validity issues.
Pearson conducts several studies during the test item development process, benchmarking
studies, content evaluation, alignment between the test content and the CCSS, test mode, and device
comparability. These studies show that the PARCC proficient students demonstrate college and
24
career-ready or on track to readiness to those standards. The reports suggest that PARCC program
will be improved if more depth of knowledge is implemented to the overall assessments. Pearson
conducts field tests each year with the selected schools to improve the test administration and the
rigor of the test.
In 2016, a study was conducted to evaluate the degree of alignment for ELA and Math
assessments for the grade levels 5, 8, and 11 with CCSS (CCSS; Doorey, & Polikoff, 2016, Schultz,
Michaels, Dvorak, & Wiley, 2016). The experts on the content reviewed the assessment items,
rubrics, and answers and evaluated the how much the items aligned to the CCSS, the critical thinking
levels, and depth of knowledge of the items, the accessibility of the items for all students including
the English learners and students with disabilities. The experts both rated all items and gave
feedback that was used for item improvement. According to the results of the study, the PARCC
assessment program was rated as;
“Excellent Match” for ELA contents and depth of knowledge
“Good Match” for Math contents for grades 5 and 8
“Excellent Match” for 11th-grade math content
“Limited/Uneven Match” for 11th-grade depth of knowledge
“Excellent Match” for high school math content
“Good Match” for high school math depth of knowledge.
Some suggested issues regarding the ELA assessment were lack of emphasis on vocabulary
and language skills, more focus on close reading, and writing. The strength of the math assessment
was the perfect alignment with the major standards in each grade level. It was also suggested to
include questions for all cognitive demand ranges.
In 2017, a similar study was done by the Human Resources Research Organization
25
(HumRRO) for grades 3, 4, 6, and 7 to assess the quality and the alignment of the tests. ELA
assessment was rated as “Excellent Match” for the content and text quality, text complexity,
cognitive demand, and rigor, and text variety. For the math test grades 3, 4, and 6 got “Good Match”
rating, and grade 7 received “Excellent Match” rating in terms of alignment and quality of the items.
All grade levels received “Excellent Math” for the depth of knowledge level.
Pearson conducts similar studies to assess the alignment, rigor, quality of the assessment, and
also longitudinal studies to check the relationship between level 4 and level 5 students and their
projected college achievements. There are also studies to investigate the comparability analysis
between paper and online forms, and also tablet conditions and non-tablet conditions. There were
little or no flagged items related to device effects and the form of the assessment.
The results of all the studies conducted regarding the PARCC assessment are used to
improve the test quality, alignment, and depth of knowledge, content, and the rigor of the assessment
items.
15. Evaluate the degree to which the overall body of validity evidence is sufficient to support test
use, including noting in particular any outstanding threats to validity of construct-
representation (i.e., the degree to which the test is fully representative of the domain assessed)
or construct irrelevant-variance (i.e., the degree to which the test might measure variables
other than the variable intended) related to either the assessment task or the scoring
procedures.
The longitudinal study between Pearson and College Board by checking the association
between the ACT and PARCC performances revealed that the students receiving Level 4 or higher
have similar performances in their college life. Pearson aims to work on the association between
PARCC scores and the student performances in their first-year college courses in the future.
26
According to the results of the studies regarding content, internal structure, item construction,
the form of the test, there were no threats to the validity of the PARCC testing. During the test
development, Pearson’s Assessment and Information Quality (AIQ) completes a comprehensive
review of all the forms of the test items to make sure that the items in the test measure the intended
content. Also, the scoring team conducts empirical analyses, reviews all items, and checks the
accuracy of the content of the items, answers, and scoring rules. If any item is flagged, it is further
investigated, and the necessary changes are made.
16. Describe procedures related to the development of norms, and interpret the appropriateness of
those norms for students in your local school context (if the test is intended to support norm-
referenced score interpretations).
PARCC student growth percentiles (SGPs) help us to understand how the student’s
performance compared to the rest of the students in the same grade level and subject. It also provides
the data on how the student performs from one year to the next. SGP cannot be calculated for the 3rd-
grade math test since they don’t take the test in 2nd grade. The average math SGP for the grade levels
4 through 11th grade is close to 50 percent. Scores below 30 show that the students are not able to
demonstrate a year-worth growth, 30 to 70 tells us that the students show a year-worth growth, and
any scores above 70 indicate that the students exceed a year-worth growth.
Part 4: Test Fairness
17. Describe the procedures (judgmental and/or empirical) used to investigate test fairness and
bias issues, and the subgroups emphasized in such procedures.
During the test development process, Pearson provides training to the item writers. The test
items are reviewed for the content, alignment, rigor, structure, and bias. The committee includes
experts from the state, educators from secondary and college level, and community members to
27
make sure that the questions are aligned to CCSS, rigorous and high quality, and fair for all
subgroups and all student populations. The committee members review all items in all subjects and
grade levels to confirm that there are no bias and sensitivity issues that would interfere with the
students’ ability to demonstrate their knowledge and performance.
The table for test reliability estimates for the subgroups for 3rd-grade math is below.
CBT PBT
Max. Avg. Min. Max. Max. Avg. Min. Max.

Avg. Avg.
Raw Reliabil Sample Sample Raw Reliabil Sample Sample
SEM SEM
Score ity Size Size Score ity Size Size
Total Group 66 3.37 0.94 2,070 33,176 65 3.39 0.93 4,405 37,527
Male 66 3.34 0.94 1,285 16,168 65 3.37 0.93 2,686 18,455
Female 66 3.40 0.94 785 14,921 65 3.41 0.93 1,719 19,072
White 66 3.43 0.93 705 9,510 65 3.48 0.92 905 10,291
African
66 3.25 0.94 528 23,135 65 3.27 0.92 1,078 10,967
American
Asian/Pacific
66 3.35 0.93 7,583 1,522 65 3.44 0.93 123 1,792
Islander
A. Indian
66 3.23 0.93 593 1,378 65 3.28 0.91 1,081 1,081
Alaska N.
Hispanic 66 3.33 0.93 655 30,453 65 3.38 0.92 2,166 12,458
Multiple 66 3.38 0.94 3,815 779 65 3.45 0.94 841 841
Economically
66 3.29 0.93 1,246 53,165 65 3.33 0.92 3,611 25,866
Disadv.
28
Not
Economically 66 3.42 0.93 803 11,588 65 3.49 0.92 765 11,591
Disadvantaged
English Learner 66 3.25 0.92 593 8,124 65 3.34 0.91 1,872 8,147
Not English
66 3.39 0.94 1,477 21,070 65 3.40 0.93 2,532 29,346
Learner
S. with
66 3.18 0.94 1,544 12,802 65 3.22 0.91 3,270 2,431
Disabilities
S. w/o
66 3.40 0.94 526 21,778 65 3.41 0.93 1,114 34,937
Disabilities
Based on these results, the average reliability scores for the subgroups range from 0.91 to
0.94. The average of the total population is 0.94 for the CBT and 0.93 for the PBT. The average
reliability scores for the subgroups do not show any concerning results. The scores show high-
reliability scores for all subgroups.
18. Describe and interpret the results of procedures used to investigate test fairness and bias
issues.
Pearson and Text Review Committee work cooperatively to confirm that all texts and
questions are appropriate for all students and don’t contain any concerns. This committee has
members from both Content Item Review and Bias and Sensitivity Review Committees. Any
questions or text, including concerns of bias and sensitivity, are sent to PARCC Priority Alert Task
Force for further evaluation. After that process, the item is either fixed and included in the test or
excluded from the test.
19. Indicate what test accommodations are acceptable during the test, per the test manual.
PARCC test accommodations include screen reader, assistive technology, braille reader and
writer, large print, paper-based edition, text-to-speech, American Sign Language, closed captioning
tactile graphics, human reader, human signer, calculation device use, speech-to-text, monitor test
29
response, word prediction external device, extended time, word-to-word dictionary, online trans
adaptation of the mathematics test in Spanish, text-to-speech in Spanish in mathematics, human
reader in Mathematics in Spanish, paper-based assessment in mathematics in Spanish.
20. Evaluate 1) the degree to which the overall evidence for test fairness is sufficient to support
test use, 2), the degree to which procedures to evaluate test fairness sufficiently address the
equality of all relevant subgroups’ opportunities to demonstrate their knowledge (i.e., are
there other potential fairness issues that were not addressed), 3) and the degree to which
allowed accommodations address the equality of all relevant subgroups’ opportunities to
demonstrate their knowledge, especially in your specific school context (and whether
additional accommodations might be necessary).
Pearson has designed the testing platform which enabled the test takers access the features
and accommodations when needed. The system has the accessibility features available for all
students, and the accommodations for the students with disabilities, English learners, and the
English learners with disabilities to create the fair testing environment for all students. The
purpose of the accommodations is to reduce the effects of the students’ disability on the
performance of the students’ academic skills not to lower the expectations, reduce the rigor or the
complexity of the test. These accommodations and the accessibility features allow students to show
their abilities and demonstrate their knowledge more fully and fairly.
There is also an alternative test form Dynamic Learning Maps (DLM) available for the
student with severe cognitive disabilities.
According to Batel & Sargad (2016), the“PARCC exam moved beyond fill-in-the-bubble
tests to not only measure critical thinking skills but also to better accommodate the needs of
students with disabilities and English language learners. The computer-based systems offer
30
advancements in universal design principles as applied to assessments that provide access for a
wider range of student needs, reducing the number of students required to take exams in separate
small-group or one-on-one settings.” (pg. 4). The accessibility features and accommodations
provide students with disabilities and English learners a more dynamic, user-friendly, and fair test
compared to the previous test forms.
As it shows on table above, all the sub-groups including the English learners and students
with disabilities have very high average reliability scores for both CBT and PBT forms.
There is an ongoing process of quality control for the whole testing program done by
Pearson. During the process of the test item creation, the reviewers vote on each item in the test
item banking system. The committee members review each question and the passage, and they
report if they accept or reject the item and also put comments under each test item. The reports are
generated with all the feedback and comments from the committee members. The questions
containing any concerns are sent to Priority Alert Task Force for further evaluation. For the PARCC
3rd-grade math test, there was no test item flagged for further evaluation.
31
References
American Educational Research Association, American Psychological Association, & National

Council on Measurement in Education (2014). Standards for educational and psychological
testing.Washington, DC: American Educational Research Association.
Batel, S., & Sargrad, S. (2016). Better Tests, Fewer Barriers: Advances in Accessibility through
PARCC and Smarter Balanced. Center for American Progress.
Bowman, T., Wiener, D., & Branson, D. (2017). PARCC Accessibility Features and
Accommodations Manual: Guidance for Districts and Decision-Making Teams to Ensure
That PARCC Summative Assessments Produce Valid Results for All Students. Partnership
for Assessment of Readiness for College and Careers.
Brennan, R. L. (2011). Using generalizability theory to address reliability issues for PARCC
assessments: A white paper. Center for Advanced Studies in Measurement and Assessment
(CASMA), University of Iowa.
Common Core State Standards. (2019). Retrieved from

http://www.corestandards.org/Math/Content/3/introduction/
Dogan, E., Hauger, J. B., & Maliszewski, C. (2015). EMPIRICAL AND PROCEDURAL
VALIDITY EVIDENCE IN DEVELOPMENT AND IMPLEMENTATION OF PARCC
ASSESSMENTS. The Next Generation of Testing: Common Core Standards, Smarter?
Balanced, PARCC, and the Nationwide Testing Movement, 273.
Livingston, S. A., & Lewis, C. (1995). Estimating the consistency and accuracy of classifications
based on test scores. Journal of Educational Measurement, 32, 179–197.
PARCC Accessibility and Fairness Technical Memorandum Heather M. Buzick Educational

Testing Service October 2, 2013
PARCC Final Technical Report for 2018 Administration. (2018). Retrieved from https://parcc-
assessment.org/wp-content/uploads/2019/05/PARCC-2018-Technical-
Report_Final_02282019_FORWEB.pdf.
PARCC High Level Blueprints-Mathematics. (2018). Retrieved from https://parcc-

assessment.org/content/uploads/2017/11/PARCCHighLevelBlueprints-
Mathematics_08.25.15.pdf
PARCC Math Grade-3 Alignment Document. (2018). Retrieved from https://parcc-

assessment.org/wpcontent/uploads/2018/08/Math_2018_Released_Items/Grade03/PARCC-
Math-Sp-2018-G3-Released-Answer-Key_modified-final_20181029.pdf
32
PARCC Math Grade-3 Released Answer Key. (2018). Retrieved from https://parcc-
assessment.org/wp-content/uploads/2019/05/PARCC-2018-Technical-
Report_Final_02282019_FORWEB.pdf.
PARCC Math Grade -3 Released Items. (2018). Retrieved from https://parcc-assessment.org/wp-

content/uploads/2018/08/Math_2018_Released_Items/Grade03/Grade-3-Math-Item-Set-
2018_20181029.pdf
PARCC Math Scoring Rules. (2018). Retrieved from https://parcc-

assessment.org/content/uploads/released_materials/06/PARCC_Math_-
_Scoring_Rules_V6_Approved.pdf
Sinclair, A., Deatz, R., & Johnston-Fisher, J. Findings from the PARCC Quality of Test
Administration Investigations: Year.

Etr 528 Test Evaluation-Aynur Aytekin

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Etr 528 Test Evaluation-Aynur Aytekin

Uploaded by

Copyright:

Available Formats

1

PARCC Grade 3 Spring 2018 Assessment

2. Assessment type/s: formative (interim/benchmark, diagnostic), summative, norm-referenced,

criterion-referenced, traditional, performance, high-stakes, standardized, etc.

through 11th-grade mathematics and English language arts/literacy standards.

take Dynamic Learning Maps (DLM) assessment instead.

mathematics) or other content framework.

grade level and subject. PARCC measures;

 Grade level standards

 Mathematical reasoning statements

 Mathematical modeling statements

statements assessed in PARCC Grade 3 Spring 2018 math test is below:

Grade 3 Math CCSS

Domain 1: Operations and Algebraic Thinking

Domain 2: Number & Operations in Base Ten

Domain 3: Number & Operations - Fractions

Domain 4: Measurement & Data

Integrative Standard Statements

Solve two-step word problems using the four operations requiring a

Solve real world and mathematical problems involving perimeters of

Use information presented in a scaled bar graph to solve a two-step “how

Mathematical Reasoning Statements

Base explanations/reasoning on the properties of operations.

Base explanations/reasoning on the properties of operations.

Base explanations/reasoning on the properties of operations.

Base explanations/reasoning on the relationship between multiplication

Base arithmetic explanations/reasoning on concrete referents such as

Base explanations/reasoning on concrete referents such as diagrams

Distinguish correct explanation/reasoning from that which is flawed, and –

Distinguish correct explanation/reasoning from that which is flawed, and –

Distinguish correct explanation/reasoning from that which is flawed, and –

Distinguish correct explanation/reasoning from that which is flawed, and –

Distinguish correct explanation/reasoning from that which is flawed, and –

Distinguish correct explanation/reasoning from that which is flawed, and –

Distinguish correct explanation/reasoning from that which is flawed, and –

Present solutions to two-step problems in the form of valid chains of

Present solutions to multi-step problems in the form of valid chains of

Base explanations/reasoning on a number line diagram (whether

Base explanations/reasoning on a number line diagram (whether

Mathematical Modeling Statements

Solve multi-step contextual word problems with degree of difficulty

need to know for the next grade level.

what type of support they need.

collaboratively with their teacher and school to close those gaps.

schools are categorized as lowest-performing, underperforming, commendable, or exemplary

depending on the state.

response, traditional constructed-response, performance, hybrid) and specific nature of

assessment task/s (e.g., numbers of types of selected- or constructed-response items)

PARCC assessment contains three task types.

Task Type Sub-Claim Description

 Modeling and application in a real-world context

PARCC 3rd-Grade Math Assessment Item Specifications

Task Evidence Sub-

2 Type I 3.NBT.2 B Fill-in-the-blank MS

3 Type I 3.MD.1-1 A Fill-in-the-blank MS

5 Type I 3.NF.1 A Multiple Choice MS

6 Type I 3.OA.3-1 A Multiple Select MS

7 Type I 3.NF.2 A Multiple Choice MS

8 Type II 3.C.5-1 C B. Constructed Response HS

9 Type I 3.OA.7-2 A Multiple Select MS

11 Type I 3.G.2 B Fill-in-the-blank MS

12 Type I 3.MD.1-2 A Fill-in-the-blank MS

13 Type I 3.NF.2 A Multiple Choice MS