You are on page 1of 32

1

Test Evaluation
ETR 528
Aynur Aytekin
Part 1: Test Purpose, Population, and Design

1. Assessment name

PARCC Grade 3 Spring 2018 Assessment

2. Assessment type/s: formative (interim/benchmark, diagnostic), summative, norm-referenced,

criterion-referenced, traditional, performance, high-stakes, standardized, etc.

PARCC is a standardized, summative, criterion-referenced, traditional state test that measures 3rd

through 11th-grade mathematics and English language arts/literacy standards.

3. Population for which assessment is designed, including grade/s/ages/s and any special

inclusion or exclusion criteria, including whether test is intended for special populations (e.g.,

ELLs, SWDs)

This test is designed for all 3rd -grade students to measure the mastery of grade-level standards. All

3rd-grade students have to take the test, including English learners and diverse learners. The only

exception is the diverse learners with severe cognitive disabilities. For those students with severe

cognitive disabilities, the state test is not appropriate, even if they have the accommodations. They

take Dynamic Learning Maps (DLM) assessment instead.

4. Content assessed, in terms of general content area (e.g., mathematics), domains and sub-

domains, and specific content standards (e.g., Common Core State Standards for

mathematics) or other content framework.

PARCC assesses the mastery of CCSS in 3rd-11th grade for the subjects of math and ELA. PARCC

is also aligned with the cognitive complexity and rigor of the Common Core Standards in each

grade level and subject. PARCC measures;


2

 Grade level standards

 Integrated standards

 Mathematical reasoning statements

 Mathematical modeling statements

The list of the CCSS and integrated standards, mathematical reasoning, and mathematical modeling

statements assessed in PARCC Grade 3 Spring 2018 math test is below:

Grade 3 Math CCSS

Domain 1: Operations and Algebraic Thinking

Standard
Statement Text
Key
CCSS.MATH.CONTENT.3.OA.A.2
Interpret whole-number quotients of whole numbers, e.g., interpret 56 ÷ 8 as the
number of objects in each share when 56 objects are partitioned equally into 8
3.OA.A.2
shares, or as a number of shares when 56 objects are partitioned into equal shares of
8 objects each. For example, describe a context in which a number of shares or a
number of groups can be expressed as 56 ÷ 8.
CCSS.MATH.CONTENT.3.OA.A.3
Use multiplication and division within 100 to solve word problems in situations
3.OA.A.3
involving equal groups, arrays, and measurement quantities, e.g., by using drawings
and equations with a symbol for the unknown number to represent the problem.
CCSS.MATH.CONTENT.3.OA.A.4
Determine the unknown whole number in a multiplication or division equation
3.OA.A.4
relating three whole numbers. For example, determine the unknown number that
makes the equation true in each of the equations 8 × ? = 48, 5 = _ ÷ 3, 6 × 6 = ?

Domain 2: Number & Operations in Base Ten

Standard
Statement Text
Key
CCSS.MATH.CONTENT.3.NBT.A.1
3.NBT.A.1
Use place value understanding to round whole numbers to the nearest 10 or 100.
CCSS.MATH.CONTENT.3.NBT.A.2
Fluently add and subtract within 1000 using strategies and algorithms based on
3.NBT.A.2
place value, properties of operations, and/or the relationship between addition and
subtraction.
3

CCSS.MATH.CONTENT.3.NBT.A.3
3.NBT.A.3 Multiply one-digit whole numbers by multiples of 10 in the range 10-90 (e.g., 9 ×
80, 5 × 60) using strategies based on place value and properties of operations.

Domain 3: Number & Operations - Fractions

Standard
Statement Text
Key
CCSS.MATH.CONTENT.3.NF.A.1
Understand a fraction 1/b as the quantity formed by 1 part when a whole is
3.NF.A.1
partitioned into b equal parts; understand a fraction a/b as the quantity formed by a
parts of size 1/b.
CCSS.MATH.CONTENT.3.NF.A.2
3.NF.A.2 Understand a fraction as a number on the number line; represent fractions on a
number line diagram.
CCSS.MATH.CONTENT.3.NF.A.2.A
Represent a fraction 1/b on a number line diagram by defining the interval from 0
3.NF.A.2.A to 1 as the whole and partitioning it into b equal parts. Recognize that each part has
size 1/b and that the endpoint of the part based at 0 locates the number 1/b on the
number line.
CCSS.MATH.CONTENT.3.NF.A.2.B
Represent a fraction a/b on a number line diagram by marking off a lengths 1/b
3.NF.A.2.B
from 0. Recognize that the resulting interval has size a/b and that its endpoint
locates the number a/b on the number line.
CCSS.MATH.CONTENT.3.NF.A.3
3.NF.A.3 Explain equivalence of fractions in special cases, and compare fractions by
reasoning about their size.
CCSS.MATH.CONTENT.3.NF.A.3.A
3.NF.A.3.A Understand two fractions as equivalent (equal) if they are the same size, or the
same point on a number line.
CCSS.MATH.CONTENT.3.NF.A.3.B
3.NF.A.3.B Recognize and generate simple equivalent fractions, e.g., 1/2 = 2/4, 4/6 = 2/3.
Explain why the fractions are equivalent, e.g., by using a visual fraction model.
CCSS.MATH.CONTENT.3.NF.A.3.C
Express whole numbers as fractions, and recognize fractions that are equivalent to
3.NF.A.3.C
whole numbers. Examples: Express 3 in the form 3 = 3/1; recognize that 6/1 = 6;
locate 4/4 and 1 at the same point of a number line diagram.
CCSS.MATH.CONTENT.3.NF.A.3.D
Compare two fractions with the same numerator or the same denominator by
reasoning about their size. Recognize that comparisons are valid only when the two
3.NF.A.3.D
fractions refer to the same whole. Record the results of comparisons with the
symbols >, =, or <, and justify the conclusions, e.g., by using a visual fraction
model.
4

Domain 4: Measurement & Data

Standard
Statement Text
Key
CCSS.MATH.CONTENT.3.MD.A.1
Tell and write time to the nearest minute and measure time intervals in minutes.
3.MD.A.1
Solve word problems involving addition and subtraction of time intervals in
minutes, e.g., by representing the problem on a number line diagram.
CCSS.MATH.CONTENT.3.MD.A.2
Measure and estimate liquid volumes and masses of objects using standard units of
grams (g), kilograms (kg), and liters (l).1 Add, subtract, multiply, or divide to solve
3.MD.A.2
one-step word problems involving masses or volumes that are given in the same
units, e.g., by using drawings (such as a beaker with a measurement scale) to
represent the problem
CCSS.MATH.CONTENT.3.MD.B.3
Draw a scaled picture graph and a scaled bar graph to represent a data set with
3.MD.B.3 several categories. Solve one- and two-step "how many more" and "how many
less" problems using information presented in scaled bar graphs. For example,
draw a bar graph in which each square in the bar graph might represent 5 pets.
CCSS.MATH.CONTENT.3.MD.B.4
Generate measurement data by measuring lengths using rulers marked with halves
3.MD.B.4
and fourths of an inch. Show the data by making a line plot, where the horizontal
scale is marked off in appropriate units— whole numbers, halves, or quarters.
CCSS.MATH.CONTENT.3.MD.C.5
3.MD.C.5 Recognize area as an attribute of plane figures and understand concepts of area
measurement.
CCSS.MATH.CONTENT.3.MD.C.5.A
3.MD.C.5.A A square with side length 1 unit, called "a unit square," is said to have "one square
unit" of area, and can be used to measure area.
CCSS.MATH.CONTENT.3.MD.C.5.B
3.MD.C.5.B A plane figure which can be covered without gaps or overlaps by n unit squares is
said to have an area of n square units.
CCSS.MATH.CONTENT.3.MD.C.6
3.MD.C.6 Measure areas by counting unit squares (square cm, square m, square in, square ft,
and improvised units).
CCSS.MATH.CONTENT.3.MD.C.7
3.MD.C.7
Relate area to the operations of multiplication and addition.
CCSS.MATH.CONTENT.3.MD.C.7.A
3.MD.C.7.A Find the area of a rectangle with whole-number side lengths by tiling it, and show
that the area is the same as would be found by multiplying the side lengths.
CCSS.MATH.CONTENT.3.MD.C.7.B
3.MD.C.7.B
Multiply side lengths to find areas of rectangles with whole-number side lengths in
5

the context of solving real world and mathematical problems, and represent whole-
number products as rectangular areas in mathematical reasoning.
CCSS.MATH.CONTENT.3.MD.C.7.C
Use tiling to show in a concrete case that the area of a rectangle with whole-
3.MD.C.7.C
number side lengths a and b + c is the sum of a × b and a × c. Use area models to
represent the distributive property in mathematical reasoning.
CCSS.MATH.CONTENT.3.MD.C.7.D
Recognize area as additive. Find areas of rectilinear figures by decomposing them
3.MD.C.7.D
into non-overlapping rectangles and adding the areas of the non-overlapping parts,
applying this technique to solve real world problems.
CCSS.MATH.CONTENT.3.MD.D.8
Solve real world and mathematical problems involving perimeters of polygons,
3.MD.D.8 including finding the perimeter given the side lengths, finding an unknown side
length, and exhibiting rectangles with the same perimeter and different areas or
with the same area and different perimeters.

Domain 5: Geometry

Standard
Statement Text
Key
CCSS.MATH.CONTENT.3.G.A.1
Understand that shapes in different categories (e.g., rhombuses, rectangles, and
others) may share attributes (e.g., having four sides), and that the shared attributes
3.G.A.1
can define a larger category (e.g., quadrilaterals). Recognize rhombuses, rectangles,
and squares as examples of quadrilaterals, and draw examples of quadrilaterals that
do not belong to any of these subcategories.
CCSS.MATH.CONTENT.3.G.A.2
Partition shapes into parts with equal areas. Express the area of each part as a unit
3.G.A.2
fraction of the whole. For example, partition a shape into 4 parts with equal area,
and describe the area of each part as 1/4 of the area of the shape.

Integrative Standard Statements

Given a two-step problem situation with the four operations, round the
values in the problem, then use the rounded values to produce an
3.Int.1
approximate solution.
Content Scope: 3.OA.8, 3.NBT.1, 3.NBT.2, 3.NBT.3
6

Solve two-step word problems using the four operations requiring a


substantial addition, subtraction, or multiplication step, drawing on
3.Int.2
knowledge and skills articulated in 3.NBT.
Content Scope: 3.OA.8, 3.NBT.2, and 3.NBT.3

Solve real world and mathematical problems involving perimeters of


polygons requiring a substantial addition, subtraction, or multiplication
3.Int.3
step, drawing on knowledge and skills articulated in 3.NBT.
Content Scope: 3.MD.8, 3.NBT.2, and 3.NBT.3

Use information presented in a scaled bar graph to solve a two-step “how


many more” or “how many less” problem requiring a substantial addition,
3.Int.4 subtraction, or multiplication step, drawing on knowledge and skills
articulated in 3.NBT.
Content Scope: 3.MD.3, 3.NBT.2, and 3.NBT.3
Add, subtract, or multiply to solve a one-step word problem involving
masses or volumes that are given in the same units, where a substantial
3.Int.5 addition, subtraction, or multiplication step is required drawing on
knowledge and skills articulated in 3.NBT, e.g., by using drawings.
Content Scope: 3.MD.2, 3.NBT.2, and 3.NBT.3

Mathematical Reasoning Statements

Base explanations/reasoning on the properties of operations.


3.C.1-1
Content Scope: Knowledge and skills articulated in 3.OA.5

Base explanations/reasoning on the properties of operations.


3.C.1-2
Content Scope: Knowledge and skills articulated in 3.OA.9

Base explanations/reasoning on the properties of operations.


3.C.1-3
Content Scope: Knowledge and skills articulated in 3.MD.7

Base explanations/reasoning on the relationship between multiplication


3.C.2 and division.
Content Scope: Knowledge and skills articulated in 3.OA.6
7

Base arithmetic explanations/reasoning on concrete referents such as


diagrams (whether provided in the prompt or constructed by the student
3.C.3-1 in her response), connecting the diagrams to a written (symbolic) method.
Content Scope: Knowledge and skills articulated in 3.NF.3b,
3.NF.3d

Base explanations/reasoning on concrete referents such as diagrams


(whether provided in the prompt or constructed by the student in her
3.C.3-2 response).
Content Scope: Knowledge and skills articulated in 3.MD.5,
3.MD.6, 3.MD.7

Distinguish correct explanation/reasoning from that which is flawed, and –


if there is a flaw in the argument – present corrected reasoning.(For
3.C.4-1 example, some flawed ‘student’ reasoning is presented and the task is to
correct and improve it.)
Content Scope: Knowledge and skills articulated in 3.OA.5

Distinguish correct explanation/reasoning from that which is flawed, and –


if there is a flaw in the argument – present corrected reasoning. (For
3.C.4-2 example, some flawed ‘student’ reasoning is presented and the task is to
correct and improve it.)
Content Scope: Knowledge and skills articulated in 3.OA.6

Distinguish correct explanation/reasoning from that which is flawed, and –


if there is a flaw in the argument – present corrected reasoning. (For
3.C.4-3 example, some flawed ‘student’ reasoning is presented and the task is to
correct and improve it.)
Content Scope: Knowledge and skills articulated in 3.OA.8
8

Distinguish correct explanation/reasoning from that which is flawed, and –


if there is a flaw in the argument – present corrected reasoning. (For
example, some flawed ‘student’ reasoning is presented and the task is to
3.C.4-4
correct and improve it.)
Content Scope: Knowledge and skills articulated in 3.NF.3b,
3.NF.3d

Distinguish correct explanation/reasoning from that which is flawed, and –


if there is a flaw in the argument – present corrected reasoning. (For
3.C.4-5 example, some flawed ‘student’ reasoning is presented and the task is to
correct and improve it.)
Content Scope: Knowledge and skills articulated in 3.MD.7

Distinguish correct explanation/reasoning from that which is flawed, and –


if there is a flaw in the argument – present corrected reasoning. (For
3.C.4-6 example, some flawed ‘student’ reasoning is presented and the task is to
correct and improve it.)
Content Scope: Knowledge and skills articulated in 3.OA.9

Distinguish correct explanation/reasoning from that which is flawed, and –


if there is a flaw in the argument – present corrected reasoning. (For
3.C.4-7 example, some flawed ‘student’ reasoning is presented and the task is to
correct and improve it.)
Content Scope: Knowledge and skills articulated in 2.NBT

Present solutions to two-step problems in the form of valid chains of


reasoning, using symbols such as equals signs appropriately (for
example, rubrics award less than full credit for the presence of nonsense
3.C.5-1 statements such as 1 + 4 = 5 + 7 = 12, even if the final answer is correct),
or identify or describe errors in solutions to two-step problems and
present corrected solutions.
Content Scope: Knowledge and skills articulated in 3.OA.8
9

Present solutions to multi-step problems in the form of valid chains of


reasoning, using symbols such as equals signs appropriately (for
example, rubrics award less than full credit for the presence of nonsense
statements such as 1 + 4 = 5 + 7 = 12, even if the final answer is correct),
3.C.5-1
or identify or describe errors in solutions to multi-step problems and
present corrected solutions.
Content Scope: Knowledge and skills articulated in 3.MD.7b,
3.MD.7d

Base explanations/reasoning on a number line diagram (whether


3.C.6-1 provided in the prompt or constructed by the student in her response)
Content scope: Knowledge and skills articulated in 3.NF.2

Base explanations/reasoning on a number line diagram (whether


3.C.6-2 provided in the prompt or constructed by the student in her response)
Content scope: Knowledge and skills articulated in 3.MD.1

Mathematical Modeling Statements

Solve multi-step contextual word problems with degree of difficulty


3.D.1 appropriate to Grade 3, requiring application of knowledge and skills
articulated in Type I, Sub-Claim A Evidence Statements.
Solve multi-step contextual problems with degree of difficulty appropriate
3.D.2 to Grade 3, requiring application of knowledge and skills articulated in
2.OA.A, 2.OA.B, 2.NBT, and/or 2.MD.B.

5. Purpose for which the assessment is designed, and decisions made (i.e., for what and by whom

are the collected data to be used). When considering the assessment purpose and assessment

users, cast a wide net to include teachers, administrators, policymakers, parents, etc.
10

The purpose of the PARCC testing is to provide information for the schools and teachers to

improve instruction in the classrooms. The test gives directions to the instructors about the student’s

current academic level, what areas they need more support, and what knowledge and skills they

need to know for the next grade level.

Schools use the test results to put the changes in place to improve the quality of the

instruction. The PARCC assessment also provides data regarding the sub-groups performance and

what type of support they need.

Parents are informed the weaknesses and strength of their children and how to work together

collaboratively with their teacher and school to close those gaps.

Policymakers check the data and plan accordingly to support schools and districts to have

higher achievements. PARCC data can be used to see how much the curriculum and instructional

practices are aligned to CCSS. For the state of Illinois, Illinois State Board of Education (ISBE)

releases the Illinois Report Card to show the picture of how the state, districts, and schools are

performing academically. Based on the school’s overall data for all accountability indicators, the

schools are categorized as lowest-performing, underperforming, commendable, or exemplary

schools. Lowest-performing and underperforming schools receive support from commendable and

exemplary schools, and they have to complete school improvement plans or other requirements

depending on the state.

*Illinois State-administered PARCC testing in 2018, and the PARCC results were part of the report cards. In

March 2019, ISBE decided to stop administering PARCC and switched to Illinois Assessment of Readiness

(IAR). IAR is a very similar test to PARCC in terms of the rigor and test structure.
11

6. Assessment Design (nature of assessment task): general test design (traditional selected-

response, traditional constructed-response, performance, hybrid) and specific nature of

assessment task/s (e.g., numbers of types of selected- or constructed-response items)

PARCC assessment contains three task types.

Task Type Sub-Claim Description


 Problems involving the major content for the grade level
(Sub-Claim A)
Sub-Claims  Problems involving the additional and supporting content
Type I
A and B for the grade level (Sub-claim B)
 Conceptual understanding, fluency, and application
 Machine-scorable items
 Problems involving expressing mathematical reasoning by
constructing mathematical arguments and critiques (Sub-
Claim C)
Type II Sub-Claims C
 Includes written arguments/justifications, the critique of
reasoning
 Machine-scorable and hand-scored items

 Modeling and application in a real-world context


Type III Sub-Claims D
 Machine-scorable and hand-scored items

PARCC 3rd-Grade Math Assessment Item Specifications

Task Evidence Sub-


Number Question Type Scoring*
Type Statement Claims
1 Type I 3.OA.1 A Drag-and-drop MS

2 Type I 3.NBT.2 B Fill-in-the-blank MS

3 Type I 3.MD.1-1 A Fill-in-the-blank MS

A. Constructed Response HS
4 Type III 3.D.2 D
B. Constructed Response HS

5 Type I 3.NF.1 A Multiple Choice MS


12

6 Type I 3.OA.3-1 A Multiple Select MS

7 Type I 3.NF.2 A Multiple Choice MS

A. Constructed Response HS

8 Type II 3.C.5-1 C B. Constructed Response HS

C. Fill-in-the-blank MS

9 Type I 3.OA.7-2 A Multiple Select MS

A. Fill-in-the-blank MS
10 Type I 3.Int.2 A
B. Fill-in-the-blank MS

11 Type I 3.G.2 B Fill-in-the-blank MS

12 Type I 3.MD.1-2 A Fill-in-the-blank MS

13 Type I 3.NF.2 A Multiple Choice MS

14 Type I 3.OA.7-2 A Fill-in-the-blank MS

15 Type III 3.D.1 D Constructed Response HS

16 Type I 3.OA.2 A Multiple Select MS

17 Type I 3.MD.7b-1 A Fill-in-the-blank MS

18 Type I 3.OA.3-4 A Multiple Choice MS

19 Type I 3.MD.5 A Multiple Choice MS

20 Type I 3.G.2 B Drag-and-drop MS

21 Type I 3.OA.4 A Fill-in-the-blank MS

22 Type I 3.NBT.2 B Fill-in-the-blank MS

23 Type I 3.OA.3-3 A Multiple Choice MS

A. Constructed Response HS
24 Type II 3.C.4-6 C
B. Multiple Choice MS
13

C. Constructed Response HS

25 Type I 3.NF.3a-1 A Multiple Select MS

26 Type I 3.G.1 B Multiple Select MS

27 Type I 3.NF.3d A Drag-and-drop MS

28 Type I 3.MD.8 B Multiple Select MS

29 Type I 3.OA.1 A Multiple Select MS

A. Multiple Choice MS
30 Type II 3.C.3-2 C
B. Constructed Response HS

31 Type I 3.NBT.3 B Fill-in-the-blank MS

32 Type I 3.OA.3-2 A Multiple Choice MS

33 Type I 3.OA.7-1 A Multiple Select MS

34 Type I 3.NF.3c A Multiple Select MS

35 Type I 3.MD.4 B Multiple Select MS

*Machine Scored (MS)

Hand Scored (HS)

7. Assessment design (scoring): scoring procedures for assessment items/task (e.g., answer key,

checklists, scoring rubric/s, rating scales); and scores reported by the assessment (e.g., scale

scores, percentile ranks, z-scores, RIT scores)

PARCC 3rd Grade Math Spring 2018 –Scoring Guidelines

There are three different types of prompts for the PARCC 3rd-grade math assessment. The

type of prompts for each item is listed on the table above.


14

Machine-Scored Prompts: Multiple choices, multiple select, fill-in-the-blank type

questions are machine-scored items. The machine scored prompts are either right (1 point), or

wrong (0 points) prompts.

Equation Editor Prompts: Students enter their response using the equation editor box.

Only numbers and equations can be entered in the box. The responses in the box are machine-

scored prompts scored by the Pearson’s Knowledge Technology group. These items are 0, 1, or 2

points.

Constructed Response Prompts: Students may enter both text and mathematics in the

boxes provided. Pearson’s Performance Scoring group scores these items based on the scoring

rubric. For the 3rd-grade math assessment, constructed response prompts have a maximum of 3, 4,

or 6 points.

The PARCC reports provide two different types of data;

 Overall Scale Score: The scale score summarizes student performance. The overall score

ranges from 600 to 850. The scores 750 and above are considered to be proficient level.

 Performance Level: PARCC also uses performance levels to determine how well the

students meet grade-level expectations. There are five levels;

o Level 1: Did not meet the expectations

o Level 2: Partially met expectations

o Level 3: Approached expectations

o Level 4: Met expectations

o Level 5: Exceeded expectations


15

For the PARCC Grade 3 Math Spring 2018 test the table of specifications is below;

Items Number of Tasks Total Points


Type I
32 32
1 Point
Type I
4 8
2 Point
Type II
2 6
3 Point
Type II
2 8
4 Point
Type III
2 6
3 Point
Type III
1 6
6 Point
TOTAL 43 66

Type I 36 (84%) 40 (61%)


Total
Points by Type II 4 (9%) 14 (21%)
Task Type
Type III 3 (7%) 12 (18%)

Part 2: Reliability

8. Describe the procedures used to investigate score reliability (e.g., KR-20, construct reliability

from confirmatory factor analysis), and the types of reliability investigated (e.g., internal

consistency, test-retest reliability, parallel-forms reliability, inter-rater reliability).

Reliability focuses on how much the test is consistent and stable in assessing what it is

indented to assess. For the PARCC test, there are several ways to estimate reliability. Internal

consistency evidence, reliability of classification estimate, inter-rater reliability, and standard error

of measurement (SEM) will be explained in this part.

Internal Consistency Evidence


16

Internal consistency evidence explains the consistency of the performance of each individual

student across the test items. The higher reliability coefficients of internal consistency explain that

without changing the academic level or the skills of the students, most likely, they will be obtaining

similar scores form the test if they were administered the test again. Any score above .70 is

considered as acceptable, and the scores above .90 show a very high internal consistency. The

average reliability of the math test for the 3rd-grade is .94. The average reliability estimates for the

3rd through 8th-grade range from .91 to .94. Having the Cronbach alpha so high tells us that the

internal consistency of the PARCC test is very high across all the grade levels. The data for the 3rd

grade math internal consistency evidence is below.

Average Avg. Raw


Min. Max.
Test Max. Score Avg. Min. Max.
Sample Sample
Form* Possible SEM Reliability Reliability Reliability
Size Size
Score

3-CBT 66 3.37 0.94 2,070 0.92 33,176 0.94

3-PBT 65 3.39 0.93 4,405 0.90 37,527 0.93

*CBT-Computer Based Test


PBT-Paper-based Test

Reliability of the Classification Estimates

PARCC divides the performance levels into five categories. The reliability of classification

estimates measures how accurately the students are placed into those performance levels.

Classification estimates have two parts; decision accuracy and decision consistency. Decision

accuracy explains the relations between the actual classifications and the classifications if we had

the perfect scores for the reliability. Decision consistency measures the consistency between the

two independent forms of the tests. (Livingston & Lewis, 1995).


17

The table of the decision accuracy and decision consistency scores for 3rd-grade math is below;

Decision Accuracy Decision Consistency


Proportion Accurately Classified Proportion Consistently Classified

Level 4 or
Level 4 or higher
higher vs.
Form of the test Exact Level Exact Level vs. Level 3 or
Level 3 or
lower
lower
CBT 0.77 0.93 0.68 0.90

PBT 0.76 0.93 0.67 0.90

The column “Exact level” explains the estimation of the classification of the students into

one of the five performance levels. For the 3rd-grade math test, the table shows that the proportions

of decision accuracy for the five performance levels of PARCC math were.77 of the CBT takers

and .76 of the students who take PBT. PARCC 3rd-grade math test classifies students as being at

Level 4 or higher vs. being at Level 3 or lower with an accuracy of 93%. Also, the proportion of the

students being classified consistently is .90. Pearson uses the computer program BB-Class

(Brennann, 2004), to calculate the decision accuracy and decision consistency proportions. This

statistical method was developed by Livingston and Lewis (1993, 1995). The decision accuracy

and decision consistency scores are acceptable scores in terms of the reliability of classification of

performance levels.

Inter-Rater Agreement

Inter-rater agreement is the relation between the first score and the second score of the

students’ extended responses. It measures exact, adjacent, and nonadjacent agreements. The data is

used by the decision-makers for training and intervention purposes.


18

Inter-rater agreement expectation and the results are as below for all math grade levels;

Perfect Perfect Within One


Score Point Within One
Subject Agreement Agreement Point
Range Point Result
Expectation Results Expectation
0-1 90% 97% 96% 100%
0-2 80% 94% 96% 100%
0-3 70% 93% 96% 99%
Math
0-4 65% 92% 95% 98%
0-5 65% 90% 95% 98%
0-6 65% 91% 95% 97%

The percentiles in the column “Perfect Agreement Results” and “Within One Point

Expectation” shows the level of agreement between the scorers. Scores above 90% agreement are

considered a high agreement. For the PARCC test, all agreement results are on or above 90%.

9. Describe and interpret the results of procedures used to investigate score reliability.

The constructed response items in the PARCC test are human scored items. All the scorers

get the training on scoring the items. All the scorers successfully complete the training and qualify

the task to score the online and paper-based, constructed response items using the ePEN (Electronic

Performance Evaluation Network, second-generation) platform. After all items are scored once by

the scorers, 10% of the items are assigned randomly to the scorers again. The scorers evaluate those

items again. The scorers are not informed whether it is the first time or the second the item is being

scored. If the first score and the second score are non-adjacent, the item is assigned again to be

scored one more time until the disagreement is resolved, and the item is accurately scored.

The expected agreement percentiles are listed in the table above. Scorers have to meet these

criteria to be able to continue scoring the items. If the scorers cannot reach the level expected

agreements, they will be subjected to a series of interventions and tests to check their scoring

abilities. Scorers who cannot meet the requirements will be removed from the group, and all the

items that they score will be reevaluated.


19

10. Describe and interpret information about the test standard error of measurement (or

conditional standard error of measurement).

Standard error of measurement (SEM) measures the amount of error in the scores. SEM is

the difference between the actual student scores and the scores that the student would get if the test

was 100% reliable. Higher SEM values indicate the increased variability in scores when the

students take the same test repeatedly. The average value of SEM for 3rd grade CBT is 3.37, and for

the PBT, it is 3.39. For a perfectly reliable test, the SEM value is zero. The lower the SEM value,

the more reliable the test is. For this 3rd-grade test, the SEM value is lower for CBT. These SEM

values are acceptable values.

11. Describe and interpret evidence related to item discriminations (e.g., corrected item-total

correlations), including how the item discrimination values compare to the recommended

range, noting in particular any items with negative or small-positive discriminations.

Item discrimination explains the relationship between the student’s performances on a

specific item vs. the total test performance. The scores for the item discrimination range from -1.00

to 1.00. The scores above 0.15 are acceptable. Negative scores represent low-ability students

perform better on that item than the high-ability students. Pearson investigates any items with the

item discrimination score lower than 0.15. If the score is extremely low or negative values are

subjected to be excluded from the test after the investigation. For the 3rd-grade math assessment, the

item discrimination values could not be found.

12. Evaluate the degree to which the overall body of reliability evidence is sufficient to support

test use, noting in particular any outstanding and unaddressed threats to reliability.
20

For the 3rd grade PARCC math spring 2018 assessment, no issues were observed that might

risk the reliability of the test. The average reliability coefficient scores for the test are high. The

score for the 3rd grade math CBT is 0.94 which indicates high reliability for the test.

Part 3: Test Validity

13. Describe the procedures used to investigate test validity (e.g., factor analysis, expert reviews,

think-alouds), and the specific forms of validity evidence gathered (e.g., based on test content,

based on relations with external variables, based on internal structure, based on response

processes, and based on the consequences of testing).

According to the American Educational Research Association (AERA), American

Psychological Association (APA), and National Council on Measurement and Education (NCME)

(2014) reports, validity is described as;

“The degree to which evidence and theory support the interpretations of test scores for

proposed uses of tests. Validity is, therefore, the most fundamental consideration in

developing tests and evaluating tests. The process of validation involves accumulating

relevant evidence to provide a sound scientific basis for the proposed score interpretations

(p. 11)”.

Evidence-Based on Test Content

Construct validity explains how much a test is able to measure what it claims to measure.

PARCC testing is designed to assess the CCSS for math and ELA for 3-8 grades. Test blueprints

and evidence statements are derived from those Common Core State Standards. In addition to those

statements, the PARCC College and Career Ready Determinations (CCRD) identify the skills,

knowledge, and performances that the students are expected to demonstrate to be college and career

ready. PARCC Governing Board and PARCC Advisory Committee on College Readiness reviewed
21

the standards and the test design and concluded that students who got Level 4 and 5 in PARCC high

school assessments are most likely to earn at least a “C” from their college courses without any

extra help.

When the math and ELA PARCC tests are constructed, experts from the states which

administer the PARCC test gather and hold meetings to review the test forms. The experts work

mostly on the alignment of the CCSS and test blueprints evidence statements. Necessary item

replacements are made during those meetings while maintaining the integrity of the design of the

test.

PARCC items are subjected to field tests before they are included in the assessments.

Selected schools administer the field test, and the data is used to the development of test items.

Evidence-Based on Internal Structure

The validity of the internal structure refers to the relation between the test items and the test

components and sub-claims. The ELA reports provide data regarding the overall ELA score,

Reading (RD), Writing (WR) claim scores, and five sub-claim scores; Reading Literature (RL),

Reading Information (RI), Reading Vocabulary (RV) Writing Written Expression (WE), and

Writing Language and Conventions (WKL). Math reports have data for four sub-claims; Major

Content (MC), Mathematical Reasoning (MR), Modeling Practice (MP), and Additional and

Supporting Content (ASC).

The validity of the internal structure depends on both the total group consistency and the

subgroup consistencies. Another way to check the internal structure is to measure the relationship

among the ELA and Math scores and the ELA and Math sub-claim scores.

The intercorrelations for the 3rd grade Math sub-claims are below;
22

CBT PBT
MC ASC MR MP MC ASC MR MP MC
MC 0.89 305,106 305,106 305,106 0.87 42,794 42,794 42,794 42,794
ASC 0.81 0.75 305,106 305,106 0.80 0.72 42,794 42,794 42,794
MR 0.78 0.72 0.70 305,106 0.78 0.72 0.67 42,794 42,794
MP 0.74 0.68 0.70 0.71 0.73 0.68 0.69 0.67 42,794

The average intercorrelations and the sample sizes are provided in this table for the CBT and

PBT students taking the 3rd grade Math test. The intercorrelation between MC and the other

subclaims, ASC, MR, and MP, show higher averages than the other relations. The scores for the

intercorrelation between MP and ASC are the lowest for both CBT and PBT. The results show a

moderate relation overall among the subclaims.

The table for the correlations between ELA and Math for the 3rd grade is below;

CBT PBT
ELA RD WD MA ELA RD WD MA ELA
ELA 295,747 295,747 295,747 42,794 42,794 42,794 42,794
RD 0.95 295,747 295,747 0.95 42,794 42,794 42,794
WD 0.88 0.73 295,747 0.89 0.73 42,794 42,794
MA 0.79 0.76 0.70 0.77 0.75 0.67 42,794

The average scores for the correlations between Math and ELA show similar results for both

CBT and PBT. There is a stronger correlation among Math and ELA and Math and Reading domain

scores. The correlation between the Math and Writing domain can be considered as moderate.

Evidence-Based on Response Processes

Evidence-based on the response process indicates that the students are following the expected

response process while they are answering the questions. The data for this evidence can be collected

using the feedback from the test proctors’ observations during the test, and also the item scorers’
23

feedback mostly about the extended-response and constructed-response test items. Several studies

were conducted by Pearson to check the validity of the response processes. There is a drawing tool

available during the mathematics test, and several studies were done to check the effectiveness and

usability of the drawing tool. To check how drawing tool impacts the students’ math performance,

Pearson included drawing tool in the field tests. The students were randomly selected and assigned

the drawing tool during the field test. The results showed that there is no statistical difference

between the performances of the students using the drawing tool and the students without the

drawing tool feature.

Pearson conducts several researches regarding the topics included below but not limited to;

item quality, whether the students engage with the items as expected while they are taking the test, if

the time given for each item is sufficient or not, the level of accuracy and reliability of the test

scoring rubrics, the accommodations for the English learners and students with disabilities, the

efficiency of the format of the test, and the technology features of the test.

Evidence-Based on the Consequences of the Testing

Each state and districts use the results of the PARCC testing in various ways. For some

states, PARCC proficiency is required to graduate from high school, and some don’t require students

to be PARCC proficient to high school graduates. The PARCC data can be used by some districts to

place the students in gifted programs or course placements such as Algebra I or Geometry in middle

school. Each state and district has different ways of consequences of the test in place.

14. Describe and interpret the results of procedures used to investigate test validity issues.

Pearson conducts several studies during the test item development process, benchmarking

studies, content evaluation, alignment between the test content and the CCSS, test mode, and device

comparability. These studies show that the PARCC proficient students demonstrate college and
24

career-ready or on track to readiness to those standards. The reports suggest that PARCC program

will be improved if more depth of knowledge is implemented to the overall assessments. Pearson

conducts field tests each year with the selected schools to improve the test administration and the

rigor of the test.

In 2016, a study was conducted to evaluate the degree of alignment for ELA and Math

assessments for the grade levels 5, 8, and 11 with CCSS (CCSS; Doorey, & Polikoff, 2016, Schultz,

Michaels, Dvorak, & Wiley, 2016). The experts on the content reviewed the assessment items,

rubrics, and answers and evaluated the how much the items aligned to the CCSS, the critical thinking

levels, and depth of knowledge of the items, the accessibility of the items for all students including

the English learners and students with disabilities. The experts both rated all items and gave

feedback that was used for item improvement. According to the results of the study, the PARCC

assessment program was rated as;

“Excellent Match” for ELA contents and depth of knowledge

“Good Match” for Math contents for grades 5 and 8

“Excellent Match” for 11th-grade math content

“Limited/Uneven Match” for 11th-grade depth of knowledge

“Excellent Match” for high school math content

“Good Match” for high school math depth of knowledge.

Some suggested issues regarding the ELA assessment were lack of emphasis on vocabulary

and language skills, more focus on close reading, and writing. The strength of the math assessment

was the perfect alignment with the major standards in each grade level. It was also suggested to

include questions for all cognitive demand ranges.

In 2017, a similar study was done by the Human Resources Research Organization
25

(HumRRO) for grades 3, 4, 6, and 7 to assess the quality and the alignment of the tests. ELA

assessment was rated as “Excellent Match” for the content and text quality, text complexity,

cognitive demand, and rigor, and text variety. For the math test grades 3, 4, and 6 got “Good Match”

rating, and grade 7 received “Excellent Match” rating in terms of alignment and quality of the items.

All grade levels received “Excellent Math” for the depth of knowledge level.

Pearson conducts similar studies to assess the alignment, rigor, quality of the assessment, and

also longitudinal studies to check the relationship between level 4 and level 5 students and their

projected college achievements. There are also studies to investigate the comparability analysis

between paper and online forms, and also tablet conditions and non-tablet conditions. There were

little or no flagged items related to device effects and the form of the assessment.

The results of all the studies conducted regarding the PARCC assessment are used to

improve the test quality, alignment, and depth of knowledge, content, and the rigor of the assessment

items.

15. Evaluate the degree to which the overall body of validity evidence is sufficient to support test

use, including noting in particular any outstanding threats to validity of construct-

representation (i.e., the degree to which the test is fully representative of the domain assessed)

or construct irrelevant-variance (i.e., the degree to which the test might measure variables

other than the variable intended) related to either the assessment task or the scoring

procedures.

The longitudinal study between Pearson and College Board by checking the association

between the ACT and PARCC performances revealed that the students receiving Level 4 or higher

have similar performances in their college life. Pearson aims to work on the association between

PARCC scores and the student performances in their first-year college courses in the future.
26

According to the results of the studies regarding content, internal structure, item construction,

the form of the test, there were no threats to the validity of the PARCC testing. During the test

development, Pearson’s Assessment and Information Quality (AIQ) completes a comprehensive

review of all the forms of the test items to make sure that the items in the test measure the intended

content. Also, the scoring team conducts empirical analyses, reviews all items, and checks the

accuracy of the content of the items, answers, and scoring rules. If any item is flagged, it is further

investigated, and the necessary changes are made.

16. Describe procedures related to the development of norms, and interpret the appropriateness of

those norms for students in your local school context (if the test is intended to support norm-

referenced score interpretations).

PARCC student growth percentiles (SGPs) help us to understand how the student’s

performance compared to the rest of the students in the same grade level and subject. It also provides

the data on how the student performs from one year to the next. SGP cannot be calculated for the 3rd-

grade math test since they don’t take the test in 2nd grade. The average math SGP for the grade levels

4 through 11th grade is close to 50 percent. Scores below 30 show that the students are not able to

demonstrate a year-worth growth, 30 to 70 tells us that the students show a year-worth growth, and

any scores above 70 indicate that the students exceed a year-worth growth.

Part 4: Test Fairness

17. Describe the procedures (judgmental and/or empirical) used to investigate test fairness and

bias issues, and the subgroups emphasized in such procedures.

During the test development process, Pearson provides training to the item writers. The test

items are reviewed for the content, alignment, rigor, structure, and bias. The committee includes

experts from the state, educators from secondary and college level, and community members to
27

make sure that the questions are aligned to CCSS, rigorous and high quality, and fair for all

subgroups and all student populations. The committee members review all items in all subjects and

grade levels to confirm that there are no bias and sensitivity issues that would interfere with the

students’ ability to demonstrate their knowledge and performance.

The table for test reliability estimates for the subgroups for 3rd-grade math is below.

CBT PBT

Max. Avg. Min. Max. Max. Avg. Min. Max.


Avg. Avg.
Raw Reliabil Sample Sample Raw Reliabil Sample Sample
SEM SEM
Score ity Size Size Score ity Size Size

Total Group 66 3.37 0.94 2,070 33,176 65 3.39 0.93 4,405 37,527

Male 66 3.34 0.94 1,285 16,168 65 3.37 0.93 2,686 18,455

Female 66 3.40 0.94 785 14,921 65 3.41 0.93 1,719 19,072

White 66 3.43 0.93 705 9,510 65 3.48 0.92 905 10,291

African
66 3.25 0.94 528 23,135 65 3.27 0.92 1,078 10,967
American

Asian/Pacific
66 3.35 0.93 7,583 1,522 65 3.44 0.93 123 1,792
Islander

A. Indian
66 3.23 0.93 593 1,378 65 3.28 0.91 1,081 1,081
Alaska N.

Hispanic 66 3.33 0.93 655 30,453 65 3.38 0.92 2,166 12,458

Multiple 66 3.38 0.94 3,815 779 65 3.45 0.94 841 841

Economically
66 3.29 0.93 1,246 53,165 65 3.33 0.92 3,611 25,866
Disadv.
28

Not
Economically 66 3.42 0.93 803 11,588 65 3.49 0.92 765 11,591
Disadvantaged

English Learner 66 3.25 0.92 593 8,124 65 3.34 0.91 1,872 8,147

Not English
66 3.39 0.94 1,477 21,070 65 3.40 0.93 2,532 29,346
Learner

S. with
66 3.18 0.94 1,544 12,802 65 3.22 0.91 3,270 2,431
Disabilities

S. w/o
66 3.40 0.94 526 21,778 65 3.41 0.93 1,114 34,937
Disabilities

Based on these results, the average reliability scores for the subgroups range from 0.91 to

0.94. The average of the total population is 0.94 for the CBT and 0.93 for the PBT. The average

reliability scores for the subgroups do not show any concerning results. The scores show high-

reliability scores for all subgroups.

18. Describe and interpret the results of procedures used to investigate test fairness and bias

issues.

Pearson and Text Review Committee work cooperatively to confirm that all texts and

questions are appropriate for all students and don’t contain any concerns. This committee has

members from both Content Item Review and Bias and Sensitivity Review Committees. Any

questions or text, including concerns of bias and sensitivity, are sent to PARCC Priority Alert Task

Force for further evaluation. After that process, the item is either fixed and included in the test or

excluded from the test.

19. Indicate what test accommodations are acceptable during the test, per the test manual.

PARCC test accommodations include screen reader, assistive technology, braille reader and

writer, large print, paper-based edition, text-to-speech, American Sign Language, closed captioning

tactile graphics, human reader, human signer, calculation device use, speech-to-text, monitor test
29

response, word prediction external device, extended time, word-to-word dictionary, online trans

adaptation of the mathematics test in Spanish, text-to-speech in Spanish in mathematics, human

reader in Mathematics in Spanish, paper-based assessment in mathematics in Spanish.

20. Evaluate 1) the degree to which the overall evidence for test fairness is sufficient to support

test use, 2), the degree to which procedures to evaluate test fairness sufficiently address the

equality of all relevant subgroups’ opportunities to demonstrate their knowledge (i.e., are

there other potential fairness issues that were not addressed), 3) and the degree to which

allowed accommodations address the equality of all relevant subgroups’ opportunities to

demonstrate their knowledge, especially in your specific school context (and whether

additional accommodations might be necessary).

Pearson has designed the testing platform which enabled the test takers access the features

and accommodations when needed. The system has the accessibility features available for all

students, and the accommodations for the students with disabilities, English learners, and the

English learners with disabilities to create the fair testing environment for all students. The

purpose of the accommodations is to reduce the effects of the students’ disability on the

performance of the students’ academic skills not to lower the expectations, reduce the rigor or the

complexity of the test. These accommodations and the accessibility features allow students to show

their abilities and demonstrate their knowledge more fully and fairly.

There is also an alternative test form Dynamic Learning Maps (DLM) available for the

student with severe cognitive disabilities.

According to Batel & Sargad (2016), the“PARCC exam moved beyond fill-in-the-bubble

tests to not only measure critical thinking skills but also to better accommodate the needs of

students with disabilities and English language learners. The computer-based systems offer
30

advancements in universal design principles as applied to assessments that provide access for a

wider range of student needs, reducing the number of students required to take exams in separate

small-group or one-on-one settings.” (pg. 4). The accessibility features and accommodations

provide students with disabilities and English learners a more dynamic, user-friendly, and fair test

compared to the previous test forms.

As it shows on table above, all the sub-groups including the English learners and students

with disabilities have very high average reliability scores for both CBT and PBT forms.

There is an ongoing process of quality control for the whole testing program done by

Pearson. During the process of the test item creation, the reviewers vote on each item in the test

item banking system. The committee members review each question and the passage, and they

report if they accept or reject the item and also put comments under each test item. The reports are

generated with all the feedback and comments from the committee members. The questions

containing any concerns are sent to Priority Alert Task Force for further evaluation. For the PARCC

3rd-grade math test, there was no test item flagged for further evaluation.
31

References

American Educational Research Association, American Psychological Association, & National


Council on Measurement in Education (2014). Standards for educational and psychological
testing.Washington, DC: American Educational Research Association.

Batel, S., & Sargrad, S. (2016). Better Tests, Fewer Barriers: Advances in Accessibility through
PARCC and Smarter Balanced. Center for American Progress.

Bowman, T., Wiener, D., & Branson, D. (2017). PARCC Accessibility Features and
Accommodations Manual: Guidance for Districts and Decision-Making Teams to Ensure
That PARCC Summative Assessments Produce Valid Results for All Students. Partnership
for Assessment of Readiness for College and Careers.

Brennan, R. L. (2011). Using generalizability theory to address reliability issues for PARCC
assessments: A white paper. Center for Advanced Studies in Measurement and Assessment
(CASMA), University of Iowa.

Common Core State Standards. (2019). Retrieved from


http://www.corestandards.org/Math/Content/3/introduction/

Dogan, E., Hauger, J. B., & Maliszewski, C. (2015). EMPIRICAL AND PROCEDURAL
VALIDITY EVIDENCE IN DEVELOPMENT AND IMPLEMENTATION OF PARCC
ASSESSMENTS. The Next Generation of Testing: Common Core Standards, Smarter?
Balanced, PARCC, and the Nationwide Testing Movement, 273.

Livingston, S. A., & Lewis, C. (1995). Estimating the consistency and accuracy of classifications
based on test scores. Journal of Educational Measurement, 32, 179–197.

PARCC Accessibility and Fairness Technical Memorandum Heather M. Buzick Educational


Testing Service October 2, 2013

PARCC Final Technical Report for 2018 Administration. (2018). Retrieved from https://parcc-
assessment.org/wp-content/uploads/2019/05/PARCC-2018-Technical-
Report_Final_02282019_FORWEB.pdf.

PARCC High Level Blueprints-Mathematics. (2018). Retrieved from https://parcc-


assessment.org/content/uploads/2017/11/PARCCHighLevelBlueprints-
Mathematics_08.25.15.pdf

PARCC Math Grade-3 Alignment Document. (2018). Retrieved from https://parcc-


assessment.org/wpcontent/uploads/2018/08/Math_2018_Released_Items/Grade03/PARCC-
Math-Sp-2018-G3-Released-Answer-Key_modified-final_20181029.pdf
32

PARCC Math Grade-3 Released Answer Key. (2018). Retrieved from https://parcc-
assessment.org/wp-content/uploads/2019/05/PARCC-2018-Technical-
Report_Final_02282019_FORWEB.pdf.

PARCC Math Grade -3 Released Items. (2018). Retrieved from https://parcc-assessment.org/wp-


content/uploads/2018/08/Math_2018_Released_Items/Grade03/Grade-3-Math-Item-Set-
2018_20181029.pdf

PARCC Math Scoring Rules. (2018). Retrieved from https://parcc-


assessment.org/content/uploads/released_materials/06/PARCC_Math_-
_Scoring_Rules_V6_Approved.pdf

Sinclair, A., Deatz, R., & Johnston-Fisher, J. Findings from the PARCC Quality of Test
Administration Investigations: Year.

You might also like