You are on page 1of 39

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/269337035

A Body of Work Standard-Setting Method With Construct Maps

Article  in  Educational and Psychological Measurement · April 2014


DOI: 10.1177/0013164413502037

CITATIONS READS
8 1,445

4 authors, including:

Adam Wyse Michael Bunch


Renaissance Measurement Incorporated
45 PUBLICATIONS   368 CITATIONS    10 PUBLICATIONS   568 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Evaluating and Improving Standard-Setting Methods View project

Standard Setting View project

All content following this page was uploaded by Adam Wyse on 10 December 2014.

The user has requested enhancement of the downloaded file.


RUNNING HEADER: Body of Work with Construct Maps

A Body of Work Standard-Setting Method with Construct Maps

Adam E. Wyse

Michigan Department of Education

Michael B. Bunch

Measurement Incorporated

Craig Deville

Measurement Incorporated

Steven G. Viger

Michigan Department of Education

Accepted for publication in Educational and Psychological Measurement.

Please cite as:

Wyse, A. E., Bunch, M. B., Deville, C., & Viger, S. G. (2014). A body of work standard-setting

method with construct maps. Educational and Psychological Measurement, 74(2), 236-

262.

Version of Record available at: http://epm.sagepub.com/content/74/2/236.full

DOI: 10.1177/0013164413502037
Body of Work with Construct Maps 2

Abstract

This article describes a novel variation of the Body of Work method that utilizes

construct maps to overcome challenges of transparency, rater inconsistency, and scores gaps

commonly occurring with the Body of Work method. The Body of Work method with construct

maps was implemented to set cut-scores for two separate K-12 assessment programs in a large

Midwestern state. Data from the standard settings were used to investigate the procedural,

internal, and external validity of the Body of Work method with construct maps. Results

suggested that the method had strong procedural, internal, and external validity evidence to

support its application. The article concludes with discussion and some areas for future research.

Keywords: Standard Setting, Body of Work, Construct Maps, Validity Evidence, Large-scale

Assessment
Body of Work with Construct Maps 3

Introduction

The Body of Work (BoW) method (Kingston, Kahl, Sweeney, & Bay, 2001) is a holistic

standard-setting method that is gaining popularity for setting cut-scores on large-scale

assessments, especially when the assessment contains many open-ended items. In the typical

procedure, panelists review samples of student work and sort those samples into piles to indicate

cut-scores. The procedure usually involves multiple rounds of ratings with discussion and

feedback between rounds.

Despite the popularity of the BoW method, there are several challenges with the method

in practice. These include (1) rater inconsistency concerns in which samples with the same score

are placed into different piles or samples with higher scores are placed into lower performance

categories than samples with lower scores or vice versa, (2) score gap concerns in which some

score locations on the underlying score scale do not have work samples, and (3) challenges with

panelists understanding how their recommendations translate into cut-scores and are related to

other standard-setting data.

The purpose of this article is to describe a new variation of the BoW method that can be

used to overcome many of these challenges. The variation uses construct maps (Wyse, 2013;

Luecht, 2008; Wilson, 2005) to integrate and visually display the multiple pieces of data received

during standard setting. Panelists use the construct maps at each stage of standard setting to

receive feedback and record their judgments.

In the next section, the typical BoW procedure is described. Discussion is given as to

how the typical procedure can result in rater inconsistency, score gaps, and transparency issues.

The new BoW procedure with construct maps is then outlined. The following section outlines

data from two standard settings that used the new procedure; a standard setting on a large
Body of Work with Construct Maps 4

Midwestern state’s general assessment and a standard setting on a large Midwestern state’s

alternate assessments based on modified achievement standards. Then, results on the procedural,

internal, and external validity evidence (Kane, 1994; 2001) based on these data are provided. The

article concludes with discussion and some areas for future research.

Typical Procedure

The typical BoW procedure (Kingston et al., 2001) consists of a practice round followed

by three operational rounds of ratings. The panelists’ task in the BoW procedure is to consider

samples of student work and to sort the samples into different categories so that numerical cut-

scores can be determined to separate the examinees into the performance levels in the

performance level descriptors (PLDs).

In the practice round, panelists are given a small number of work samples. The samples

are often selected such that there is one high scoring sample, one low scoring sample, and a

couple of samples with scores in the middle of the score range. Depending on how the BoW

method is being implemented, the samples can be shown with or without the scores and the

samples may or may not be shown in order based on their scores (Cizek & Bunch, 2007;

Kingston et al., 2001; Zieky, Perie, & Livingston, 2008). Panelists are asked to look at the

samples and provide ratings for each sample. The goal in the practice is not to precisely identify

cut-scores, but to ensure that panelists have a solid understanding of the materials and processes.

Following the practice round, panelists participate in the three rounds of operational

ratings. The first round is called the range finding round (Kingston et al., 2001). In this round,

panelists consider work samples that span the range of scores that examinees can obtain on the

assessment. The panelists’ task is very similar to the practice round where they sort the samples

into performance categories to indicate their cut-scores.


Body of Work with Construct Maps 5

It is important to point out that even though the first round samples are designed to span

the range of possible scores, there are often gaps along the score scale in which no work samples

are shown. This occurs because the score scale underlying the assessment is usually created by

scaling the test using a psychometric model, such as an item response theory (IRT) model or

cognitive diagnostic model. As a result, this often leads to a situation in which not every point on

the scale is represented with a work sample. This creates a challenge because it is difficult to

precisely set cut-scores at some locations on the scale; yet these represent viable potential cut-

scores. Another challenge that can make it hard to estimate cut-scores is the presence of rater

inconsistency where panelists will rate samples with the similar scores into different categories.

This can occur because panelists are often not forced to provide ratings in which higher scoring

samples need to have ratings greater than or equal to ratings for samples with lower scores.

Following the range finding round cut-score ranges may be determined or actual cut-

scores may be estimated. If the cut-score ranges are being employed, the ranges are usually

identified by looking at the panelists’ ratings and identifying the locations where panelists begin

to move papers from one category to the next (see Kingston et al., 2001). If the cut-scores are

reported as single numbers the cut-scores are typically estimated using a logistic regression

model. In this case, the panelists’ ratings are input into the model and the score associated with a

fifty percent chance of moving from one category to the next is the cut-score. The feedback

information given to the panelists following the first round is often in the form of frequencies of

the ratings assigned to each sample and the estimated cut-scores for each category (Zieky et al.,

2008). Panelists discuss these data and then move to the second round.

Herein lies another set of challenges, which is that panelists often have a hard time seeing

how their ratings translate into the cut-scores and relate to various standard-setting data. This
Body of Work with Construct Maps 6

occurs because the relationships between ratings, estimated cut-scores, and other data are often

not fully transparent. For example, there may be three samples at one score, two at another score,

one sample at a few different scores and so on. The question becomes, how does one distribute

their ratings to produce appropriate cut-scores? For most this is not a straightforward task. In all

fairness, one of the goals of the facilitator in the typical procedure is to try and help panelists

understand this process. The facilitator may explain, for example, that to lower their cut-score

they should give more ratings to some of the lower scoring papers. However, even with the aid

of a skilled facilitator knowing how to redistribute ratings to produce any specific cut-score is

not always clear. Panelists may also want to make adjustments in response to feedback and doing

this can also be a challenge because it is hard to see the relationships.

The second round of the BoW procedure is aptly named the pinpointing round (Kingston

et al., 2001). The process that panelists use to indicate their cut-scores is very similar; they look

at each sample and sort them into piles to indicate their cut-scores. However, the range of work

samples is narrowed in comparison to round 1. Typically, the range is adjusted based on the cut-

scores estimated in round 1. Many of the concerns with score gaps, rater inconsistency, and

transparency persist in round 2. Following the second round, the cut-scores are estimated using

logistic regression, panelists are given feedback information, and they discuss these data. One

new piece of feedback information often included is the impact data for the round 2 cut-scores.

The final round allows the panelists the opportunity to reexamine the samples from round

2 in light of the impact data and feedback information. Panelists can make their final judgments

by either sorting the work samples into piles again or if the scores assigned to the work samples

are shown panelists may be asked to just indicate numbers for their cut-scores. Again, many of

the same challenges that existed in rounds 1 and 2 are still present in the final round.
Body of Work with Construct Maps 7

Even though the typical BoW method is a legitimate and defensible approach to

determining cut-scores, there are challenges with its application that can impact cut-score

estimates. In the next section, we describe a novel approach to the BoW method that uses

construct maps to help mitigate some of these previous challenges.

New Procedure

Construct Maps

The new BoW procedure hinges on the use of construct maps. Construct maps are visual

displays that show how examinee and/or item data are related to an underlying achievement

construct and corresponding score scale (Authors, 2012; Luecht, 2008; Wilson, 2005). Construct

maps have three critical components: (1) the content that defines the construct is well defined,

(2) the construct can be represented by an underlying continuum in which items and examinees

can be ordered along that continuum, and (3) a measurement model can be applied to show the

relationships between the scoring continuum and the item and examinee data from the

assessment in a chart or graphic (Authors, 2012; Luecht, 2008; Wilson, 2005). Many commonly

used psychometric models can be applied to data to create empirical construct maps.

An abbreviated example of an empirical construct map based on data collected from a

grade 5 mathematics assessment that applied IRT models is shown in Table 1. In the construct

map in Table 1, the score scale values from 220 to 230 are shown in the score scale column and

different quantities computed from item data, examinee data, and various ratings are shown in

separate columns. The construct map clearly shows that there are specific relationships between

the potential scores on the underlying achievement continuum and various data elements. Many

of these same data elements are what panelists focus on in different standard-setting procedures.

For example, the item scores in each row in Table 1 are the values of the item characteristic
Body of Work with Construct Maps 8

curves from IRT models and they correspond to the data that panelists consider when providing

ratings in the Angoff procedure (Angoff, 1971). Authors (2012) explain how data in the general

construct mapping framework relate to several common standard-setting methods.

[Insert Table 1 about here]

Construct maps have important potential applications in standard setting because they can

help panelists to better understand the relationship between the examinee and item data and how

they translate into cut-scores; all panelists need to do is look across a row in the construct map to

see the relationship between that potential cut-score and the data that they are reviewing. Further,

construct maps can help panelists to provide improved judgments that can reduce concerns of

rater inconsistency, score gaps, and transparency. The next section describes how construct maps

can be used for these purposes in the context of the BoW procedure.

Body of Work with Construct Maps

In the BoW method, panelists are asked to consider student work samples in each round

of standard setting. Additionally, panelists are often given impact data and feedback information

displaying their cut-scores in relation to those of other panelists. Therefore, these data are

contained in the construct map for the BoW method (Table 2).

For the sake of this article, it is assumed that the underlying scoring continuum is created

using an IRT model, specifically the Rasch model, since that was the model the state that

conducted the standard settings used to create scale scores. However, the use of the Rasch model

is not a necessity to apply the BoW Method with construct maps.

[Insert Table 2 about here]

In the construct map (Table 2) the score scale, which is a linear transformation of the IRT

ability scale, is shown in the column labeled “score scale.” The columns labeled “Round 1 Work
Body of Work with Construct Maps 9

Samples” and “Round 2 Work Samples” show the scale locations of work samples in the first

and second round. These locations are determined from IRT ability estimates assigned to each

sample after the test is administered. Since we are applying the Rasch model, examinees with the

same raw score will have the same scale score, and work samples with the same raw score will

appear in the same row. The “PAC” column shows data on the percent at or above cut-score

(PAC) and it is determined from the distribution of ability estimates on the assessment. The

“Partially Proficient”, “Proficient”, and “Advanced” cut-score columns are the places where the

panelists record their ratings in each round and get to see rater feedback information on cut-

scores in between rounds. The number and names of these columns can be changed based on the

standard-setting application. By reading across each row in the construct map, panelists can see

how their cut-scores are related to the other data used in standard setting.

[Insert Table 2 about here]

Similar to the typical BoW method, the BoW method with construct maps begins with

training, the opportunity to take the assessment and discuss the test items and rubrics, and in

depth consideration of the PLDs. The main difference in these initial steps is in the training,

which introduces construct maps and explains how the construct maps show the work samples

and standard-setting data that panelists will see, and outlines how the panelists use these

materials to provide their standard-setting judgments.

Prior to the operational rounds, there is a practice round to orient the panelists to the

materials that are used in the procedure. For the applications in this article, panelists were given

five work samples, one of which was a high scoring sample, one of which was a low scoring

sample, and three work samples in the middle of the score range. Each work sample had the

scale score on a cover sheet, the total score on the multiple-choice items, the total score on the
Body of Work with Construct Maps 10

open-ended items, the answer choices chosen by the examinees on the multiple-choice items and

the item p-values, and the writing responses with scores on the rubrics and annotations as to why

each response was given the scores that it received. Panelists also received a construct map that

showed the five work samples. The work samples were shown in order from lowest to highest

score. Panelists were instructed to indicate their cut-score by looking at the construct map,

marking a score on the construct map, and then transferring this score to a rating sheet.

In the first round, panelists received the round 1 work samples and a construct map that

showed the samples, the score scale, and the columns where they recommended their cut-scores.

Panelists did not see the PAC or round 2 work samples; those were shown in later rounds. Fifty

work samples spread out across the range of possible scores were given in round 1. The work

samples had various combinations of scores including higher multiple-choice scores and lower

open-ended scores, moderate scores on both sections, and higher open-ended scores and lower

multiple-choice scores. It was explained that samples that had the same number of total points

had to be classified into the same category and that this would be reflected in the cut-score

selected in the construct map. We assumed the underlying score scale ranged from 100 to 300,

and this scale, in one point increments, was displayed in the construct map.

The number of work samples and score scale can change based on the specific

implementation. The samples were shown to the panelists in order from the lowest score (e.g.,

sample 1 in Table 2) to the highest score (e.g., sample 20 in Table 2). Some of the samples had

the same score and panelists were told that they can find these locations in the rows with more

than one sample (e.g., a score of 220 in Table 2). Panelists were also told that there were

locations where there were no samples (e.g., a score of 150 in Table 2). These locations

represented score gaps.


Body of Work with Construct Maps 11

To use the construct map to set cut-scores, panelists were instructed to start with the first

cut-score and consider the examinee that would just barely be considered at that level of

performance. In our applications, this was the just barely partially proficient student for the

general assessments and the just barely proficient student for the alternate assessments based on

modified achievement standards. For the sake of illustration, the explanation in this section is for

the general assessments. In this case, panelists were instructed to look at the first sample and ask

themselves whether or not the sample represented just barely partially proficient performance. If

they believed it was not indicative of just barely partially proficient performance, then they were

instructed to move to the next sample. This sample had a score that would be greater than or

equal to the previous sample that they just examined. As they moved from sample to sample they

were told to examine the construct map to see the location of the sample and relationship of the

score assigned to the sample to the potential cut-scores that could be set. When they reached a

point where they thought they had found a sample that represented just barely partially proficient

performance, they were instructed to examine that sample and a few samples before and after it

to select their cut-score. Their cut-score could be at the location of a sample or in between

several samples. Panelists were told to select a cut-score at a sample if they believed the sample

was indicative of work right on the border of what the just barely partially proficient student

should be able to do. They were told to select a cut-score in between two samples if one sample

was clearly above and the other sample was clearly below what the just barely partially

proficient student should be able do. Panelists were encouraged to write the word “Cut-Score” in

the row representing their cut-score, draw a line through that score, and write the words “Not

Proficient” above the line and “Partially Proficient” below the line. Writing the words was
Body of Work with Construct Maps 12

designed to remind panelists of the meaning of their cut-score. Panelists used a similar process to

recommend the proficient and advanced cut-scores.

The benefits to this process are that it helps reduce score gap concerns because panelists

can select cut-scores in between samples (and there is a sound rationale for doing this, one

sample is clearly too high and one sample is clearly too low in relationship to the PLD) and that

it virtually removes rater inconsistency because the construct map clearly shows when a cut-

score will result in samples with higher scores being placed into lower performance categories

and vice versa. The process for choosing their cut-scores in the construct map in essence forces

the panelists to make a decision of the best location for the cut-score and does not allow them to

place samples with similar scores into different performance categories. The construct map also

clearly shows the relationships between the standard-setting data and cut-scores. After round 1, a

completed construct map will appear similar to the example in Figure 1.

[Insert Figure 1 about here]

The cut-scores were computed by taking the median of the panelists’ ratings. It was

possible to use the median, instead of the more complicated logistic regression procedure, since

each panelist indicated a single scale score representing their best judgment for a cut-score. This

approach greatly simplifies the computational aspects of the procedure and makes it easier for

panelists to understand how the cut-scores were calculated. Most panelists are familiar with

medians and how medians are computed. It is also possible to use the mean to compute cut-

scores, but we used the median since the median is not as impacted by extreme judgments.

Panelists then discussed feedback data received in construct maps between rounds 1 and

2. The process for recommending cut-scores in round 2 was very similar to round 1 except that

panelists considered targeted work samples and they could see these samples and how they relate
Body of Work with Construct Maps 13

to the samples from round 1. Thirty new work samples were shown in round 2. These samples

again had some samples with different combinations of scores. Each set of samples was

represented with specific columns in the construct map as is shown in Table 2.

Following the second round, panelists received additional feedback (Table 3). The group

cut-scores were shown with horizontal shaded lines and the individual panelist cut-scores were

shown by rater numbers. This construct map was very similar to the one that the panelists used to

provide their ratings in the rounds of standard setting and saw as feedback between other rounds.

This was intentional and was designed to help integrate the process and make it clear how each

piece of data related to their ratings and estimated cut-scores. In the last round, panelists saw

information on the samples and the impact data. The panelists’ were instructed to make their

final recommendations by considering of all data in the construct maps and its relationship to the

PLDs. Panelists indicated their final cut-scores by again selecting a score in the construct map.

[Insert Table 3 about here]

Data

The new BoW procedure with construct maps was used to set cut-scores in two different

writing assessment programs in a large Midwestern state. The first program was a state general

writing assessment program. In this program, assessments were given in grades 4 and 7 to over

100,000 examinees. Both tests consisted of: one narrative writing prompt that was analytically

scored and worth 15 points, one informational writing prompt that was analytically scored and

worth 15 points, one holistically scored item that focused on providing feedback and editing a

piece of writing that was worth 4 points, and 16 multiple-choice questions each worth one point.

Each 50 point test was scaled using the Rasch model and put onto a scale ranging from 100 to

300. The choice of the scale of 100 to 300 was arbitrary, but was chosen so as not overlap with
Body of Work with Construct Maps 14

scales for the general assessments. Typical scales span a 200 to 250 point range and have the

proficient cut-score anchored at a specific number to aid score interpretations. It was made clear

that the scale shown was designed to give to panelists an idea of the relationship between the

samples and possible cut-scores and that the final scale would be based on the final cut-scores.

Twenty-two panelists participated in the standard setting at grade 4 and twenty-three

panelists participated at grade 7. The panelists were selected through a nomination and

recruitment process. Panelists represented a range of geographical regions, school districts, and

ranged in their educational experience. Most of the panelists were educators, but there were three

business and community members in grade 4 and four business and community members in

grade 7. In grade 4, 82% of the panelists were white and 77% of the panelists were female. In

grade 7, 87% of the panelists were white and 87% of the panelists were female.

The BoW method with construct maps consisted of the three rounds of judgments as

described above with impact data shown between the second and third rounds. Throughout the

standard-setting process, process evaluation forms were given to ensure that the procedure was

functioning as intended. The process evaluations had a series of open-ended questions that asked

about various aspects of the procedure and the approaches panelists used (or thought they would

use) to determine their cut-scores. Many of the questions were identical across rounds except for

changing the wording to reference the specific round in question. Data from the process

evaluations on the question that asked panelists “Please describe how you used (or think you will

use) the standard-setting materials provided by the facilitator to provide your standard-setting

judgments in Round __,” is used to help provide an idea of the processes and materials that

panelists were focusing on when providing their standard-setting judgments. A final evaluation

form was given at the end of the standard setting, which contained a mixture of opened-ended
Body of Work with Construct Maps 15

and rating scale questions. The rating scale questions ranged from one to four with a rating of

one indicating strongly disagree, two indicating disagree, three indicating agree, and four

indicating strongly agree. We also used data from the rating scale questions on the final

evaluations that dealt with various aspects of the procedure as another set of data. In addition, we

examined data on the three cut-scores and their impact data and whether ratings were on papers

or in gaps in the construct maps across rounds as other data on the method.

The second set of data comes from a state alternate assessment writing program based on

modified achievement standards. This standard setting took place approximately one year after

the standard setting for the general assessments. Similar to the general assessments, the alternate

assessments based on modified achievement standards were given in grades 4 and 7.

Approximately 2,000 examinees took the assessment in each grade. The assessments were also

scaled using the Rasch model and placed on a scale from 100 to 300. The tests consisted of: one

15 point narrative prompt that was analytically scored, one 15 point informational prompt that

was analytically scored, and 10 multiple-choice questions which were each worth one point.

Eleven panelists participated in the grade 4 standard setting and twelve panelists

participated in the grade 7 standard setting. The panelists were recruited using the same

nomination and recruitment process as was used with the general assessment. Fewer panelists

were nominated and available to participate for the alternate assessment standard setting.

Panelists again represented a range of geographical regions, school districts, and levels of

educational experience. All of the participants were educators. In grade 4, 81% of the panelists

were white and 81% of the panelists were female. In grade 7, 83% of the panelists were white

and 100% of the panelists were female.


Body of Work with Construct Maps 16

The standard-setting procedure used for the alternate assessments was similar in many

respects to the approach used with the general assessment. In particular, panelists again received

similar training, used construct maps to complete three rounds of ratings, and were asked many

of the same evaluation questions as in the general assessment standard setting. However, there

were several key differences. First, panelists only had to give two cut-score ratings instead of

three. Second, impact data was shown in the construct maps after round 1. The inclusion of these

data earlier in the process was a policy consideration and decision made by the state since the

state had recently conducted a study to reset cut-scores on several of the general assessments for

other content areas. The reset cut-scores produced dramatic changes in the amount of students

proficient and the state felt that showing these data earlier in the process would be instructive.

A third difference was that panelists were given results from a separate Contrasting

Groups (Livingston & Zieky, 1982) study to help benchmark their judgments in between round 1

and 2. These data were collected from teachers when the alternate assessments were given to the

examinees. In both grades, data were filled in on roughly 30% of the 2,000 answer documents.

Teachers gave predictions of the categories that they thought each examinee would obtain after

considering generic descriptions of what it meant to score in each category. These statements

were very similar to the policy statements in the PLDs. The predictions were used to calculate

ranges of cut-scores for the proficient cut-score only. Ranges were not found for the advanced

cut-score since very few examinees were rated into the advanced category and finding stable

estimates proved difficult. The ranges of cut-scores were determined by analyzing the data using

the approaches suggested for calculating cut-scores with Contrasting Groups in Cizek and Bunch

(2007) (i.e., logistic regression, midpoints between means, and midpoints between medians).

Panelists were presented with these data and were told how they compared to the ranges of cut-
Body of Work with Construct Maps 17

scores they recommended in round 1. Panelists could place marks on their construct maps to

indicate these ranges, but none of the panelists chose to do this.

A final difference between the two implementations was that after the last round, a

subcommittee of panelists was asked to serve on an articulation committee to review the

recommendations from both grades. This subcommittee consisted of four panelists from each

grade level. Participants in the articulation committee got to see the cut-scores and the impact

data for both grades. Panelists were asked how the process worked in each committee. The

committee was told they could adjust the cut-scores if they so desired after discussion. No

changes were made in the articulation meeting. Separate evaluation forms were administered

after the articulation meeting that asked open-ended and rating scale questions. To investigate the

method, the same data were used with this assessment as the general assessment with the

addition of data from the Contrasting Groups study and articulation committee evaluations.

Results

The results section is broken down into three separate subsections, each section focuses

on the types of validity evidence suggested by Kane (1994; 2001) for evaluating standard setting.

These types of validity are procedural, internal, and external validity evidence. Procedural

validity deals with information about procedural aspects of the method, internal validity deals

with data about the consistency of the cut-scores, and external validity deals with data about the

relationship of cut-scores from the procedure with outside data or criteria.

Procedural Validity Evidence

Common procedural challenges with the traditional BoW method were major drivers of

the development of BoW with construct maps. These concerns include: score gaps, rater

inconsistency, and transparency. In the new procedure, the challenge of rater inconsistency was
Body of Work with Construct Maps 18

essentially removed because panelists were asked to give judgment for cut-scores that place

samples with the same score into the same categories and which higher scoring samples fall into

a performance category equal to or greater than lower scoring samples. Score gaps were still

present in the new procedure, but panelists could mitigate the impact of score gaps because they

can recommend cut-scores between samples where there may be score gaps if they so choose.

The frequency of ratings on papers and in between papers in score gaps for the general

assessment (Table 4) and the alternate assessment (Table 5) show that there were several ratings

that were placed into score gaps. For the general assessments (Table 4) most of the ratings were

still on papers, especially for fourth grade committee, and there were several cut-scores with no

ratings in score gaps. For the alternate assessments (Table 5), there were a greater percentage of

ratings placed in the score gaps, and in some cases the number of ratings placed into score gaps

exceeded the number of ratings on papers. The differences could be a function of the group

dynamics in each committee or the unique aspects of the papers reviewed by the various

committees. The differences could also be a function of the fact that with fewer items on the

alternate assessments there were more score gaps, which creates the potential for increased score

gap ratings. Nonetheless, the ratings produced provide strong evidence in support of the BoW

method with construct maps and the need for the ability to give ratings in score gaps.

[Insert Table 4 and 5 about here]

A second piece of procedural evidence in support of BoW with construct maps was

qualitative data in response to the question “Please describe how you used (or think you will use)

the standard-setting materials provided by the facilitator to provide your standard-setting

judgments in Round __.” The responses were coded by a trained rater and categorized based on

the materials or information that the panelist described using or anticipated using when providing
Body of Work with Construct Maps 19

their judgments. Responses were coded into the following categories: PLDs, construct maps,

work samples, feedback information, test content (i.e., focused on content of specific items),

item stats, discussion, scoring rubrics, professional judgment, and blank/off-topic. Each time a

panelist specifically referenced one of these materials or described a process linked to one of

these materials it was coded in that category for that panelist. Responses that were hard to

categorize were flagged and reviewed by the first author in conjunction with the coder.

The percentages of ratings in each of the coding categories across panelists for the four

standard settings are shown in Figure 2. The top right panel shows the results for the general

assessment at grade 4, the top left panel shows the results for the general assessment at grade 7,

the bottom right panel shows the results for alternate assessment at grade 4, and the bottom left

panel results the coding for the alternate assessment at grade 7. The responses from each process

questionnaire for different rounds are shown with different colored shaded bars.

[Insert Figure 2 about here]

Figure 2 shows some interesting patterns. Across all standard settings it appeared that the

processes that people described using focused mainly on the three critical elements of the PLDs,

construct maps, and work samples as the bars for these categories were, in most cases, the

highest bars. The actual percentage of responses in each category did differ slightly across

groups and across rounds. One would expect if the procedure was working effectively that these

would be the three highest rated elements. Additionally, it is apparent that feedback information

and discussion were not initially important to panelists, but as the rounds increased these areas

received a greater emphasis. This is also as expected as after the first round panelists get to

engage in discussion with their colleagues and receive feedback information. Few panelists

described processes incorporating the scoring rubrics, their own outside professional judgment,
Body of Work with Construct Maps 20

or gave blank or off-topic responses. This is also as one would expect as panelists were

instructed to not rescore papers and limit their decisions to materials in the standard setting.

Also apparent in Figure 2 is the impact various facilitators have on the process. For

example, in the grade 4 general assessments there seemed to be a higher emphasis on test content

and less of a focus on the construct maps prior to round 1. In this case, a large number of

panelists said they would focus on the content of specific items and few panelists described a

process using the construct map. As the rounds went on, however, the focus on the content of

specific questions was less emphasized and greater emphasis was placed on the construct maps.

No panelists mentioned content of individual items as being important for the grade 7 general

assessments and few panelists mentioned this for the alternate assessments. There was also some

variation in the use of item statistics across panels. However, in general there seemed be limited

focus on the p-values of particular items. The patterns in Figure 2 lend support to the claim that

most of the panelists took the standard-setting process seriously and focused on the most critical

elements (e.g., PLDs, student work samples, and construct maps) when providing their ratings.

Another piece of procedural validity evidence comes from the responses to rating scale

questions for the panelists and articulation committees. The average ratings for the final

evaluations are shown in Table 6 and the average ratings for the articulation committee questions

are shown in Table 7. Admittedly, some of the questions might fall into the internal and external

validity evidence sections, but they are included here for continuity.

[Insert Table 6 and 7 here]

The average ratings from both the final evaluation questions and the articulation

committee questions are uniformly higher than a rating of three for all four standard settings,

indicating that most panelists agreed or strongly agreed with each statement. This is as expected
Body of Work with Construct Maps 21

if the procedure and process worked as it was supposed to. One would expect that panelists

would think that the presentation of the PLDs was clear, that they felt that their confidence in

their judgments increased, that they felt that the construct maps assisted them, and that they

understood how to use the feedback information, and so on if the standard-setting method

worked effectively. Again, these responses lend support to the procedural validity of the method.

Internal Validity Evidence

In terms of internal validity, one can look at the cut-score ratings and whether the

standard deviations of the panelists’ ratings decreased across rounds. Decreased standard

deviations indicate that panelists were converging in on similar cut-scores. One would expect to

find this pattern in these data, especially given that the question on consensus on the process

evaluation form suggested that panelists felt they had reached some form of consensus (Table 6).

The final cut-scores for the general assessment and alternate assessment are displayed in

Tables 8 and 9, respectively. Each table displays the median, minimum, and maximum cut-

scores as well as the standard deviation of the cut-scores.

[Insert Table 8 and Table 9 about]

The cut-scores, with the exception of the advanced cut-score for grade 4 that went down

by 8 scale score points (which represents a change of 4% of the scale score range) remained

fairly similar across rounds for the general assessments (Table 8). In addition, the standard

deviations of the cut-scores were less in rounds 2 and 3 compared to round 1. Some of the

standard deviations for the grade 4 panel did slightly increase from round 2 to round 3, although

not in a manner that would suggest panelists had widely different opinions. The range of the cut-

scores in the third round was also fairly narrow as the biggest difference between the minimum

and maximum ratings for any cut-score was only 13 scale score points.
Body of Work with Construct Maps 22

There was a slightly different pattern for the alternate assessments in terms of the median

cut-scores. In particular, Table 9 shows that were some more dramatic changes in cut-scores

from round 1 to round 2. For example, in the grade 4 the proficient median cut-score dropped by

10 points from round 1 to round 2 and the advanced cut-scored dropped by 28 points. These

bigger changes for the alternate assessments in comparison to the general assessment could have

been a function of the introduction of the impact data earlier in the process or the consideration

of the Contrasting Groups data. These data were not presented after round 1 for the general

assessments. The cut-scores were more similar between rounds 2 and 3. Again, the standard

deviations were decreased in rounds 2 and 3 as compared to round 1. Similar to the general

assessments, a few of the standard deviations slightly increased from round 2 to round 3. There

did appear to be more disagreement and a larger range of cut-scores for grade 7 compared to the

grade 4 in round 3, although again the standard deviations were not that large in comparison to

the range of possible cut-scores.

The results in Tables 8 and 9 are similar to results that are commonly observed in other

standard-setting processes and seem to provide fairly strong internal validity evidence that by the

last round panelists were able to converge on a set of cut-scores. In the case of the alternate

assessment standard setting, the articulation committee had a chance to review the cut-scores and

endorsed the cut-scores from round 3 with no changes. This provides some additional support for

the cut-scores that were recommended and the process used to arrive at the cut-scores.

External Validity Evidence

The last set of validity evidence is external validity evidence. One piece of external

validity evidence came from Contrasting Groups data collected for the alternate assessments.

These Contrasting Groups data were used to calculate a range of cut-scores to serve as
Body of Work with Construct Maps 23

benchmarks for the round 1 proficient cut-scores. The range of proficient cut-scores for grade 4

was 175 to 213 and the range of proficient cut-scores for grade 7 was 182 to 211. The group

median cut-scores from round 1 for the new BoW procedure were 195 and 184 for grades 4 and

7, respectively. The ranges for the panelist recommendations were 175 to 209 for grade 4 and

163 to 214 for grade 7. The cut-scores fell within the ranges from the Contrasting Groups and the

ranges for the BoW with construct maps and Contrasting Groups were fairly similar. These data

provide some external validity evidence in terms of how cut-scores from the new BoW method

compare with that of another procedure. Although it is not strong additional evidence because

the panelists saw the Contrasting Groups data after round 1, the final cut-scores for the alternate

assessment standard setting also fell within the ranges of the Contrasting Groups data.

Another piece of external validity evidence was the similarity of percentages for the

grade 4 and grade 7 students that committees said would be proficient or above in each

assessment. In the general assessments, the recommended cut-scores placed 75 percent at

proficient or above for grade 4 and 74 percent at proficient or above for grade 7. For the alternate

assessments, the recommended cut-scores produced 38 percent at or above at grade 4 and 40

percent at or above grade 7. From a policy and accountability standpoint, the similarity of the

passing percentages is desirable since there will not be disparate impacts in different grades. The

fact that two grades arrived at similar percentages of students that would be proficient or above is

another piece of important external validity evidence for the procedure.

Discussion and Conclusion

Developing data driven standard-setting approaches that make the process more

transparent and that also ameliorate factors that can contaminate cut-score estimates are pressing

needs in the field of educational measurement. This article makes an important contribution by
Body of Work with Construct Maps 24

illustrating how construct maps can be used to modify the BoW method in a way that allows

panelists with limited training in statistics and assessment to better understand the process and

provide judgments that mitigate rater inconsistency and score gap concerns.

Applications from two separate writing standard-setting processes, one of which came

from a large-scale state general assessment and one of which came from a large-scale state

alternate assessment based on modified achievement standards, illustrated how the method could

be used to effectively set cut-scores in K-12 state testing programs. Data collected provided

procedural, internal, and external validity evidence in support of the new approach. In terms of

procedural validity, data suggested that panelists do in fact provide ratings in score gaps, that for

the most part the processes that panelists described using to determine cut-scores were focused

on the critical elements they should be using with the approach, and their responses to feedback

and evaluation questions demonstrated that the standard settings and various aspects of the

procedure were viewed positively. Data providing internal validity evidence showed decreased

standard deviations over the rounds of standard setting and convergence on a narrow range of

cut-scores. Data providing external validity evidence demonstrated that cut-scores from round 1

of the alternate assessment standard setting were similar to cut-scores determined from a set of

Contrasting Groups data collected on the same assessment. In addition, the percentage of

students at or above the proficient cut-score was very similar across grades for both the general

and alternate assessments, which is desirable from a policy and accountability perspective.

Although these data provide strong validity evidence in support of the new procedure,

there were other validity data that we were not able to collect that might provide additional

insight about the procedure. For example, data comparing cut-scores and perceived differences in

ease of implementation and understanding of standard-setting relationships that directly


Body of Work with Construct Maps 25

compared the BoW with construct maps with the traditional BoW procedure would be valuable.

Ideally, these data should be collected from a group of panelists that implemented both the

typical and new approach. We suspect that if these data were collected that the new procedure

would receive higher ratings in terms of transparency and that panelists would demonstrate

greater understanding of relationships between data. However, without the evidence to support

this claim, this is just a conjecture at this point. Further method comparison studies that compare

the new BoW approach with other standard-setting procedures would also be valuable.

Additional research that probed panelists on their use and understanding of the construct

maps would also be beneficial. Several of the panelists in their written descriptions made

comments demonstrating that they were using the construct maps appropriately and that the

found the construct maps to be a valuable visual aid in standard setting. However, deeper

interviews and think-aloud type studies might provide additional insight into the processes that

panelists were using beyond what we were able to capture on the feedback forms.

Future research could also explore and isolate the impact of showing specific data

elements in the construct maps. For example, our study did not directly investigate the impact of

showing something like impact data at different points in the process. Our data did show greater

changes for the alternate assessment between round 1 and round 2 when these data were shown

earlier in the process. However, this was not a pure experimental study in which one group had

these data and another did not. Finally, additional research should look at and investigate

applications of the BoW Method with construct maps in other contexts. Our applications only

focused on two writing assessments in a single state. The method may work differently in other

contexts.

References
Body of Work with Construct Maps 26

Angoff, W. H. (1971). Scales, norms, and equivalent scores. In R. L. Thorndike (Ed.),

Educational Measurement (2nd ed., pp. 508-597). Washington, DC: American Council on

Education.

Cizek, G. J., & Bunch, M. B. (2007). Standard setting: A guide to establishing and evaluating

performance standards on tests. Thousand Oaks, CA: Sage Publications.

Kingston, N. M., Kahl, S. R., Sweeney, K., & Bay, L. (2001). Setting performance standards

using the Body of Work method. In G. J. Cizek (Ed.), Standard setting: Concepts,

methods, and perspectives (pp. 219-248). Mahwah, NJ: Erlbaum.

Kane, M. T. (2001). So much remains the same: Conception and status of validation in setting

standards. In G. J. Cizek (Ed.), Setting performance standards: Concepts, methods, and

perspectives (pp. 53-81). Mahwah, NJ: Lawrence Erlbaum.

Kane, M. T. (1994). Validating the performance standards associated with passing scores.

Review of Educational Research, 64, 425-461.

Luecht, R. M. (February, 2008). Assessment Engineering. In symposium Assessment

Engineering: Moving from theory to practice. Presented at the Annual Meeting of the

Association of Test Publishers, Dallas, TX.

Livingston, S. A., & Zieky, M. J. (1982). Passing scores: A manual for setting standards of

performance on educational and occupational tests. Princeton, NJ: ETS.

Wilson, M. (2005). Constructing measures: An item response modeling approach. Mahwah, NJ:

Lawrence Erlbaum.

Wyse, A. E. (2013). Construct Maps as a Foundation for Standard Setting. Manuscript

submitted for publication. Submitted for publication.


Body of Work with Construct Maps 27

Zieky, M. J., Perie, M. J., & Livingston, S. A. (2008). Cutscores: A manual for setting standards

of performance on educational and occupational tests. Princeton, NJ: ETS.


Body of Work with Construct Maps 28

Figure 1: Example of Filled in Panelist Construct Map from Round 1

Figure Note: The figure shows three cut-score placements for the panelist with the words “Cut-Score” and lines through the row that

signifies their cut-scores. The performance level categories labels are written above and below each line to remind the panelist of

meaning of their cut-scores.


Body of Work with Construct Maps 29

Figure 2: Results from Qualitative Responses on Processes Used by Panelists in Modified Body of Work Method
Body of Work with Construct Maps 30

Table 1: Example of Empirically Derived Construct Map for a Mathematics Assessment

PAC Teacher’s Work Raters’ Score RP67 Item Item Domain


Students Samples Cut Scores Scale Locations Scores Scores

Item … Item Number … Geometry


1 35 Sense
… … .. … … …
6.41% R8 230 Item 26 .844 .725 .739 .752
6.70% Student A AA, BB, CC 229 .841 .720 .735 .748
7.09% R2 228 Item 24, .837 .714 .731 .744
Item 25
7.47% R4 227 .834 .709 .726 .740

7.86% 226 .830 .704 .722 .735

8.24% R12 225 .827 .699 .718 .731


8.63% R7, R10, 224 Item 23 .823 .694 .714 .726
R11, R13
9.01% R14 223 .820 .688 .709 .722

9.40% Student B DD, EE, FF R1, R6, R9 222 Item 22 .816 .683 .705 .717
10.1% 221 .812 .677 .700 .713
10.80% 220 Item 21 .808 .672 0.696 .708
… … … … … …

Table Note: The PAC column shows the percent at or above cut-score. The Teacher’s Students column shows the location of students

in the teacher’s classroom. The Work Samples column shows the location of student work samples. The Raters’ Cut-Scores column

shows the location of the standard-setting judges’ ratings. The Score Scale shows the underlying mathematics achievement construct.

The RP67 locations represent Bookmark locations where the response probability is equal to 0.67. The Item Scores are the expected

item scores and the Domain Scores are the expected proportion correct scores in the domains.
Body of Work with Construct Maps 31

Table 2: Example of a General Construct Map for Body of Work Method with Construct Maps

Partially Proficient Advanced


Round 1 Work Round 2 Work
PAC Score Proficient Cut-Score Cut-Score
Samples Samples
Scale Cut-Score
96% 100
92% Sample 1, Sample2 110
88% Sample 3 Sample 21, 120
85% Sample 4 Sample 22, Sample 23 130
83% Sample 5 Sample 24 140
79% 150
75% Sample 6, Sample 7 160
70% Sample 8 170
64% 180
57% Sample 9 Sample 25 190
54% Sample 10 Sample 26, Sample 27 200
49% 210
42% Sample 11, Sample 12 Sample 28, Sample29 220
38% Sample 13, Sample 14 Sample 30 230
31% 240
26% Sample 15 250
20% Sample 16 Sample 31, Sample 32 260
16% 270
10% Sample 17, Sample 18 Sample 33, Sample 34 280
7% Sample 19 Sample 35 290
3% Sample 20 300

Table Note: The rows in the construct map show the meaning of each possible cut-score in terms of the PAC (percent at or above cut-

score) and the work samples reviewed in round 1 and 2. Cut-scores and rater information for the partially proficient, proficient, and

advanced cut-scores are recorded and shown in those columns in the construct map during the rounds of standard setting.
Body of Work with Construct Maps 32

Table 3: Example of feedback information for Modified Body of Work between Rounds 2 and 3

Partially Proficient Advanced


Round 1 Work Round 2 Work
PAC Score Proficient Cut-Score Cut-Score
Samples Samples
Scale Cut-Score
96% 100
92% Sample 1, Sample2 110 R6
88% Sample 3 Sample 21, 120 R2, R3
85% Sample 4 Sample 22, Sample 23 130 R1, R4, R10
83% Sample 5 Sample 24 140 R5, R8, R9
79% 150 R7
75% Sample 6, Sample 7 160
70% Sample 8 170
64% 180
57% Sample 9 Sample 25 190 R8
54% Sample 10 Sample 26, Sample 27 200 R2, R5, R7
49% 210 R1, R3, R6, R9
42% Sample 11, Sample 12 Sample 28, Sample29 220 R4
38% Sample 13, Sample 14 Sample 30 230 R10
31% 240
26% Sample 15 250 R3, R9
20% Sample 16 Sample 31, Sample 32 260 R5, R7
16% 270 R2, R4, R10
10% Sample 17, Sample 18 Sample 33, Sample 34 280 R6
7% Sample 19 Sample 35 290 R8
3% Sample 20 300 R1

Table Note: The group median cut-scores are shown with the horizontal shaded rows in the construct map. Data on the PAC and the

Round 1 and Round 2 Work Samples are shown in those columns. Each rater is represented with an individual rater number.
Body of Work with Construct Maps 33

Table 4: Types of Ratings Provided by the Standard-Setting Panelists for General Assessments

Round 1 Round 2 Round 3

Grade Type of Partially Partially Partially


Level Rating Proficient Proficient Advanced Proficient Proficient Advanced Proficient Proficient Advanced
On
4 Papers 22 21 17 22 22 18 22 22 20
4 In Gaps 0 1 5 0 0 4 0 0 2
On
7 Papers 16 16 14 21 19 15 18 18 17
7 In Gaps 7 7 9 2 4 8 5 5 6
Body of Work with Construct Maps 34

Table 5: Types of Ratings Provided by the Standard-Setting Panelists for Alternate Assessment

based on Modified Achievement Standards

Round 1 Round 2 Round 3


Type
Grade of
Level Rating Proficient Advanced Proficient Advanced Proficient Advanced
On
4 Papers 10 4 5 4 2 8
In
4 Gaps 1 7 6 7 9 3
On
7 Papers 5 3 8 5 5 5
In
7 Gaps 7 9 4 7 7 7
Body of Work with Construct Maps 35

Table 6: Mean Scores on Panelist Final Feedback Evaluation Forms

Statement Mean Mean Mean Mean


Grade 4 Grade 7 Grade 4 Grade 7
General General 2% 2%
(N=22) (N=23) (N=11) (N=12)
The presentation regarding the standard-setting purpose 4.00 3.52 3.91 3.50
was clear.
The presentation of the performance level descriptors 3.89 3.39 3.64 3.58
(PLDs) was clear.
The training in the standard-setting methods was clear. 3.89 3.57 3.82 3.58
I am confident that I was able to apply the standard- 3.89 3.50 3.82 3.75
setting method appropriately.
I feel that my confidence in my standard-setting 3.95 3.74 3.91 3.67
judgments increased as the process played out.
The standard-setting procedures allowed me to use my 3.53 3.78 4.00 3.58
experience and expertise to recommend cut-scores for
the Writing Assessment.
The facilitators helped to ensure that everyone was able 3.89 3.78 4.00 3.83
to contribute to the group discussions.
I felt that the facilitators were knowledgeable and able 3.74 3.70 4.00 3.75
to answer my questions.
The construct maps assisted me in understanding the 3.84 3.74 3.82 3.67
relationship between actual student work and potential
scale scores.
I was able to understand and use the feedback provided 3.79 3.78 3.55 3.58
(e.g., other participant’s ratings, impact data).
The final cut-scores represent reasonable estimates of 3.79 3.48 3.73 3.33
what students should be able to do as defined by the
performance level descriptors.
The final cut-scores represent group consensus taking 3.63 3.48 3.73 3.50
into account all opinions and knowledge of the
collective group of panelist.

Table Note: In the each question, 4 equals strongly agree, 3 equals agree, 2 equal disagree, and 1

equals strongly disagree.


Body of Work with Construct Maps 36

Table 7: Mean Scores on Articulation Committee Final Feedback Evaluations

Statement Mean Mean


Grade 4 Grade 7
2% 2%
(N=4) (N=4)
The standard-setting panelists who provided cut- 4.00 3.50
scores recommendations were cognizant of the need to
articulate standards across grades.
The final cut-scores recommended by this panel 4.00 3.50
reflect the statements in the performance level
descriptors (PLDs).
The final cut-scores recommended by this panel were 4.00 3.50
appropriate given the population who takes alternate
assessment based on modified achievement standards.
The cut-scores recommended by this panel represent 3.50 3.50
performance expectations consistent with the
attainment of requisite knowledge.
The standard-setting panelists were able to carry out 4.00 3.75
the standard-setting method as it was described in
training.
The standard-setting panelists understood the 3.25 3.50
relationship between test performance and potential
scale scores.
The standard-setting panelists understood the impact 4.00 3.75
of moving their cut-score selection up or down in the
construct maps.
The standard-setting panelists understood how the data 3.75 3.75
in the construct maps related to the booklets they were
reviewing.
The standard-setting panelists made their own 3.75 3.75
decisions for the cut-scores they recommended and
were not overly influenced by the facilitator or other
panelists.

Table Note: In the each question, 4 equals strongly agree, 3 equals agree, 2 equal disagree, and 1
equals strongly disagree.
Body of Work with Construct Maps 37

Table 8: Cut-Scores for Three Rounds of Body of Work with Construct Maps for the General Assessments

Round 1 Round 2 Round 3


Grade
Level Cut-Score Median SD Min Max Median SD Min Max Median SD Min Max
Partially
4 Proficient 175 3.54 173 182 175 2.22 173 182 177 2.24 173 182
4 Proficient 195 3.88 194 208 194 1.12 192 196 194 2.52 192 204
4 Advanced 239 8.94 230 264 231 3.84 230 241 231 2.77 230 241
Partially
7 Proficient 181 5.23 169 194 182 3.04 178 192 182 1.92 179 188
7 Proficient 196 7.11 189 218 196 2.58 192 200 196 2.33 193 201
7 Advanced 232 11.96 220 277 231 3.69 225 239 232 2.84 226 239

Table Note: Median is the median of the cut-score recommendations, SD is the standard deviation of the recommended cut-scores,

Min is the minimum cut-score recommendation, and Max is the maximum cut-score recommendation.
Body of Work with Construct Maps 38

Table 9: Cut-Scores for Three Rounds of Body of Work with Construct Maps for the Alternate Assessments based on Modified

Achievement Standards

Round 1 Round 2 Round 3


Grade
Level Cut-Score Median SD Min Max Median SD Min Max Median SD Min Max
4 Proficient 195 11.66 175 209 185 3.69 175 188 186 1.51 183 188
4 Advanced 273 14.11 238 280 245 6.23 230 249 247 4.04 236 249
7 Proficient 184 14.66 163 214 190 8.05 184 207 190 8.98 184 210
7 Advanced 246 16.64 223 274 244 15.92 234 285 248 9.15 241 274

Table Note: Median is the median of the cut-score recommendations, SD is the standard deviation of the recommended cut-scores,

Min is the minimum cut-score recommendation, and Max is the maximum cut-score recommendation.

View publication stats

You might also like