Professional Documents
Culture Documents
net/publication/269337035
CITATIONS READS
8 1,445
4 authors, including:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Adam Wyse on 10 December 2014.
Adam E. Wyse
Michael B. Bunch
Measurement Incorporated
Craig Deville
Measurement Incorporated
Steven G. Viger
Wyse, A. E., Bunch, M. B., Deville, C., & Viger, S. G. (2014). A body of work standard-setting
method with construct maps. Educational and Psychological Measurement, 74(2), 236-
262.
DOI: 10.1177/0013164413502037
Body of Work with Construct Maps 2
Abstract
This article describes a novel variation of the Body of Work method that utilizes
construct maps to overcome challenges of transparency, rater inconsistency, and scores gaps
commonly occurring with the Body of Work method. The Body of Work method with construct
maps was implemented to set cut-scores for two separate K-12 assessment programs in a large
Midwestern state. Data from the standard settings were used to investigate the procedural,
internal, and external validity of the Body of Work method with construct maps. Results
suggested that the method had strong procedural, internal, and external validity evidence to
support its application. The article concludes with discussion and some areas for future research.
Keywords: Standard Setting, Body of Work, Construct Maps, Validity Evidence, Large-scale
Assessment
Body of Work with Construct Maps 3
Introduction
The Body of Work (BoW) method (Kingston, Kahl, Sweeney, & Bay, 2001) is a holistic
assessments, especially when the assessment contains many open-ended items. In the typical
procedure, panelists review samples of student work and sort those samples into piles to indicate
cut-scores. The procedure usually involves multiple rounds of ratings with discussion and
Despite the popularity of the BoW method, there are several challenges with the method
in practice. These include (1) rater inconsistency concerns in which samples with the same score
are placed into different piles or samples with higher scores are placed into lower performance
categories than samples with lower scores or vice versa, (2) score gap concerns in which some
score locations on the underlying score scale do not have work samples, and (3) challenges with
panelists understanding how their recommendations translate into cut-scores and are related to
The purpose of this article is to describe a new variation of the BoW method that can be
used to overcome many of these challenges. The variation uses construct maps (Wyse, 2013;
Luecht, 2008; Wilson, 2005) to integrate and visually display the multiple pieces of data received
during standard setting. Panelists use the construct maps at each stage of standard setting to
In the next section, the typical BoW procedure is described. Discussion is given as to
how the typical procedure can result in rater inconsistency, score gaps, and transparency issues.
The new BoW procedure with construct maps is then outlined. The following section outlines
data from two standard settings that used the new procedure; a standard setting on a large
Body of Work with Construct Maps 4
Midwestern state’s general assessment and a standard setting on a large Midwestern state’s
alternate assessments based on modified achievement standards. Then, results on the procedural,
internal, and external validity evidence (Kane, 1994; 2001) based on these data are provided. The
article concludes with discussion and some areas for future research.
Typical Procedure
The typical BoW procedure (Kingston et al., 2001) consists of a practice round followed
by three operational rounds of ratings. The panelists’ task in the BoW procedure is to consider
samples of student work and to sort the samples into different categories so that numerical cut-
scores can be determined to separate the examinees into the performance levels in the
In the practice round, panelists are given a small number of work samples. The samples
are often selected such that there is one high scoring sample, one low scoring sample, and a
couple of samples with scores in the middle of the score range. Depending on how the BoW
method is being implemented, the samples can be shown with or without the scores and the
samples may or may not be shown in order based on their scores (Cizek & Bunch, 2007;
Kingston et al., 2001; Zieky, Perie, & Livingston, 2008). Panelists are asked to look at the
samples and provide ratings for each sample. The goal in the practice is not to precisely identify
cut-scores, but to ensure that panelists have a solid understanding of the materials and processes.
Following the practice round, panelists participate in the three rounds of operational
ratings. The first round is called the range finding round (Kingston et al., 2001). In this round,
panelists consider work samples that span the range of scores that examinees can obtain on the
assessment. The panelists’ task is very similar to the practice round where they sort the samples
It is important to point out that even though the first round samples are designed to span
the range of possible scores, there are often gaps along the score scale in which no work samples
are shown. This occurs because the score scale underlying the assessment is usually created by
scaling the test using a psychometric model, such as an item response theory (IRT) model or
cognitive diagnostic model. As a result, this often leads to a situation in which not every point on
the scale is represented with a work sample. This creates a challenge because it is difficult to
precisely set cut-scores at some locations on the scale; yet these represent viable potential cut-
scores. Another challenge that can make it hard to estimate cut-scores is the presence of rater
inconsistency where panelists will rate samples with the similar scores into different categories.
This can occur because panelists are often not forced to provide ratings in which higher scoring
samples need to have ratings greater than or equal to ratings for samples with lower scores.
Following the range finding round cut-score ranges may be determined or actual cut-
scores may be estimated. If the cut-score ranges are being employed, the ranges are usually
identified by looking at the panelists’ ratings and identifying the locations where panelists begin
to move papers from one category to the next (see Kingston et al., 2001). If the cut-scores are
reported as single numbers the cut-scores are typically estimated using a logistic regression
model. In this case, the panelists’ ratings are input into the model and the score associated with a
fifty percent chance of moving from one category to the next is the cut-score. The feedback
information given to the panelists following the first round is often in the form of frequencies of
the ratings assigned to each sample and the estimated cut-scores for each category (Zieky et al.,
2008). Panelists discuss these data and then move to the second round.
Herein lies another set of challenges, which is that panelists often have a hard time seeing
how their ratings translate into the cut-scores and relate to various standard-setting data. This
Body of Work with Construct Maps 6
occurs because the relationships between ratings, estimated cut-scores, and other data are often
not fully transparent. For example, there may be three samples at one score, two at another score,
one sample at a few different scores and so on. The question becomes, how does one distribute
their ratings to produce appropriate cut-scores? For most this is not a straightforward task. In all
fairness, one of the goals of the facilitator in the typical procedure is to try and help panelists
understand this process. The facilitator may explain, for example, that to lower their cut-score
they should give more ratings to some of the lower scoring papers. However, even with the aid
of a skilled facilitator knowing how to redistribute ratings to produce any specific cut-score is
not always clear. Panelists may also want to make adjustments in response to feedback and doing
The second round of the BoW procedure is aptly named the pinpointing round (Kingston
et al., 2001). The process that panelists use to indicate their cut-scores is very similar; they look
at each sample and sort them into piles to indicate their cut-scores. However, the range of work
samples is narrowed in comparison to round 1. Typically, the range is adjusted based on the cut-
scores estimated in round 1. Many of the concerns with score gaps, rater inconsistency, and
transparency persist in round 2. Following the second round, the cut-scores are estimated using
logistic regression, panelists are given feedback information, and they discuss these data. One
new piece of feedback information often included is the impact data for the round 2 cut-scores.
The final round allows the panelists the opportunity to reexamine the samples from round
2 in light of the impact data and feedback information. Panelists can make their final judgments
by either sorting the work samples into piles again or if the scores assigned to the work samples
are shown panelists may be asked to just indicate numbers for their cut-scores. Again, many of
the same challenges that existed in rounds 1 and 2 are still present in the final round.
Body of Work with Construct Maps 7
Even though the typical BoW method is a legitimate and defensible approach to
determining cut-scores, there are challenges with its application that can impact cut-score
estimates. In the next section, we describe a novel approach to the BoW method that uses
New Procedure
Construct Maps
The new BoW procedure hinges on the use of construct maps. Construct maps are visual
displays that show how examinee and/or item data are related to an underlying achievement
construct and corresponding score scale (Authors, 2012; Luecht, 2008; Wilson, 2005). Construct
maps have three critical components: (1) the content that defines the construct is well defined,
(2) the construct can be represented by an underlying continuum in which items and examinees
can be ordered along that continuum, and (3) a measurement model can be applied to show the
relationships between the scoring continuum and the item and examinee data from the
assessment in a chart or graphic (Authors, 2012; Luecht, 2008; Wilson, 2005). Many commonly
used psychometric models can be applied to data to create empirical construct maps.
grade 5 mathematics assessment that applied IRT models is shown in Table 1. In the construct
map in Table 1, the score scale values from 220 to 230 are shown in the score scale column and
different quantities computed from item data, examinee data, and various ratings are shown in
separate columns. The construct map clearly shows that there are specific relationships between
the potential scores on the underlying achievement continuum and various data elements. Many
of these same data elements are what panelists focus on in different standard-setting procedures.
For example, the item scores in each row in Table 1 are the values of the item characteristic
Body of Work with Construct Maps 8
curves from IRT models and they correspond to the data that panelists consider when providing
ratings in the Angoff procedure (Angoff, 1971). Authors (2012) explain how data in the general
Construct maps have important potential applications in standard setting because they can
help panelists to better understand the relationship between the examinee and item data and how
they translate into cut-scores; all panelists need to do is look across a row in the construct map to
see the relationship between that potential cut-score and the data that they are reviewing. Further,
construct maps can help panelists to provide improved judgments that can reduce concerns of
rater inconsistency, score gaps, and transparency. The next section describes how construct maps
can be used for these purposes in the context of the BoW procedure.
In the BoW method, panelists are asked to consider student work samples in each round
of standard setting. Additionally, panelists are often given impact data and feedback information
displaying their cut-scores in relation to those of other panelists. Therefore, these data are
contained in the construct map for the BoW method (Table 2).
For the sake of this article, it is assumed that the underlying scoring continuum is created
using an IRT model, specifically the Rasch model, since that was the model the state that
conducted the standard settings used to create scale scores. However, the use of the Rasch model
In the construct map (Table 2) the score scale, which is a linear transformation of the IRT
ability scale, is shown in the column labeled “score scale.” The columns labeled “Round 1 Work
Body of Work with Construct Maps 9
Samples” and “Round 2 Work Samples” show the scale locations of work samples in the first
and second round. These locations are determined from IRT ability estimates assigned to each
sample after the test is administered. Since we are applying the Rasch model, examinees with the
same raw score will have the same scale score, and work samples with the same raw score will
appear in the same row. The “PAC” column shows data on the percent at or above cut-score
(PAC) and it is determined from the distribution of ability estimates on the assessment. The
“Partially Proficient”, “Proficient”, and “Advanced” cut-score columns are the places where the
panelists record their ratings in each round and get to see rater feedback information on cut-
scores in between rounds. The number and names of these columns can be changed based on the
standard-setting application. By reading across each row in the construct map, panelists can see
how their cut-scores are related to the other data used in standard setting.
Similar to the typical BoW method, the BoW method with construct maps begins with
training, the opportunity to take the assessment and discuss the test items and rubrics, and in
depth consideration of the PLDs. The main difference in these initial steps is in the training,
which introduces construct maps and explains how the construct maps show the work samples
and standard-setting data that panelists will see, and outlines how the panelists use these
Prior to the operational rounds, there is a practice round to orient the panelists to the
materials that are used in the procedure. For the applications in this article, panelists were given
five work samples, one of which was a high scoring sample, one of which was a low scoring
sample, and three work samples in the middle of the score range. Each work sample had the
scale score on a cover sheet, the total score on the multiple-choice items, the total score on the
Body of Work with Construct Maps 10
open-ended items, the answer choices chosen by the examinees on the multiple-choice items and
the item p-values, and the writing responses with scores on the rubrics and annotations as to why
each response was given the scores that it received. Panelists also received a construct map that
showed the five work samples. The work samples were shown in order from lowest to highest
score. Panelists were instructed to indicate their cut-score by looking at the construct map,
marking a score on the construct map, and then transferring this score to a rating sheet.
In the first round, panelists received the round 1 work samples and a construct map that
showed the samples, the score scale, and the columns where they recommended their cut-scores.
Panelists did not see the PAC or round 2 work samples; those were shown in later rounds. Fifty
work samples spread out across the range of possible scores were given in round 1. The work
samples had various combinations of scores including higher multiple-choice scores and lower
open-ended scores, moderate scores on both sections, and higher open-ended scores and lower
multiple-choice scores. It was explained that samples that had the same number of total points
had to be classified into the same category and that this would be reflected in the cut-score
selected in the construct map. We assumed the underlying score scale ranged from 100 to 300,
and this scale, in one point increments, was displayed in the construct map.
The number of work samples and score scale can change based on the specific
implementation. The samples were shown to the panelists in order from the lowest score (e.g.,
sample 1 in Table 2) to the highest score (e.g., sample 20 in Table 2). Some of the samples had
the same score and panelists were told that they can find these locations in the rows with more
than one sample (e.g., a score of 220 in Table 2). Panelists were also told that there were
locations where there were no samples (e.g., a score of 150 in Table 2). These locations
To use the construct map to set cut-scores, panelists were instructed to start with the first
cut-score and consider the examinee that would just barely be considered at that level of
performance. In our applications, this was the just barely partially proficient student for the
general assessments and the just barely proficient student for the alternate assessments based on
modified achievement standards. For the sake of illustration, the explanation in this section is for
the general assessments. In this case, panelists were instructed to look at the first sample and ask
themselves whether or not the sample represented just barely partially proficient performance. If
they believed it was not indicative of just barely partially proficient performance, then they were
instructed to move to the next sample. This sample had a score that would be greater than or
equal to the previous sample that they just examined. As they moved from sample to sample they
were told to examine the construct map to see the location of the sample and relationship of the
score assigned to the sample to the potential cut-scores that could be set. When they reached a
point where they thought they had found a sample that represented just barely partially proficient
performance, they were instructed to examine that sample and a few samples before and after it
to select their cut-score. Their cut-score could be at the location of a sample or in between
several samples. Panelists were told to select a cut-score at a sample if they believed the sample
was indicative of work right on the border of what the just barely partially proficient student
should be able to do. They were told to select a cut-score in between two samples if one sample
was clearly above and the other sample was clearly below what the just barely partially
proficient student should be able do. Panelists were encouraged to write the word “Cut-Score” in
the row representing their cut-score, draw a line through that score, and write the words “Not
Proficient” above the line and “Partially Proficient” below the line. Writing the words was
Body of Work with Construct Maps 12
designed to remind panelists of the meaning of their cut-score. Panelists used a similar process to
The benefits to this process are that it helps reduce score gap concerns because panelists
can select cut-scores in between samples (and there is a sound rationale for doing this, one
sample is clearly too high and one sample is clearly too low in relationship to the PLD) and that
it virtually removes rater inconsistency because the construct map clearly shows when a cut-
score will result in samples with higher scores being placed into lower performance categories
and vice versa. The process for choosing their cut-scores in the construct map in essence forces
the panelists to make a decision of the best location for the cut-score and does not allow them to
place samples with similar scores into different performance categories. The construct map also
clearly shows the relationships between the standard-setting data and cut-scores. After round 1, a
The cut-scores were computed by taking the median of the panelists’ ratings. It was
possible to use the median, instead of the more complicated logistic regression procedure, since
each panelist indicated a single scale score representing their best judgment for a cut-score. This
approach greatly simplifies the computational aspects of the procedure and makes it easier for
panelists to understand how the cut-scores were calculated. Most panelists are familiar with
medians and how medians are computed. It is also possible to use the mean to compute cut-
scores, but we used the median since the median is not as impacted by extreme judgments.
Panelists then discussed feedback data received in construct maps between rounds 1 and
2. The process for recommending cut-scores in round 2 was very similar to round 1 except that
panelists considered targeted work samples and they could see these samples and how they relate
Body of Work with Construct Maps 13
to the samples from round 1. Thirty new work samples were shown in round 2. These samples
again had some samples with different combinations of scores. Each set of samples was
Following the second round, panelists received additional feedback (Table 3). The group
cut-scores were shown with horizontal shaded lines and the individual panelist cut-scores were
shown by rater numbers. This construct map was very similar to the one that the panelists used to
provide their ratings in the rounds of standard setting and saw as feedback between other rounds.
This was intentional and was designed to help integrate the process and make it clear how each
piece of data related to their ratings and estimated cut-scores. In the last round, panelists saw
information on the samples and the impact data. The panelists’ were instructed to make their
final recommendations by considering of all data in the construct maps and its relationship to the
PLDs. Panelists indicated their final cut-scores by again selecting a score in the construct map.
Data
The new BoW procedure with construct maps was used to set cut-scores in two different
writing assessment programs in a large Midwestern state. The first program was a state general
writing assessment program. In this program, assessments were given in grades 4 and 7 to over
100,000 examinees. Both tests consisted of: one narrative writing prompt that was analytically
scored and worth 15 points, one informational writing prompt that was analytically scored and
worth 15 points, one holistically scored item that focused on providing feedback and editing a
piece of writing that was worth 4 points, and 16 multiple-choice questions each worth one point.
Each 50 point test was scaled using the Rasch model and put onto a scale ranging from 100 to
300. The choice of the scale of 100 to 300 was arbitrary, but was chosen so as not overlap with
Body of Work with Construct Maps 14
scales for the general assessments. Typical scales span a 200 to 250 point range and have the
proficient cut-score anchored at a specific number to aid score interpretations. It was made clear
that the scale shown was designed to give to panelists an idea of the relationship between the
samples and possible cut-scores and that the final scale would be based on the final cut-scores.
panelists participated at grade 7. The panelists were selected through a nomination and
recruitment process. Panelists represented a range of geographical regions, school districts, and
ranged in their educational experience. Most of the panelists were educators, but there were three
business and community members in grade 4 and four business and community members in
grade 7. In grade 4, 82% of the panelists were white and 77% of the panelists were female. In
grade 7, 87% of the panelists were white and 87% of the panelists were female.
The BoW method with construct maps consisted of the three rounds of judgments as
described above with impact data shown between the second and third rounds. Throughout the
standard-setting process, process evaluation forms were given to ensure that the procedure was
functioning as intended. The process evaluations had a series of open-ended questions that asked
about various aspects of the procedure and the approaches panelists used (or thought they would
use) to determine their cut-scores. Many of the questions were identical across rounds except for
changing the wording to reference the specific round in question. Data from the process
evaluations on the question that asked panelists “Please describe how you used (or think you will
use) the standard-setting materials provided by the facilitator to provide your standard-setting
judgments in Round __,” is used to help provide an idea of the processes and materials that
panelists were focusing on when providing their standard-setting judgments. A final evaluation
form was given at the end of the standard setting, which contained a mixture of opened-ended
Body of Work with Construct Maps 15
and rating scale questions. The rating scale questions ranged from one to four with a rating of
one indicating strongly disagree, two indicating disagree, three indicating agree, and four
indicating strongly agree. We also used data from the rating scale questions on the final
evaluations that dealt with various aspects of the procedure as another set of data. In addition, we
examined data on the three cut-scores and their impact data and whether ratings were on papers
or in gaps in the construct maps across rounds as other data on the method.
The second set of data comes from a state alternate assessment writing program based on
modified achievement standards. This standard setting took place approximately one year after
the standard setting for the general assessments. Similar to the general assessments, the alternate
Approximately 2,000 examinees took the assessment in each grade. The assessments were also
scaled using the Rasch model and placed on a scale from 100 to 300. The tests consisted of: one
15 point narrative prompt that was analytically scored, one 15 point informational prompt that
was analytically scored, and 10 multiple-choice questions which were each worth one point.
Eleven panelists participated in the grade 4 standard setting and twelve panelists
participated in the grade 7 standard setting. The panelists were recruited using the same
nomination and recruitment process as was used with the general assessment. Fewer panelists
were nominated and available to participate for the alternate assessment standard setting.
Panelists again represented a range of geographical regions, school districts, and levels of
educational experience. All of the participants were educators. In grade 4, 81% of the panelists
were white and 81% of the panelists were female. In grade 7, 83% of the panelists were white
The standard-setting procedure used for the alternate assessments was similar in many
respects to the approach used with the general assessment. In particular, panelists again received
similar training, used construct maps to complete three rounds of ratings, and were asked many
of the same evaluation questions as in the general assessment standard setting. However, there
were several key differences. First, panelists only had to give two cut-score ratings instead of
three. Second, impact data was shown in the construct maps after round 1. The inclusion of these
data earlier in the process was a policy consideration and decision made by the state since the
state had recently conducted a study to reset cut-scores on several of the general assessments for
other content areas. The reset cut-scores produced dramatic changes in the amount of students
proficient and the state felt that showing these data earlier in the process would be instructive.
A third difference was that panelists were given results from a separate Contrasting
Groups (Livingston & Zieky, 1982) study to help benchmark their judgments in between round 1
and 2. These data were collected from teachers when the alternate assessments were given to the
examinees. In both grades, data were filled in on roughly 30% of the 2,000 answer documents.
Teachers gave predictions of the categories that they thought each examinee would obtain after
considering generic descriptions of what it meant to score in each category. These statements
were very similar to the policy statements in the PLDs. The predictions were used to calculate
ranges of cut-scores for the proficient cut-score only. Ranges were not found for the advanced
cut-score since very few examinees were rated into the advanced category and finding stable
estimates proved difficult. The ranges of cut-scores were determined by analyzing the data using
the approaches suggested for calculating cut-scores with Contrasting Groups in Cizek and Bunch
(2007) (i.e., logistic regression, midpoints between means, and midpoints between medians).
Panelists were presented with these data and were told how they compared to the ranges of cut-
Body of Work with Construct Maps 17
scores they recommended in round 1. Panelists could place marks on their construct maps to
A final difference between the two implementations was that after the last round, a
recommendations from both grades. This subcommittee consisted of four panelists from each
grade level. Participants in the articulation committee got to see the cut-scores and the impact
data for both grades. Panelists were asked how the process worked in each committee. The
committee was told they could adjust the cut-scores if they so desired after discussion. No
changes were made in the articulation meeting. Separate evaluation forms were administered
after the articulation meeting that asked open-ended and rating scale questions. To investigate the
method, the same data were used with this assessment as the general assessment with the
addition of data from the Contrasting Groups study and articulation committee evaluations.
Results
The results section is broken down into three separate subsections, each section focuses
on the types of validity evidence suggested by Kane (1994; 2001) for evaluating standard setting.
These types of validity are procedural, internal, and external validity evidence. Procedural
validity deals with information about procedural aspects of the method, internal validity deals
with data about the consistency of the cut-scores, and external validity deals with data about the
Common procedural challenges with the traditional BoW method were major drivers of
the development of BoW with construct maps. These concerns include: score gaps, rater
inconsistency, and transparency. In the new procedure, the challenge of rater inconsistency was
Body of Work with Construct Maps 18
essentially removed because panelists were asked to give judgment for cut-scores that place
samples with the same score into the same categories and which higher scoring samples fall into
a performance category equal to or greater than lower scoring samples. Score gaps were still
present in the new procedure, but panelists could mitigate the impact of score gaps because they
can recommend cut-scores between samples where there may be score gaps if they so choose.
The frequency of ratings on papers and in between papers in score gaps for the general
assessment (Table 4) and the alternate assessment (Table 5) show that there were several ratings
that were placed into score gaps. For the general assessments (Table 4) most of the ratings were
still on papers, especially for fourth grade committee, and there were several cut-scores with no
ratings in score gaps. For the alternate assessments (Table 5), there were a greater percentage of
ratings placed in the score gaps, and in some cases the number of ratings placed into score gaps
exceeded the number of ratings on papers. The differences could be a function of the group
dynamics in each committee or the unique aspects of the papers reviewed by the various
committees. The differences could also be a function of the fact that with fewer items on the
alternate assessments there were more score gaps, which creates the potential for increased score
gap ratings. Nonetheless, the ratings produced provide strong evidence in support of the BoW
method with construct maps and the need for the ability to give ratings in score gaps.
A second piece of procedural evidence in support of BoW with construct maps was
qualitative data in response to the question “Please describe how you used (or think you will use)
judgments in Round __.” The responses were coded by a trained rater and categorized based on
the materials or information that the panelist described using or anticipated using when providing
Body of Work with Construct Maps 19
their judgments. Responses were coded into the following categories: PLDs, construct maps,
work samples, feedback information, test content (i.e., focused on content of specific items),
item stats, discussion, scoring rubrics, professional judgment, and blank/off-topic. Each time a
panelist specifically referenced one of these materials or described a process linked to one of
these materials it was coded in that category for that panelist. Responses that were hard to
categorize were flagged and reviewed by the first author in conjunction with the coder.
The percentages of ratings in each of the coding categories across panelists for the four
standard settings are shown in Figure 2. The top right panel shows the results for the general
assessment at grade 4, the top left panel shows the results for the general assessment at grade 7,
the bottom right panel shows the results for alternate assessment at grade 4, and the bottom left
panel results the coding for the alternate assessment at grade 7. The responses from each process
questionnaire for different rounds are shown with different colored shaded bars.
Figure 2 shows some interesting patterns. Across all standard settings it appeared that the
processes that people described using focused mainly on the three critical elements of the PLDs,
construct maps, and work samples as the bars for these categories were, in most cases, the
highest bars. The actual percentage of responses in each category did differ slightly across
groups and across rounds. One would expect if the procedure was working effectively that these
would be the three highest rated elements. Additionally, it is apparent that feedback information
and discussion were not initially important to panelists, but as the rounds increased these areas
received a greater emphasis. This is also as expected as after the first round panelists get to
engage in discussion with their colleagues and receive feedback information. Few panelists
described processes incorporating the scoring rubrics, their own outside professional judgment,
Body of Work with Construct Maps 20
or gave blank or off-topic responses. This is also as one would expect as panelists were
instructed to not rescore papers and limit their decisions to materials in the standard setting.
Also apparent in Figure 2 is the impact various facilitators have on the process. For
example, in the grade 4 general assessments there seemed to be a higher emphasis on test content
and less of a focus on the construct maps prior to round 1. In this case, a large number of
panelists said they would focus on the content of specific items and few panelists described a
process using the construct map. As the rounds went on, however, the focus on the content of
specific questions was less emphasized and greater emphasis was placed on the construct maps.
No panelists mentioned content of individual items as being important for the grade 7 general
assessments and few panelists mentioned this for the alternate assessments. There was also some
variation in the use of item statistics across panels. However, in general there seemed be limited
focus on the p-values of particular items. The patterns in Figure 2 lend support to the claim that
most of the panelists took the standard-setting process seriously and focused on the most critical
elements (e.g., PLDs, student work samples, and construct maps) when providing their ratings.
Another piece of procedural validity evidence comes from the responses to rating scale
questions for the panelists and articulation committees. The average ratings for the final
evaluations are shown in Table 6 and the average ratings for the articulation committee questions
are shown in Table 7. Admittedly, some of the questions might fall into the internal and external
validity evidence sections, but they are included here for continuity.
The average ratings from both the final evaluation questions and the articulation
committee questions are uniformly higher than a rating of three for all four standard settings,
indicating that most panelists agreed or strongly agreed with each statement. This is as expected
Body of Work with Construct Maps 21
if the procedure and process worked as it was supposed to. One would expect that panelists
would think that the presentation of the PLDs was clear, that they felt that their confidence in
their judgments increased, that they felt that the construct maps assisted them, and that they
understood how to use the feedback information, and so on if the standard-setting method
worked effectively. Again, these responses lend support to the procedural validity of the method.
In terms of internal validity, one can look at the cut-score ratings and whether the
standard deviations of the panelists’ ratings decreased across rounds. Decreased standard
deviations indicate that panelists were converging in on similar cut-scores. One would expect to
find this pattern in these data, especially given that the question on consensus on the process
evaluation form suggested that panelists felt they had reached some form of consensus (Table 6).
The final cut-scores for the general assessment and alternate assessment are displayed in
Tables 8 and 9, respectively. Each table displays the median, minimum, and maximum cut-
The cut-scores, with the exception of the advanced cut-score for grade 4 that went down
by 8 scale score points (which represents a change of 4% of the scale score range) remained
fairly similar across rounds for the general assessments (Table 8). In addition, the standard
deviations of the cut-scores were less in rounds 2 and 3 compared to round 1. Some of the
standard deviations for the grade 4 panel did slightly increase from round 2 to round 3, although
not in a manner that would suggest panelists had widely different opinions. The range of the cut-
scores in the third round was also fairly narrow as the biggest difference between the minimum
and maximum ratings for any cut-score was only 13 scale score points.
Body of Work with Construct Maps 22
There was a slightly different pattern for the alternate assessments in terms of the median
cut-scores. In particular, Table 9 shows that were some more dramatic changes in cut-scores
from round 1 to round 2. For example, in the grade 4 the proficient median cut-score dropped by
10 points from round 1 to round 2 and the advanced cut-scored dropped by 28 points. These
bigger changes for the alternate assessments in comparison to the general assessment could have
been a function of the introduction of the impact data earlier in the process or the consideration
of the Contrasting Groups data. These data were not presented after round 1 for the general
assessments. The cut-scores were more similar between rounds 2 and 3. Again, the standard
deviations were decreased in rounds 2 and 3 as compared to round 1. Similar to the general
assessments, a few of the standard deviations slightly increased from round 2 to round 3. There
did appear to be more disagreement and a larger range of cut-scores for grade 7 compared to the
grade 4 in round 3, although again the standard deviations were not that large in comparison to
The results in Tables 8 and 9 are similar to results that are commonly observed in other
standard-setting processes and seem to provide fairly strong internal validity evidence that by the
last round panelists were able to converge on a set of cut-scores. In the case of the alternate
assessment standard setting, the articulation committee had a chance to review the cut-scores and
endorsed the cut-scores from round 3 with no changes. This provides some additional support for
the cut-scores that were recommended and the process used to arrive at the cut-scores.
The last set of validity evidence is external validity evidence. One piece of external
validity evidence came from Contrasting Groups data collected for the alternate assessments.
These Contrasting Groups data were used to calculate a range of cut-scores to serve as
Body of Work with Construct Maps 23
benchmarks for the round 1 proficient cut-scores. The range of proficient cut-scores for grade 4
was 175 to 213 and the range of proficient cut-scores for grade 7 was 182 to 211. The group
median cut-scores from round 1 for the new BoW procedure were 195 and 184 for grades 4 and
7, respectively. The ranges for the panelist recommendations were 175 to 209 for grade 4 and
163 to 214 for grade 7. The cut-scores fell within the ranges from the Contrasting Groups and the
ranges for the BoW with construct maps and Contrasting Groups were fairly similar. These data
provide some external validity evidence in terms of how cut-scores from the new BoW method
compare with that of another procedure. Although it is not strong additional evidence because
the panelists saw the Contrasting Groups data after round 1, the final cut-scores for the alternate
assessment standard setting also fell within the ranges of the Contrasting Groups data.
Another piece of external validity evidence was the similarity of percentages for the
grade 4 and grade 7 students that committees said would be proficient or above in each
proficient or above for grade 4 and 74 percent at proficient or above for grade 7. For the alternate
percent at or above grade 7. From a policy and accountability standpoint, the similarity of the
passing percentages is desirable since there will not be disparate impacts in different grades. The
fact that two grades arrived at similar percentages of students that would be proficient or above is
Developing data driven standard-setting approaches that make the process more
transparent and that also ameliorate factors that can contaminate cut-score estimates are pressing
needs in the field of educational measurement. This article makes an important contribution by
Body of Work with Construct Maps 24
illustrating how construct maps can be used to modify the BoW method in a way that allows
panelists with limited training in statistics and assessment to better understand the process and
provide judgments that mitigate rater inconsistency and score gap concerns.
Applications from two separate writing standard-setting processes, one of which came
from a large-scale state general assessment and one of which came from a large-scale state
alternate assessment based on modified achievement standards, illustrated how the method could
be used to effectively set cut-scores in K-12 state testing programs. Data collected provided
procedural, internal, and external validity evidence in support of the new approach. In terms of
procedural validity, data suggested that panelists do in fact provide ratings in score gaps, that for
the most part the processes that panelists described using to determine cut-scores were focused
on the critical elements they should be using with the approach, and their responses to feedback
and evaluation questions demonstrated that the standard settings and various aspects of the
procedure were viewed positively. Data providing internal validity evidence showed decreased
standard deviations over the rounds of standard setting and convergence on a narrow range of
cut-scores. Data providing external validity evidence demonstrated that cut-scores from round 1
of the alternate assessment standard setting were similar to cut-scores determined from a set of
Contrasting Groups data collected on the same assessment. In addition, the percentage of
students at or above the proficient cut-score was very similar across grades for both the general
and alternate assessments, which is desirable from a policy and accountability perspective.
Although these data provide strong validity evidence in support of the new procedure,
there were other validity data that we were not able to collect that might provide additional
insight about the procedure. For example, data comparing cut-scores and perceived differences in
compared the BoW with construct maps with the traditional BoW procedure would be valuable.
Ideally, these data should be collected from a group of panelists that implemented both the
typical and new approach. We suspect that if these data were collected that the new procedure
would receive higher ratings in terms of transparency and that panelists would demonstrate
greater understanding of relationships between data. However, without the evidence to support
this claim, this is just a conjecture at this point. Further method comparison studies that compare
the new BoW approach with other standard-setting procedures would also be valuable.
Additional research that probed panelists on their use and understanding of the construct
maps would also be beneficial. Several of the panelists in their written descriptions made
comments demonstrating that they were using the construct maps appropriately and that the
found the construct maps to be a valuable visual aid in standard setting. However, deeper
interviews and think-aloud type studies might provide additional insight into the processes that
panelists were using beyond what we were able to capture on the feedback forms.
Future research could also explore and isolate the impact of showing specific data
elements in the construct maps. For example, our study did not directly investigate the impact of
showing something like impact data at different points in the process. Our data did show greater
changes for the alternate assessment between round 1 and round 2 when these data were shown
earlier in the process. However, this was not a pure experimental study in which one group had
these data and another did not. Finally, additional research should look at and investigate
applications of the BoW Method with construct maps in other contexts. Our applications only
focused on two writing assessments in a single state. The method may work differently in other
contexts.
References
Body of Work with Construct Maps 26
Educational Measurement (2nd ed., pp. 508-597). Washington, DC: American Council on
Education.
Cizek, G. J., & Bunch, M. B. (2007). Standard setting: A guide to establishing and evaluating
Kingston, N. M., Kahl, S. R., Sweeney, K., & Bay, L. (2001). Setting performance standards
using the Body of Work method. In G. J. Cizek (Ed.), Standard setting: Concepts,
Kane, M. T. (2001). So much remains the same: Conception and status of validation in setting
Kane, M. T. (1994). Validating the performance standards associated with passing scores.
Engineering: Moving from theory to practice. Presented at the Annual Meeting of the
Livingston, S. A., & Zieky, M. J. (1982). Passing scores: A manual for setting standards of
Wilson, M. (2005). Constructing measures: An item response modeling approach. Mahwah, NJ:
Lawrence Erlbaum.
Zieky, M. J., Perie, M. J., & Livingston, S. A. (2008). Cutscores: A manual for setting standards
Figure Note: The figure shows three cut-score placements for the panelist with the words “Cut-Score” and lines through the row that
signifies their cut-scores. The performance level categories labels are written above and below each line to remind the panelist of
Figure 2: Results from Qualitative Responses on Processes Used by Panelists in Modified Body of Work Method
Body of Work with Construct Maps 30
9.40% Student B DD, EE, FF R1, R6, R9 222 Item 22 .816 .683 .705 .717
10.1% 221 .812 .677 .700 .713
10.80% 220 Item 21 .808 .672 0.696 .708
… … … … … …
Table Note: The PAC column shows the percent at or above cut-score. The Teacher’s Students column shows the location of students
in the teacher’s classroom. The Work Samples column shows the location of student work samples. The Raters’ Cut-Scores column
shows the location of the standard-setting judges’ ratings. The Score Scale shows the underlying mathematics achievement construct.
The RP67 locations represent Bookmark locations where the response probability is equal to 0.67. The Item Scores are the expected
item scores and the Domain Scores are the expected proportion correct scores in the domains.
Body of Work with Construct Maps 31
Table 2: Example of a General Construct Map for Body of Work Method with Construct Maps
Table Note: The rows in the construct map show the meaning of each possible cut-score in terms of the PAC (percent at or above cut-
score) and the work samples reviewed in round 1 and 2. Cut-scores and rater information for the partially proficient, proficient, and
advanced cut-scores are recorded and shown in those columns in the construct map during the rounds of standard setting.
Body of Work with Construct Maps 32
Table 3: Example of feedback information for Modified Body of Work between Rounds 2 and 3
Table Note: The group median cut-scores are shown with the horizontal shaded rows in the construct map. Data on the PAC and the
Round 1 and Round 2 Work Samples are shown in those columns. Each rater is represented with an individual rater number.
Body of Work with Construct Maps 33
Table 4: Types of Ratings Provided by the Standard-Setting Panelists for General Assessments
Table 5: Types of Ratings Provided by the Standard-Setting Panelists for Alternate Assessment
Table Note: In the each question, 4 equals strongly agree, 3 equals agree, 2 equal disagree, and 1
Table Note: In the each question, 4 equals strongly agree, 3 equals agree, 2 equal disagree, and 1
equals strongly disagree.
Body of Work with Construct Maps 37
Table 8: Cut-Scores for Three Rounds of Body of Work with Construct Maps for the General Assessments
Table Note: Median is the median of the cut-score recommendations, SD is the standard deviation of the recommended cut-scores,
Min is the minimum cut-score recommendation, and Max is the maximum cut-score recommendation.
Body of Work with Construct Maps 38
Table 9: Cut-Scores for Three Rounds of Body of Work with Construct Maps for the Alternate Assessments based on Modified
Achievement Standards
Table Note: Median is the median of the cut-score recommendations, SD is the standard deviation of the recommended cut-scores,
Min is the minimum cut-score recommendation, and Max is the maximum cut-score recommendation.