Wyse Et Al 2014 EPM

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/269337035
A Body of Work Standard-Setting Method With Construct Maps
Article in Educational and Psychological Measurement · April 2014

DOI: 10.1177/0013164413502037
CITATIONS READS
8 1,445
4 authors, including:
Adam Wyse Michael Bunch

Renaissance Measurement Incorporated
45 PUBLICATIONS 368 CITATIONS 10 PUBLICATIONS 568 CITATIONS
SEE PROFILE SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Evaluating and Improving Standard-Setting Methods View project
Standard Setting View project
All content following this page was uploaded by Adam Wyse on 10 December 2014.
The user has requested enhancement of the downloaded file.

RUNNING HEADER: Body of Work with Construct Maps
A Body of Work Standard-Setting Method with Construct Maps
Adam E. Wyse
Michigan Department of Education
Michael B. Bunch
Measurement Incorporated
Craig Deville
Measurement Incorporated
Steven G. Viger
Michigan Department of Education
Accepted for publication in Educational and Psychological Measurement.
Please cite as:
Wyse, A. E., Bunch, M. B., Deville, C., & Viger, S. G. (2014). A body of work standard-setting
method with construct maps. Educational and Psychological Measurement, 74(2), 236-
262.
Version of Record available at: http://epm.sagepub.com/content/74/2/236.full
DOI: 10.1177/0013164413502037
Body of Work with Construct Maps 2
Abstract
This article describes a novel variation of the Body of Work method that utilizes
construct maps to overcome challenges of transparency, rater inconsistency, and scores gaps
commonly occurring with the Body of Work method. The Body of Work method with construct
maps was implemented to set cut-scores for two separate K-12 assessment programs in a large
Midwestern state. Data from the standard settings were used to investigate the procedural,
internal, and external validity of the Body of Work method with construct maps. Results
suggested that the method had strong procedural, internal, and external validity evidence to
support its application. The article concludes with discussion and some areas for future research.
Keywords: Standard Setting, Body of Work, Construct Maps, Validity Evidence, Large-scale
Assessment
Introduction
The Body of Work (BoW) method (Kingston, Kahl, Sweeney, & Bay, 2001) is a holistic
standard-setting method that is gaining popularity for setting cut-scores on large-scale
assessments, especially when the assessment contains many open-ended items. In the typical
procedure, panelists review samples of student work and sort those samples into piles to indicate
cut-scores. The procedure usually involves multiple rounds of ratings with discussion and
feedback between rounds.
Despite the popularity of the BoW method, there are several challenges with the method
in practice. These include (1) rater inconsistency concerns in which samples with the same score
are placed into different piles or samples with higher scores are placed into lower performance
categories than samples with lower scores or vice versa, (2) score gap concerns in which some
score locations on the underlying score scale do not have work samples, and (3) challenges with
panelists understanding how their recommendations translate into cut-scores and are related to
other standard-setting data.
The purpose of this article is to describe a new variation of the BoW method that can be
used to overcome many of these challenges. The variation uses construct maps (Wyse, 2013;
Luecht, 2008; Wilson, 2005) to integrate and visually display the multiple pieces of data received
during standard setting. Panelists use the construct maps at each stage of standard setting to
receive feedback and record their judgments.
In the next section, the typical BoW procedure is described. Discussion is given as to
how the typical procedure can result in rater inconsistency, score gaps, and transparency issues.
The new BoW procedure with construct maps is then outlined. The following section outlines
data from two standard settings that used the new procedure; a standard setting on a large
Midwestern state’s general assessment and a standard setting on a large Midwestern state’s
alternate assessments based on modified achievement standards. Then, results on the procedural,
internal, and external validity evidence (Kane, 1994; 2001) based on these data are provided. The
article concludes with discussion and some areas for future research.
Typical Procedure
The typical BoW procedure (Kingston et al., 2001) consists of a practice round followed
by three operational rounds of ratings. The panelists’ task in the BoW procedure is to consider
samples of student work and to sort the samples into different categories so that numerical cut-
scores can be determined to separate the examinees into the performance levels in the
performance level descriptors (PLDs).
In the practice round, panelists are given a small number of work samples. The samples
are often selected such that there is one high scoring sample, one low scoring sample, and a
couple of samples with scores in the middle of the score range. Depending on how the BoW
method is being implemented, the samples can be shown with or without the scores and the
samples may or may not be shown in order based on their scores (Cizek & Bunch, 2007;
Kingston et al., 2001; Zieky, Perie, & Livingston, 2008). Panelists are asked to look at the
samples and provide ratings for each sample. The goal in the practice is not to precisely identify
cut-scores, but to ensure that panelists have a solid understanding of the materials and processes.
Following the practice round, panelists participate in the three rounds of operational
ratings. The first round is called the range finding round (Kingston et al., 2001). In this round,
panelists consider work samples that span the range of scores that examinees can obtain on the
assessment. The panelists’ task is very similar to the practice round where they sort the samples
into performance categories to indicate their cut-scores.

It is important to point out that even though the first round samples are designed to span
the range of possible scores, there are often gaps along the score scale in which no work samples
are shown. This occurs because the score scale underlying the assessment is usually created by
scaling the test using a psychometric model, such as an item response theory (IRT) model or
cognitive diagnostic model. As a result, this often leads to a situation in which not every point on
the scale is represented with a work sample. This creates a challenge because it is difficult to
precisely set cut-scores at some locations on the scale; yet these represent viable potential cut-
scores. Another challenge that can make it hard to estimate cut-scores is the presence of rater
inconsistency where panelists will rate samples with the similar scores into different categories.
This can occur because panelists are often not forced to provide ratings in which higher scoring
samples need to have ratings greater than or equal to ratings for samples with lower scores.
Following the range finding round cut-score ranges may be determined or actual cut-
scores may be estimated. If the cut-score ranges are being employed, the ranges are usually
identified by looking at the panelists’ ratings and identifying the locations where panelists begin
to move papers from one category to the next (see Kingston et al., 2001). If the cut-scores are
reported as single numbers the cut-scores are typically estimated using a logistic regression
model. In this case, the panelists’ ratings are input into the model and the score associated with a
fifty percent chance of moving from one category to the next is the cut-score. The feedback
information given to the panelists following the first round is often in the form of frequencies of
the ratings assigned to each sample and the estimated cut-scores for each category (Zieky et al.,
2008). Panelists discuss these data and then move to the second round.
Herein lies another set of challenges, which is that panelists often have a hard time seeing
how their ratings translate into the cut-scores and relate to various standard-setting data. This
occurs because the relationships between ratings, estimated cut-scores, and other data are often
not fully transparent. For example, there may be three samples at one score, two at another score,
one sample at a few different scores and so on. The question becomes, how does one distribute
their ratings to produce appropriate cut-scores? For most this is not a straightforward task. In all
fairness, one of the goals of the facilitator in the typical procedure is to try and help panelists
understand this process. The facilitator may explain, for example, that to lower their cut-score
they should give more ratings to some of the lower scoring papers. However, even with the aid
of a skilled facilitator knowing how to redistribute ratings to produce any specific cut-score is
not always clear. Panelists may also want to make adjustments in response to feedback and doing
this can also be a challenge because it is hard to see the relationships.
The second round of the BoW procedure is aptly named the pinpointing round (Kingston
et al., 2001). The process that panelists use to indicate their cut-scores is very similar; they look
at each sample and sort them into piles to indicate their cut-scores. However, the range of work
samples is narrowed in comparison to round 1. Typically, the range is adjusted based on the cut-
scores estimated in round 1. Many of the concerns with score gaps, rater inconsistency, and
transparency persist in round 2. Following the second round, the cut-scores are estimated using
logistic regression, panelists are given feedback information, and they discuss these data. One
new piece of feedback information often included is the impact data for the round 2 cut-scores.
The final round allows the panelists the opportunity to reexamine the samples from round
2 in light of the impact data and feedback information. Panelists can make their final judgments
by either sorting the work samples into piles again or if the scores assigned to the work samples
are shown panelists may be asked to just indicate numbers for their cut-scores. Again, many of
the same challenges that existed in rounds 1 and 2 are still present in the final round.
Even though the typical BoW method is a legitimate and defensible approach to
determining cut-scores, there are challenges with its application that can impact cut-score
estimates. In the next section, we describe a novel approach to the BoW method that uses
construct maps to help mitigate some of these previous challenges.
New Procedure
Construct Maps
The new BoW procedure hinges on the use of construct maps. Construct maps are visual
displays that show how examinee and/or item data are related to an underlying achievement
construct and corresponding score scale (Authors, 2012; Luecht, 2008; Wilson, 2005). Construct
maps have three critical components: (1) the content that defines the construct is well defined,
(2) the construct can be represented by an underlying continuum in which items and examinees
can be ordered along that continuum, and (3) a measurement model can be applied to show the
relationships between the scoring continuum and the item and examinee data from the
assessment in a chart or graphic (Authors, 2012; Luecht, 2008; Wilson, 2005). Many commonly
used psychometric models can be applied to data to create empirical construct maps.
An abbreviated example of an empirical construct map based on data collected from a
grade 5 mathematics assessment that applied IRT models is shown in Table 1. In the construct
map in Table 1, the score scale values from 220 to 230 are shown in the score scale column and
different quantities computed from item data, examinee data, and various ratings are shown in
separate columns. The construct map clearly shows that there are specific relationships between
the potential scores on the underlying achievement continuum and various data elements. Many
of these same data elements are what panelists focus on in different standard-setting procedures.
For example, the item scores in each row in Table 1 are the values of the item characteristic
curves from IRT models and they correspond to the data that panelists consider when providing
ratings in the Angoff procedure (Angoff, 1971). Authors (2012) explain how data in the general
construct mapping framework relate to several common standard-setting methods.
[Insert Table 1 about here]
Construct maps have important potential applications in standard setting because they can
help panelists to better understand the relationship between the examinee and item data and how
they translate into cut-scores; all panelists need to do is look across a row in the construct map to
see the relationship between that potential cut-score and the data that they are reviewing. Further,
construct maps can help panelists to provide improved judgments that can reduce concerns of
rater inconsistency, score gaps, and transparency. The next section describes how construct maps
can be used for these purposes in the context of the BoW procedure.
Body of Work with Construct Maps
In the BoW method, panelists are asked to consider student work samples in each round
of standard setting. Additionally, panelists are often given impact data and feedback information
displaying their cut-scores in relation to those of other panelists. Therefore, these data are
contained in the construct map for the BoW method (Table 2).
For the sake of this article, it is assumed that the underlying scoring continuum is created
using an IRT model, specifically the Rasch model, since that was the model the state that
conducted the standard settings used to create scale scores. However, the use of the Rasch model
is not a necessity to apply the BoW Method with construct maps.
In the construct map (Table 2) the score scale, which is a linear transformation of the IRT
ability scale, is shown in the column labeled “score scale.” The columns labeled “Round 1 Work
Samples” and “Round 2 Work Samples” show the scale locations of work samples in the first
and second round. These locations are determined from IRT ability estimates assigned to each
sample after the test is administered. Since we are applying the Rasch model, examinees with the
same raw score will have the same scale score, and work samples with the same raw score will
appear in the same row. The “PAC” column shows data on the percent at or above cut-score
(PAC) and it is determined from the distribution of ability estimates on the assessment. The
“Partially Proficient”, “Proficient”, and “Advanced” cut-score columns are the places where the
panelists record their ratings in each round and get to see rater feedback information on cut-
scores in between rounds. The number and names of these columns can be changed based on the
standard-setting application. By reading across each row in the construct map, panelists can see
how their cut-scores are related to the other data used in standard setting.
Similar to the typical BoW method, the BoW method with construct maps begins with
training, the opportunity to take the assessment and discuss the test items and rubrics, and in
depth consideration of the PLDs. The main difference in these initial steps is in the training,
which introduces construct maps and explains how the construct maps show the work samples
and standard-setting data that panelists will see, and outlines how the panelists use these
materials to provide their standard-setting judgments.
Prior to the operational rounds, there is a practice round to orient the panelists to the
materials that are used in the procedure. For the applications in this article, panelists were given
five work samples, one of which was a high scoring sample, one of which was a low scoring
sample, and three work samples in the middle of the score range. Each work sample had the
scale score on a cover sheet, the total score on the multiple-choice items, the total score on the
open-ended items, the answer choices chosen by the examinees on the multiple-choice items and
the item p-values, and the writing responses with scores on the rubrics and annotations as to why
each response was given the scores that it received. Panelists also received a construct map that
showed the five work samples. The work samples were shown in order from lowest to highest
score. Panelists were instructed to indicate their cut-score by looking at the construct map,
marking a score on the construct map, and then transferring this score to a rating sheet.
In the first round, panelists received the round 1 work samples and a construct map that
showed the samples, the score scale, and the columns where they recommended their cut-scores.
Panelists did not see the PAC or round 2 work samples; those were shown in later rounds. Fifty
work samples spread out across the range of possible scores were given in round 1. The work
samples had various combinations of scores including higher multiple-choice scores and lower
open-ended scores, moderate scores on both sections, and higher open-ended scores and lower
multiple-choice scores. It was explained that samples that had the same number of total points
had to be classified into the same category and that this would be reflected in the cut-score
selected in the construct map. We assumed the underlying score scale ranged from 100 to 300,
and this scale, in one point increments, was displayed in the construct map.
The number of work samples and score scale can change based on the specific
implementation. The samples were shown to the panelists in order from the lowest score (e.g.,
sample 1 in Table 2) to the highest score (e.g., sample 20 in Table 2). Some of the samples had
the same score and panelists were told that they can find these locations in the rows with more
than one sample (e.g., a score of 220 in Table 2). Panelists were also told that there were
locations where there were no samples (e.g., a score of 150 in Table 2). These locations
represented score gaps.

To use the construct map to set cut-scores, panelists were instructed to start with the first
cut-score and consider the examinee that would just barely be considered at that level of
performance. In our applications, this was the just barely partially proficient student for the
general assessments and the just barely proficient student for the alternate assessments based on
modified achievement standards. For the sake of illustration, the explanation in this section is for
the general assessments. In this case, panelists were instructed to look at the first sample and ask
themselves whether or not the sample represented just barely partially proficient performance. If
they believed it was not indicative of just barely partially proficient performance, then they were
instructed to move to the next sample. This sample had a score that would be greater than or
equal to the previous sample that they just examined. As they moved from sample to sample they
were told to examine the construct map to see the location of the sample and relationship of the
score assigned to the sample to the potential cut-scores that could be set. When they reached a
point where they thought they had found a sample that represented just barely partially proficient
performance, they were instructed to examine that sample and a few samples before and after it
to select their cut-score. Their cut-score could be at the location of a sample or in between
several samples. Panelists were told to select a cut-score at a sample if they believed the sample
was indicative of work right on the border of what the just barely partially proficient student
should be able to do. They were told to select a cut-score in between two samples if one sample
was clearly above and the other sample was clearly below what the just barely partially
proficient student should be able do. Panelists were encouraged to write the word “Cut-Score” in
the row representing their cut-score, draw a line through that score, and write the words “Not
Proficient” above the line and “Partially Proficient” below the line. Writing the words was
designed to remind panelists of the meaning of their cut-score. Panelists used a similar process to
recommend the proficient and advanced cut-scores.
The benefits to this process are that it helps reduce score gap concerns because panelists
can select cut-scores in between samples (and there is a sound rationale for doing this, one
sample is clearly too high and one sample is clearly too low in relationship to the PLD) and that
it virtually removes rater inconsistency because the construct map clearly shows when a cut-
score will result in samples with higher scores being placed into lower performance categories
and vice versa. The process for choosing their cut-scores in the construct map in essence forces
the panelists to make a decision of the best location for the cut-score and does not allow them to
place samples with similar scores into different performance categories. The construct map also
clearly shows the relationships between the standard-setting data and cut-scores. After round 1, a
completed construct map will appear similar to the example in Figure 1.
[Insert Figure 1 about here]
The cut-scores were computed by taking the median of the panelists’ ratings. It was
possible to use the median, instead of the more complicated logistic regression procedure, since
each panelist indicated a single scale score representing their best judgment for a cut-score. This
approach greatly simplifies the computational aspects of the procedure and makes it easier for
panelists to understand how the cut-scores were calculated. Most panelists are familiar with
medians and how medians are computed. It is also possible to use the mean to compute cut-
scores, but we used the median since the median is not as impacted by extreme judgments.
Panelists then discussed feedback data received in construct maps between rounds 1 and
2. The process for recommending cut-scores in round 2 was very similar to round 1 except that
panelists considered targeted work samples and they could see these samples and how they relate
to the samples from round 1. Thirty new work samples were shown in round 2. These samples
again had some samples with different combinations of scores. Each set of samples was
represented with specific columns in the construct map as is shown in Table 2.
Following the second round, panelists received additional feedback (Table 3). The group
cut-scores were shown with horizontal shaded lines and the individual panelist cut-scores were
shown by rater numbers. This construct map was very similar to the one that the panelists used to
provide their ratings in the rounds of standard setting and saw as feedback between other rounds.
This was intentional and was designed to help integrate the process and make it clear how each
piece of data related to their ratings and estimated cut-scores. In the last round, panelists saw
information on the samples and the impact data. The panelists’ were instructed to make their
final recommendations by considering of all data in the construct maps and its relationship to the
PLDs. Panelists indicated their final cut-scores by again selecting a score in the construct map.
Data
The new BoW procedure with construct maps was used to set cut-scores in two different
writing assessment programs in a large Midwestern state. The first program was a state general
writing assessment program. In this program, assessments were given in grades 4 and 7 to over
100,000 examinees. Both tests consisted of: one narrative writing prompt that was analytically
scored and worth 15 points, one informational writing prompt that was analytically scored and
worth 15 points, one holistically scored item that focused on providing feedback and editing a
piece of writing that was worth 4 points, and 16 multiple-choice questions each worth one point.
Each 50 point test was scaled using the Rasch model and put onto a scale ranging from 100 to
300. The choice of the scale of 100 to 300 was arbitrary, but was chosen so as not overlap with
scales for the general assessments. Typical scales span a 200 to 250 point range and have the
proficient cut-score anchored at a specific number to aid score interpretations. It was made clear
that the scale shown was designed to give to panelists an idea of the relationship between the
samples and possible cut-scores and that the final scale would be based on the final cut-scores.
Twenty-two panelists participated in the standard setting at grade 4 and twenty-three
panelists participated at grade 7. The panelists were selected through a nomination and
recruitment process. Panelists represented a range of geographical regions, school districts, and
ranged in their educational experience. Most of the panelists were educators, but there were three
business and community members in grade 4 and four business and community members in
grade 7. In grade 4, 82% of the panelists were white and 77% of the panelists were female. In
grade 7, 87% of the panelists were white and 87% of the panelists were female.
The BoW method with construct maps consisted of the three rounds of judgments as
described above with impact data shown between the second and third rounds. Throughout the
standard-setting process, process evaluation forms were given to ensure that the procedure was
functioning as intended. The process evaluations had a series of open-ended questions that asked
about various aspects of the procedure and the approaches panelists used (or thought they would
use) to determine their cut-scores. Many of the questions were identical across rounds except for
changing the wording to reference the specific round in question. Data from the process
evaluations on the question that asked panelists “Please describe how you used (or think you will
use) the standard-setting materials provided by the facilitator to provide your standard-setting
judgments in Round __,” is used to help provide an idea of the processes and materials that
panelists were focusing on when providing their standard-setting judgments. A final evaluation
form was given at the end of the standard setting, which contained a mixture of opened-ended
and rating scale questions. The rating scale questions ranged from one to four with a rating of
one indicating strongly disagree, two indicating disagree, three indicating agree, and four
indicating strongly agree. We also used data from the rating scale questions on the final
evaluations that dealt with various aspects of the procedure as another set of data. In addition, we
examined data on the three cut-scores and their impact data and whether ratings were on papers
or in gaps in the construct maps across rounds as other data on the method.
The second set of data comes from a state alternate assessment writing program based on
modified achievement standards. This standard setting took place approximately one year after
the standard setting for the general assessments. Similar to the general assessments, the alternate
assessments based on modified achievement standards were given in grades 4 and 7.
Approximately 2,000 examinees took the assessment in each grade. The assessments were also
scaled using the Rasch model and placed on a scale from 100 to 300. The tests consisted of: one
15 point narrative prompt that was analytically scored, one 15 point informational prompt that
was analytically scored, and 10 multiple-choice questions which were each worth one point.
Eleven panelists participated in the grade 4 standard setting and twelve panelists
participated in the grade 7 standard setting. The panelists were recruited using the same
nomination and recruitment process as was used with the general assessment. Fewer panelists
were nominated and available to participate for the alternate assessment standard setting.
Panelists again represented a range of geographical regions, school districts, and levels of
educational experience. All of the participants were educators. In grade 4, 81% of the panelists
were white and 81% of the panelists were female. In grade 7, 83% of the panelists were white
and 100% of the panelists were female.

The standard-setting procedure used for the alternate assessments was similar in many
respects to the approach used with the general assessment. In particular, panelists again received
similar training, used construct maps to complete three rounds of ratings, and were asked many
of the same evaluation questions as in the general assessment standard setting. However, there
were several key differences. First, panelists only had to give two cut-score ratings instead of
three. Second, impact data was shown in the construct maps after round 1. The inclusion of these
data earlier in the process was a policy consideration and decision made by the state since the
state had recently conducted a study to reset cut-scores on several of the general assessments for
other content areas. The reset cut-scores produced dramatic changes in the amount of students
proficient and the state felt that showing these data earlier in the process would be instructive.
A third difference was that panelists were given results from a separate Contrasting
Groups (Livingston & Zieky, 1982) study to help benchmark their judgments in between round 1
and 2. These data were collected from teachers when the alternate assessments were given to the
examinees. In both grades, data were filled in on roughly 30% of the 2,000 answer documents.
Teachers gave predictions of the categories that they thought each examinee would obtain after
considering generic descriptions of what it meant to score in each category. These statements
were very similar to the policy statements in the PLDs. The predictions were used to calculate
ranges of cut-scores for the proficient cut-score only. Ranges were not found for the advanced
cut-score since very few examinees were rated into the advanced category and finding stable
estimates proved difficult. The ranges of cut-scores were determined by analyzing the data using
the approaches suggested for calculating cut-scores with Contrasting Groups in Cizek and Bunch
(2007) (i.e., logistic regression, midpoints between means, and midpoints between medians).
Panelists were presented with these data and were told how they compared to the ranges of cut-
scores they recommended in round 1. Panelists could place marks on their construct maps to
indicate these ranges, but none of the panelists chose to do this.
A final difference between the two implementations was that after the last round, a
subcommittee of panelists was asked to serve on an articulation committee to review the
recommendations from both grades. This subcommittee consisted of four panelists from each
grade level. Participants in the articulation committee got to see the cut-scores and the impact
data for both grades. Panelists were asked how the process worked in each committee. The
committee was told they could adjust the cut-scores if they so desired after discussion. No
changes were made in the articulation meeting. Separate evaluation forms were administered
after the articulation meeting that asked open-ended and rating scale questions. To investigate the
method, the same data were used with this assessment as the general assessment with the
addition of data from the Contrasting Groups study and articulation committee evaluations.
Results
The results section is broken down into three separate subsections, each section focuses
on the types of validity evidence suggested by Kane (1994; 2001) for evaluating standard setting.
These types of validity are procedural, internal, and external validity evidence. Procedural
validity deals with information about procedural aspects of the method, internal validity deals
with data about the consistency of the cut-scores, and external validity deals with data about the
relationship of cut-scores from the procedure with outside data or criteria.
Procedural Validity Evidence
Common procedural challenges with the traditional BoW method were major drivers of
the development of BoW with construct maps. These concerns include: score gaps, rater
inconsistency, and transparency. In the new procedure, the challenge of rater inconsistency was
essentially removed because panelists were asked to give judgment for cut-scores that place
samples with the same score into the same categories and which higher scoring samples fall into
a performance category equal to or greater than lower scoring samples. Score gaps were still
present in the new procedure, but panelists could mitigate the impact of score gaps because they
can recommend cut-scores between samples where there may be score gaps if they so choose.
The frequency of ratings on papers and in between papers in score gaps for the general
assessment (Table 4) and the alternate assessment (Table 5) show that there were several ratings
that were placed into score gaps. For the general assessments (Table 4) most of the ratings were
still on papers, especially for fourth grade committee, and there were several cut-scores with no
ratings in score gaps. For the alternate assessments (Table 5), there were a greater percentage of
ratings placed in the score gaps, and in some cases the number of ratings placed into score gaps
exceeded the number of ratings on papers. The differences could be a function of the group
dynamics in each committee or the unique aspects of the papers reviewed by the various
committees. The differences could also be a function of the fact that with fewer items on the
alternate assessments there were more score gaps, which creates the potential for increased score
gap ratings. Nonetheless, the ratings produced provide strong evidence in support of the BoW
method with construct maps and the need for the ability to give ratings in score gaps.
[Insert Table 4 and 5 about here]
A second piece of procedural evidence in support of BoW with construct maps was
qualitative data in response to the question “Please describe how you used (or think you will use)
the standard-setting materials provided by the facilitator to provide your standard-setting
judgments in Round __.” The responses were coded by a trained rater and categorized based on
the materials or information that the panelist described using or anticipated using when providing
their judgments. Responses were coded into the following categories: PLDs, construct maps,
work samples, feedback information, test content (i.e., focused on content of specific items),
item stats, discussion, scoring rubrics, professional judgment, and blank/off-topic. Each time a
panelist specifically referenced one of these materials or described a process linked to one of
these materials it was coded in that category for that panelist. Responses that were hard to
categorize were flagged and reviewed by the first author in conjunction with the coder.
The percentages of ratings in each of the coding categories across panelists for the four
standard settings are shown in Figure 2. The top right panel shows the results for the general
assessment at grade 4, the top left panel shows the results for the general assessment at grade 7,
the bottom right panel shows the results for alternate assessment at grade 4, and the bottom left
panel results the coding for the alternate assessment at grade 7. The responses from each process
questionnaire for different rounds are shown with different colored shaded bars.
[Insert Figure 2 about here]
Figure 2 shows some interesting patterns. Across all standard settings it appeared that the
processes that people described using focused mainly on the three critical elements of the PLDs,
construct maps, and work samples as the bars for these categories were, in most cases, the
highest bars. The actual percentage of responses in each category did differ slightly across
groups and across rounds. One would expect if the procedure was working effectively that these
would be the three highest rated elements. Additionally, it is apparent that feedback information
and discussion were not initially important to panelists, but as the rounds increased these areas
received a greater emphasis. This is also as expected as after the first round panelists get to
engage in discussion with their colleagues and receive feedback information. Few panelists
described processes incorporating the scoring rubrics, their own outside professional judgment,
or gave blank or off-topic responses. This is also as one would expect as panelists were
instructed to not rescore papers and limit their decisions to materials in the standard setting.
Also apparent in Figure 2 is the impact various facilitators have on the process. For
example, in the grade 4 general assessments there seemed to be a higher emphasis on test content
and less of a focus on the construct maps prior to round 1. In this case, a large number of
panelists said they would focus on the content of specific items and few panelists described a
process using the construct map. As the rounds went on, however, the focus on the content of
specific questions was less emphasized and greater emphasis was placed on the construct maps.
No panelists mentioned content of individual items as being important for the grade 7 general
assessments and few panelists mentioned this for the alternate assessments. There was also some
variation in the use of item statistics across panels. However, in general there seemed be limited
focus on the p-values of particular items. The patterns in Figure 2 lend support to the claim that
most of the panelists took the standard-setting process seriously and focused on the most critical
elements (e.g., PLDs, student work samples, and construct maps) when providing their ratings.
Another piece of procedural validity evidence comes from the responses to rating scale
questions for the panelists and articulation committees. The average ratings for the final
evaluations are shown in Table 6 and the average ratings for the articulation committee questions
are shown in Table 7. Admittedly, some of the questions might fall into the internal and external
validity evidence sections, but they are included here for continuity.
[Insert Table 6 and 7 here]
The average ratings from both the final evaluation questions and the articulation
committee questions are uniformly higher than a rating of three for all four standard settings,
indicating that most panelists agreed or strongly agreed with each statement. This is as expected
if the procedure and process worked as it was supposed to. One would expect that panelists
would think that the presentation of the PLDs was clear, that they felt that their confidence in
their judgments increased, that they felt that the construct maps assisted them, and that they
understood how to use the feedback information, and so on if the standard-setting method
worked effectively. Again, these responses lend support to the procedural validity of the method.
Internal Validity Evidence
In terms of internal validity, one can look at the cut-score ratings and whether the
standard deviations of the panelists’ ratings decreased across rounds. Decreased standard
deviations indicate that panelists were converging in on similar cut-scores. One would expect to
find this pattern in these data, especially given that the question on consensus on the process
evaluation form suggested that panelists felt they had reached some form of consensus (Table 6).
The final cut-scores for the general assessment and alternate assessment are displayed in
Tables 8 and 9, respectively. Each table displays the median, minimum, and maximum cut-
scores as well as the standard deviation of the cut-scores.
[Insert Table 8 and Table 9 about]
The cut-scores, with the exception of the advanced cut-score for grade 4 that went down
by 8 scale score points (which represents a change of 4% of the scale score range) remained
fairly similar across rounds for the general assessments (Table 8). In addition, the standard
deviations of the cut-scores were less in rounds 2 and 3 compared to round 1. Some of the
standard deviations for the grade 4 panel did slightly increase from round 2 to round 3, although
not in a manner that would suggest panelists had widely different opinions. The range of the cut-
scores in the third round was also fairly narrow as the biggest difference between the minimum
and maximum ratings for any cut-score was only 13 scale score points.
There was a slightly different pattern for the alternate assessments in terms of the median
cut-scores. In particular, Table 9 shows that were some more dramatic changes in cut-scores
from round 1 to round 2. For example, in the grade 4 the proficient median cut-score dropped by
10 points from round 1 to round 2 and the advanced cut-scored dropped by 28 points. These
bigger changes for the alternate assessments in comparison to the general assessment could have
been a function of the introduction of the impact data earlier in the process or the consideration
of the Contrasting Groups data. These data were not presented after round 1 for the general
assessments. The cut-scores were more similar between rounds 2 and 3. Again, the standard
deviations were decreased in rounds 2 and 3 as compared to round 1. Similar to the general
assessments, a few of the standard deviations slightly increased from round 2 to round 3. There
did appear to be more disagreement and a larger range of cut-scores for grade 7 compared to the
grade 4 in round 3, although again the standard deviations were not that large in comparison to
the range of possible cut-scores.
The results in Tables 8 and 9 are similar to results that are commonly observed in other
standard-setting processes and seem to provide fairly strong internal validity evidence that by the
last round panelists were able to converge on a set of cut-scores. In the case of the alternate
assessment standard setting, the articulation committee had a chance to review the cut-scores and
endorsed the cut-scores from round 3 with no changes. This provides some additional support for
the cut-scores that were recommended and the process used to arrive at the cut-scores.
External Validity Evidence
The last set of validity evidence is external validity evidence. One piece of external
validity evidence came from Contrasting Groups data collected for the alternate assessments.
These Contrasting Groups data were used to calculate a range of cut-scores to serve as
benchmarks for the round 1 proficient cut-scores. The range of proficient cut-scores for grade 4
was 175 to 213 and the range of proficient cut-scores for grade 7 was 182 to 211. The group
median cut-scores from round 1 for the new BoW procedure were 195 and 184 for grades 4 and
7, respectively. The ranges for the panelist recommendations were 175 to 209 for grade 4 and
163 to 214 for grade 7. The cut-scores fell within the ranges from the Contrasting Groups and the
ranges for the BoW with construct maps and Contrasting Groups were fairly similar. These data
provide some external validity evidence in terms of how cut-scores from the new BoW method
compare with that of another procedure. Although it is not strong additional evidence because
the panelists saw the Contrasting Groups data after round 1, the final cut-scores for the alternate
assessment standard setting also fell within the ranges of the Contrasting Groups data.
Another piece of external validity evidence was the similarity of percentages for the
grade 4 and grade 7 students that committees said would be proficient or above in each
assessment. In the general assessments, the recommended cut-scores placed 75 percent at
proficient or above for grade 4 and 74 percent at proficient or above for grade 7. For the alternate
assessments, the recommended cut-scores produced 38 percent at or above at grade 4 and 40
percent at or above grade 7. From a policy and accountability standpoint, the similarity of the
passing percentages is desirable since there will not be disparate impacts in different grades. The
fact that two grades arrived at similar percentages of students that would be proficient or above is
another piece of important external validity evidence for the procedure.
Discussion and Conclusion
Developing data driven standard-setting approaches that make the process more
transparent and that also ameliorate factors that can contaminate cut-score estimates are pressing
needs in the field of educational measurement. This article makes an important contribution by
illustrating how construct maps can be used to modify the BoW method in a way that allows
panelists with limited training in statistics and assessment to better understand the process and
provide judgments that mitigate rater inconsistency and score gap concerns.
Applications from two separate writing standard-setting processes, one of which came
from a large-scale state general assessment and one of which came from a large-scale state
alternate assessment based on modified achievement standards, illustrated how the method could
be used to effectively set cut-scores in K-12 state testing programs. Data collected provided
procedural, internal, and external validity evidence in support of the new approach. In terms of
procedural validity, data suggested that panelists do in fact provide ratings in score gaps, that for
the most part the processes that panelists described using to determine cut-scores were focused
on the critical elements they should be using with the approach, and their responses to feedback
and evaluation questions demonstrated that the standard settings and various aspects of the
procedure were viewed positively. Data providing internal validity evidence showed decreased
standard deviations over the rounds of standard setting and convergence on a narrow range of
cut-scores. Data providing external validity evidence demonstrated that cut-scores from round 1
of the alternate assessment standard setting were similar to cut-scores determined from a set of
Contrasting Groups data collected on the same assessment. In addition, the percentage of
students at or above the proficient cut-score was very similar across grades for both the general
and alternate assessments, which is desirable from a policy and accountability perspective.
Although these data provide strong validity evidence in support of the new procedure,
there were other validity data that we were not able to collect that might provide additional
insight about the procedure. For example, data comparing cut-scores and perceived differences in
ease of implementation and understanding of standard-setting relationships that directly

compared the BoW with construct maps with the traditional BoW procedure would be valuable.
Ideally, these data should be collected from a group of panelists that implemented both the
typical and new approach. We suspect that if these data were collected that the new procedure
would receive higher ratings in terms of transparency and that panelists would demonstrate
greater understanding of relationships between data. However, without the evidence to support
this claim, this is just a conjecture at this point. Further method comparison studies that compare
the new BoW approach with other standard-setting procedures would also be valuable.
Additional research that probed panelists on their use and understanding of the construct
maps would also be beneficial. Several of the panelists in their written descriptions made
comments demonstrating that they were using the construct maps appropriately and that the
found the construct maps to be a valuable visual aid in standard setting. However, deeper
interviews and think-aloud type studies might provide additional insight into the processes that
panelists were using beyond what we were able to capture on the feedback forms.
Future research could also explore and isolate the impact of showing specific data
elements in the construct maps. For example, our study did not directly investigate the impact of
showing something like impact data at different points in the process. Our data did show greater
changes for the alternate assessment between round 1 and round 2 when these data were shown
earlier in the process. However, this was not a pure experimental study in which one group had
these data and another did not. Finally, additional research should look at and investigate
applications of the BoW Method with construct maps in other contexts. Our applications only
focused on two writing assessments in a single state. The method may work differently in other
contexts.
References
Angoff, W. H. (1971). Scales, norms, and equivalent scores. In R. L. Thorndike (Ed.),
Educational Measurement (2nd ed., pp. 508-597). Washington, DC: American Council on
Education.
Cizek, G. J., & Bunch, M. B. (2007). Standard setting: A guide to establishing and evaluating
performance standards on tests. Thousand Oaks, CA: Sage Publications.
Kingston, N. M., Kahl, S. R., Sweeney, K., & Bay, L. (2001). Setting performance standards
using the Body of Work method. In G. J. Cizek (Ed.), Standard setting: Concepts,
methods, and perspectives (pp. 219-248). Mahwah, NJ: Erlbaum.
Kane, M. T. (2001). So much remains the same: Conception and status of validation in setting
standards. In G. J. Cizek (Ed.), Setting performance standards: Concepts, methods, and
perspectives (pp. 53-81). Mahwah, NJ: Lawrence Erlbaum.
Kane, M. T. (1994). Validating the performance standards associated with passing scores.
Review of Educational Research, 64, 425-461.
Luecht, R. M. (February, 2008). Assessment Engineering. In symposium Assessment
Engineering: Moving from theory to practice. Presented at the Annual Meeting of the
Association of Test Publishers, Dallas, TX.
Livingston, S. A., & Zieky, M. J. (1982). Passing scores: A manual for setting standards of
performance on educational and occupational tests. Princeton, NJ: ETS.
Wilson, M. (2005). Constructing measures: An item response modeling approach. Mahwah, NJ:
Lawrence Erlbaum.
Wyse, A. E. (2013). Construct Maps as a Foundation for Standard Setting. Manuscript
submitted for publication. Submitted for publication.

Zieky, M. J., Perie, M. J., & Livingston, S. A. (2008). Cutscores: A manual for setting standards
of performance on educational and occupational tests. Princeton, NJ: ETS.

Figure 1: Example of Filled in Panelist Construct Map from Round 1
Figure Note: The figure shows three cut-score placements for the panelist with the words “Cut-Score” and lines through the row that
signifies their cut-scores. The performance level categories labels are written above and below each line to remind the panelist of
meaning of their cut-scores.

Figure 2: Results from Qualitative Responses on Processes Used by Panelists in Modified Body of Work Method
Table 1: Example of Empirically Derived Construct Map for a Mathematics Assessment
PAC Teacher’s Work Raters’ Score RP67 Item Item Domain

Students Samples Cut Scores Scale Locations Scores Scores
Item … Item Number … Geometry

1 35 Sense
… … .. … … …
6.41% R8 230 Item 26 .844 .725 .739 .752
6.70% Student A AA, BB, CC 229 .841 .720 .735 .748
7.09% R2 228 Item 24, .837 .714 .731 .744
Item 25
7.47% R4 227 .834 .709 .726 .740
7.86% 226 .830 .704 .722 .735
8.24% R12 225 .827 .699 .718 .731

8.63% R7, R10, 224 Item 23 .823 .694 .714 .726
R11, R13
9.01% R14 223 .820 .688 .709 .722
9.40% Student B DD, EE, FF R1, R6, R9 222 Item 22 .816 .683 .705 .717
10.1% 221 .812 .677 .700 .713
10.80% 220 Item 21 .808 .672 0.696 .708
… … … … … …
Table Note: The PAC column shows the percent at or above cut-score. The Teacher’s Students column shows the location of students
in the teacher’s classroom. The Work Samples column shows the location of student work samples. The Raters’ Cut-Scores column
shows the location of the standard-setting judges’ ratings. The Score Scale shows the underlying mathematics achievement construct.
The RP67 locations represent Bookmark locations where the response probability is equal to 0.67. The Item Scores are the expected
item scores and the Domain Scores are the expected proportion correct scores in the domains.
Table 2: Example of a General Construct Map for Body of Work Method with Construct Maps
Partially Proficient Advanced

Round 1 Work Round 2 Work
PAC Score Proficient Cut-Score Cut-Score
Samples Samples
Scale Cut-Score
96% 100
92% Sample 1, Sample2 110
88% Sample 3 Sample 21, 120
85% Sample 4 Sample 22, Sample 23 130
83% Sample 5 Sample 24 140
79% 150
75% Sample 6, Sample 7 160
70% Sample 8 170
64% 180
49% 210
42% Sample 11, Sample 12 Sample 28, Sample29 220
38% Sample 13, Sample 14 Sample 30 230
31% 240
26% Sample 15 250
16% 270
10% Sample 17, Sample 18 Sample 33, Sample 34 280
3% Sample 20 300
Table Note: The rows in the construct map show the meaning of each possible cut-score in terms of the PAC (percent at or above cut-
score) and the work samples reviewed in round 1 and 2. Cut-scores and rater information for the partially proficient, proficient, and
advanced cut-scores are recorded and shown in those columns in the construct map during the rounds of standard setting.
Table 3: Example of feedback information for Modified Body of Work between Rounds 2 and 3
Partially Proficient Advanced

Round 1 Work Round 2 Work
PAC Score Proficient Cut-Score Cut-Score
Samples Samples
Scale Cut-Score
96% 100
92% Sample 1, Sample2 110 R6
88% Sample 3 Sample 21, 120 R2, R3
85% Sample 4 Sample 22, Sample 23 130 R1, R4, R10
83% Sample 5 Sample 24 140 R5, R8, R9
79% 150 R7
75% Sample 6, Sample 7 160
70% Sample 8 170
64% 180
57% Sample 9 Sample 25 190 R8
54% Sample 10 Sample 26, Sample 27 200 R2, R5, R7
49% 210 R1, R3, R6, R9
42% Sample 11, Sample 12 Sample 28, Sample29 220 R4
38% Sample 13, Sample 14 Sample 30 230 R10
31% 240
26% Sample 15 250 R3, R9
20% Sample 16 Sample 31, Sample 32 260 R5, R7
16% 270 R2, R4, R10
10% Sample 17, Sample 18 Sample 33, Sample 34 280 R6
7% Sample 19 Sample 35 290 R8
3% Sample 20 300 R1
Table Note: The group median cut-scores are shown with the horizontal shaded rows in the construct map. Data on the PAC and the
Round 1 and Round 2 Work Samples are shown in those columns. Each rater is represented with an individual rater number.
Table 4: Types of Ratings Provided by the Standard-Setting Panelists for General Assessments
Round 1 Round 2 Round 3
Grade Type of Partially Partially Partially

Level Rating Proficient Proficient Advanced Proficient Proficient Advanced Proficient Proficient Advanced
On
4 Papers 22 21 17 22 22 18 22 22 20
4 In Gaps 0 1 5 0 0 4 0 0 2
On
7 Papers 16 16 14 21 19 15 18 18 17
7 In Gaps 7 7 9 2 4 8 5 5 6
Table 5: Types of Ratings Provided by the Standard-Setting Panelists for Alternate Assessment
based on Modified Achievement Standards

Type
Grade of
Level Rating Proficient Advanced Proficient Advanced Proficient Advanced
On
4 Papers 10 4 5 4 2 8
In
4 Gaps 1 7 6 7 9 3
On
7 Papers 5 3 8 5 5 5
In
7 Gaps 7 9 4 7 7 7
Table 6: Mean Scores on Panelist Final Feedback Evaluation Forms
Statement Mean Mean Mean Mean

Grade 4 Grade 7 Grade 4 Grade 7
General General 2% 2%
(N=22) (N=23) (N=11) (N=12)
The presentation regarding the standard-setting purpose 4.00 3.52 3.91 3.50
was clear.
The presentation of the performance level descriptors 3.89 3.39 3.64 3.58
(PLDs) was clear.
The training in the standard-setting methods was clear. 3.89 3.57 3.82 3.58
I am confident that I was able to apply the standard- 3.89 3.50 3.82 3.75
setting method appropriately.
I feel that my confidence in my standard-setting 3.95 3.74 3.91 3.67
judgments increased as the process played out.
The standard-setting procedures allowed me to use my 3.53 3.78 4.00 3.58
experience and expertise to recommend cut-scores for
the Writing Assessment.
The facilitators helped to ensure that everyone was able 3.89 3.78 4.00 3.83
to contribute to the group discussions.
I felt that the facilitators were knowledgeable and able 3.74 3.70 4.00 3.75
to answer my questions.
The construct maps assisted me in understanding the 3.84 3.74 3.82 3.67
relationship between actual student work and potential
scale scores.
I was able to understand and use the feedback provided 3.79 3.78 3.55 3.58
(e.g., other participant’s ratings, impact data).
The final cut-scores represent reasonable estimates of 3.79 3.48 3.73 3.33
what students should be able to do as defined by the
performance level descriptors.
The final cut-scores represent group consensus taking 3.63 3.48 3.73 3.50
into account all opinions and knowledge of the
collective group of panelist.
Table Note: In the each question, 4 equals strongly agree, 3 equals agree, 2 equal disagree, and 1
equals strongly disagree.

Table 7: Mean Scores on Articulation Committee Final Feedback Evaluations
Statement Mean Mean

Grade 4 Grade 7
2% 2%
(N=4) (N=4)
The standard-setting panelists who provided cut- 4.00 3.50
scores recommendations were cognizant of the need to
articulate standards across grades.
The final cut-scores recommended by this panel 4.00 3.50
reflect the statements in the performance level
descriptors (PLDs).
The final cut-scores recommended by this panel were 4.00 3.50
appropriate given the population who takes alternate
assessment based on modified achievement standards.
The cut-scores recommended by this panel represent 3.50 3.50
performance expectations consistent with the
attainment of requisite knowledge.
The standard-setting panelists were able to carry out 4.00 3.75
the standard-setting method as it was described in
training.
The standard-setting panelists understood the 3.25 3.50
relationship between test performance and potential
scale scores.
The standard-setting panelists understood the impact 4.00 3.75
of moving their cut-score selection up or down in the
construct maps.
The standard-setting panelists understood how the data 3.75 3.75
in the construct maps related to the booklets they were
reviewing.
The standard-setting panelists made their own 3.75 3.75
decisions for the cut-scores they recommended and
were not overly influenced by the facilitator or other
panelists.
Table Note: In the each question, 4 equals strongly agree, 3 equals agree, 2 equal disagree, and 1
equals strongly disagree.
Table 8: Cut-Scores for Three Rounds of Body of Work with Construct Maps for the General Assessments

Grade
Level Cut-Score Median SD Min Max Median SD Min Max Median SD Min Max
Partially
4 Proficient 175 3.54 173 182 175 2.22 173 182 177 2.24 173 182
4 Proficient 195 3.88 194 208 194 1.12 192 196 194 2.52 192 204
4 Advanced 239 8.94 230 264 231 3.84 230 241 231 2.77 230 241
Partially
7 Proficient 181 5.23 169 194 182 3.04 178 192 182 1.92 179 188
7 Proficient 196 7.11 189 218 196 2.58 192 200 196 2.33 193 201
7 Advanced 232 11.96 220 277 231 3.69 225 239 232 2.84 226 239
Table Note: Median is the median of the cut-score recommendations, SD is the standard deviation of the recommended cut-scores,
Min is the minimum cut-score recommendation, and Max is the maximum cut-score recommendation.
Table 9: Cut-Scores for Three Rounds of Body of Work with Construct Maps for the Alternate Assessments based on Modified
Achievement Standards

Grade
Level Cut-Score Median SD Min Max Median SD Min Max Median SD Min Max
4 Proficient 195 11.66 175 209 185 3.69 175 188 186 1.51 183 188
4 Advanced 273 14.11 238 280 245 6.23 230 249 247 4.04 236 249
7 Proficient 184 14.66 163 214 190 8.05 184 207 190 8.98 184 210
7 Advanced 246 16.64 223 274 244 15.92 234 285 248 9.15 241 274
Table Note: Median is the median of the cut-score recommendations, SD is the standard deviation of the recommended cut-scores,
Min is the minimum cut-score recommendation, and Max is the maximum cut-score recommendation.
View publication stats

Wyse Et Al 2014 EPM

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Wyse Et Al 2014 EPM

Uploaded by

Copyright:

Available Formats

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

A Body of Work Standard-Setting Method With Construct Maps

Article in Educational and Psychological Measurement · April 2014

Adam Wyse Michael Bunch

SEE PROFILE SEE PROFILE

Evaluating and Improving Standard-Setting Methods View project

Standard Setting View project

The user has requested enhancement of the downloaded file.

A Body of Work Standard-Setting Method with Construct Maps

Michigan Department of Education

Michigan Department of Education

Accepted for publication in Educational and Psychological Measurement.

Please cite as:

Version of Record available at: http://epm.sagepub.com/content/74/2/236.full

standard-setting method that is gaining popularity for setting cut-scores on large-scale

feedback between rounds.

other standard-setting data.

receive feedback and record their judgments.

performance level descriptors (PLDs).

into performance categories to indicate their cut-scores.

this can also be a challenge because it is hard to see the relationships.

construct maps to help mitigate some of these previous challenges.

An abbreviated example of an empirical construct map based on data collected from a

construct mapping framework relate to several common standard-setting methods.

[Insert Table 1 about here]

Body of Work with Construct Maps

is not a necessity to apply the BoW Method with construct maps.

[Insert Table 2 about here]

[Insert Table 2 about here]

materials to provide their standard-setting judgments.

represented score gaps.

recommend the proficient and advanced cut-scores.

completed construct map will appear similar to the example in Figure 1.

[Insert Figure 1 about here]

represented with specific columns in the construct map as is shown in Table 2.

[Insert Table 3 about here]

Twenty-two panelists participated in the standard setting at grade 4 and twenty-three

assessments based on modified achievement standards were given in grades 4 and 7.

and 100% of the panelists were female.

indicate these ranges, but none of the panelists chose to do this.

subcommittee of panelists was asked to serve on an articulation committee to review the

relationship of cut-scores from the procedure with outside data or criteria.

Procedural Validity Evidence

[Insert Table 4 and 5 about here]

the standard-setting materials provided by the facilitator to provide your standard-setting

[Insert Figure 2 about here]

[Insert Table 6 and 7 here]

Internal Validity Evidence

scores as well as the standard deviation of the cut-scores.

[Insert Table 8 and Table 9 about]

the range of possible cut-scores.

External Validity Evidence

assessment. In the general assessments, the recommended cut-scores placed 75 percent at

assessments, the recommended cut-scores produced 38 percent at or above at grade 4 and 40

another piece of important external validity evidence for the procedure.

Discussion and Conclusion

ease of implementation and understanding of standard-setting relationships that directly

Angoff, W. H. (1971). Scales, norms, and equivalent scores. In R. L. Thorndike (Ed.),

performance standards on tests. Thousand Oaks, CA: Sage Publications.

methods, and perspectives (pp. 219-248). Mahwah, NJ: Erlbaum.

standards. In G. J. Cizek (Ed.), Setting performance standards: Concepts, methods, and

perspectives (pp. 53-81). Mahwah, NJ: Lawrence Erlbaum.