NAEP Scoring

NAEP assessments include multiple-choice items, which are machine-scored by optical mark reflex scanning, and constructed-response items, which are scored by trained scoring staff. These trained scorers ("raters") use an image-based scoring system that routes student responses directly to each rater. Focused, explicit scoring guides are developed to match the criteria emphasized in the assessment frameworks. Consistency of scoring between raters is monitored during the process through ongoing reliability checks and frequent backreading. Throughout the scoring process, three types of personnel make up individual scoring teams: • • • Raters are professional scorers who are hired to rate the individual student responses, or items. Supervisors lead teams of raters throughout the scoring process on a daily basis. Trainers provide training for the scoring raters, continually monitoring the progress of each scoring team.

Team members are required to have, at a minimum, a baccalaureate degree from a four-year college or university. An advanced degree, scoring experience, and/or teaching experience is preferred. Scoring teams use the training process to determine whether each individual rater is sufficiently prepared to score. Following training , each rater is given a pre-scored "qualification set" and expected to attain 80 percent correct in order to proceed. All scoring is carried out via image processing. To assign a score, raters click the mouse over a button displayed in a scoring window. Since buttons are included only for valid scores, there is no editing for outof-range scores. Two significant advantages of the image-scoring system are the ease of regulating the flow of work to raters and the ease of monitoring scoring. The image system provides scoring supervisors with tools to determine rater qualification, to backread raters, to determine rater calibration, to reset trend rescore items, to monitor trend rescore items through t-statistics reports, to monitor interrater reliability, and to gauge the rate at which scoring was being completed. The scoring supervisors monitor work flow for each item using a status tool that displays the number of responses scored, the number of responses first-scored that still need to be second-scored, the number of responses remaining to be first-scored, and the total number of responses remaining to be scored. This allows the scoring directors and project leads to accurately monitor the rate of scoring and to estimate the time needed for completion of the various phases of scoring.

1

Backreading
After scoring begins, NAEP scoring supervisors review each rater's progress using a backreading utility that allows the scoring supervisor to review papers scored by each rater on the team. Scoring supervisors make certain to note the score the rater awards each response as well as the score a second rater gives that same paper. This is done as an interrater reliability check. Typically, a scoring supervisor reviews approximately 10 percent of all item responses scored by each rater. Alternatively, a scoring supervisor can choose to review all responses given a particular score to determine if the team as a whole is scoring consistently. Both of these review methods use the same display screen, showing the ID number of the rater and the scores awarded. If the scoring supervisor disagrees with the score given an item, he or she discusses it with the rater for possible correction. Whether or not the scoring supervisor agrees with the score, he or she assigns a scoring supervisor score in backreading. If this score agrees with the first score, the score is recorded only for statistical purposes. If the scores disagree, the scoring supervisor score overrides the first score as the reported score. Replacement of scores by the scoring supervisor is done only with the knowledge and approval of the rater, thereby serving as a learning experience for the rater. Changing the score does not change the measurement of interrater reliability.

2

Calibration of Scoring Raters
For new assessment items, the scoring supervisor of each team invokes calibration as needed from the tool used in backreading. During backreading, the scoring supervisor has a pool of 300 responses for each item to use in the calibration process. The scoring supervisor views samples of these responses together with the scores assigned by the first, and if applicable, second rater. From this pool, the scoring supervisor chooses some responses to put into the pool—responses that have been scored correctly and are a good measure to keep scoring on track. From this pool, the supervisor builds sets with the desired number of responses, usually between five and ten. These sets are then released on the image-based scoring system for raters to score. When raters invoke the calibration window, all raters receive the same responses and score them. After all raters have finished scoring this pool, the scoring supervisor can look at reliability reports, which include only the data from the calibration set just run. This process serves to refresh training and avoid drift in scoring. If pre-scored paper calibration sets already exist, these can be used to calibrate raters instead of the image-based sets created by the scoring supervisor. In general, each team scores calibration sets whenever they take a break longer than fifteen minutes, such as when returning from lunch or at the beginning of a shift. Raters scoring trend rescore items are calibrated using trend rescore responses that are already loaded into the system with the scores given during prior year scoring.

3

t statistics
A goal in scoring is consistency in scores that are assigned to the same responses by different raters within the same year or by different raters across different assessment years. Statistical flags are used to identify items for which scoring is not consistent. A system allowing for t statistics to be calculated to compare the scores for each of the item responses that has been rescored at different points in the scoring process has been implemented. To calculate a t statistic, the scoring supervisor executes a command in the report window. The scoring supervisor is then prompted for the item, the application (or purpose for which the t statistic is being performed), and the scoring group to which the item is assigned. The system then displays the results, which are printable. The test results are based only on responses for which there are two scores and for which both scores are on task. The display shows number of scores compared, number of scores with exact agreement, percent of scores with exact agreement, mean of the scores assigned during the scoring process for previous assessment years, mean of the currently assigned scores, the mean difference, variance of the mean difference, standard error of the mean difference, and the estimate of the t-statistic. The formulas used are as follows: •

Dbar = Mean Score 2 – Mean Score 1, where Dbar is the mean difference,
Mean Score 1 is the mean of all scores assigned by the first rater, and Mean Score 2 is the mean of all scores assigned by the second rater.

DiffDbarsq = ((Score 2 – Score 1) – Dbar)2, where DiffDbarsq is calculated for each score comparison.

VarDbar = (sum(DiffDbarsq))/(N–1), where VarDbar is the variance of the mean difference.

SEDbar = SQRT (VarDbar/N), where SEDbar is the standard error of the mean difference, and N is the number of responses with two scores assigned by two different raters.

Percent Exact Agreement = number of responses with identical scores/total number of double scored responses being compared, where Exact Agreement is a response with identical scores assigned by two different raters.

T = Dbar/SEDbar

For purposes of calculations, the possible scores for a response to an item are ordered categories beginning with 0 and ending with the number of categories for the item responses minus one. The estimate of a t statistic is acceptable if it is within the range from -1.5 to 1.5. The range of + or - 1.5 was selected because only one criterion was required for all items, regardless of the number of responses with scores being compared. As the number of responses with scores being compared gets large, 1.5 as the criterion means that about 15 percent of the differences were judged not acceptable according to the test when they should have been acceptable. If the estimate of the t statistic was outside that range, raters were asked to stop scoring so the situation could be assessed by the trainer and
4

scoring supervisor. Scoring resumed only after trainer and scoring supervisor had determined a plan of action that would rectify the differences in scores. Usually, different responses to the item were discussed with the raters or raters were retrained prior to the continuation of scoring.

5

Training for the Scoring Raters
Training of NAEP scoring raters is conducted by subject-area specialists from Educational Testing Service (ETS) and Pearson. All assessments are scored item-by-item so that each rater works from only one scoring guide at a time. After scoring all available responses for an item, a team then proceeds with training and scoring of the next item. Training for current assessment scoring involves explaining the item and its scoring guide to the team and discussing responses that represent the various score points in the guide. The trainer provides three or four student responses to "anchor," each score point. When review of the anchor responses is completed, the raters score 10 to 20 pre-scored "practice papers" that represent the entire range of score points the item could receive. The trainer then leads the team in a discussion of the practice papers to focus the raters on the interpretation of the scoring guide. After the trainer and supervisor determine that the team has reached consensus, the supervisor releases work on the image-based scoring system for the raters. The raters initially gather around a PC terminal to group-score items to ensure further consensus. Following group-scoring, raters work in pairs as a final check before beginning work individually. Once the practice session is completed, the formal scoring process begins. During training, raters and the supervisor keep notes of scoring decisions made by the trainer. The scoring supervisor is then responsible for compiling these notes and ensuring that all raters are in alignment; this process in referred to as calibration in NAEP. Teams vary greatly in the amount of time spent scoring as a group before working individually. Training for trend scoring is only slightly different in that prior year trend papers must be reviewed to understand scoring decisions made in prior years before raters can commence further scoring.

6

Online Training
Online training is used because of the large number of raters needed to accomplish the NAEP scoring. Training online reduces the number of trainers that are required. Another benefit is that it allows flexibility of location to make better use of scoring personnel. This proved particularly helpful during the 2000 assessment—the assessment in which online scoring was first introduced, as science items that were originally scheduled to be scored in Iowa City, Iowa were actually scored in Tucson, Arizona. Educational Testing Service (ETS) and Pearson jointly agreed to identify some items from each subject area that would be conducive to online training. ETS specialists from each subject area identified 37 items—13 mathematics items, 4 reading items, and 20 science items—for online training. ETS specialists in science and reading prepared keys and annotations for the science and reading items respectively, while the mathematics scoring director from Pearson prepared the keys and annotations for the mathematics items. Next, Pearson photocopied training sets, scanned training sets, enhanced and cropped images that had been scanned, created instruction screens, edited all images and annotations, and finally edited the beta CD prior to duplicating the master CD. NAEP online training replicated the traditional NAEP training/scoring process except that the training was actually done via CD. The speed was controlled by the individual rater. At the end of the training process, the raters scored a qualification set (or a practice set if a qualification set was not available.) These scores were printed out to determine whether any of the raters needed additional help or could proceed with scoring. Scoring supervisors still monitored scoring using scoring supervisor tools and consulted with an assigned trainer or the ETS specialist if problems occurred.

7

Selection of Training Papers
In January of each assessment cycle, clerical staff and a professional printing company begin the process of preparing sets of training papers by copying all sets for items that are to be replicated from prior years for trend scoring. These papers are sorted by item and are numbered. Then a photocopy is made of each set of papers and sent to the NAEP instrument development staff for rangefinding . The original is kept at the NAEP materials headquarters in Iowa City, Iowa so that the sets can be compiled according to instructions from the instruments developers. After review by each subject area's coordinator, the instrument development staff send the keys and/or the training sets to the materials processing staff, who label them according to standard format and reproduce the sets of papers using the original copies located at the materials headquarters. Correct scores are written on all anchor papers, while only the scoring supervisors and trainers have keys for the practice, calibration, and qualification sets. Trainers also keep annotations explaining the thought process behind each score assigned. If any of these scores changes during training or scoring, the scoring supervisor keeps notes explaining why. For the 2000 assessment, training papers were selected for 126 trend rescore items from mathematics, 41 from reading, and 200 from science. NAEP clerical staff photocopied sample responses for new items that had been field tested in spring of 1999. This included 37 new items from mathematics, 5 from reading, and 46 from science. The number of sample responses photocopied for each new item ranged from 50 to 100 depending on the difficulty of the item. For the 2001 assessment, training papers were selected for 64 trend rescore items from geography and 79 from U.S. history. This process was also implemented for the eight writing and 38 reading field test items. The number of sample responses photocopied for each new item ranged from 100 to 250 depending on the subject area.

8

Trend Scoring
To measure comparability of current-year scoring to the scoring of the same items scored in prior assessment years, a certain number of student responses per item from prior years are retrieved from image archives or rescanned from prior assessment year booklets and loaded into the system with their scores from prior years as the first score. These are loaded into a separate application to keep the data separate from current year scoring. At staggered intervals during the scoring process, the scoring supervisor releases items from prior assessment years for raters to score. Since prior year scores are pre-loaded as first scores, the current year's teams are 100 percent second-scoring the prior year papers. Following scoring of trend rescore items from prior years, scoring supervisors and trainers look at reliability reports, t-statistic reports, and backreading to gauge consistency with prior year scoring and make adjustments in scoring where appropriate. The score given to each response is captured, retained, and provided for data analysis at the end of scoring. For each item one of the following decisions is made based on these data: • • • continue scoring the current year responses without changing course; stop scoring and retrain the current group of raters; or stop scoring, delete the current scores, and train a different group of raters.

For the 2000 and 2001 NAEP assessments, the initial release of trend item responses on the imagebased scoring system took place very soon after training was completed. Scoring supervisors controlled the number released by asking raters to score a certain amount that totaled the number required. Immediately upon completion, the scoring supervisor accessed the t-statistic report. The acceptable range for the t value was within + or - 1.5 of zero. If the t value was outside that range, raters were not allowed to begin scoring current year responses. Usually this next group of responses were scored successfully. Scoring of current year responses only began after a successful t-test. These trend items were also released after every break over fifteen minutes long (first thing in the morning and after lunch) to calibrate raters. Scoring resumed only after the trainer and scoring supervisor had determined a plan of action. This was usually accomplished by their studying scored papers from prior assessment years to find trends in scoring. This helped determine what needed to be communicated to the raters before scoring could begin again. The t-statistic was printed out at the end of every trend release. An interrater agreement (IRA) matrix was also viewed after every trend release. The matrix was used as a tool to determine if the team was scoring too harshly or too leniently. IRAs were required to be within + or - 7 of the trend year reliability. Trainers and scoring supervisors had access to reliabilities for each item from prior years. This "trend scoring" is not related to the long-term trend assessment. Trend scoring looks at changes over time using main NAEP item responses (e.g., 2000 reading assessment scores for an item compared to the 1998 reading assessment scores for that item). View a table that lists the differences between main NAEP assessment and the long-term trend NAEP assessment.

9

Scoring NAEP Geography Assessments
The NAEP geography items that are not scored by machine are constructed-response items—those for which the student must write in a response rather than selecting from a printed list of multiple choices. Each constructed-response item has a unique scoring guide that identifies the range of possible scores for the item. To measure longitudinal trends in geography, NAEP requires trend scoring—replication of scoring from prior assessment years—to demonstrate statistically that scoring was comparable across years. Students' constructed responses are scored on computer workstations using an image-based scoring system. This allows for item-by-item scoring and online, real-time monitoring of geography interrater reliabilities, as well as the performance of each individual rater. A subset of these items—those that appear in large-print booklets—require scoring by hand. The 2001 geography assessment included 57 discrete constructed-response items. The total number of constructed responses scored was 381,477. The number of raters working on the geography assessment and the location of the scoring are listed here: Scoring activities, geography assessment: 2001
Number of raters 81 Number of scoring supervisors 9

Scoring location Iowa City, Iowa

Start date 5/7/2001

End date 5/23/2001

SOURCE: U.S. Department of Education, Institute of Education Sciences, National Center for Education Statistics, National Assessment of Educational Progress (NAEP), 2001 Geography Assessment.

Each constructed-response item has a unique scoring guide that identifies the range of possible scores for the item and defines the criteria to be used in evaluating student responses. During the course of the project, each team scores the items using a 2-, 3-, or 4-point scale as outlined below: Dichotomous Items 2 = Complete 1 = Inappropriate Short Three-Point Items 3 = Complete 2 = Partial 1 = Inappropriate Extended Four-Point Items 4 = Complete 3 = Essential 2 = Partial 1 = Inappropriate

In some cases student responses do not fit into any of the categories listed on the scoring guide. Special coding categories for the unscorable responses are assigned to these types of responses. These unscorable categories are only assigned if no aspect of the student's response can be scored. Scoring supervisors and/or trainers are consulted prior to the assignment of any special coding category to an item. The unscorable categories used for geography are outlined as follows. Categories for unscorable responses, geography assessment: 2001
Label B X IL OT ? Description Blank responses, random marks on paper, word underlined in prompt but response area completely blank, mark on item number but response area completely blank Completely crossed out, completely erased Completely illegible response Off task, off topic, comments to the test makers, refusal to answer, "Who cares," language other than English (unless otherwise noted) "I don't know," "I can't do this," "No clue," "I don't understand," "I forget"

NOTE: Because the NAEP scoring contractor's database recognizes only alphanumeric characters and sets a singlecharacter field for the value for each score, the label "IL" appears in the database file as "I," the label "OT" appears as "T," and the label "?" appears as "D." SOURCE: U.S. Department of Education, Institute of Education Sciences, National Center for Education Statistics, National Assessment of Educational Progress (NAEP), 2001 Geography Assessment.
10

Number of constructed-response items, by scorepoint level and grade, geography national main assessment: 2001
Grade Total 4 8 8/12 12 Total 53 27 9 4 13 Dichotomous Short 3- Extended 42-point items point items point items 22 17 0 0 5 16 5 7 0 4 15 5 2 4 4

SOURCE: U.S. Department of Education, Institute of Education Sciences, National Center for Education Statistics, National Assessment of Educational Progress (NAEP), 2001 Geography Assessment.

11

Number of 1994 constructed-response items rescored in 2001, by score-point level and grade, geography national main assessment: 2001
Grade Total 4 4/8 8 8/12 12 Total 53 9 5 15 6 18 Short 3-point items 39 6 4 13 5 11 Extended 4point items 14 3 1 2 1 7

SOURCE: U.S. Department of Education, Institute of Education Sciences, National Center for Education Statistics, National Assessment of Educational Progress (NAEP), 2001 Geography Assessment.

12

Geography Interrater Reliability
A subsample of the geography responses for each constructed-response item is scored by a second rater to obtain statistics on interrater reliability. In general, geography items receive 25 percent second scoring. This reliability information is also used by the scoring supervisor to monitor the capabilities of all raters and maintain uniformity of scoring across raters. Reliability reports are generated on demand by the scoring supervisor, trainer, scoring director, or item development subject area coordinator. Printed copies are reviewed daily by lead scoring staff. In addition to the immediate feedback provided by the online reliability reports, each scoring supervisor can also review the actual responses scored by a rater with the backreading tool. In this way, the scoring supervisor can monitor each rater carefully and correct difficulties in scoring almost immediately with a high degree of efficiency. Interrater reliability ranges, by assessment year, geography national main assessment: 2001
Number of items Number of items Number of between 70% and between 80% and unique items 79% 89% 57 64 8 1 41 15 Number of items above 90% 8 48

Assessment year 2001 geography 1994 geography

SOURCE: U.S. Department of Education, Institute of Education Sciences, National Center for Education Statistics, National Assessment of Educational Progress (NAEP), 2001 Geography Assessment.

During the scoring of an item or the scoring of a calibration set, scoring supervisors monitor progress using an interrater reliability tool. This display tool functions in either of two modes: • • to display information of all first readings versus all second readings; or to display all readings of an individual which were also scored by another rater versus the scores assigned by the other raters.

The information is displayed as a matrix with scores awarded during first readings displayed in rows and scores awarded during second readings displayed in columns (for mode one) and the individual's scores in rows and all other raters in columns (for mode two.) In this format, instances of exact agreement fall along the diagonal of the matrix. For completeness, data in each cell of the matrix contain the number and percentage of cases of agreement (or disagreement). The display also contains information on the total number of second readings and the overall percentage of reliability on the item. Since the interrater reliability reports are cumulative, a printed copy of the reliability of each item is made periodically and compared to previously generated reports. Scoring staff members save printed copies of all final reliability reports and archive them with the training sets.

13

Item-by-item rater reliability, by grade, geography national main assessment: 2001
Grade Total 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 Item † G027201 G027301 G027401 G027501 G027601 G027701 G027801 G028201 G028401 G028501 G028701 G008001 G008001 G008201 G008201 G008503 G008503 G008701 G008701 G009001 G009001 G009201 G009201 G009402 G009402 G009403 G009403 G009601 G009601 G029301 G029401 G029501 G029601 G029701 G030101 G030501 G030801 G030901 G031001 G031101 G031401 G031801 G031901 G032401 G032601 Score points † 1,3 1,3 1,3 1,3 1,3 1,3 1,3 4 3 3 4 3 3 4 4 3 3 3 3 4 4 3 3 3 3 3 3 4 4 1,3 1,3 1,3 1,3 1,3 4 4 1,3 1,3 1,3 1,3 4 3 3 3 1,3 Number scored (1st and 2nd) 2001 reliability 1994 reliability 420,157 3,331 3,331 3,331 3,332 3,331 3,331 3,331 3,331 3,332 3,332 3,331 639 3,247 628 3,247 637 3,248 612 3,248 582 3,247 603 3,236 625 3,236 626 3,236 612 3,236 3,277 3,278 3,277 3,277 3,278 3,277 3,278 3,278 3,277 3,277 3,277 3,277 3,220 3,220 3,220 3,220 † 99 99 97 98 98 98 96 86 86 92 91 99 99 91 93 94 93 94 93 83 83 88 88 94 92 94 93 91 91 96 97 98 99 97 82 78 99 98 98 98 94 92 91 99 98 † § § § § § § § § § § § 98 § 93 § 94 § 92 § 86 § 87 § 93 § 90 § 91 § § § § § § § § § § § § § § § § §

See notes at end of table.

14

Item-by-item rater reliability, by grade, geography national main assessment: 2001 (continued)
Grade 4 4 4 4 4 4 4 4 4 4 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 Item G012201 G012201 G012503 G012503 G012902 G012902 G013001 G013001 G013201 G013201 G013402 G013402 G014001 G014001 G014201 G014201 G014301 G014301 G014401 G014401 G032901 G033301 G033501 G033801 G034101 G016201 G016201 G016302 G016302 G016401 G016401 G016502 G016502 G016701 G016701 G017101 G017101 G034801 G035001 G035301 G035501 G036201 G036501 G036801 G037101 G012201 Score points 3 3 3 3 3 3 4 4 3 3 3 3 3 3 4 4 3 3 3 3 3 3 4 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 3 4 3 3 3 Number scored (1st and 2nd) 638 3,225 642 3,225 644 3,225 619 3,225 623 3,225 638 3,159 648 3,159 620 3,159 640 3,159 646 3,159 3,160 3,160 3,160 3,160 3,160 641 3,099 636 3,099 622 3,099 623 3,099 616 3,099 609 3,099 3,137 3,138 3,137 3,138 3,093 3,094 3,094 3,094 639 2001 reliability 97 97 99 99 98 98 94 94 91 93 85 86 98 98 88 89 86 90 95 94 90 89 77 93 81 97 99 97 99 95 95 92 93 91 93 79 85 95 81 85 87 97 82 90 92 98 1994 reliability 97 § 98 § 97 § 95 § 92 § 93 § 99 § 87 § 91 § 97 § § § § § § 99 § 97 § 92 § 91 § 88 § 85 § § § § § § § § § 98

See notes at end of table.
15

Item-by-item rater reliability, by grade, geography national main assessment: 2001 (continued)
Grade 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 Item G012201 G012503 G012902 G012902 G013001 G013001 G013201 G013201 G019002 G019003 G019003 G019102 G019102 G019202 G019202 G019302 G019302 G019402 G019402 G019901 G019901 G020001 G020001 G020201 G020201 G020302 G020302 G020701 G020701 G021001 G021001 G021401 G021401 G021601 G021601 G021602 G021602 G037501 G037601 G037701 G037801 G037901 G038101 G038401 G038801 Score points 3 3 3 3 4 4 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 3 3 3 3 3 3 3 3 4 4 4 4 4 4 3 3 1,3 1,3 1,3 1,3 1,3 3 3 4 Number scored (1st and 2nd) 3,119 646 638 3,119 601 3,119 610 3,119 3,210 624 3,210 618 3,210 621 3,210 602 3,210 623 3,210 639 3,210 642 3,210 633 3,210 607 3,210 584 3,049 614 3,049 620 3,049 594 3,049 606 3,048 3,072 3,073 3,072 3,072 3,072 3,073 3,072 3,073 2001 reliability 98 100 98 98 92 93 79 85 99 96 93 97 95 91 89 94 95 94 95 93 91 97 97 94 91 92 93 88 88 79 79 85 87 82 92 94 93 98 99 99 99 99 83 83 90 1994 reliability § 99 96 § 95 § 86 § 98 94 § 94 § 93 § 95 § 93 § 95 § 97 § 95 § 92 § 89 § 85 § 90 § 92 § 95 § § § § § § § § §

See notes at end of table.

16

Item-by-item rater reliability, by grade, geography national main assessment: 2001 (continued)
Grade 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 Item G039201 G016201 G016201 G016302 G016401 G016401 G016502 G016502 G016701 G016701 G017101 G017101 G034801 G035001 G035301 G035501 G039801 G040001 G040301 G040701 G025001 G025202 G025202 G025301 G025301 G025601 G025601 G025801 G025801 G026101 G026101 G026204 G026204 G026301 G026301 G026502 G026502 G026503 G026601 G026601 G026901 G026901 Score points 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 3 4 4 3 3 3 3 4 4 3 3 3 3 3 3 3 3 4 4 3 3 4 3 3 3 3 Number scored (1st and 2nd) 2001 reliability 1994 reliability 3,073 639 3,005 641 625 3,005 610 3,005 616 3,005 630 3,005 3,003 3,002 3,002 3,002 3,026 3,026 3,026 3,026 630 616 3,062 623 3,062 628 3,062 630 3,063 627 3,013 611 3,014 621 3,014 635 3,014 3,014 641 3,014 631 3,014 94 98 100 96 94 92 92 92 91 92 80 84 93 82 85 86 88 78 85 94 88 83 86 92 94 93 93 90 92 94 91 91 93 90 88 97 98 92 96 97 92 94 § 99 § 97 91 § 88 § 87 § 87 § § § § § § § § § 88 86 § 95 § 93 § 91 § 94 § 92 § 88 § 96 § 84 98 § 94 §

† Not applicable. § Item had not been created for the 1994 assessment. SOURCE: U.S. Department of Education, Institute of Education Sciences, National Center for Education Statistics, National Assessment of Educational Progress (NAEP), 2001 Geography Assessment.
17

Scoring of the 2001 Geography Assessment Large-Print Booklets
A subset of the total items scored were those from large-print booklets. These booklets were administered to students with disabilities who had met the criteria for participation with accommodations. Since these booklets were non-scannable, they were transported to the scoring center after processing. A log and score sheet were created to account for these booklets. As a rater scored an item, he or she marked the score for that response, his or her rater ID, and the date on which the item was scored. Once all items in each booklet for a given subject were scored, the geography scoring director returned the sheets to NAEP clerical staff to enter those scores manually into the records for these booklets. In the 2001 assessment, there were five large-print geography booklets.

18

Item-by-item rater reliability for items in large-print booklets, by grade, geography national main assessment: 2001
Score points † 1,3 1,3 1,3 1,3 1,3 1,3 1,3 4 3 3 4 3 3 4 4 3 3 3 3 4 4 3 3 3 3 4 4 3 3 3 3 3 3 4 3 3 3 3 4 4 4 4 4 4 3 Number scored (1st and 2nd) 2001 reliability 1994 reliability 136,680 3,331 3,331 3,331 3,332 3,331 3,331 3,331 3,331 3,332 3,332 3,331 639 3,247 628 3,247 637 3,248 612 3,248 582 3,247 638 3,159 648 3,159 620 3,159 640 3,159 646 3,159 3,160 3,160 3,160 3,160 3,160 584 3,049 614 3,049 620 3,049 594 3,049 606 † 99 99 97 98 98 98 96 86 86 92 91 99 99 91 93 94 93 94 93 83 83 85 86 98 98 88 89 86 90 95 94 90 89 77 93 81 88 88 79 79 85 87 82 92 94 † § § § § § § § § § § § 98 § 93 § 94 § 92 § 86 § 93 § 99 § 87 § 91 § 97 § § § § § § 89 § 85 § 90 § 92 § 95

Grade Total 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 12 12 12 12 12 12 12 12 12

Item † X1G3_01A X1G3_01B X1G3_01C X1G3_01D X1G3_01E X1G3_01F X1G3_01G X1G3_05 X1G3_07 X1G3_08 X1G3_10 Q1G401 Q1G401 Q1G404 Q1G404 Q1G409 Q1G409 Q1G411 Q1G411 Q1G415 Q1G415 Q2G304 Q2G304 Q2G312 Q2G312 Q2G314 Q2G314 Q2G315 Q2G315 Q2G316 Q2G316 X2G4_01 X2G4_05 X2G4_07 X2G4_10 X2G4_13 Q3G304 Q3G304 Q3G308 Q3G308 Q3G312 Q3G312 Q3G314 Q3G314 Q3G315

See notes at end of table.
19

Item-by-item rater reliability for items in large-print booklets, by grade, geography national main assessment: 2001 (continued)
Grade 12 12 12 12 12 12 12 12 12 12 Item Q3G315 X3G4_01A X3G4_01B X3G4_01C X3G4_01D X3G4_01E X3G4_03 X3G4_06 X3G4_10 X3G4_14 Score Number scored points (1st and 2nd) 2001 reliability 1994 reliability 3 1,3 1,3 1,3 1,3 1,3 3 3 4 3 3,048 3,072 3,073 3,072 3,072 3,072 3,073 3,072 3,073 3,073 93 98 99 99 99 99 83 83 90 94 § § § § § § § § § §

† Not applicable. § Item had not been created for the 1994 assessment. SOURCE: U.S. Department of Education, Institute of Education Sciences, National Center for Education Statistics, National Assessment of Educational Progress (NAEP), 2001 Geography Assessment.

20

Scoring NAEP Mathematics Assessments
The NAEP mathematics items that are not scored by machine are constructed-response items—those for which the student must write in a response rather than selecting from a printed list of multiple choices. Each constructed-response item has a unique scoring guide that identifies the range of possible scores for the item. To measure longitudinal trends in mathematics, NAEP requires trend scoring—replication of scoring from prior assessment years—to demonstrate statistically that scoring is comparable across years. Students' constructed responses are scored on computer workstations using an image-based scoring system. This allows for item-by-item scoring and online, real-time monitoring of mathematics interrater reliabilities, as well as the performance of each individual rater. All responses from large-print booklets are transcribed into the appropriate regular-sized booklet and scanned with other booklets. Image scoring of these responses takes place with regular scoring. The 2000 mathematics assessment included 199 discrete constructed-response items. The total number of constructed responses scored was 3,856,211. The number of raters working on the mathematics assessment and the location of the scoring are listed here: Scoring activities, mathematics assessment: 2000
Number of raters 177 Number of scoring supervisors 16

Scoring location Tucson, Arizona

Start date 3/13/2000

End date 4/29/2000

SOURCE: U.S. Department of Education, Institute of Education Sciences, National Center for Education Statistics, National Assessment of Educational Progress (NAEP), 2000 Mathematics Assessment.

Each constructed-response item has a unique scoring guide that identifies the range of possible scores for the item and defines the criteria to be used in evaluating student responses. During the course of the project, each team scores constructed-response items using a 2-, 3-, or 5-point scale as outlined below: Dichotomous Items 2 = Correct 1 = Incorrect Short Three-Point Items 3 = Correct 2 = Partial 1 = Incorrect Extended Five-Point Items 5 = Extended 4 = Satisfactory 3 = Partial 2 = Minimal 1 = Incorrect

Early (1990) mathematics constructed-response items used a 1 = incorrect and 7 = correct rating scale. Several of these items also tracked how a student approached the problem by expanding the rating 1 to [1, 2, and 3] or by expanding the rating 7 to [6 and 7.] An example of this would be if the student was asked to draw a figure with four 90 degree angles. A student's response that was rated 6 or 7 was correct; 6 tracked the 'square' while 7 tracked the 'rectangle' response. An example of a response that would be rated as incorrect would be one for which the student renamed incorrectly in a subtraction problem and therefore got an incorrect response. This might be tracked as a 2. In some cases, student responses do not fit into any of the categories listed on the scoring guide. Special coding categories for the unscorable responses are assigned to these types of responses. These categories are only assigned if no aspect of the student's response can be scored. Scoring supervisors and/or trainers are consulted prior to the assignment of any of the special coding categories. The unscorable categories for mathematics are outlined in the following table.

21

Categories for unscorable responses, mathematics assessments: 2000
Label B X IL OT ? Description Blank responses, random marks on paper Completely crossed out, completely erased Completely illegible response Off task, off topic, comments to the test makers, refusal to answer, "Who cares," language other than English (unless otherwise noted) "I don't know," "I can't do this," "No clue," "I don't understand," "I forget"

NOTE: Because the NAEP scoring database recognizes only alphanumeric characters and sets a single-character field for the value for each score, the label "IL" appears in the database file as "I," the label "OT" appears as "T," and the label "?" appears as "D." SOURCE: U.S. Department of Education, Institute of Education Sciences, National Center for Education Statistics, National Assessment of Educational Progress (NAEP), Mathematics 2000 Assessment.

Special studies are also included in the mathematics assessment. When the special study item is the same as the operational item, the responses are scored together within one team.

22

Number of constructed-response items, by score-point level and grade, national main and state assessments: 2000
Grade Total 4 4/8 8 8/12 12 4/8/12 Total 163 52 17 30 11 49 4 Dichotomous 2-point Short 3-point Extended 4-point Extended 5-point Extended 6-point items items items items items 41 11 3 7 4 13 3 78 24 9 14 5 26 0 13 6 2 0 1 3 1 29 10 3 9 0 7 0 2 1 0 0 1 0 0

SOURCE: U.S. Department of Education, Institute of Education Sciences, National Center for Education Statistics, National Assessment of Educational Progress (NAEP), 2000 Mathematics Assessment.

23

Number of 1996 constructed-response items rescored in 2000, by score-point level and grade, mathematics national main and state assessments: 2000
Grade Total 4 4/8 8 8/12 12 4/8/12 Total 126 32 15 25 9 41 4 Dichotomous 2-point items 35 5 3 7 4 13 3 Short 3-point items 63 19 7 12 3 22 0 Extended 4-point items 8 3 2 0 1 1 1 Extended 5-point items 19 5 3 6 0 5 0 Extended 6-point items 1 0 0 0 1 0 0

SOURCE: U.S. Department of Education, Institute of Education Sciences, National Center for Education Statistics, National Assessment of Educational Progress (NAEP), 2000 Mathematics Assessment.

24

Number of 1992 constructed-response items rescored in 2000, by score-point level and grade, mathematics national main and state assessments: 2000
Grade Total 4 4/8 8 8/12 12 4/8/12 Total 65 13 10 11 8 19 4 Dichotomous 2-point items 34 4 3 7 4 13 3 Short 3-point items 15 5 2 2 2 4 0 Extended 4-point items 8 3 2 0 1 1 1 Extended 5-point items 7 1 3 2 0 1 0 Extended 6-point items 1 0 0 0 1 0 0

SOURCE: U.S. Department of Education, Institute of Education Sciences, National Center for Education Statistics, National Assessment of Educational Progress (NAEP), 2000 Mathematics Assessment.

25

Number of 1990 constructed-response items rescored in 2000, by score-point level and grade, national main and state assessments: 2000
Grade Total 4 4/8 8 8/12 12 4/8/12 Total 31 3 5 2 7 10 4 Dichotomous 2- Short 3-point Extended 4-point Extended 6-point point items items items items 20 1 3 2 3 8 3 6 1 2 0 2 1 0 4 1 0 0 1 1 1 1 0 0 0 1 0 0

NOTE: No extended 5-point items from the 1990 assessment were rescored in the 2000 assessment. SOURCE: U.S. Department of Education, Institute of Education Sciences, National Center for Education Statistics, National Assessment of Educational Progress (NAEP), 2000 Mathematics Assessment.

26

Mathematics Interrater Reliability
A subsample of the mathematics responses for each constructed-response item is scored by a second rater to obtain statistics on interrater reliability. In general, items administered only to the national main sample receive 25 percent second scoring, while those given in state samples receive 6 percent. This reliability information is also used by the scoring supervisor to monitor the capabilities of all raters and maintain uniformity of scoring across raters. Reliability reports are generated on demand by the scoring supervisor, trainer, scoring director, or mathematics item development coordinator. Printed copies are reviewed daily by the lead scoring staff. In addition to the immediate feedback provided by the online reliability reports, each scoring supervisor can also review the actual responses scored by a rater with the backreading tool. In this way, the scoring supervisor can monitor each rater carefully and correct difficulties in scoring almost immediately with a high degree of efficiency. Interrater reliability ranges, by assessment year, mathematics national main and state assessments: 2000
Number of Number of Number of Number of Number of unique items between items between items between items above items 60% and 69% 70% and 79% 80% and 89% 90% 199 158 91 51 † † 2 3 1 1 † † 12 2 12 2 186 155 77 47

Assessment year 2000 assessment 1996 assessment 1992 assessment 1990 assessment

† No items fell within this interrater reliability range. SOURCE: U.S. Department of Education, Institute of Education Sciences, National Center for Education Statistics, National Assessment of Educational Progress (NAEP), 2000 Mathematics Assessment.

During the scoring of an item or the scoring of a calibration set, scoring supervisors monitor progress using an interrater reliability tool. This display tool functions in either of two modes: • • to display information of all first readings versus all second readings; or to display all readings of an individual that were also scored by another rater versus the scores assigned by the other raters.

The information is displayed as a matrix with scores awarded during first readings displayed in rows and scores awarded during second readings displayed in columns (for mode one) and the individual's scores in rows and all other raters in columns (for mode two.) In this format, instances of exact agreement fall along the diagonal of the matrix. For completeness, data in each cell of the matrix contain the number and percentage of cases of agreement (or disagreement). The display also contains information on the total number of second readings and the overall percentage of reliability on the item. Since the interrater reliability reports are cumulative, a printed copy of the reliability of each item is made periodically and compared to previously generated reports. Scoring staff members save printed copies of all final reliability reports and archive them with the training sets.

27

Item-by-item rater reliability, by grade, mathematics national main and state assessment: 2000
Number scored Score points (1st and 2nd) † 2 3 3 2 3 3 3 3 3 5 2 3 3 2 2 2 4 2 2 2 4 3 3 3 3 5 3 4 5 6 3 3 5 5 4 5 5 4 5 3 5 2 3 4 4 3,856,211 28,343 28,342 28,343 28,346 28,502 28,505 28,503 28,500 28,499 28,503 28,559 28,556 28,559 28,557 28,557 28,554 28,557 28,553 28,555 28,556 28,560 28,470 28,464 28,468 28,469 28,466 29,306 29,186 29,184 29,185 28,470 28,474 28,469 29,237 29,236 29,238 29,238 29,238 28,355 28,356 28,356 29,107 29,108 29,107 29,116 2000 reliability † 99 100 98 93 97 96 95 96 95 89 99 96 99 99 98 97 98 99 98 99 85 97 98 98 98 91 98 97 91 91 96 96 94 99 97 99 99 98 97 97 94 98 99 96 96 1996 reliability † 99 100 98 96 98 98 92 97 97 92 98 94 98 99 99 95 99 99 99 100 88 98 99 98 98 94 99 § § § § § § 99 97 99 100 99 § § § 98 99 97 98 1992 reliability † 99 99 95 92 § § § § § § 97 91 96 97 98 91 96 98 96 99 85 § § § § § 96 § § § § § § 97 94 96 97 96 § § § 97 96 91 91 1990 reliability † § § § § § § § § § § 99 93 97 97 98 95 97 98 98 99 84 § § § § § 97 § § § § § § § § § § § § § § § § § §

Grade 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

Item

Total † M039201 M039301 M040001 M040201 M066301 M066501 M066601 M066701 M066801 M066901 M019701 M019801 M019901 M020001 M020101 M020201 M020301 M020401 M020501 N277903 M020701 M067901 M068001 M068002 M068003 M068004 M010631 M091101 M091201 M091401 M085701 M085901 M085401 M046001 M046601 M046801 M046901 M047301 M086601 M087001 M087301 M043201 M043301 M043401 M043402

See notes at end of table.
28

Item-by-item rater reliability, by grade, mathematics national main and state assessment: 2000 (continued)
Number scored Score points (1st and 2nd) 3 5 3 3 3 3 3 5 2 3 3 3 3 3 5 2 2 3 4 2 2 2 4 2 3 3 3 5 3 3 3 5 2 3 3 3 5 2 2 5 3 2 2 2 3 29,108 29,112 29,183 29,177 29,179 29,176 29,127 29,177 28,405 28,404 28,405 28,403 28,406 28,406 28,404 5,059 5,058 5,056 5,058 5,010 5,011 5,009 5,010 5,009 5,008 5,014 5,016 5,016 3,448 3,448 3,449 3,449 3,459 3,460 3,457 3,459 3,460 3,446 3,448 3,450 3,450 3,449 513 514 512 2000 reliability 99 92 100 98 95 94 99 93 99 93 98 99 95 99 90 99 98 98 96 99 98 99 98 99 96 96 98 90 99 98 97 89 99 99 91 100 88 98 98 99 98 93 98 100 96 1996 reliability 98 91 99 98 95 94 99 96 100 92 98 99 95 99 91 § § § § § § § § § § § § § § § § § § § § § § § § § § § 98 98 99 1992 reliability 91 87 § § § § § § § § § § § § § § § § § § § § § § § § § § § § § § § § § § § § § § § § 97 97 96 1990 reliability § § § § § § § § § § § § § § § § § § § § § § § § § § § § § § § § § § § § § § § § § § § § §

Grade Item 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 M043403 M043501 M072201 M072202 M072401 M072501 M072601 M072701 M074301 M074501 M074701 M074801 M074901 M075001 M075101 M087501 M087601 M088001 M088301 M088501 M088601 M088701 M088801 M089101 M089401 M090201 M090301 M090401 M019901 M067901 M066701 M075101 M074301 M074801 M074501 M043403 M043501 M043501 M043201 M046801 M072601 M040201 M043201 M043201 M043301

See notes at end of table.
29

Item-by-item rater reliability, by grade, mathematics national main and state assessment: 2000 (continued)
Number scored Score points (1st and 2nd) 3 4 4 4 4 3 3 5 5 3 3 3 3 3 3 3 3 3 3 5 5 3 3 5 3 3 3 3 5 2 3 3 2 2 2 4 2 2 6 2 2 3 3 2 2 514 515 514 515 514 513 513 513 513 513 514 514 514 513 515 513 514 515 514 513 513 28,075 28,075 28,074 28,120 28,121 28,121 28,123 28,119 28,092 28,095 28,092 28,093 28,092 28,092 28,094 28,096 28,095 28,090 28,092 28,092 28,096 28,092 28,095 28,098 2000 reliability 96 98 98 98 100 100 100 100 94 100 96 98 98 94 96 100 96 100 100 98 96 93 99 94 98 95 97 91 89 99 98 99 100 98 99 99 99 99 96 93 100 95 98 98 97 1996 reliability 99 97 97 98 98 98 98 91 91 99 99 98 98 95 95 94 94 99 99 96 96 § § § 99 96 94 91 94 100 98 99 100 100 97 100 99 100 97 94 100 95 98 98 98 1992 reliability 96 91 91 91 91 91 91 87 87 § § § § § § § § § § § § § § § § § § § § 99 96 97 98 99 96 98 100 99 93 90 98 91 96 96 94 1990 reliability § § § § § § § § § § § § § § § § § § § § § § § § § § § § § 100 96 98 98 99 97 98 99 99 93 69 99 92 97 95 95

Grade Item 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 M043301 M043401 M043401 M043402 M043402 M043403 M043403 M043501 M043501 M072201 M072201 M072202 M072202 M072401 M072401 M072501 M072501 M072601 M072601 M072701 M072701 M093501 M093601 M093801 M066301 M066501 M066601 M067201 M067501 M019701 M019801 M019901 M020001 M020101 M020201 M020301 M020401 M020501 M020801 M020901 M021001 M021101 M021201 M021301 M021302

See notes at end of table.
30

Item-by-item rater reliability, by grade, mathematics national main and state assessment: 2000 (continued)
Number scored Score points (1st and 2nd) 3 3 3 3 3 5 4 2 2 2 2 5 3 3 5 5 4 5 5 4 3 3 3 5 2 2 2 3 5 3 3 3 5 3 3 3 3 5 2 2 2 2 2 2 3 27,976 27,978 27,985 27,975 27,981 27,977 28,094 28,091 28,095 28,094 28,095 28,089 28,192 28,192 28,194 28,192 28,189 28,192 28,192 28,192 28,190 27,961 27,964 27,966 28,098 28,102 28,100 28,106 28,101 28,109 28,104 28,106 28,108 28,078 28,073 28,074 28,077 28,076 189 187 188 189 188 189 189 2000 reliability 98 98 95 98 95 93 99 98 94 92 95 89 98 98 94 99 98 99 99 99 99 95 99 88 100 100 99 100 90 96 96 98 93 97 95 95 91 91 100 100 100 100 100 100 100 1996 reliability 98 99 94 98 93 93 99 98 96 95 94 90 § § § 99 98 99 100 100 99 § § § 100 100 99 97 93 93 97 98 90 98 94 95 91 92 § § § § § § § 1992 reliability § § § § § § 96 96 89 84 90 86 § § § 98 96 98 98 98 97 § § § 99 99 97 96 88 § § § § § § § § § § § § § § § § 1990 reliability § § § § § § 97 95 § § § § § § § § § § § § § § § § § § § § § § § § § § § § § § § § § § § § §

Grade Item 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 M067901 M068003 M068006 M068005 M068008 M068201 M013031 M013131 M052401 M052901 M053001 M053101 M085701 M085901 M086301 M046001 M046601 M046801 M046901 M047301 M047901 M092401 M092601 M092001 M051201 M051301 M051601 M052101 M052201 M072901 M073401 M073501 M073601 M075301 M075401 M075601 M075801 M076001 M051201 M051201 M051301 M051301 M051601 M051601 M052101

See notes at end of table.
31

Item-by-item rater reliability, by grade, mathematics national main and state assessment: 2000 (continued)
Number Score scored points (1st and 2nd) 3 5 5 3 3 3 3 3 3 5 5 2 3 2 2 3 3 3 3 3 5 2 2 2 4 2 2 4 2 2 6 2 2 3 3 2 2 2 3 3 3 3 3 5 4 2 189 188 188 188 189 188 188 188 188 188 188 4,129 4,126 4,125 4,124 4,124 4,123 4,124 4,127 4,125 4,126 4,106 4,109 4,107 4,108 4,107 4,104 4,104 4,106 4,108 4,102 4,108 4,109 4,107 4,107 4,106 4,108 4,105 4,038 4,040 4,040 4,040 4,040 4,041 4,076 4,078 2000 reliability 100 100 100 100 89 100 100 100 94 100 94 99 96 99 97 89 96 98 96 99 97 100 98 99 97 97 98 99 99 99 96 95 99 94 97 100 99 99 94 92 90 95 95 95 99 98 1996 reliability § § § § § § § § § § § 99 98 99 98 87 95 97 94 96 98 99 97 99 98 96 99 100 99 100 98 95 100 95 98 99 98 99 93 92 92 94 96 96 99 99 1992 reliability § § § § § § § § § § § 98 94 96 94 § § § § § § 98 97 99 89 92 97 98 99 99 92 93 100 92 95 97 95 97 § § § § § § 97 96 1990 reliability § § § § § § § § § § § § § § § § § § § § § 99 97 99 90 94 98 99 99 99 95 63 99 94 96 97 95 96 § § § § § § 98 94

Grade Item 8 8 8 8 8 8 8 8 8 8 8 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 M052101 M052201 M052201 M072901 M072901 M073401 M073401 M073501 M073501 M073601 M073601 M056801 M056901 M057001 M057101 M070801 M071001 M071101 M071201 M071301 M071401 M021401 M021501 M021502 M021601 M021602 M020201 M020301 M020401 M020501 M020801 M020901 M021001 M021101 M021201 M021701 M021702 M021801 M071502 M071602 M071603 M071604 M071701 M071801 M013031 M013131

See notes at end of table.

32

Item-by-item rater reliability, by grade, mathematics national main and state assessment: 2000 (continued)
Grade Item 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 M011931 M012031 M052401 M053301 M053401 M094201 M094301 M094701 M058901 M059702 M059801 M092301 M092401 M092601 M092901 M095001 M095301 M095401 M073801 M073401 M073901 M074001 M074101 M076101 M076601 M076701 M076801 M076901 M077001 Score points 2 3 2 2 5 4 3 5 3 3 2 3 3 3 5 3 3 4 3 3 3 3 5 3 3 3 3 3 5 Number scored (1st and 2nd) 4,078 4,078 4,098 4,095 4,095 4,063 4,063 4,063 4,048 4,049 4,047 4,125 4,124 4,123 4,123 4,100 4,099 4,099 4,100 4,097 4,100 4,099 4,099 4,114 4,116 4,113 4,115 4,116 4,116 2000 reliability 98 99 92 92 72 95 94 88 99 94 98 96 95 97 94 94 92 97 98 93 96 94 95 98 98 98 98 98 90 1996 reliability 100 99 94 92 75 § § § 99 92 98 § § § § § § § 97 94 97 97 96 99 99 98 98 97 93 1992 reliability 97 94 89 89 70 § § § 98 68 96 § § § § § § § § § § § § § § § § § § 1990 reliability 94 96 § § § § § § § § § § § § § § § § § § § § § § § § § § §

† Not applicable. § Item had not been created at the time of the assessment noted in this column heading. SOURCE: U.S. Department of Education, Institute of Education Sciences, National Center for Education Statistics, National Assessment of Educational Progress (NAEP), 2000 Mathematics Assessment.

33

Scoring of the 2000 Mathematics Assessment Large-Print Booklets
A subset of the total items scored were those from large-print booklets. These booklets were administered to students with disabilities who had met the criteria for participation with accommodations. Since these booklets were non-scannable, they were transported to the scoring center after processing. A log and score sheet were created to account for these booklets. As a rater scored an item, he or she marked the score for that response, his or her rater ID, and the date on which the item was scored. Once all items in each booklet for a given subject were scored, the mathematics scoring director returned the sheets to Pearson clerical staff to enter those scores manually into the records for these booklets. In the 2000 assessment, there were 32 large-print mathematics booklets.

34

Item-by-item rater reliability for items in large-print booklets, by grade, mathematics national main and state assessments: 2000
Grade 4 4 4 4 4 4 4 4 4 4 4 4 8 8 8 8 8 8 8 8 8 8 8 8 8 12 12 12 12 12 12 12 12 12 12 Item Score Number scored points (1st and 2nd) † 2 3 3 2 4 5 6 5 4 5 5 4 3 3 5 2 2 2 5 5 4 5 5 4 3 2 3 2 2 2 2 5 3 3 2 753,796 28,343 28,342 28,343 28,346 29,186 29,184 29,185 29,237 29,236 29,238 29,238 29,238 28,075 28,075 28,074 28,095 28,094 28,095 28,089 28,192 28,189 28,192 28,192 28,192 28,190 4,129 4,126 4,125 4,124 4,098 4,095 4,095 4,048 4,049 4,047 2000 reliability † 99 100 98 93 97 91 91 99 97 99 99 98 93 99 94 94 92 95 89 99 98 99 99 99 99 99 96 99 97 92 92 72 99 94 98 1996 reliability † 99 100 98 96 § § § 99 97 99 100 99 § § § 96 95 94 90 99 98 99 100 100 99 99 98 99 98 94 92 75 99 92 98 1992 reliability † 99 99 95 92 § § § 97 94 96 97 96 § § § 89 84 90 86 98 96 98 98 98 97 98 94 96 94 89 89 70 98 68 96

Total † W1M3_03 W1M3_04 W1M3_11 W1M3_13 W1M9_07 W1M9_08 W1M9_10 W12M11A_01 W12M11A_07 W12M11A_09 W12M11A_10 W12M11A_14 W2M3_06 W2M3_07 W2M3_09 W23M9B_02 W23M9B_07 W23M9B_08 W23M9B_09 W12M11B_01 W12M11B_07 W12M11B_09 W12M11B_10 W12M11B_14 W12M11B_18 S3M3_10 S3M3_12 S3M3_13 S3M3_14 S23M9C_02 S23M9C_08 S23M9C_09 S3M11_04 S3M11_13 S3M11_14

† Not applicable. § Item had not been created at the time of the assessment noted in this column heading. SOURCE: U.S. Department of Education, Institute of Education Sciences, National Center for Education Statistics, National Assessment of Educational Progress (NAEP), 2000 Mathematics Assessment.

35

Scoring NAEP Reading Assessments
The reading items scored include short constructed responses and extended constructed responses. Each constructed-response item has a unique scoring guide that identifies the range of possible scores for the item. To measure longitudinal trends in reading, NAEP requires trend scoring—replication of scoring from prior assessment years—to demonstrate statistically that scoring is comparable across years. Students' constructed responses are scored on computer workstations using an image-based scoring system. This allows for item-by-item scoring and online, real-time monitoring of reading interrater reliabilities, and monitoring of the performance of each individual rater. A subset of these items—those that appear in large-print booklets —require scoring by hand. The 2000 reading assessment included 46 discrete constructed-response items. The total number of constructed responses scored was 123,100. The number of raters working on the reading assessment and the location of the scoring are listed here: Scoring activities, reading assessment: 2000
Scoring location Tucson, Arizona Start date 4/3/2000 End date 4/3/2000 Number of raters 40 Number of supervisors 4

SOURCE: U.S. Department of Education, Institute of Education Sciences, National Center for Education Statistics, National Assessment of Educational Progress (NAEP), 2000 Reading Assessment.

Each constructed-response item has a unique scoring guide that identifies the range of possible scores for the item and defines the criteria to be used in evaluating student responses. During the course of the project, each team scores the items using a 2-, 3-, or 4-point scale as outlined below: Dichotomous Items 1 = unacceptable 2, 3, or 4 = acceptable Short Three-Point Items 1 = evidence of little or no comprehension 2 = evidence of partial or surface comprehension 3 = evidence of full comprehension Extended Four-Point Items 1 = unsatisfactory 2 = partial 3 = essential 4 = extensive

In some cases, student responses do not fit into any of the categories listed on the scoring guide. Special coding categories for the unscorable responses are assigned to these types of responses. These categories are only assigned if no aspect of the student's response can be scored. Scoring supervisors and/or trainers are consulted prior to the assignment of any of the special coding categories. The unscorable categories used for reading are outlined as follows. Categories for unscorable responses, reading assessment: 2000
Label B X IL OT ? Description Blank responses, random marks on paper, word underlined in prompt but response area completely blank, mark on item number but response area completely blank Completely crossed out, completely erased Completely illegible response Off task, off topic, comments to the test makers, refusal to answer, "Who cares," language other than English (unless otherwise noted) "I don't know," "I can't do this," "No clue," "I don't understand," "I forget"

NOTE: Because the NAEP scoring contractor's database recognizes only alphanumeric characters and sets a single-character field for the value for each score, the label "IL" appears in the database file as "I," the label "OT" appears as "T," and the label "?" appears as "D." SOURCE: U.S. Department of Education, Institute of Education Sciences, National Center for Education Statistics, National Assessment of Educational Progress (NAEP), 2000 Reading Assessment.
36

Special studies are also included in the reading assessment. When the special study item is the same as the operational item, the responses are scored together within one team.

37

Number of constructed-response items, by score-point level, grade 4 reading national main assessment: 2000
Assessment Total 2000 reading items scored 1998 reading items rescored 1994 reading items rescored Total 122 46 41 35 Dichotomous Short 3-point Extended 42-point items items point items 67 24 23 20 34 14 11 9 21 8 7 6

SOURCE: U.S. Department of Education, Institute of Education Sciences, National Center for Education Statistics, National Assessment of Educational Progress (NAEP), 2000 Reading Assessment.

38

Number of 1998 constructed-response items rescored in 2000, by score-point level, grade 4 reading national main assessment: 2000
Grade Total 4 4/8 Total 41 31 10 Dichotomous Short 3-point Extended 4-point 2-point items items items 23 15 8 11 11 0 7 5 2

SOURCE: U.S. Department of Education, Institute of Education Sciences, National Center for Education Statistics, National Assessment of Educational Progress (NAEP), 2000 Reading Assessment.

39

Number of 1994 constructed-response items rescored in 2000, by score-point level, grade 4 reading national main assessment: 2000
Grade Total 4 4/8 Total 35 25 10 Dichotomous 2-point items 20 12 8 Short 3-point items 9 9 0 Extended 4-point items 6 4 2

SOURCE: U.S. Department of Education, Institute of Education Sciences, National Center for Education Statistics, National Assessment of Educational Progress (NAEP), 2000 Reading Assessment.

40

Reading Interrater Reliability
A subsample of the reading responses for each constructed-response item is scored by a second rater to obtain statistics on interrater reliability. Reading item responses in the 2000 assessment received 25 percent second scoring. This reliability information is also used by the scoring supervisor to monitor the capabilities of all raters and maintain uniformity of scoring across raters. Reliability reports are generated on demand by the scoring supervisor, trainer, scoring director, or item development subject area coordinator. Printed copies are reviewed daily by both Pearson and lead scoring staff. In addition to the immediate feedback provided by the online reliability reports, each scoring supervisor can also review the actual responses scored by a rater with the backreading tool. In this way, the scoring supervisor can monitor each rater carefully and correct difficulties in scoring almost immediately with a high degree of efficiency. Interrater reliability ranges, by assessment year, reading national main assessment: 2000
Number of Number of items Number of items Number of unique between 70% between 80% items above items and 79% and 89% 90% 46 41 35 3 2 † 17 16 13 26 23 22

Assessment 2000 reading 1998 reading 1994 reading

† In the 1994 reading assessment, interrater reliability exceeded 79%. SOURCE: U.S. Department of Education, Institute of Education Sciences, National Center for Education Statistics, National Assessment of Educational Progress (NAEP), 2000 Reading Assessment.

During the scoring of an item or the scoring of a calibration set, scoring supervisors monitor progress using an interrater reliability tool. This display tool functions in either of two modes: • • to display information of all first readings versus all second readings; or to display all readings of an individual which were also scored by another rater versus the scores assigned by the other raters.

The information is displayed as a matrix with scores awarded during first readings displayed in rows and scores awarded during second readings displayed in columns (for mode one) and the individual's scores in rows and all other raters in columns (for mode two.) In this format, instances of exact agreement fall along the diagonal of the matrix. For completeness, data in each cell of the matrix contain the number and percentage of cases of agreement (or disagreement). The display also contains information on the total number of second readings and the overall percentage of reliability on the item. Since the interrater reliability reports are cumulative, a printed copy of the reliability of each item is made periodically and compared to previously generated reports. Scoring staff members save printed copies of all final reliability reports and archive them with the training sets.

41

Item-by-item rater reliability, grade 4 reading national main assessment: 2000
Item Total R017001 R017003 R017004 R017006 R017007 R017009 R012102 R012104 R012106 R012108 R012109 R012111 R012112 R012601 R012604 R012607 R012611 R017301 R017303 R017305 R017307 R017309 R012702 R012703 R012705 R012706 R012708 R012710 R015702 R015703 R015704 R015705 R015707 R015709 R015802 R015803 R015804 R015806 R015807 R015809 R012503 R012504 Score points † 2 3 2 2 4 3 2 2 2 2 2 4 2 2 2 4 2 2 3 3 4 3 2 2 2 2 4 2 3 3 3 3 4 3 2 3 4 3 3 3 2 2 Number scored (1st and 2nd) 2000 reliability 1998 reliability 1994 reliability 123,100 2,674 2,674 2,674 2,674 2,674 2,674 2,697 2,697 2,697 2,697 2,697 2,697 2,697 2,666 2,666 2,666 2,666 2,683 2,683 2,683 2,683 2,683 2,684 2,684 2,684 2,684 2,684 2,684 2,679 2,679 2,679 2,679 2,679 2,679 2,670 2,670 2,670 2,670 2,670 2,670 2,650 2,650 † 86 80 90 91 76 87 95 93 91 96 96 91 92 93 93 81 91 94 88 94 81 89 91 87 92 83 83 94 81 90 83 92 85 95 96 88 77 86 89 93 90 98 † 93 87 95 94 79 89 98 95 90 96 96 89 93 89 93 84 91 § § § § § 96 93 93 88 87 95 85 89 84 90 88 90 91 87 80 86 89 91 93 98 † § § § § § § 95 93 92 96 96 92 95 91 95 90 96 § § § § § 94 92 95 92 86 96 86 88 85 90 88 91 91 84 83 84 83 84 90 96

See notes at end of table.

42

Item-by-item rater reliability, grade 4 reading national main assessment: 2000 (continued)
Number scored Score points (1st and 2nd) 2000 reliability 1998 reliability 1994 reliability 2 2 2 4 2,650 2,650 2,650 2,650 93 97 98 84 96 97 97 85 92 97 95 83

Item R012506 R012508 R012511 R012512

† Not applicable. § Item had not been created in the year noted in this column heading. SOURCE: U.S. Department of Education, Institute of Education Sciences, National Center for Education Statistics, National Assessment of Educational Progress (NAEP), 2000 Reading Assessment.

43

Scoring of the 2000 Reading Assessment Large-Print Booklets
A subset of the total items scored were those from large-print booklets. These booklets were administered to students with disabilities who had met the criteria for participation with accommodations. Since these booklets were non-scannable, they were transported to the scoring center after processing. A log and score sheet were created to account for these booklets. As a rater scored an item, he or she marked the score for that response, his or her rater ID, and the date on which the item was scored. Once all items in each booklet for a given subject were scored, the science scoring director returned the sheets to NAEP clerical staff to enter those scores manually into the records for these booklets. In the 2000 assessment, there was one large-print reading booklet.

44

Scoring NAEP Science Assessments
The NAEP science items that are not scored by machine are constructed-response items—those for which the student must write in a response rather than selecting from a printed list of multiple choices. Each constructed-response item has a unique scoring guide that identifies the range of possible scores for the item. To measure longitudinal trends in science, NAEP requires trend scoring—replication of scoring from prior assessment years—to demonstrate statistically that scoring is comparable across years. Students' constructed responses are scored on computer workstations using a image-based scoring system. This allows for item-by-item scoring and online, real-time monitoring of science interrater reliabilities and the performance of each individual rater. A subset of these items—those that appeared in large-print booklets—required scoring by hand. The 2000 science assessment included 295 discrete constructed-response items. The total number of constructed responses scored was 4,398,021. The number of raters working on the science assessment and the location of the scoring are listed here: Location of scoring activities, science assessment: 2000
Number of raters 115 40 Number of scoring supervisors 16 4

Scoring location Iowa City, Iowa Tucson, Arizona

Start date 3/13/2000 4/13/2000

End date 6/04/2000 4/29/2000

SOURCE: U.S. Department of Education, Institute of Education Sciences, National Center for Education Statistics, National Assessment of Educational Progress (NAEP), 2000 Science Assessment.

One unique aspect of the science assessment is the use of "hands-on" tasks that are given to students as a part of the assessment. Each student who performs a hands-on task is given a kit with all of the materials needed to conduct the experiment. For the 2000 assessment, a total of 9 hands-on tasks (3 per grade) originally designed for the 1996 assessment were chosen for use, although the actual kits used by the students were new. During scoring of the hands-on task items, raters actually performed the experiment as part of their training. Each student's experiment was scored as a unit because of the interconnectivity of the questions the student had to answer. Each item's scoring guide identifies the range of possible scores for the item and defines the criteria to be used in evaluating student responses. During the course of the project, each team scores the items using a 2-, 3-, 4-, or 5-point scale as outlined below: Dichotomous Items 3 = complete 1 = unsatisfactory/incorrect Short Three-Point Items 3 = complete 2 = partial 1 = unsatisfactory/incorrect Extended Four-Point Items 4 = complete 3 = essential 2 = partial 1 = unsatisfactory/incorrect Extended Five-Point Items 5 = complete 4 = essential 3 = adequate 2 = partial 1 = unsatisfactory/incorrect

In some cases, student responses do not fit into any of the categories listed in the scoring guide. Special coding categories for the unscorable responses are assigned to these type of responses. These categories are only assigned if no aspect of the student's response could be scored. Scoring supervisors

45

and/or trainers are consulted prior to the assignment of any of the special coding categories. The unscorable categories used for science are outlined below. Categories for unscorable responses, science assessments
Label B Description Blank responses, random marks on paper, word underlined in prompt but response area completely blank, mark on item number but response area completely blank Completely crossed out, completely erased Completely illegible response Off task, off topic, comments to the test makers, refusal to answer, "Who cares," language other than English (unless otherwise noted) "I don't know," "I can't do this," "No clue," "I don't understand," "I forget"

X IL OT ?

NOTE: Because the NAEP scoring contractor's database recognizes only alphanumeric characters and sets a single-character field for the value for each score, the label "IL" appears in the database file as "I," the label "OT" appears as "T," and the label "?" appears as "D." SOURCE: U.S. Department of Education, Institute of Education Sciences, National Center for Education Statistics, National Assessment of Educational Progress (NAEP), 2000 Science Assessment.

46

Number of constructed-response items, by score-point level and grade, science national main and state assessments: 2000
Grade Total 4 4/8 8 8/12 12 Total 246 60 20 61 29 76 Dichotomous 2-point items 12 5 0 2 2 3 Short 3-point items 190 48 16 49 24 53 Extended 4-point items 38 6 4 9 3 16 Extended 5-point items 6 1 0 1 0 4

SOURCE: U.S. Department of Education, Institute of Education Sciences, National Center for Education Statistics, National Assessment of Educational Progress (NAEP), 2000 Science Assessment.

47

Number of 1996 constructed-response items rescored in 2000, by score-point level and grade, science national main and state assessments: 2000
Grade Total 4 4/8 8 8/12 12 Total 200 50 20 43 29 58 Dichotomous 2-point items 9 5 0 1 2 1 Short 3-point items 149 38 16 33 24 38 Extended 4-point items 36 6 4 8 3 15 Extended 5-point items 6 1 0 1 0 4

SOURCE: U.S. Department of Education, Institute of Education Sciences, National Center for Education Statistics, National Assessment of Educational Progress (NAEP), 2000 Science Assessment.

48

Science Interrater Reliability
A subsample of the science responses for each constructed-response item is scored by a second rater to obtain statistics on interrater reliability. In general, items administered only to the national main sample receive 25 percent second scoring, while those given in state samples receive 6 percent. This reliability information is also used by the scoring supervisor to monitor the capabilities of all raters and maintain uniformity of scoring across raters. Reliability reports are generated on demand by the scoring supervisor, trainer, scoring director, or item development subject-area coordinator. Printed copies are reviewed daily by lead scoring staff. In addition to the immediate feedback provided by the online reliability reports, each scoring supervisor can also review the actual responses scored by a rater with the backreading tool. In this way, the scoring supervisor can monitor each rater carefully and correct difficulties in scoring almost immediately with a high degree of efficiency. Interrater reliability ranges, by assessment year, science national main and state assessments: 2000
Number of Number of items unique between 80% Number of items items and 89% above 90% 295 249 25 41 270 208

Assessment 2000 science 1996 science

SOURCE: U.S. Department of Education, Institute of Education Sciences, National Center for Education Statistics, National Assessment of Educational Progress (NAEP), 2000 Science Assessment.

During the scoring of an item or the scoring of a calibration set, scoring supervisors monitor progress using an interrater reliability tool. This display tool functions in either of two modes: • • to display information of all first readings versus all second readings; or to display all readings of an individual which were also scored by another rater versus the scores assigned by the other raters.

The information is displayed as a matrix with scores awarded during first readings displayed in rows and scores awarded during second readings displayed in columns (for mode one) and the individual's scores in rows and all other raters in columns (for mode two.) In this format, instances of exact agreement fall along the diagonal of the matrix. For completeness, data in each cell of the matrix contain the number and percentage of cases of agreement (or disagreement). The display also contains information on the total number of second readings and the overall percentage of reliability on the item. Since the interrater reliability reports are cumulative, a printed copy of the reliability of each item is made periodically and compared to previously generated reports. Scoring staff members save printed copies of all final reliability reports and archive them with the training sets.

49

Item-by-item rater reliability, by grade, science national main and state assessment: 2000
Grade Total 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 Item † K031001 K031002 K031003 K031004 K031005 K031006 K031007 K031101 K031102 K031103 K031104 K031105 K031107 K031301 K031309 K031302 K031303 K031304 K031401 K031402 K031403 K031404 K031407 K031408 K031409 K031410 K103901 K103101 K031602 K031603 K031604 K031606 K031607 K031608 K031609 K031901 K032001 K032501 K032502 K032601 K032602 K099501 K098301 K098201 K092201 K034001 Score points † 3 3 3 2 3 3 3 2 2 2 2 3 4 4 4 3 3 3 4 3 3 3 3 3 3 3 3 3 3 3 3 3 4 3 3 3 3 3 3 3 3 3 3 3 3 3 Number scored (1st and 2nd) 2000 reliability 1996 reliability 4,398,021 19,578 19,578 19,578 19,578 19,578 19,578 19,578 19,429 19,429 19,429 19,429 19,429 19,428 19,845 19,845 19,845 19,845 19,845 26,088 26,090 26,089 26,090 26,088 26,089 26,088 26,091 26,043 26,042 26,250 26,254 26,251 26,253 26,256 26,251 26,250 19,524 19,525 19,523 19,523 19,521 19,520 19,621 19,623 19,624 19,621 19,551 † 95 90 87 91 84 93 88 97 96 94 98 100 94 92 91 94 94 97 88 92 93 95 90 98 97 96 93 95 98 98 98 92 91 95 97 94 94 93 92 93 96 94 91 91 92 96 † 96 89 91 96 88 94 90 97 95 94 99 99 93 94 93 96 95 96 88 94 94 92 90 98 95 95 § § 98 98 99 96 93 94 97 88 97 96 96 90 94 § § § § 92

See notes at end of table.
50

Item-by-item rater reliability, by grade, science national main and state assessment: 2000 (continued)
Grade 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 8 8 8 8 8 8 8 8 8 8 8 8 Item K034101 KW34101 KX34101 KY34101 KZ34101 K034401 K034501 K034502 K034802 K034901 K034902 K035201 K035301 K035601 K035801 K035901 K036101 K036301 K037301 K037401 K037501 K037601 K037701 K037702 K096901 K098401 K100701 K099601 K039801 K039901 K040001 K040301 K040401 K040501 K040601 K040603 K040604 K040605 K040606 K040607 K040608 K040609 K040610 K040801 K040802 K040808 Score points 3 3 3 3 3 3 4 3 3 3 3 3 4 3 3 3 3 3 3 3 3 3 4 3 3 3 3 3 3 5 4 3 4 3 3 4 4 4 4 3 3 3 4 3 3 3 Number scored (1st and 2nd) 2000 reliability 1996 reliability 19,554 19,554 19,554 19,554 19,554 19,552 19,552 19,551 19,878 19,881 19,882 19,883 19,881 26,087 26,086 26,086 26,087 26,086 19,570 19,567 19,570 19,569 19,569 19,570 19,825 19,826 19,828 19,826 19,658 19,656 19,658 19,660 19,659 19,659 18,996 18,996 18,996 18,996 18,996 18,996 18,996 18,996 18,996 19,142 19,142 19,142 93 93 97 91 95 94 91 98 93 93 92 92 92 92 90 97 93 97 96 94 99 92 98 98 92 94 95 92 96 92 92 92 94 98 99 96 93 95 95 93 94 97 95 97 96 98 90 90 96 86 93 92 92 99 94 96 95 93 94 95 94 97 90 97 97 96 97 93 97 94 § § § § 97 91 90 89 96 98 99 96 94 96 96 94 91 95 93 98 97 99

See notes at end of table.
51

Item-by-item rater reliability, by grade, science national main and state assessment: 2000 (continued)
Grade 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 Item K040809 K040803 K040805 K040806 K031301 K031309 K031302 K031305 K031306 K031307 K031308 K102301 K102001 K101801 K098501 K101201 K097901 K041306 K041307 K041401 K041402 K041403 K031602 K031603 K031604 K031606 K031610 K031607 K031608 K031609 K031611 K031613 K099001 K092601 K095901 K093601 K095801 K096101 K043001 K043101 K043102 K043103 K043501 K043601 K043602 K043603 Score points 3 3 3 2 4 4 3 3 2 3 3 3 4 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 3 3 3 3 3 3 3 3 2 3 3 3 4 3 3 3 3 3 Number scored (1st and 2nd) 2000 reliability 1996 reliability 19,142 19,142 19,142 19,142 19,281 19,281 19,281 19,281 19,281 19,281 19,280 25,535 25,536 25,535 25,536 25,533 25,533 25,614 25,616 25,614 25,614 25,612 25,427 25,429 25,424 25,428 25,423 25,427 25,427 25,427 25,427 25,426 19,163 19,166 19,165 19,166 19,166 19,164 19,147 19,147 19,147 19,145 19,145 19,145 19,145 19,148 97 95 94 94 95 96 90 93 97 92 99 96 87 87 95 91 97 86 90 96 97 91 97 99 99 92 96 89 93 93 98 98 97 96 94 98 92 94 96 89 91 91 93 94 97 89 98 97 97 96 95 97 98 97 89 95 98 § § § § § § 87 90 96 99 94 98 99 100 95 98 90 92 96 97 99 § § § § § § 95 89 85 92 94 90 95 88

See notes at end of table.
52

Item-by-item rater reliability, by grade, science national main and state assessment: 2000 (continued)
Grade 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 Item K047201 K047301 K047401 K047901 K048001 K048101 K048102 K048103 K048601 K048901 K049001 K049301 K049401 K049402 K049403 K049404 K035601 K035801 K035901 K036101 K036301 K036401 K036403 K036404 K036402 K036701 K036801 K037301 K037401 K037501 K037601 K037701 K037703 K038101 K038201 K038301 K093901 K095401 K092801 K093701 K097001 K094901 K045301 K045601 K045701 K045801 Score points 4 3 3 3 3 4 3 2 3 3 3 3 3 3 3 4 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 3 3 3 5 3 3 3 3 3 3 3 4 3 3 Number scored (1st and 2nd) 2000 reliability 1996 reliability 19,153 19,158 19,154 19,156 19,153 19,157 19,157 19,159 19,265 19,265 19,266 19,269 19,263 19,266 19,268 19,268 25,607 25,609 25,602 25,603 25,606 25,601 25,601 25,601 25,603 25,606 25,603 19,180 19,176 19,179 19,179 19,180 19,178 19,180 19,179 19,180 19,185 19,187 19,187 19,185 19,184 19,185 19,068 19,068 19,070 19,069 94 98 95 98 99 93 98 94 97 99 99 100 94 89 93 87 91 93 95 93 95 98 95 94 92 98 96 95 92 95 92 99 97 96 94 88 90 92 99 93 91 94 95 95 92 93 88 96 92 93 99 91 96 95 93 98 100 98 94 90 89 85 93 94 95 89 95 97 92 93 93 97 97 93 93 96 89 99 91 98 92 87 § § § § § § 93 96 90 93

See notes at end of table.
53

Item-by-item rater reliability, by grade, science national main and state assessment: 2000 (continued)
Grade 8 8 8 8 8 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 Item K046301 K046401 K046501 K046601 K046701 K049501 K049502 K049503 K049504 K049505 K049506 K040801 K040802 K040808 K040809 K040803 K040804 K040805 K040806 K049701 K049702 K049708 K049703 K049704 K049705 K049706 K049707 K105501 K105601 K106101 K105001 K104501 K104601 K041306 K041307 K041401 K041402 K041403 K041404 K041406 K049901 K049902 K049903 K049904 K049907 K049908 Score points 3 3 3 3 4 4 5 3 3 3 4 3 3 3 3 3 3 3 2 3 3 2 3 3 4 3 5 3 2 4 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 Number scored (1st and 2nd) 2000 reliability 1996 reliability 19,070 19,070 19,071 19,067 19,072 3,234 3,234 3,234 3,234 3,234 3,234 3,195 3,195 3,195 3,195 3,195 3,195 3,195 3,195 3,263 3,263 3,263 3,263 3,263 3,263 3,263 3,263 4,355 4,353 4,354 4,352 4,355 4,354 4,260 4,260 4,259 4,260 4,261 4,262 4,262 4,304 4,301 4,299 4,300 4,298 4,300 96 93 97 91 88 99 93 91 91 94 93 98 98 98 98 97 92 95 92 99 97 98 96 96 91 90 89 96 99 97 97 98 98 89 92 96 96 91 96 100 91 90 92 95 94 95 93 94 96 92 89 98 93 94 94 94 97 99 98 99 98 97 95 97 92 99 98 98 94 96 90 86 88 § § § § § § 88 90 94 99 94 95 96 94 87 90 93 86 97

See notes at end of table.
54

Item-by-item rater reliability, by grade, science national main and state assessment: 2000 (continued)
Grade 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 Item K049909 K049911 K049914 K049912 K098601 K092701 K090501 K094601 K092101 K090301 K051701 K051801 K052301 K052401 K052402 K052501 K052502 K052503 K047201 K047301 K047401 K047901 K048001 K048101 K048102 K048103 K048601 K048901 K049001 K049301 K049401 K049402 K049403 K049404 K052901 K053001 K053101 K053102 K053601 K053701 K053801 K053901 K054001 K054002 K054003 K054004 Score points 4 3 3 5 3 3 4 3 3 3 3 3 3 4 3 4 4 3 4 3 3 3 3 4 3 2 3 3 3 3 3 3 3 4 4 3 3 3 5 3 3 3 4 3 3 3 Number scored (1st and 2nd) 2000 reliability 1996 reliability 4,300 4,301 4,301 4,301 3,258 3,258 3,258 3,257 3,258 3,257 3,241 3,242 3,242 3,244 3,243 3,242 3,241 3,242 3,221 3,221 3,218 3,221 3,222 3,220 3,220 3,221 3,259 3,257 3,257 3,261 3,258 3,261 3,259 3,259 4,372 4,370 4,373 4,372 4,370 4,370 4,368 4,371 3,256 3,256 3,256 3,257 98 94 98 87 91 94 98 95 96 98 96 92 94 92 95 98 98 98 94 97 95 100 99 85 94 93 95 99 96 99 94 89 92 85 98 98 98 98 94 98 93 98 97 97 100 100 94 94 94 84 § § § § § § 96 91 98 90 97 91 92 94 91 96 92 94 98 86 94 96 92 98 99 97 94 87 88 84 90 91 99 92 89 96 91 94 98 94 99 97

See notes at end of table.
55

Item-by-item rater reliability, by grade, science national main and state assessment: 2000 (continued)
Grade 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 Item K054005 K054006 K054007 K054008 K100001 K100101 K100201 K100301 K096501 K092001 K059001 K059101 K059201 K059301 K059801 K059901 K060001 K060101 Score points 3 3 3 4 2 3 3 3 3 3 3 3 4 4 3 4 3 4 Number scored (1st and 2nd) 2000 reliability 1996 reliability 3,258 3,256 3,259 3,256 3,248 3,250 3,248 3,250 3,250 3,250 3,190 3,190 3,191 3,189 3,189 3,191 3,189 3,187 100 98 100 96 100 98 95 97 100 94 95 91 95 99 95 96 95 93 97 97 87 84 § § § § § § 90 94 93 99 92 95 93 90

† Not applicable. § Item had not been created at the time of the assessment noted in this column heading. SOURCE: U.S. Department of Education, Institute of Education Sciences, National Center for Education Statistics, National Assessment of Educational Progress (NAEP), 2000 Science Assessment.

56

Scoring of the 2000 Science Assessment Large-Print Booklets
A subset of the total items scored were those from large-print booklets. These booklets were administered to students with disabilities who had met the criteria for participation with accommodations. Since these booklets were non-scannable, they were transported to the scoring center after processing. A log and score sheet were created to account for these booklets. As a rater scored an item, he or she marked the score for that response, his or her rater ID, and the date on which the item was scored. Once all items in each booklet for a given subject were scored, the science scoring director returned the sheets to NAEP clerical staff to enter those scores manually into the records for these booklets. In the 2000 assessment, there were 28 large-print science booklets.

57

Item-by-item rater reliability for items in large-print booklets, by grade, science national main and state assessments: 2000
Grade Total 4 4 4 4 4 4 4 4 4 4 4 4 4 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 12 12 12 12 12 12 12 12 12 12 12 12 12 12 Item † S12S9A_02 S12S9A_03 S12S9A_04 S12S9A_06 S12S9A_07 S12S9A_08 S12S9A_09 S1S21_04 S1S21_05 S1S21_06 S1S21_09 S1S21_10 S1S21_11 W2S7_01 W2S7_04 W2S7_06 W2S7_11 W2S7_16 W2S7_20 S12S15B_04 S12S15B_05 S12S15B_06 S12S15B_07 S12S15B_08 S12S15B_09 S12S15B_14 S12S15B_15 S12S15B_16 W3S7_09 W3S7_10 W3S7_11 W3S7_14 W3S7_18 W3S7_19 S3S15_01 S3S15_02 S3S15_03 S3S15_04 S3S15_05 S3S15_06 S3S15_07 S3S15_08 Score points † 3 3 3 3 4 3 3 3 5 4 3 4 3 3 4 3 3 3 3 3 3 3 3 4 3 3 3 5 3 2 4 3 3 3 4 3 3 3 3 3 3 4 Number scored (1st and 2nd) 679,711 26,250 26,254 26,251 26,253 26,256 26,251 26,250 19,658 19,656 19,658 19,660 19,659 19,659 25,535 25,536 25,535 25,536 25,533 25,533 19,180 19,176 19,179 19,179 19,180 19,178 19,180 19,179 19,180 4,355 4,353 4,354 4,352 4,355 4,354 3,256 3,256 3,256 3,257 3,258 3,256 3,259 3,256 2000 reliability † 98 98 98 92 91 95 97 96 92 92 92 94 98 96 87 87 95 91 97 95 92 95 92 99 97 96 94 88 96 99 97 97 98 98 97 97 100 100 100 98 100 96 1996 reliability † 98 98 99 96 93 94 97 97 91 90 89 96 98 § § § § § § 93 93 96 89 99 91 98 92 87 § § § § § § 98 94 99 97 97 97 87 84

† Not applicable. § Item had not been created at the time of the assessment noted in this column heading. SOURCE: U.S. Department of Education, Institute of Education Sciences, National Center for Education Statistics, National Assessment of Educational Progress (NAEP), 2000 Science Assessment.
58

Scoring NAEP U.S. History Assessments
The NAEP U.S. history items that are not scored by machine are constructed-response items—those for which the student must write in a response rather than selecting from a printed list of multiple choices. Each constructed-response item has a unique scoring guide that identifies the range of possible scores for the item. To measure longitudinal trends in U.S. history, NAEP requires trend scoring—replication of scoring from prior assessment years—to demonstrate statistically that scoring is comparable across years. Students' constructed responses are scored on computer workstations using a image-based scoring system. This allows for item-by-item scoring and online, real-time monitoring of U.S. history interrater reliabilities, as well as the performance of each individual rater. A subset of these items—those that appear in large-print booklets—require scoring by hand. The 2001 U.S. history assessment included 47 discrete constructed-response items. The total number of constructed responses scored was 399,182. Scoring activities, U.S. history assessment: 2001
Scoring location Iowa City, Iowa Start date 5/7/2001 End date 5/25/2001 Number of raters 81 Number of scoring supervisors 9

NOTE: U.S. history was not assessed in 2000. SOURCE: U.S. Department of Education, Institute of Education Sciences, National Center for Education Statistics, National Assessment of Educational Progress (NAEP), 2001 U.S. History Assessment.

Each item's scoring guide identifies the range of possible scores for the item and defines the criteria to be used in evaluating student responses. During the course of the project, each team scores the items using a 2-, 3-, or 4-point scale as outlined below: Dichotomous Items 1 = Inappropriate 2 = Appropriate Short Three-Point Items 1 = Inappropriate 2 = Partial 3 = Appropriate Extended Four-Point Items 1 = Inappropriate 2 = Partial 3 = Essential 4 = Complete

In some cases, student responses do not fit into any of the categories listed on the scoring guide. Special coding categories for the unscorable responses are assigned to these types of responses. These categories are only assigned if no aspect of the student's response can be scored. Scoring supervisors and/or trainers are consulted prior to the assignment of any of the special coding categories. The unscorable categories used for U.S. history are outlined below. Categories for unscorable responses, U.S. history assessments
Label B X IL OT ? Description Blank responses, random marks on paper Completely crossed out, completely erased Completely illegible response Off task, off topic, comments to the test makers, refusal to answer, "Who cares," language other than English (unless otherwise noted) "I don't know," "I can't do this," "No clue," "I don't understand," "I forget"

NOTE: Because the NAEP scoring contractor's database recognizes only alphanumeric characters and sets a single-character field for the value for each score, the label "IL" appears in the database file as "I," the label "OT" appears as "T," and the label "?" appears as "D." SOURCE: U.S. Department of Education, Institute of Education Sciences, National Center for Education Statistics, National Assessment of Educational Progress (NAEP), 2001 U.S. History Assessment.

59

Number of constructed-response items, by score-point level and grade, U.S. history national main assessment: 2001
Grade Total 4 4/8 8 8/12 12 Total 38 10 5 5 4 14 Short 3-point Extended 4items point items 35 10 5 5 4 11 3 0 0 0 0 3

SOURCE: U.S. Department of Education, Institute of Education Sciences, National Center for Education Statistics, National Assessment of Educational Progress (NAEP), 2001 U.S. History Assessment.

60

Number of 1994 constructed-response items rescored in 2001, by score-point level and grade, U.S. history national main assessment: 2001
Grade Total 4 4/8 8 8/12 12 Total 66 10 6 20 5 25 Dichotomous 2-point items 2 0 1 0 0 1 Short 3-point items 47 8 4 16 4 15 Extended 4-point items 17 2 1 4 1 9

SOURCE: U.S. Department of Education, Institute of Education Sciences, National Center for Education Statistics, National Assessment of Educational Progress (NAEP), 2001 U.S. History Assessment.

61

U.S. History Interrater Reliability
A subsample of the U.S. history responses for each constructed-response item is scored by a second rater to obtain statistics on interrater reliability. In general, items administered only to the national main sample receive 25 percent second scoring. This reliability information is also used by the scoring supervisor to monitor the capabilities of all raters and maintain uniformity of scoring across raters. Reliability reports are generated on demand by the scoring supervisor, trainer, scoring director, or item development subject area coordinator. Printed copies are reviewed daily by lead scoring staff. In addition to the immediate feedback provided by the online reliability reports, each scoring supervisor can also review the actual responses scored by a rater with the backreading tool. In this way, the scoring supervisor can monitor each rater carefully and correct difficulties in scoring almost immediately with a high degree of efficiency. Interrater reliability ranges, by assessment year, U.S. history national main assessment: 2001
Number of unique items 47 79 Number of items Number of items Number of items between 60% between 70% between 80% and 69% and 79% and 89% 2 † 16 1 16 33 Number of items above 90% 13 45

Assessment 2001 U.S. history 1994 U.S. history

† The interrater reliability of 1994 items rescored in 2000 exceeded 79 percent. SOURCE: U.S. Department of Education, Institute of Education Sciences, National Center for Education Statistics, National Assessment of Educational Progress (NAEP), 2001 U.S. History Assessment.

During the scoring of an item or the scoring of a calibration set, scoring supervisors monitor progress using an interrater reliability tool. This display tool functions in either of two modes: • • to display information of all first readings versus all second readings, or to display all readings of an individual which were also scored by another rater versus the scores assigned by the other raters.

The information is displayed as a matrix with scores awarded during first readings displayed in rows and scores awarded during second readings displayed in columns (for mode one) and the individual's scores in rows and all other raters in columns (for mode two.) In this format, instances of exact agreement fall along the diagonal of the matrix. For completeness, data in each cell of the matrix contain the number and percentage of cases of agreement (or disagreement). The display also contains information on the total number of second readings and the overall percentage of reliability on the item. Since the interrater reliability reports are cumulative, a printed copy of the reliability of each item is made periodically and compared to previously generated reports. Scoring staff members save printed copies of all final reliability reports and archive them with the training sets.

62

Item-by-item rater reliability, by grade, U.S. history national main assessment: 2001
Grade 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 Item Score Number scored (1st and points 2nd) † 4 4 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 3 3 3 3 3 3 3 3 3 4 3 3 3 4 4 3 3 2 2 3 3 447,714 642 3,324 591 3,323 601 3,324 606 3,324 628 3,323 3,283 3,283 3,282 3,283 3,283 3,359 3,359 3,358 3,359 3,359 604 3,354 553 3,354 637 3,353 576 3,354 606 3,354 3,223 3,223 3,223 3,223 3,223 609 3,285 642 3,285 657 3,281 644 3,281 2001 reliability † 88 87 90 91 89 91 88 91 86 90 82 90 92 94 88 92 93 94 95 98 86 86 89 86 96 94 83 86 92 91 97 85 89 88 91 93 94 94 93 100 99 98 99 1994 reliability † 92 § 89 § 91 § 88 § 89 § § § § § § § § § § § 86 § 90 § 97 § 87 § 93 § § § § § § 94 § 97 § 99 § 98 §

Total † H028201 H028201 H028701 H028701 H028702 H028702 H028801 H028801 H029002 H029002 H054301 H054601 H054801 H054901 H055301 H055901 H056101 H056401 H056601 H056801 H031701 H031701 H031801 H031801 H031802 H031802 H032301 H032301 H032503 H032503 H057501 H057701 H057801 H058601 H058701 H034101 H034101 H034401 H034401 H034501 H034501 H034702 H034702

See notes at end of table.

63

Item-by-item rater reliability, by grade, U.S. history national main assessment: 2001 (continued)
Grade Item 4 4 4 4 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 H035001 H035001 H035101 H035101 H035801 H035801 H035901 H035901 H035902 H035902 H036101 H036101 H036402 H036402 H059001 H059201 H059701 H059801 H060201 H038103 H038103 H038301 H038301 H038601 H038601 H038702 H038702 H039001 H039001 H039401 H039401 H039901 H039901 H040001 H040001 H040103 H040103 H040201 H040201 H057501 H057701 H057801 H058601 H058701 H034101 H034101 Score points 3 3 3 3 3 3 3 3 3 3 3 3 4 4 3 3 3 3 3 4 4 3 3 3 3 3 3 3 3 3 3 4 4 3 3 3 3 3 3 3 4 3 3 3 4 4 Number scored (1st and 2nd) 627 3,281 637 3,282 639 3,177 630 3,177 613 3,177 593 3,177 572 3,178 3,195 3,195 3,195 3,195 3,195 617 3,222 611 3,222 594 3,223 601 3,223 630 3,222 614 3,257 609 3,256 632 3,256 640 3,256 610 3,256 3,204 3,204 3,204 3,204 3,204 609 3,285 2001 reliability 89 92 88 90 92 92 94 93 84 87 87 87 63 83 88 82 91 85 87 88 88 91 90 87 85 85 88 95 98 89 89 89 89 93 95 81 89 92 95 96 83 83 85 92 87 90 1994 reliability 90 § 92 § 92 § 89 § 85 § 87 § 82 § § § § § § 92 § 90 § 90 § 85 § 93 § 90 § 89 § 94 § 90 § 93 § § § § § § 85 §

See notes at end of table.
64

Item-by-item rater reliability, by grade, U.S. history national main assessment: 2001 (continued)
Grade Item 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 12 12 12 12 12 12 12 12 12 H034401 H034401 H034501 H034702 H034702 H035001 H035001 H035101 H035101 H060701 H061501 H061601 H061801 H042201 H042201 H042801 H042801 H042902 H042902 H043001 H043001 H043101 H043101 H043201 H043201 H043401 H043401 H043501 H043501 H043601 H043601 H043701 H043701 H043705 H043705 H044001 H044001 H044301 H044301 H044501 H044501 H044702 H044702 H045102 H045102 H045301 Score points 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 4 4 3 3 3 3 3 3 3 3 3 3 4 4 3 3 3 3 3 3 4 4 4 4 4 4 3 3 3 3 3 Number scored (1st and 2nd) 642 3,285 3,285 636 3,285 587 3,285 640 3,285 3,156 3,156 3,156 3,156 643 3,209 615 3,209 622 3,209 607 3,209 608 3,209 615 3,101 612 3,101 616 3,101 601 3,101 606 3,101 582 3,101 602 3,101 620 3,071 617 3,071 601 3,071 627 3,071 634 2001 reliability 93 94 99 92 94 80 84 88 89 96 93 92 98 99 97 89 92 94 94 85 89 94 94 87 88 89 86 81 87 86 83 90 89 78 83 79 77 70 78 84 85 91 93 86 92 89 1994 reliability 95 § § 93 § 81 § 92 § § § § § 96 § 86 § 91 § 92 § 89 § 90 § 89 § 92 § 86 § 90 § 81 § 83 § 87 § 92 § 87 § 92 § 97

See notes at end of table.
65

Item-by-item rater reliability, by grade, U.S. history national main assessment: 2001 (continued)
Grade 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 Item H045301 H045501 H045501 H045901 H046001 H046001 H046101 H046101 H046301 H046301 H062001 H062201 H063101 H063401 H063601 H048901 H048901 H049401 H049401 H049503 H049503 H049601 H049601 H049701 H049701 H050101 H050101 H050201 H050201 H051002 H051002 H051101 H051101 H051102 H051102 H051301 H051301 H052301 H052301 H052501 H052501 H052601 H052601 H052701 H052701 Score Number scored (1st points and 2nd) 3 4 4 4 2 2 3 3 3 3 3 3 3 4 3 3 3 4 4 3 3 4 4 3 3 4 4 4 4 3 3 3 3 3 3 3 3 4 4 3 3 3 3 3 3 3,071 560 3,027 3,026 650 3,026 621 3,026 600 3,026 3,100 3,100 3,100 3,100 3,100 645 2,996 639 2,996 620 2,996 596 2,997 648 2,996 592 3,063 590 3,063 639 3,063 601 3,063 558 3,064 590 3,055 628 3,055 626 3,055 607 3,055 614 3,055 2001 reliability 93 61 78 80 100 99 96 95 88 91 79 88 87 84 86 97 99 84 86 85 86 84 86 99 98 79 80 76 78 97 98 84 87 78 81 79 85 82 71 87 92 81 86 87 88 1994 reliability § 78 § § 99 § 96 § 90 § § § § § § 99 § 88 § 92 § 88 § 98 § 86 § 82 § 98 § 90 § 81 § 83 § 92 § 93 § 85 § 88 §

See notes at end of table.

66

Item-by-item rater reliability, by grade, U.S. history national main assessment: 2001 (continued)
Grade 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 Item H060701 H061501 H061601 H061801 H042201 H042201 H042801 H042801 H042902 H042902 H043001 H043001 H043101 H043101 H063801 H064001 H064101 H064401 H064901 H065101 H065201 H065301 H065401 Score Number scored (1st points and 2nd) 3 3 3 3 3 3 4 4 3 3 3 3 3 3 3 3 3 4 3 3 3 3 4 3,038 3,037 3,038 3,037 632 3,045 598 3,045 606 3,045 618 3,045 610 3,045 3,054 3,054 3,054 3,053 3,054 3,054 3,054 3,054 3,054 2001 reliability 94 82 91 96 97 96 89 91 91 93 81 86 96 93 80 84 81 79 81 79 76 81 88 1994 reliability § § § § 96 § 88 § 87 § 87 § 91 § § § § § § § § § §

† Not applicable. § Item had not been created in 1994. SOURCE: U.S. Department of Education, Institute of Education Sciences, National Center for Education Statistics, National Assessment of Educational Progress (NAEP), 2001 U.S. History Assessment.

67

Scoring of the 2001 U.S. History Assessment Large-Print Booklets
A subset of the total items scored were those from large-print booklets. These booklets were administered to students with disabilities who had met the criteria for participation with accommodations. Since these booklets were non-scannable, they were transported to the scoring center after processing. A log and score sheet were created to account for these booklets. As a rater scored an item, he or she marked the score for that response, his or her rater ID, and the date on which the item was scored. Once all items in each booklet for a given subject were scored, the U.S. history scoring director returned the sheets to NAEP clerical staff to enter those scores manually into the records for these booklets. In the 2001 assessment, there were eight large-print U.S. history booklets.

68

Item-by-item rater reliability for items in large-print booklets, by grade, U.S. history national main assessment: 2001
Score points † 3 3 3 3 3 4 4 3 3 3 3 3 3 3 3 4 4 3 3 3 3 3 3 3 3 3 3 4 4 3 3 3 3 3 3 3 3 4 4 3 3 4 4 3 Number scored (1st and 2nd) 2001 reliability 1994 reliability 111,515 3,359 3,359 3,358 3,359 3,359 604 3,354 553 3,354 637 3,353 576 3,354 606 3,354 617 3,222 611 3,222 594 3,223 601 3,223 630 3,222 614 3,257 609 3,256 632 3,256 640 3,256 610 3,256 645 2,996 639 2,996 620 2,996 596 2,997 648 † 92 93 94 95 98 86 86 89 86 96 94 83 86 92 91 88 88 91 90 87 85 85 88 95 98 89 89 89 89 93 95 81 89 92 95 97 99 84 86 85 86 84 86 99 † § § § § § 86 § 90 § 97 § 87 § 93 § 92 § 90 § 90 § 85 § 93 § 90 § 89 § 94 § 90 § 93 § 99 § 88 § 92 § 88 § 98

Grade Total 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 12 12 12 12 12 12 12 12 12

Item † X1H5_03 X1H5_05 X1H5_08 X1H5_10 X1H5_12 Q1H602 Q1H602 Q1H603 Q1H603 Q1H604 Q1H604 Q1H611 Q1H611 Q1H615 Q1H615 Q2H505 Q2H505 Q2H508 Q2H508 Q2H511 Q2H511 Q2H513 Q2H513 Q2H517 Q2H517 Q2H604 Q2H604 Q2H609 Q2H609 Q2H610 Q2H610 Q2H613 Q2H613 Q2H614 Q2H614 Q3H604 Q3H604 Q3H610 Q3H610 Q3H613 Q3H613 Q3H614 Q3H614 Q3H615

See notes at end of table.

69

Item-by-item rater reliability for items in large-print booklets, by grade, U.S. history national main assessment: 2001
Grade 12 12 12 12 12 12 12 12 12 12 12 Item Q3H615 Q3H702 Q3H702 Q3H703 Q3H703 Q3H714 Q3H714 Q3H715 Q3H715 Q3H716 Q3H716 Score points 3 4 4 4 4 3 3 3 3 3 3 Number scored (1st and 2nd) 2001 reliability 1994 reliability 2,996 592 3,063 590 3,063 639 3,063 601 3,063 558 3,064 98 79 80 76 78 97 98 84 87 78 81 § 86 § 82 § 98 § 90 § 81 §

† Not applicable. § Item had not been created in 1994. SOURCE: U.S. Department of Education, Institute of Education Sciences, National Center for Education Statistics, National Assessment of Educational Progress (NAEP), 2001 U.S. History Assessment.

70