NAEP Scoring
NAEP assessments include multiple-choice items, which are machine-scored by optical mark reflexscanning, and constructed-response items, which are scored by trained scoring staff. These trainedscorers ("raters") use an image-based scoring system that routes student responses directly to eachrater. Focused, explicit scoring guides are developed to match the criteria emphasized in the assessmentframeworks. Consistency of scoring between raters is monitored during the process through ongoingreliability checks and frequent backreading.Throughout the scoring process, three types of personnel make up individual scoring teams:
are professional scorers who are hired to rate the individual student responses, or 
lead teams of raters throughout the scoring process on a daily basis.
provide training for the scoring raters, continually monitoring the progress of eachscoring team.Team members are required to have, at a minimum, a baccalaureate degree from a four-year college or university. An advanced degree, scoring experience, and/or teaching experience is preferred. Scoringteams use the training process to determine whether each individual rater is sufficiently prepared toscore. Following training , each rater is given a pre-scored "qualification set" and expected to attain 80percent correct in order to proceed.All scoring is carried out via image processing. To assign a score, raters click the mouse over a buttondisplayed in a scoring window. Since buttons are included only for valid scores, there is no editing for out-of-range scores. Two significant advantages of the image-scoring system are the ease of regulating theflow of work to raters and the ease of monitoring scoring. The image system provides scoring supervisorswith tools to determine rater qualification, to backread raters, to determine rater calibration, to reset trendrescore items, to monitor trend rescore items through
-statistics reports, to monitor interrater reliability,and to gauge the rate at which scoring was being completed.The scoring supervisors monitor work flow for each item using a status tool that displays the number of responses scored, the number of responses first-scored that still need to be second-scored, the number of responses remaining to be first-scored, and the total number of responses remaining to be scored. Thisallows the scoring directors and project leads to accurately monitor the rate of scoring and to estimate thetime needed for completion of the various phases of scoring.
After scoring begins, NAEP scoring supervisors review each rater's progress using a backreading utilitythat allows the scoring supervisor to review papers scored by each rater on the team. Scoring supervisorsmake certain to note the score the rater awards each response as well as the score a second rater givesthat same paper. This is done as an interrater reliability check. Typically, a scoring supervisor reviewsapproximately 10 percent of all item responses scored by each rater.Alternatively, a scoring supervisor can choose to review all responses given a particular score todetermine if the team as a whole is scoring consistently. Both of these review methods use the samedisplay screen, showing the ID number of the rater and the scores awarded. If the scoring supervisor disagrees with the score given an item, he or she discusses it with the rater for possible correction.Whether or not the scoring supervisor agrees with the score, he or she assigns a scoring supervisor scorein backreading. If this score agrees with the first score, the score is recorded only for statistical purposes.If the scores disagree, the scoring supervisor score overrides the first score as the reported score.Replacement of scores by the scoring supervisor is done only with the knowledge and approval of therater, thereby serving as a learning experience for the rater. Changing the score does not change themeasurement of interrater reliability.
Calibration of Scoring Raters
For new assessment items, the scoring supervisor of each team invokes calibration as needed from thetool used in backreading. During backreading, the scoring supervisor has a pool of 300 responses for each item to use in the calibration process. The scoring supervisor views samples of these responsestogether with the scores assigned by the first, and if applicable, second rater. From this pool, the scoringsupervisor chooses some responses to put into the pool—responses that have been scored correctly andare a good measure to keep scoring on track. From this pool, the supervisor builds sets with the desirednumber of responses, usually between five and ten. These sets are then released on the image-basedscoring system for raters to score.When raters invoke the calibration window, all raters receive the same responses and score them. After all raters have finished scoring this pool, the scoring supervisor can look at reliability reports, whichinclude only the data from the calibration set just run. This process serves to refresh training and avoiddrift in scoring. If pre-scored paper calibration sets already exist, these can be used to calibrate ratersinstead of the image-based sets created by the scoring supervisor.In general, each team scores calibration sets whenever they take a break longer than fifteen minutes,such as when returning from lunch or at the beginning of a shift. Raters scoring trend rescore items arecalibrated using trend rescore responses that are already loaded into the system with the scores givenduring prior year scoring.

