Figure 2: Summary of the NAEP Item Scoring Process

Phase I. Scoring Guide Development and Pilot Scoring Stage 1: Develop Scoring Guides Scoring guides developed along with items Refined during item reviews: Item development contractor Standing committee (content experts) State item review Ongoing NCES reviews Pilot administration Multiple-Choice Items Scanned and processed electronically with quality control and validity checks Stage 2: Score Pilot Constructed-Response Items Scanned responses scored by qualified and trained scorers using an electronic image-processing and scoring system Stage 2A: Refine Scoring Guides and Prepare Training Materials Content experts check student papers against scoring guides Refine scoring guides Select examples and practice papers for training packets Stage 2B: Score with Quality Control Checks Train scorers Ensure scorer quality Monitor scoring accuracy and reliability • Individual scorers (backreading) • Between scorers (inter-rater reliability) (25% of responses double scored) Submit data for item analysis Select items for operational assessment Refined during item reviews: Standing committee Ongoing NCES reviews Stage 2C: Conduct Scoring Debriefing and Document Potential Refinements to Items and Scoring Guides

NAGB Reviews

Pilot scoring guides and training packets reviewed by NCES and standing committee

NAGB Reviews

First Operational Administration (or Pre-Calibration)

Reading and Mathematics Pre-Calibration Administration All Other Subjects First Operational Administration Constructed-Response Items Scanned responses scored by qualified and trained scorers using an electronic image-processing and scoring system Stage 3A: Refine Scoring Guides and Prepare Training Materials Content experts check student papers against scoring guides Refine scoring guides Select examples and practice papers for training packets Stage 3B: Score with Quality Control Checks Train scorers Ensure scorer quality Monitor scoring accuracy and reliability • Individual scorers (backreading) • Between scorers (inter-rater reliability) (5–25% of responses double scored)

Phase II. First Operational Scoring (or Pre-Calibration)

Multiple-Choice Items Scanned and processed electronically with quality control and validity checks

Stage 3: Score First Operational Administration (or Pre-Calibration) Operational scoring guides and training packets reviewed by NCES and standing committee

Submit data for analysis Reading and Mathematics Item pre-calibration First operational administration

Stage 3C: Archive Final Scoring Guides and Training Materials All Other Subjects Scaling and reporting Second operational administration

Phase III. Subsequent Operational Scoring

Multiple-Choice Items Scanned and processed electronically with quality control and validity checks

Stage 4: Score Operational Administration (Trend Scoring)

Constructed-Response Items Scanned responses scored by qualified and trained scorers using an elecronic image-processing and scoring system Stage 4A: Score with Quality Control Checks Train scorers Ensure scorer quality Calibrate scoring • Qualify scorers for consistency with scoring in previous years Monitor scoring accuracy and reliability • Individual scorers (backreading) • Between scorers (inter-rater reliability) (5–25% of responses double scored) Monitor across-year scoring consistency • Meet criteria for trend

Submit data for analysis Scaling and reporting

Stage 4B: Update Documentation on Scoring and Training After Each Administration

April 2005