The GRE‚ FAME Report Series

(Vol. 2)

New Directions in Assessment for Higher Education:

FAIRNESS, ACCESS, MULTICULTURALISM, & EQUITY
(FAME)
Papers by Brent Bridgeman Kurt F. Geisinger Marcia C. Linn

Graduate Record Examinations

Editor: Shilpi Niyogi Sponsored by the Graduate Record Examinations Program. Published by Educational Testing Service, Princeton, NJ 08541-6000. EDUCATIONAL TESTING SERVICE, ETS, the ETS logo, Graduate Record Examinations, and GRE are registered trademarks of Educational Testing Service. The modernized ETS logo is a trademark of Educational Testing Service. Copyright 1998 by Educational Testing Service.

FAIRNESS, ACCESS, MULTICULTURALISM, & EQUITY

THE GRE FAME REPORT

FAME
I

FAIRNESS, ACCESS, MULTICULTURALISM, & EQUITY

Preface
n March of 1997, a conference was held to explore issues related to Fairness, Access, Multiculturalism, and Equity (FAME) in higher education. This two-day conference was co-sponsored by the Graduate Record Examinations Board and the Xavier University of Louisiana. A primary goal of the FAME conference was to increase and document our store of knowledge about equity issues in higher education and assessment, as they relate to racial and ethnic minority status, language and national background, gender, disabilities, and poverty. FAME conference speakers addressed assessment issues, issues of institutional policy and practice, and psychological and educational issues for individuals. As an additional outcome of the conference, we wished to identify important areas where research is needed to help increase equity in higher education and assessment. Our goals for the FAME report series reflect the conference goals, but encompass more than dissemination. Our hope is that the reports will highlight the multiple perspectives that exist on important FAME issues, and raise hard questions about them. This monograph is the second in the series. In this monograph, we have gathered three papers that address important technical aspects of equity in assessment. In the first paper, Brent Bridgeman of Educational Testing Service summarizes what we know and what we still need to find out about the critical fairness issues concerning the impact of computer-based assessments on test-takers with different experiences, interests, and backgrounds. Dr. Bridgeman shows us just how many aspects of assessment, some of them quite surprising, are affected by the move to computer-based testing. In the second paper, Kurt Geisinger of Lemoyne College analyzes developments in testing accommodations for candidates with various types of physical and cognitive disabilities, and how practices are changing with the use of computers to deliver tests. Dr. Geisinger, who is also an expert in test validity, puts these issues in a larger context of test quality and fairness for all test-takers. In the final paper, Marcia Linn of the University of California, Berkeley offers some commentary on technology and educational equity. She underscores the importance of knowing how speed affects candidates’ ability to show their best work, and offers some advice on considering the place of assessments in the larger world of educational indicators such as grades. As always, Dr. Linn’s astute observations draw on her extensive technical knowledge and her deep concern for educational equity.

Carol Ann Dwyer
Executive Director Program & Education Policy Research Educational Testing Service

3

Table of Contents

Page

Fairness in Computer-Based Testing: What We Know and What We Need to Know
Brent Bridgeman Educational Testing Service

. . . . . . . . . . . . . . . . .4

Testing Accommodations for the New Millennium: Computer-Administered Testing in a Changing Society
Kurt F. Geisinger Le Moyne College

. . . .12

Equity and Knowledge Integration
Marcia C. Linn University of California at Berkeley

. . . . . . . . . . . . . . . . . . . . . . . . . . . .21

4

Fairness in Computer-Based Testing: What We Know and What We Need to Know
Brent Bridgeman Educational Testing Service
An absence of format-related gender and ethnic differences does not necessarily imply equivalence on all characteristics that make CATs and paper-based tests different; a negative impact for one characteristic may be counterbalanced by a positive impact on another. Furthermore, a more complete understanding of the ways in which CATs and paper-based tests differ may have implications for the development of the next generation of CATs. Thus, it should be useful to summarize what we know and what we need to know about how these test formats differ. Among the ways paper-based tests and CATs differ are the following: question difficulty, ability to review and change answers, possibility for guessing correction, amount of text that can be displayed at one time, writing with word processor versus writing with pencil and paper, timing issues, and penalties for incomplete tests.

T

he movement from paper-based to computerbased testing has raised a number of concerns related to the fairness of computer-based tests. Are computer-based tests fair to students from gender and ethnic groups in which computer use may be relatively low? Do women or minority groups score lower on computer-based tests than on paper-based tests? Are there characteristics of computer-based tests that might make them differentially difficult for women or minority group members? ® ® The Graduate Record Examination (GRE ) General Test is currently offered as both a paperbased test and a computer-adaptive test (CAT). Three different methods have been used to study gender and ethnic differences across these formats. In one method, gender and ethnic differences among all students who took the paperbased test were compared to similar differences among those who took the CAT. In a second method, differences were evaluated among students who took the test in one mode and then repeated it in the other mode. Finally, a special study was conducted among students who agreed to be randomly assigned to one mode or the other. With all three methods, gender and ethnic differences were about as large on the CAT as on the paper-based test; the small differences occasionally observed tended to indicate that scores for ethnic minorities were relatively higher for the CAT. Representative data from the random assignment study are presented in Figures 1-3. These figures show differences from the White mean for African American, Hispanic, and Asian groups. For example, Figure 1 shows that for the Verbal part of the General Test, African American students score about 125 points below the White mean, Hispanic students about 50 points below, and Asian students close to the White mean; it also shows that these differences are about the same for both the CAT and the paper-and-pencil (P&P) versions of the test. The small differences across formats are not practically significant. A similar lack of format effects was evident for gender groups.

FIGURE 1
Verbal Mean Score Difference CAT vs. P&P, by Ethnic Group
50

0

-50

-100

-150

African Am. P&P

Hispanic CAT

Asian

GRE Verbal score differences from White mean for paperand-pencil (P&P) and CAT scores.

FAIRNESS, ACCESS, MULTICULTURALISM, & EQUITY

THE GRE FAME REPORT

5

FIGURE 2
Quantitative Mean Score Difference CAT vs. P&P, by Ethnic Group
100

Difficulty
In a paper-based test, questions are typically arranged from easy to hard, and overall success levels, in terms of percent correct, depend on individual ability and skill. High-ability students may get very few questions wrong and lowability students may get few correct, especially as they approach the end of the test. On a CAT, questions are tailored to the individual’s ability level, so nearly all students may correctly answer about half of the questions administered to them. Thus, high-ability students who are used to a high success rate on a paper-based test may be frustrated by a CAT, and students who are used to a lower success rate may be encouraged by their performance on the CAT, particularly near the end of the test when they are still getting questions that are not unreasonably difficult for them. Experiencing success on part of a test may lead to improved performance on subsequent parts of the test (Bridgeman, 1974); such success experiences may be especially beneficial for students who respond to a cultural stereotype that people in their group are perceived as less able in the area being tested (Steele & Aronson, 1995). A study by Lawrence Stricker and Isaac Bejar that is currently in progress will directly address the question of whether manipulating the proportion of questions answered correctly on the GRE-CAT has an impact on performance, especially for groups that have been stereotyped as relatively poor performers.

50

0

-50

-100

-150 African Am. P&P Hispanic CAT Asian

GRE Quantitative score differences from White mean for P&P and CAT scores.

FIGURE 3
Analytical Mean Score Differences CAT vs. P&P, by Ethnic Group
50

0

Answer Changing
-50

-100

-150

-200

African Am. P&P

Hispanic CAT

Asian

GRE Analytical score differences from White mean for P&P and CAT scores. Note: For Figures 1-3, the number of test takers by ethnicity are as follows: for the paper-and-pencil test, there were 541 African Americans, 123 Hispanics, 168 Asians, and 1,060 Whites. For the computerizedadaptive test, there were 483 African Americans, 121 Hispanics, 143 Asians, and 1,098 Whites.

On paper-based tests, students may initially omit questions and then return to them later. Or they can review previously answered questions and change their initial answers. In a CAT, questions must be answered in order without any omissions or opportunities to return to previously answered questions. Although folklore suggests that it is unwise to change answers, because the first answer is the ”best“ answer, research suggests the opposite. Students are more likely to change answers from wrong to right than from right to wrong (Muller & Wasser, 1977; Schwarz, McMorris, & DeMers, 1991); students may simply be more likely to remember the answer they changed from right to wrong than the one they changed from wrong to right. An early
continued on next page

FAIRNESS

IN

COMPUTER-BASED TESTING: WHAT WE KNOW

AND

WHAT WE NEED

TO

KNOW

6

computerized version of the GRE was a linear version (not a CAT) that allowed students to omit questions and to return to previously answered questions. As indicated in Table 1, mean gains from answer changing were small (Craig Mills, personal communication). Nevertheless, across all three measures, — quantitative, verbal, and analytical — there were more changes from wrong to right than from right to wrong. Thus, it appears that scores would be higher, though only slightly higher, if students had an opportunity to review and change their answers on the CAT. It is also possible that differences would be

TABLE 1
Answer Changing on Linear GRE-CBT
N items Average Attempted Max N Reviewed Average N Revised From Omit to Right From Omit to Wrong From Wrong to Wrong From Right to Wrong From Wrong to Right Net Gain
(Including omits)

Quant. Verbal Analytical 60 76 50 59.23 75.34 48.54 8.09 17.73 6.84 3.52 8.21 4.85 1.37 3.35 2.22 1.31 2.68 1.89 0.27 0.73 0.21 0.13 0.43 0.09 0.44 1.02 0.43 0.4 1.3 0.7

formats, does not have a guessing correction, the GRE Subject Tests, the SAT, and the paperbased GMAT do. Because CATs require an answer to every question before the next question is presented, no omitting is allowed and there is no guessing correction. If the guessing correction on a paper-based test negatively impacted a particular population subgroup (perhaps by making them unwilling to guess even when they knew enough to make an informed guess), scores for that group might be relatively higher on a test with no guessing correction. Although there is some evidence that women are more reluctant to guess than men, this reluctance seems to have a minimal impact on observed gender differences in test scores (Grandy, 1987; Ben-Shakhar & Sinai, 1991). Perhaps the best evidence for the nonimpact of the guessing correction comes from the GRE General Test, which dropped a guessing correction in the 1981-82 test year. Gender and ethnic differences were essentially the same for the last year of the test with the guessing correction as for the first year without it. Therefore, there is little reason to expect that the absence of a guessing correction on CATs would have a noticeable effect on group differences.

Amount of Text Displayed
even greater for a CAT because more of the questions would be right at the limit of the person’s ability. There would presumably be more uncertainty (hence more answer changing) for such questions than for questions that were clearly very easy or beyond the ability of the examinee. Models are currently being explored that would permit answer changing within small blocks of questions, such as all of the questions that were associated with a particular reading passage. Long reading passages cannot be displayed on a computer screen at one time. Instead, the examinee taking a CAT must scroll or page through passages. Although this scrolling may be problematic for certain individuals, there is no reason to expect that it would contribute to subgroup differences. Nevertheless, research into possible negative consequences of scrolling should be conducted.

Word-Processed Versus Handwritten Essays
If an essay examination is part of a computerbased assessment, the essay may either be written by hand or entered on a word processor. If all students taking the test are presumed to be experienced word-processor users, they could be required to use a word processor. The Graduate Management Admissions Test® (GMAT®) does just that. (The computerized GMAT became operational in October 1997.)

Guessing Correction
Some paper-based tests have a guessing correction in which wrong responses are penalized more than omitted responses. The formula typically used is R-(W/(k-1)) in which R is the number of questions answered with the right answer, W is the number answered with the wrong answer, and k is the number of response options for each multiple-choice question. Although the GRE General Test, in both computer and paper

FAIRNESS, ACCESS, MULTICULTURALISM, & EQUITY

THE GRE FAME REPORT

7

On the other hand, the essay associated with the basic skills test that is part of the computerized Praxis Series: Professional Assessments for Beginning Teachers may be either written by hand or entered with a word processor. Preliminary plans are to offer a similar option when an essay is added to the GRE-CAT in 1999. Both allowing choice of response mode or not allowing choice, raise fairness and equity concerns. If all students are forced to use a word processor, less experienced word-processor users could be disadvantaged. Even experienced users could have problems if the features on the system used for the test were different from those on the system they typically used. However, forcing all students to handwrite essays is not an equitable solution, because students who routinely use a word processor could be disadvantaged by being forced to use what for them is an unfamiliar response mode. Allowing choice of production mode solves some equity concerns, but raises others. Raters may be influenced by the differences in physical appearance of handwritten and word-processed essays. Powers and his colleagues addressed this issue in their study titled, “Will they think less of my handwritten essay if others word process theirs?” (Powers, Fowles, Farnum, & Ramsey, 1994). They took essays that were produced either by hand or on a word processor and had them rewritten in the other mode; that is, the word-processed essays were rewritten by hand and the handwritten essays were reproduced with a word processor. Contrary to initial expectations, raters assigned higher scores to handwritten essays than to word-processed versions of the same essays; this was true regardless of the mode in which the essay was originally produced. The authors speculate that this could be related to an expectancy that word-processed essays should be more refined; minor grammar and spelling errors that may be overlooked in a hastily handwritten, unpolished first draft may be penalized in a superficially more professional appearing word-processed essay, even though the word-processed essay was actually produced under the same hurried conditions. Making raters aware of this tendency to downgrade word-processed essays reduced the effect but did not entirely eliminate it. Whether students are all forced to produce essays in the same

mode or raters are forced to make fair evaluations of essays produced in different modes, ensuring equity in essay assessments will require continued research and constant attention.

Timing
Time limits on tests such as the GRE General Test serve a number of functions. They are an administrative convenience that allow tests to be scheduled in specific blocks, with fixed payments to test administrators. For CATs, time limits help control the costs related to renting computer time at a testing center. Time limits may contribute to the validity of an assessment if rapid performance is an important part of performance on the criterion of interest. However, time limits raise equity concerns if the time limit is imposed for administrative convenience rather than as an essential part of what the test is supposed to measure. Evidence from the paper-based GRE General Test suggests that a moderate amount of extra time has no impact on the size of gender and ethnic differences. Wild and Durso (1979) studied the effect of adding 10 minutes to a 20-minute verbal section containing 26 questions and the effect of adding 10 minutes to a 20-minute quantitative section that contained 14 questions. Test centers were randomly assigned to one of the timing conditions; 550 centers and 9,800 examinees were included in the study. Scores were slightly higher with the 30-minute time limit. Some individuals may have benefited more from the extra time than others, but ethnic and gender differences remained the same. A study of GRE quantitative questions in an experimental open-ended format (Bridgeman, 1992) used a fixed time limit but had forms with either 19 or 24 questions. Scores based on 14 questions that were common to both the long and short forms suggested that women might be at a slight advantage on the form with fewer seconds per question. This unexpected finding could be related to the way the questions were arranged, with the most difficult questions at the end (as is common on most standardized tests). Students with a strong mathematics background in college (who, statistics show, are more likely to be men) would be more likely to get these hard items at the end of the test correct if they had
continued on next page

FAIRNESS

IN

COMPUTER-BASED TESTING: WHAT WE KNOW

AND

WHAT WE NEED

TO

KNOW

8

time to fully consider them. Under these circumstances, a longer time limit could favor men. Note, however, that in a CAT, the questions at the end are not uniformly difficult; rather, they are tailored to the individual test taker’s ability. Thus, results from timing studies for paper-based test administrations may not generalize to CATs. Furthermore, the studies of paper-based tests deal with only modest time extensions; the effect on group differences of providing unlimited time, or of an extension of time limits by a factor of two or three is not known. The time management problem is very different on CATs than it is on paper-based tests. In a paper-based test, any time left over at the end may be used by the test taker to return to previously skipped questions or to recheck previously answered questions. But because there is no opportunity to go back in a CAT, time left over at the end of the test is wasted. If a student goes too fast, there is no opportunity to recheck hastily answered questions, and if the student is too slow, time will expire before the end of the test is reached. Although examinees are advised to pace themselves through the test, knowing what should be done and actually being able to do it are not necessarily the same. The increased importance of self-pacing on CATs suggests that students who have completed several practice tests in order to find their optimal test-taking speed will be at an advantage. Fair assessment will then require that all students be given the opportunity to take a practice CAT under realistic timing conditions.

Penalty for Incomplete Tests
In paper-based tests, the score is directly related to the proportion of the questions in the test that are answered correctly. Typically, it is impossible to receive a high score if a large number of questions is left unanswered. However, with a CAT, an ability estimate can be made after each question in the test is answered; and on tests such as the Analytical scale of the GRE General Test (GRE-A), these estimates become quite stable after about 28 questions have been answered. With a few simple assumptions, which I will address momentarily, one’s estimated ability after completing 28 questions would not likely vary drastically from that same person’s ability

estimate after all 35 questions had been administered. Thus, when the GRE-CAT was first introduced, an 80% rule was established stating that a student could receive a score as long as 80% of the questions in the test had been answered. This was intended to ensure that students who failed to reach the last few questions could still get a score and to avoid penalizing students who were relatively slow. An unintended consequence of this 80% scoring rule eventually became obvious, as some test-wise students began pacing themselves so that they would finish only the minimum 80% needed for a score — thus allowing themselves more time per question. The time savings could be substantial. For example, GRE-A students who answered all 35 questions in the allotted 60 minutes would have 103 seconds per question; students who answered only the minimally required 28 questions had 129 seconds per question. As noted earlier, “a few simple assumptions” are necessary for ability estimated when an 80% test-completion rate is intended to represent one’s ability on the entire test. One of these assumptions is that the likelihood of answering a question correctly would not improve if more time were spent on that question. This may be reasonable for some verbal questions — for which additional time will not help a student respond to a question about a vocabulary word that she or he does not know — but it is undoubtedly not true for many GRE Analytical problems and at least some GRE quantitative questions. The strategy of intentionally not completing the test was apparently used more by some subgroups than others. For example, in September of 1996, the last month that the 80% rule was in effect, only 57% of the White students completed the test compared to 71% of African American students. This difference is especially striking given that African American students, on average, score about 150 points lower than White students on the GRE-A and would be expected to complete less, not more, of the test in the absence of an intentional strategy to obtain more time per question by not finishing. In October of 1996, the 80% rule was dropped and a proportional scoring system was instituted. As explained in the GRE 1996-97

FAIRNESS, ACCESS, MULTICULTURALISM, & EQUITY

THE GRE FAME REPORT

9

Information and Registration Bulletin, “Your score on the CAT will now be dependent on how well you do on the questions presented as well as on the number of questions you answer. Therefore, it is to your advantage to answer every question, even if you have to guess to complete the test” (bold in original). As indicated in Figure 4, the percent of students answering all Analytical questions increased dramatically when this new scoring rule took effect. The increase was especially large for White students, who were lowest, in terms of percent finishing, in the month prior to the introduction of proportional scoring and highest in the first month of the new scoring rule. The impact of the change to proportional scoring on the size of ethnic group differences can be seen in Figure 5. The bars show how far each ethnic group is below the mean for White students. The three major ethnic groups all gained, relative to White students, when proportional scoring was introduced, with the largest gain (20 points) noted for African American students. White students had apparently been disproportionately benefiting from the 80% rule — and when they could no longer use this rule to gain more time per question than the test designers had intended, their relative advantage became smaller (although it remained large in absolute terms). The experience with the 80% rule underscores the importance of being vigilant for unintended consequences. This rule was instituted with the best of intentions, as a way to make the GRE fairer for relatively slow students, but it eventually became apparent that it was really benefiting the most test-wise students.
continued on next page

FIGURE 4
Percent Answering All Analytical Questions, by Ethnicity
100 80 Percent 60 40 20 0 Asian Afr. Am. Hispanic White Ethnicity 80% Rule (Sept. ’96) Proportional Scoring (Oct. ’96) 58 74 79 71 63 57 82 80

Percent answering all GRE-CAT Analytical questions under 80% rule and under proportional adjustment.

FIGURE 5
GRE Analytical Score Differences 80% Rule vs. Proportional Scoring, by Ethnic Group
0 -50 -100 -150 -200 Asian -W Afr. Am. -W Ethnicity 80% Rule(Sept ’96) Proportional Scoring (Oct ’96) Hisp. -W -34 -22 -166 -146 -81 -67

GRE-CAT Analytical score differences from White mean under 80% rule and proportional adjustment.

FAIRNESS

IN

COMPUTER-BASED TESTING: WHAT WE KNOW

AND

WHAT WE NEED

TO

KNOW

10

Conclusion
Many of the features of existing CATs have a potential impact on fairness and access. Data from the GRE program suggest that, taken together, these features have a relatively small impact on gender and ethnic group differences. Nevertheless, a particular feature could still have a substantial impact on certain individuals (for example, those with especially poor wordprocessing skills or those who have a problem pacing themselves on timed tests). Furthermore, as the experience with the 80% rule suggests, some modifications in the ways CATs are administered and scored could have consequences for group differences that are the opposite of what was originally intended.

More study of existing CATs is needed, but the next generation of CATs has the potential to make a much larger impact on issues of fairness, access, multiculturalism, and equity. Future GRE-CATs will go beyond an exclusive reliance on multiple-choice question types and will be able to measure areas that go beyond the verbal, quantitative, and analytical skills currently assessed. Ethnic and gender differences in these new areas may be larger or smaller than those in the domains traditionally assessed. What we need to know about these new assessments must include information on their impact on fairness and access.
Note: The author would like to thank Rob Durso for providing the data for Figures 1-5.

FAIRNESS, ACCESS, MULTICULTURALISM, & EQUITY

THE GRE FAME REPORT

11

References
Ben-Shakhar, G., & Sinai, Y. (1991). Gender differences in multiple-choice tests: The role of differential guessing tendencies. Journal of Educational Measurement, 28, 23-35. Bridgeman, B. (1974). Effects of test score feedback on immediately subsequent test performance. Journal of Educational Psychology, 66, 62-67, 1974. Bridgeman, B. (1992). A comparison of quantitative questions in open-ended and multiple-choice formats. Journal of Educational Measurement, 29, 253-271. Grandy, J. (1987). Characteristics of examinees who leave questions unanswered on the GRE General Test under rights-only scoring (GRE Board Professional Report No. 83-16P; ETS RR-87-38). Princeton, NJ: Educational Testing Service. Mueller, D. J., & Wasser. (1977). Implications of changing answers on objective test items. Journal of Educational Measurement, 14, 9-13. Powers, D. E., Fowles, M. E., Farnum, M., & Ramsey, P. (1994). Will they think less of my handwritten essay if others word process theirs? Effects on essay scores of intermingling handwritten and word-processed essays. Journal of Educational Measurement, 31, 220-233. Schwarz, S. P., McMorris, R. F., & DeMers, L. P. (1991). Reasons for changing answers: An evaluation using personal interviews. Journal of Educational Measurement, 28, 163-171. Steele, C. M., & Aronson, J. (1995). Stereotype threat and the intellectual test performance of African Americans. Journal of Personality and Social Psychology, 69, 797-811. Wild, C. L., & Durso, R. (1979). Effect of increased test-taking time on test scores by ethnic and gender subgroups. (GRE 76-06R). Princeton, NJ: Educational Testing Service.

FAIRNESS

IN

COMPUTER-BASED TESTING: WHAT WE KNOW

AND

WHAT WE NEED

TO

KNOW

12

Testing Accommodations for the New Millennium: Computer-Administered Testing in a Changing Society
Kurt F. Geisinger Le Moyne College
regarded as having an impairment, despite whether or not the impairment substantially limits major life activities” (Geisinger, 1994, p. 123). ADA requires that assessment of individuals with disabilities be performed with all reasonable accommodations being made. The word reasonable, of course, is ambiguous and differs depending upon the circumstances of the assessment. This paper presents some of the kinds of accommodations that have been traditionally offered to test takers with disabilities, but it does not address the so-called issue of “flagging” — that is, of identifying when an individual has not taken measure under standardized conditions — because I believe this issue is a legal and procedural one, rather than a psychometric one per se.

O

ne of the themes of the FAME initiative is inclusiveness. We know that American society has changed in many ways. With regard to this issue of inclusiveness, our society no longer wishes to relegate individuals who differ in some way from the majority group in our nation to statuses any different from those held by the majority group members. One such group is composed of those with disabilities. We no longer expect that individuals with disabilities of one type or another will be relegated to institutions or even to limited rights and responsibilities. Rather, many such individuals can both succeed individually and can improve our society through their achievements — attainments that sometimes require proper accommodations. We believe this just as we believe that those who come from other countries and whose English may not yet be fluent, or at least fluent in the academic context of test taking, should nevertheless be permitted to succeed to the best of their abilities, or that those whose developmental opportunities have been limited should be afforded improved chances for success. This freedom of opportunity has been a hallmark of our society throughout the history of the United States of America.

Accommodations With Paper-and-Pencil Tests
As noted above, ADA has extended particular rights to individuals with disabilities. While ADA was primarily oriented to employment testing, its statutes are rather broad and its spirit encourages us to permit those with disabilities to achieve “whatever status they seek if they meet the essential requirements of that status, regardless of their disability” (Geisinger, 1994, p. 124). As noted above, tests must be administered to individuals with disabilities using reasonable and appropriate accommodations. This requirement, however, conflicts with the notion that we must uniformly administer tests to all individuals to maintain both validity and fairness. Furthermore, in educational settings, mandates that all individuals should receive testing would appear to necessitate such accommodations, even when it is difficult, if not impossible, to make such assessments with valid meaning. Finally, the use of ambiguous terms such as reasonable provides insight into why our nation has the largest number of attorneys in the world. The intent of most of the legislation concerned with the testing of individuals with disabilities has been to enforce valid and nondiscriminatory measurement of levels of academic achievement, aptitude, and other constructs,

Legislation Regarding Those With Disabilities
The assessment of students with disabilities has taken on considerable importance since the passing of the Americans with Disabilities Act (ADA) of 1990 (PL 101-336), although most of the legal requirements for assessing students were previously justified by Section 504 of the Rehabilitation Act of 1973. Generally, the best methods for assessing students with disabilities coincide with the legally defensible methods for this activity. Under ADA, a “disability is defined as (a) a physical or mental impairment that substantially limits one or more life activities, (b) a record of such an impairment, or (c) being

FAIRNESS, ACCESS, MULTICULTURALISM, & EQUITY

THE GRE FAME REPORT

13

independent of any handicapping conditions. Such idealistic goals are laudable but difficult to implement, given the impact on the educational and developmental experiences of individuals with disabling conditions. To be sure, laws passed in the last 25 years have attempted to safeguard the rights of people with disabilities. Tests used in admissions decision making in higher education have been under the jurisdiction of Section 504 of the Rehabilitation Act of 1973, which mandates both that admissions tests administered to individuals with disabilities be validated and that scores resulting from such instruments reflect ability and aptitude — whatever the test was intended to measure — rather than any disabilities extraneous to what is being assessed. As previously noted, ADA has both replaced and enhanced the rights extended under Section 504 of the Rehabilitation Act. Essentially, ADA extends to persons with disabilities the rights extended to other protected groups (e.g., racial minorities). With regard to educational testing, the ADA does not guarantee that individuals with disabilities will be admitted, passed, or graduated, or receive any other status as a result of their test scores. Rather, it assures that they may achieve “whatever status they seek if they meet the essential requirements of that status, regardless of their disability” (Geisinger, 1994, p. 124) as long as their disability does not make success impossible — a rare event indeed. In addition, any means used to assess whether such individuals have met the essential requirements of an educational status must not be unnecessarily difficult as a result of their irrelevant disabilities. The number and type of reasonable accommodations needed to administer many largescale tests to individuals with every variety of disability is itself almost overwhelming. Further, we must multiply that number by the almost infinite variations in degrees of disability, thereby producing a staggering number of potential departures from the standard format. Thus, although the uniform administration of educational tests generally has limited sources of

variation stemming from individual differences among test takers, once scores from standard and nonstandard test administrations are considered as one set of scores, sources of variation are introduced from differences in test administration as well. In group-administered standardized testing situations, a number of discrete modifications are likely to be offered regularly, but specific, individually developed accommodations may not be available. Although such accommodations are likely to be used with computeradministered and computer-adaptive tests, they may differ from those used with paper-andpencil standardized tests. Willingham, Ragosta, Bennett, Braun, Rock, and Powers (1988) studied the use of standardized admissions tests in higher education for applicants with disabilities. They divided those with disabilities into four categories: those with visual impairments, those with hearing impairments, those with physical handicaps, and those with learning disabilities. There are different types of testing accommodations associated with each of these disabilities. The general objective of any modified test administration is to “provide a test that eliminates, insofar as possible, sources of difficulty that are irrelevant to the skills and knowledge being measured” (Willingham et al., 1988, p. 3). A large variety of accommodations in testing can be provided, and may be provided, alone or in combination. Some accommodations are made in a group format; others are individualized, even for tests that are generally group administered. Accessibility is a basic requirement for individuals with many kinds of disabilities. For performance assessments, a larger number of accommodations may be possible than for paperand-pencil multiple-choice tests. In extreme circumstances, changes in some of the abilities and content covered on examinations may be required. These changes, which certainly could have a serious impact on the validity of an assessment, are described later, under the ”General accommodations“ subhead.
continued on next page

TESTING ACCOMMODATIONS FOR THE NEW MILLENNIUM: COMPUTER-ADMINISTERED TESTING IN A CHANGING SOCIETY

14

The Numbers of Individuals With Disabilities
Willingham and his associates (1988) found that approximately 2.21% of the test takers on the Scholastic Aptitude Test (SAT) and 1.7% on the Graduate Record Examination (GRE) selfidentified as disabled and that these numbers increased over time. Most increases have been in the learning disabled group. Between 80% and 90% of the group self-identified as disabled took the standard test administration, while approximately 0.43% of the entire test-taking population on the SAT and 0.18% of the test takers on the GRE required nonstandard test administrations. The majority of those requiring nonstandard administrations (accommodations) on the SAT were those with learning disabilities; on the GRE, the majority were those with visual impairments. In any case, even the small number of persons with handicapping conditions who take these nationally administered, standardized tests inadvertently make it extremely and perhaps prohibitively difficult to validate empirically and otherwise study even the most common of the so-called “accommodations” on educational assessments. The validation of large-scale assessments, such as standardized achievement tests — that are justified on the basis of content validation — demands a qualitatively different form of analysis. An evaluation of whether individuals with various handicapping conditions have been (equally) exposed to the content knowledge and cognitive processes assessed on a given assessment should be investigated, and could be carefully considered, by a panel of individuals who are expert both in the subject matter and the disability (or disabilities) in question. They may also need to be knowledgeable about the education of individuals with specific disabilities. Table 1 presents a listing of the numbers of students who are in federally supported programs for the disabled (National Center for Educational Statistics, 1995). Obviously, many individuals who will be taking any of a variety of educational tests, including admissions tests, will not be in such programs. But the data in this table provide some information on the relative numbers of those with various handicapping conditions who may be interested in test administration accommodations. For most written standardized tests,

TABLE 1
Numbers of Children With Disabilities in Federally Supported Programs (1992-93) for the Disabled, by Type of Disability1
Type of Disability Numbers2 % of % of Those Total Disabled 45.9 19.4 10.1 7.8 1.2 1.0 1.3 0.5 2.0 0.03 0.4 10.4 100.0 5.50 2.33 1.21 0.94 0.14 0.12 0.15 0.05 0.24 0.004 0.04 1.24 11.97

Specific learning disabilities 2,354 Speech/language impairments 996 Mental retardation 519 Serious emotional disturbances 401 Hearing impairments 60 Orthopedic impairments 52 Other health impairments 65 Visual impairments 23 Multiple disabilities 102 Deaf-blindness 1 Autism and other 1 Preschool disabled5 531 Total
1 2

5,125

Data taken from National Center for Educational Statistics (1995). Numbers are in thousands. 3 Less than 0.05% 4 Less than 0.005% 5 Includes preschool children 3-5 and 0-5 served under Chapter I and IDEA, respectively.

those with specific learning disabilities, speech/language impairments, mental retardation, and serious emotional disturbances will be able to take the standard assessment, perhaps with increased time. This group of individuals with disabilities accounted for more than 80% of all such persons represented in elementary and secondary federally supported programs during the 1992-93 academic year. Some of the groups are less likely to be represented in the higher education admissions test-taking population, of course, but we need to be prepared to make the appropriate accommodations for them, should they choose to join this student population. Table 2 indicates that there has been an increase of 13.33% in the numbers of children and youth with disabilities served in selected federally funded educational programs during the time period 1989-94. This increase appears to have emerged primarily due to increases in

FAIRNESS, ACCESS, MULTICULTURALISM, & EQUITY

THE GRE FAME REPORT

15

the number of pupils with learning disabilities, multiple disabilities, other health impairments, and autism. The overall increase occurred in spite of decreases in the numbers of individuals identified in the speech impairment, mental retardation, and hearing impairment groups. We must continue to bear in mind that although we can make groupings of individuals with various disabilities, there are gradations within each disability grouping. Some individuals with a disability may be able to take a test without accommodation; others may require a simple accommodation, and still others may need radical changes. Such adaptations will generally require communication between the potential test taker or his/her representatives and the test administrator.

TABLE 2
Changes in the Numbers of Children With Disabilities in Federally Funded Programs 1989-94, by Type of Disability6
Type of Disability Learning disabilities Speech/language impairments Mental retardation Serious emotional disturbance Hearing impairments Orthopedic impairments Other health impairments Visual impairments Multiple disabilities Deaf-blindness8 Autism Traumatic brain injury Total
6 7

Number7 Number % 1989 1994 Change 47.8 23.1 13.8 8.9 1.4 1.1 1.2 0.5 2.0 0.0 NA NA 50.9 21.4 11.3 8.7 1.3 1.2 1.7 0.5 2.3 0.0 0.4 0.1 6.49 -7.36 -18.12 -2.25 -7.14 9.09 42.67 0.00 15.00 0.00 NA NA 13.33

4,125.5 4,730.4

Data taken from U.S. Bureau of the Census (1996). Numbers are in thousands. 8 Less than 0.05%

Accommodations With Computer-Administered Tests
In the coming decade, we know that many more examinations will be administered by computer than are presently. It is likely that some standardized assessments will be offered only by

computer administration. Let us consider what accommodations may be appropriate for this type of assessment, using for this initial analysis the four groupings employed by Willingham and his associates: individuals with visual impairments, those with hearing impairments, those with physical handicaps, and those with learning disabilities. Each of these is addressed in turn. After a discussion of the above four groups, some general accommodations are suggested; we may wish to consider these accommodations for several groups. I also consider some of the accommodations that may be appropriate for those in the ”other health impairments group“ that has been growing. Please note that in considering many of these accommodations, rather than discussing them with testing specialists, I have discussed them with academic specialists who work with individuals who have some of the disabilities. In some instances, I have cloned the techniques they have been using to help their students learn and communicate, and adapted them for computer-administered testing. Visual impairments. Currently, paperand-pencil test forms are provided in regular (standard), improved-type, large-type, Braille, and audio-cassette formats. Time limits can be enforced, extended, or waived altogether. Test takers with varying degrees of visual disability may need a reader; an amanuensis, or personal recorder; a tape recorder to register answers; or extra rest pauses to meet their particular requirements. It is possible that tests with especially clear type or larger type could be provided by using higher quality and larger monitors. Larger type may be needed on the keys of the keyboard as well. In some instances, software adaptations would be needed to increase the font size on the screen. In such cases, the tests still could be provided in an adaptive manner, where performance on each item influences subsequent item selection. However, software and headphones may also permit the computer to present the items to the test taker with a visual disability orally rather than on the screen. Voice recognition software used in conjunction with computers with microphone technology will even permit the oral responses by the test taker (although in such instances, test takers will need to be assessed alone so that their vocalizations neither help nor hinder others taking the test
continued on next page

TESTING ACCOMMODATIONS FOR THE NEW MILLENNIUM: COMPUTER-ADMINISTERED TESTING IN A CHANGING SOCIETY

16

who might be able to hear them). With such software and hardware adaptations, it is now possible to imagine a test taker with a visual disability hearing rather than reading objective test questions through a head set. This test taker could make oral responses, with the computer scoring the performance of the test taker just as it does for those using a keyboard. The testing software would then choose questions based on the test taker’s performance on the preceding questions. Clearly, however, validation demonstrating the comparability of such assessments to the standardized format would be useful, but given that many of our current college students with visual impairments “read” their books by listening to them on tape, the format of the assessment would initially appear comparable and appropriate. In some testing circumstances, it may be best to continue to offer standardized assessments to those with visual disabilities in a paperand-pencil format. It is difficult to imagine a computer-adaptive test using Braille items, for example. Costs of large-screen monitors, high resolutions fonts, and computer enunciated items, as well as development of the necessary software and oral response technology for test takers may suggest that it would be more efficient and perhaps as valid to continue to offer paper-and-pencil tests in high resolution, largetype, and Braille versions. Furthermore, in a paper-and-pencil format, it is clear that the test continues to measure — to a great extent — the skills and abilities called for in the original test plan. In extreme circumstances, changes in some of the abilities and content covered on examinations may be required. These changes, which obviously could have a serious impact on the validity of an assessment, are described, under ”General accommodations.“ Deaf and hard-of-hearing. Most paperand-pencil-type standardized tests do not require significant modifications for those with hearing disabilities. In situations where instructions are presented orally, it might be possible to modify the instructions so that they are presented in a written format, orally with significantly louder volume, or through sign language. It would be possible using CD-ROM technology, for example, to present instructions or other components of a test that are normally presented orally by showing an individual on the screen

who is providing the communication using American Sign Language (ASL), oral interpretation, or another communication approach. Assessments that depend directly upon oral communication, such as tests of listening skills, will certainly be more difficult to adapt for those with hearing difficulties. However, in the case of admissions testing, assessments of ASL or electronic communications skills may suffice, depending upon the nature of the accommodations that are present in the actual educational setting. Physical disabilities. There are, of course, many types of physical handicaps, and no single accommodation will be correct for all of them in a given assessment. For those test takers in wheelchairs, accessible test centers and computers at which to take the examinations are minimally necessary. There are numerous physical and software adaptations for keyboards that have improved the performance of individuals with various kinds of physical impairments. Among the physical changes are keyguards (which permit test takers to rest their hands without causing keys to fire), mini-keyboards that do not necessitate significant movement between keys (but which demand fine motor control), membrane keyboards that are extremely responsive to touch, and expanded keyboards that can be used by individuals who may have difficulty with fine motor control. Among the software adaptations that have been found useful for those working with individuals with physical disabilities are systems that allow the test taker to shut off keys; word prediction software that essentially guesses what a typist is keying in, thereby potentially saving a test taker valuable keystrokes; and software patches that permit individuals to touch multiple keys sequentially, rather than simultaneously. Of course, in some instances, such as the case with quadriplegics, mouth wands are often the accommodation of choice. As in the case of those with visual disabilities, dictation software that permits individuals to speak their answers aloud would be very useful for some test takers. Learning disabilities. Like the categorization of physical disabilities, the category of learning disabilities includes many different problems and patterns of problems, for which no one descriptive title can adequately describe. Few learning disabilities are precisely defined. Some of the accommodations previously cited also

FAIRNESS, ACCESS, MULTICULTURALISM, & EQUITY

THE GRE FAME REPORT

17

apply to individuals with learning disabilities. For example, some students with learning disabilities use dictation software because they find it significantly easier to speak their answers than to answer via typing. As with individuals with visual impairments, readers, large-type test forms, and other adaptations of test materials can also be used. However, the most common accommodation is that of extended time. I think that the advent of computer-administered testing provides the testing profession with the rare opportunity to rethink some of the issues related to the timing of standardized tests. This topic is discussed in the following section. General accommodations. In this section, I wish to address two general test-taking matters: time to take a test and rest pauses in testing. In addition, I will consider the ability to offer all content to all test takers. Many standardized paper-and-pencil tests are administered to individuals with additional time provided to take the test. In fact, it is frequently stated that numerous standardized tests of developed academic ability are power tests, or tests that can be given in an untimed manner and gain results that parallel those of the same test given in a timed administration. Such statements have sometimes been made about the Miller Analogies Test, for example, although I have not been able to validate them (that is, I have not found any empirical research that confirms the above premise). Sternberg (1977) reports the results of a small study that he performed in which a form of the Miller Analogies Test that was composed entirely of practice items was administered in a timed fashion along with a group of other measures. Results indicated that scores were highly correlated with performance on reasoning tests but were not correlated with scores on tests of perceptual speed. He therefore concluded that it was indeed a power test and not a test of speediness. For apparently similar reasons, the WatsonGlaser Critical Thinking Appraisal (The Psychological Corporation, 1994) may now be administered in either a timed or an untimed version, although I have questioned the meaningfulness of the norms when it is given in an untimed fashion (Geisinger, 1998). Test taking has changed with the advent of computer-adaptive tests. In most of the paperand-pencil test-taking formats, examinees can

revisit items, to change or reconsider their answer, or to provide an answer for a question that they skipped rather than spend undue time on it when they first encountered it on the test form. On the computerized administrations of the same test, however, they may not be able to return to the item; indeed, subsequent selections of test items may have been influenced based on the test taker‘s performance on the previous item. On this basis, it would appear that extra time will not help sophisticated test takers on a computer-administered test in the same way that it might on a paper-and-pencil test. I would like to see research conducted that provides individuals with two chances to take a computer-administered admissions measure, once in a timed fashion and once in an untimed manner. If correlations are high and mean differences trivial, then perhaps we could permit all test takers on some instruments to take these measures in an untimed manner. In fact, with computer-administered tests, in addition to the scores earned, we could record the time taken to complete a test, for those who believe that such information is important. (I would not necessarily argue that we do so.) A few pragmatic issues would result, such as that scheduling computer work stations used for testing would be more difficult. We may find that many of our admissions tests are power tests to the extent that additional time would not influence scores appreciably, especially if all test takers had the additional time. Indeed, if everyone took the tests under untimed conditions, many issues related to the flagging of test scores would be eliminated, I believe. If, on the other hand, we choose to retain time limits, we must also engage in research to determine the answer to the question, “How much additional time is enough?” Willingham and his associates (1988) have shown convincingly that providing additional time for test takers with learning disabilities may lead to overpredictions of academic performance. We may need to engage in research to determine an answer to the above question; or alternatively, we may need to request accommodation forms that permit the endorsing professional to quantify in some manner the extent of the learning disability so that an adequate, but not unlimited, amount of time is permitted.
continued on next page

TESTING ACCOMMODATIONS FOR THE NEW MILLENNIUM: COMPUTER-ADMINISTERED TESTING IN A CHANGING SOCIETY

18

A second timing matter related to test taking is that of rest pauses. Individuals with many reasons for taking accommodated test administrations — pregnant women, for example — require breaks in their test taking. Taking a test with an accommodation may become possible, but it may also be quite tiring as well, and rest pauses, perhaps extending the testing to a second day, may be necessary in some circumstances. With paper-and-pencil test forms, such rest pauses — or interludes to accommodate some medical or physical need — are potentially problematic, because test takers can look ahead on the test, and the dishonest test taker can use the breaks to determine answers to some components of the test in one way or another. With computeradaptive testing (CAT), however, typically the test taker does not encounter test materials until he or she must respond. Thus, some of the problems associated with pauses are eliminated with CAT. When test materials are adapted in some ways, certain kinds of test items or testing formats may not be robust to the changes. That is, they may not translate easily to the revised format. For example, there may not be any way to test validly whether a test taker with a total visual impairment can interpret a graph, or whether a test taker with an auditory deficit can take a test of oral communication skills. The adjustment of the content of an examination as part of an accommodation is perhaps the most troublesome potential change we will encounter. Can we on a test of verbal ability, for example, legitimately replace reading questions with vocabulary questions? I propose the following manner to determine whether or not this is possible: We must ask whether it would be possible for a test taker to receive a CAT similar to the one received under the accommodation. If the rubric that determines which test items are administered, for example, does not build into each assessment a strict content adherence to the test outline, then it may not be an unacceptable accommodation to provide a test that explicitly avoids certain content for which it would be impossible to test using the accommodation. It would be useful as part of the development of each test outline to assess the robustness

(Klimoski & Palmer, 1994) to accommodation of each component of the planned test.

Psychometric Issues Involved in Test Accommodation
In other settings (e.g., Geisinger, 1994), I have addressed psychometric issues that need consideration in regard to the testing of individuals with disabilities. Most of these issues relate to the comparability of scores resulting from accommodated test administrations. Validation of such accommodations is clearly needed, yet the diversity of disabilities and the huge differences in the extent of disabilities, even within categories of disabilities, make such validation efforts almost impossible. The numbers of individuals with a given disability, receiving the same accommodation, and entering programs that are similar enough that comparable criteria exist and can be accumulated are what make validation so difficult. The efforts of Willingham, Bennett, and their associates (Willingham et al., 1988; Bennett et al., 1987, 1988, 1989) are clearly the best efforts that have been noted to date. We may need to pare down the numbers of accommodations that we provide for some tests. Such efforts will help us to know, at least initially, how well different accommodated test administrations work. Perhaps we also need simpler validation designs that help us ascertain the comparability and usefulness of the accommodations. Assume for a moment an argument that I would make: Most validation relates at some level to test scores or test results being compared with professional judgments. Now imagine the following design: Teachers who know test takers requiring accommodations well are asked to estimate how they would expect students to perform on a nationally standardized admissions test. We request the same data from teachers of students not requiring such accommodations. We then compare the relationships between scores and teacher judgments across the two groups. Here we have a simple design that could be used to estimate the potential appropriateness of each of a number of accommodations. I raise a final set of questions. Should there be separate norms tables for accommodated test administrations? Should scores that result from

FAIRNESS, ACCESS, MULTICULTURALISM, & EQUITY

THE GRE FAME REPORT

19

accommodated test administrations be included in norms groups? I would probably argue that we cannot interpret such scores with the precision that we desire unless we have such data, but there are certainly legitimate arguments against so doing. I place this question on the table for discussion.

Conclusion
In the coming years, the widespread implementation of computer-administered testing will present significant challenges. In fact, it already has. Accommodating the reasonable needs of test takers with disabilities will also present formidable tasks. However, it is critical that we remember the goals: access and inclusion. We must

engage in research that examines certain testing adaptations, and we must know how comparable the results will be. We want this underrepresented segment of our society to succeed, and believe that their chance of success can be better realized with valid assessments. Clearly, testing accommodations are needed to be able to make such assessments. I hope we will begin the process of determining what accommodations can be provided, and how appropriate and valid the scores are that emerge from these accommodations. I encourage ETS to engage in a needs assessment to determine which of the previously discussed adaptations may be possible and which could be delivered at various testing centers. This paper has tried to set the stage for a potentially fascinating play. It’s time to lift the curtain. continued on next page

TESTING ACCOMMODATIONS FOR THE NEW MILLENNIUM: COMPUTER-ADMINISTERED TESTING IN A CHANGING SOCIETY

20

References
Americans with Disabilities Act of 1990, 42 U.S.C. δ (1990). Bennett, R. E., Rock, D. A., & Juele, T. (1987). GRE score level, test completion, and reliability for three handicapped groups. Journal of Special Education, 21, 9-21. Bennett, R. E., Rock, D. A., & Kaplan, B. A. (1987). SAT differential item performance for nine handicapped groups. Journal of Educational Measurement, 24, 41-55. Bennett, R. E., Rock, D. A., Kaplan, B. A., & Jirele, T. (1988). Psychometric characteristics. In W. W. Willingham, M. Ragosta, R. E. Bennett, H. Braun, D. A. Rock, & D. E. Powers. Testing handicapped people (pp. 84-97). Needham Heights, MA: Allyn and Bacon. Bennett, R. A., Rock, D. A., & Novatkowski, I. (1989). Differential item functioning on the SAT-M Braille Edition. Journal of Educational Measurement, 26, 67-79. Chmielewski, M. A. (1996). Leveling the playing field: Alternative testing arrangements. In L. L. Walling (Ed.), Hidden abilities in higher education: New college students with disabilities (pp. 63-67). Columbia, SC: National Resource Center for the Freshman Year Experience & Students in Transition, University of South Carolina. Monograph Series Number 21. Geisinger, K. F. (1994). Psychometric issues in testing students with disabilities. Applied Measurement in Education, 7, 121-140. Geisinger, K. F. (1998) Review of the Watson-Glaser Critical Thinking Appraisal-Form S. In J. C. Impara & B. S. Plake (Eds.), The thirteenth mental measurement yearbook (pp. 1121-1124). Lincoln, NE: Buros Institute, University of Nebraska. Geisinger, K. F., & Carlson, J. F. (1995). Testing students with disabilities. ERIC/Clearinghouse on Counseling and Student Services Digest, (ERIC #EDO-CG-95-27). In W. D. Schafer (Guest Ed.), Assessment in counseling and therapy: An ERIC/CASS Special Digest Collection, Washington, DC: ERIC Clearinghouse on Counseling and Student Services. Green, B. F. (1990). System design and operations. In H. Wainer (Ed.), Computer adaptive testing: A primer (pp. 23-40). Hillsdale, NJ: Erlbaum. Klimoski, R., & Palmer, S. N. (1994). The ADA and the hiring process in organizations. In S. M. Bruyere & J. O’Keeffe (Eds.), Implications of the Americans with Disabilities Act for psychology (pp. 37-83). Washington, DC: American Psychological Association. National Center for Educational Statistics. (1995). Digest of educational statistics 1995. Washington, DC: U.S. Department of Education, Office of Educational Research and Improvement. Nester, M. A. (1994). Psychometric testing and reasonable accommodation for persons with disabilities. In S. M. Bruyere & J. O’Keeffe (Eds.), Implications of the Americans with Disabilities Act for psychology pp. 25-36). Washington, DC: American Psychological Association. Section 504 of the Rehabilitation Act of 1973, 29 U.S.C. δ 701 et seq (1973). Sternberg, R. J. (1977). Intelligence, information processing, and analogical reasoning: The componential analysis of human abilities. Hillsdale, NJ: Erlbaum. The Psychological Corporation. (1994). Watson-Glaser Critical Thinking Appraisal, Form S: Manual. San Antonio: Author. U.S. Bureau of the Census. (1996). Statistical abstract of the United States: 1996 (116th ed.) Washington, DC: Author. Vess, S. (1993). The Americans with Disabilities Act. NASP Communique, June 23, 22-24. Willingham, W. W., Ragosta, M., Bennett, R. E., Braun, H., Rock, D. A., & Powers, D. A. (Eds.). (1988). Testing handicapped people. Needham Heights, MA: Allyn & Bacon.

FAIRNESS, ACCESS, MULTICULTURALISM, & EQUITY

THE GRE FAME REPORT

21

Equity and Knowledge Integration

Marcia C. Linn University of California at Berkeley

I

mportant questions and issues are being addressed in moving to computer-delivered tests, which, I think we all believe are really the wave of the future. Everybody, whether we like it or not, is using technology more and more. In listening to Brent Bridgeman and others, several issues occurred to me. First is the issue of equating the computer test with the paperand-pencil test. The equating process is going along well, and we can trust the experts to continue to ask the right questions and then explore them in a very effective way. Second is the issue of whether the effort that is currently under way will advance our Fairness, Access, Multiculturalism, and Equity (FAME) agenda. Indeed, it seems clear that an equating effort is not consistent with the equity agenda. If the new test is the same as the previous one, then equity concerns we believe are problematic in the current test will simply continue to exist in the new test. What are those equity concerns? I would like to raise two questions and then suggest a framework that we might use to think about equity. In listening to these presentations, one gets a firm sense that speededness, the rate at which questions are presented and the kinds of responses that are required, is an equity issue. Changing the rate of responding required by individuals changes the pattern of responses of different ethnic, gender, and cultural groups. What is the mechanism behind these effects? Why does rate of question response influence the performance of gender, ethnic, and cultural groups? More than knowing what the numbers are, understanding the mechanism is important. Second, how do these findings connect to the well-established discrepancy between performance in grades and performance on tests? Numerous research studies show that women earn better grades than men. They get better college grades. They get better grades in chemistry. They get better grades in computer science. They get better grades in engineering. At the University of California at Berkeley, Leonard, et al. [1996], found that women earn higher grades than men and that SAT scores

underpredict college grades for students selected on the bases of scores. This difference casts some doubt on the validity of the SAT test, provided you take the faculty’s perspective. At Berkeley, the faculty believe they can assess performance. The question is, again, what is the mechanism — how do we account for this difference? In thinking about these issues it helps me to pay attention to how students learn. And I find it helpful to think about the process of learning, understanding, and responding to instruction, in terms of knowledge integration. To illustrate, here at this conference we have come together to try to jointly integrate our ideas about fairness, equity, multiculturalism, and other related issues. We bring different models to the process. By models, I refer to ideas, conjectures, theories, patterns, and other perspectives. Indeed, for any of the issues at hand, we all have a whole repertoire of models. And the real challenge for us is to try to make sense of those models: to organize them, to integrate them, to disattenuate them. We heard today at least five or six reasonable models for fairness in testing. Is it fair if the questions are the same? Is it fair if the constructs are the same? Is it fair if you have the same amount of time? Is it fair if you have equal opportunity to learn? Is multiple choice fairer than an essay? We all have multiple models, so anytime you ask us a question about fairness, all of them are potentially available to us and we have to make a decision concerning which of those models applies to the situation. Knowledge integration is a dynamic process, so in fact, we are constantly redefining which models apply under given circumstances. And this dynamic process, as well as the repertoire of models held by the learner, needs attention when we design tests; because anytime we give a question, we are asking individuals to sort through however many models they have, select one or several depending on the question, and apply their selection(s). If the question is multiple choice, the computer does not care about the critical competitors. Students have to come up with an answer. If it is an essay, it may be valuable to bring up several models and write
continued on next page

EQUITY

AND

KNOWLEDGE INTEGRATION

22

down how you weigh those alternatives. So students in testing and grading situations can be seen as engaging in knowledge integration and then giving us an indication of how they did so. Can we use this view to find a mechanism to account for the interesting models that we are observing in test performance? Applying this knowledge integration perspective to the grades and scores discrepancy raises some interesting conjectures. Certainly for a research project, students who have a broad repertoire of models, as well as the ability to distinguish among them, may be most successful. Under circumstances where you quickly need to come up with a right answer, it may be better to have only the model that the test developer had in mind, not six or seven others to choose from. Indeed, I was taken by a recent news report about an SAT item, where it turned out that one of the respondents integrated understanding across a different range of models from the developers. This respondent questioned the answer that had been selected by the SAT developers and approved by the reviewers. The respondent seemed to have generalized this question to all rational numbers, while the individuals reviewing the item, and perhaps the test developers, were thinking that the question applied to positive integers. So here is an example of how knowledge integration and the things that one brings to bear in a situation-influence performance. It is a fairly constrained context, one where you might think everybody would really select the same set of models. So knowledge integration raises an important question concerning mechanisms governing test performance. We’ve heard a lot of results about the effect of context on performance. This knowledge integration and multiple models perspective is very useful in thinking about context. If people expand and build their knowledge integration around the issues that are of interest to them, then they may have a much richer set of information to bring to this area of interest. They may be able to evaluate critical competitors more effectively, and they may also have good intuitions about which models are likely to apply to a given situation. People who have thought less about something, have less interest in it, or have less experience with it, may have more trouble selecting the right model.

I’ve been told that my time is up, which is quite shocking. Remember what I said about speededness — we should reflect on that! At any rate, I think that a knowledge integration perspective could be extended to some of the other issues that have been raised and that concern us when we think about equity. For example, I think this is something that we might use to try to make sense of stereotype vulnerability, to explain why certain subgroups perform better in areas that have closed systems and why others perform better in areas that are more multiply contextualized, and to distinguish essay and multiple choice test performance.

Responses to Informal Questions
Test and grade performance Can a knowledge integration perspective help us understand gender differences in performance of males and females on standardized multiple-choice tests such as the SAT and in high school and college courses where grades are awarded? Males outperform females reliably and consistently on the SAT; females outperform males reliably and consistently on grades in high school and college mathematics and science courses. What is the mechanism governing these differences, and does knowledge integration as a model help us to interpret them? In particular, do tests such as the SAT-Mathematics elicit learning models for some students that are different from those elicited for other students? Could we design instruction such that more similar learning models were elicited for all individuals taking the test, or design a test more matched to learning models that are equitably distributed? Similarly, do course contexts elicit a different set of links and connections among ideas, resulting in differential performance for individuals from specific cultural groups, and can we alter learning experiences such that all students have an appropriate repertoire of learning models and the propensity to select them in the course learning context? Examining this conundrum offers some insight into design for equity. Given that males outperform females in testing situations and females outperform males in grading situations, two distinct possibilities exist. One possibility is that the repertoire of learning models held by

FAIRNESS, ACCESS, MULTICULTURALISM, & EQUITY

THE GRE FAME REPORT

23

students from these groups differs, and one set of models is more relevant to test contexts and another is more relevant to course contexts. An alternative interpretation of this model is that males and females have the same repertoire of models but the testing context promotes selection of adaptive models by males and unadaptive models by females, while the grading context elicits the opposite effect. Stereotype vulnerability and knowledge integration Why would individuals select different models in learning contexts, and how might cultural group membership influence model selection? One mechanism to account for model selection is called stereotype vulnerability. Steele (1997) argues that under conditions of stereotype vulnerability, individuals from vulnerable groups perform less successfully than they would if stereotype vulnerability were not in operation. Individuals are vulnerable to the impact of stereotype when they perceive themselves as members of a cultural or social group likely to perform poorly on a given indicator. In contrast, stereotype-enhanced individuals expect to perform well on the same set of tasks. Steele has ® studied performance on the GRE and other standardized tests under conditions of stereotype vulnerability and under conditions of stereotype neutrality, in which all respondents were told their subgroup would do well on the test. Steele found that stereotype neutral groups were more successful than stereotype labeled groups. This mechanism explained the performance of females in male-dominated fields and for African-Americans in Caucasian-dominated fields. Combining this perspective on stereotype vulnerability with the perspective on knowledge integration proposed earlier suggests that individuals under stereotype vulnerable conditions would make less sensible selections from their repertoire of models than might occur when they were not vulnerable to stereotype. In particular, under conditions where they expect failure, students might select less promising learning models than under conditions where they expect success. How might this work?

A major mechanism governing conditions where one expects failure yet hopes for success is heightened anxiety. Under conditions of heightened anxiety, typically respondents devote some of their reasoning capacity to the anxiety and have less capacity for selecting from their repertoire of models. What happens when individuals have less capacity for selecting from their repertoire of models? For example, do learners select more conservative or safer strategies under vulnerable condition, or do they just make poor choices? Results from spatial reasoning studies (Linn & Petersen, 1985) indicate that, those with heightened anxiety might check to be sure their selection procedures were appropriate for a given task. They might check these procedures against alternatives they have, and they might also seek additional indicators that confirm their procedures were effective. How would this work on a high-stakes test? Scores on the SAT and other similar measures depend on rapid responding. Reflection on model selection and review of selection practices wastes valuable testing time. Furthermore, safe practices might rely more on class-taught models than on shortcuts and hunches. Both of these practices would increase the time required to solve individual problems and therefore stand in the way of earning the highest possible scores. Research by Gallagher (1994), Fennema (Linn et al., in prep), and others suggest that stereotype vulnerable individuals are more likely to rely on class-taught heuristics under testing conditions. In course situations, social and context supports might improve model selection. Many additional cues are available with regard to model selection, including the context of the class. Instructors regularly report that stereotype vulnerable groups of students demand additional assistance from the instructor and frequently seek reassurance that they are using appropriate learning models. Stereotype vulnerable students may engage in adaptive practices of using social context cues to determine which learning models make the most sense for a given class. They may also consult peers and former participants in the class to gain further assurance that their approaches make sense. These practices increase the probability of course grade success.
continued on next page

EQUITY

AND

KNOWLEDGE INTEGRATION

24

Furthermore, a preference for safe, perhaps more time-consuming, learning models is less likely to be penalized in course contexts than in testing contexts. Open and closed systems and knowledge integration Stereotype vulnerability might influence performance in fields characterized by open or closed systems. Fields that can be represented as closed systems, like algebra, Mendelian genetics, mechanics, and geometry, advantage learners who refine and hierarchically organize their ideas. In contrast, fields like ecology, literature, and sociology frequently reward a complex web of ideas and a creative path through that web of ideas — an approach that is likely to be more consistent with the retention of multiple models. Entertaining multiple, linked, and socially identified ideas is compatible with fields like medicine, law, business, and customer-oriented aspects of engineering. Another area where knowledge integration and stereotype vulnerability may interact concerns sustained and reflective work. Individuals

who review, revise, and reconsider information can better carry out more complex and linked projects. Such individuals may be more successful on essays, projects, and written assignments rather than multiple-choice questions.

Conclusion
As we seek to make standardized tests as equitable as possible, it makes sense to reflect on mechanisms like knowledge integration that govern learning and problem solving. Current standardized college graduate admissions tests underpredict grades for females under some circumstances. One conjecture is that the combined effect of knowledge-integration practices and stereotype vulnerability contribute to this effect.
This material is based upon research supported by the National Science Foundation under grants MDR-8954753, MDR-9155744 and MDR-9453861. Any opinions, findings, and conclusions or recommendations expressed in this publication are those of the author and do not necessarily reflect the views of the National Science Foundation. Thanks to Dawn Davidson, Liana Seneriches, Erica Peck, and Mio Sekine for assistance with the production of this manuscript. Thanks also to the Knowledge Integration Environment and Computer as Learning Partner projects for helpful ideas about these issues.

References
Gallagher, A. M. and DeLisi, R. (1994). Gender Differences in the Scholastic Aptitude Test — Mathematics problem solving among high-ability students. Journal of Educational Psychology, 86, 204-211. Linn, M. et al. (in prep). Affirmative Action in the 1990’s: What Works. Educational Researcher. Linn, M. C. & Petersen, A. C. (1985). Emergence and characterization of sex differences in spatial ability: A meta-analysis. Child Development, 56, 1479-1498. Spencer, S. J., Steele, C. M., & Quinn, D. (1997). Stereotype vulnerability and women’s math performance. Manuscript submitted for publication.

FAIRNESS, ACCESS, MULTICULTURALISM, & EQUITY

THE GRE FAME REPORT

®

54030-15298 • U98M2 • Printed in USA

I. N. 241343

Sign up to vote on this title
UsefulNot useful