You are on page 1of 13

Assessment - Dynamic Assessment, National Assessment Of Educational Progress, Performance Assessment, Portfolio Assessment - CLASSROOM ASSESSMENT CLASSROOM ASSESSMENT

Classroom assessments are those developed or selected by teachers for use during their day-to-day instruction. They are different from the standardized tests that are conducted annually to gauge student achievement, and are most frequently used to serve formative purposes, that is, to help students learn. However, classroom assessments also can be used summatively to determine a student's report card grade. Standardized tests, on the other hand, tend to be considered summative assessments, as they are used to judge student progress over an extended period of time. As the research summarized below reveals, assessment used during instruction can have a profound impact on student achievement. But to do so, the assessments must provide accurate information and they must be used in appropriate ways. Research on Impact In 1984, Benjamin Bloom published a summary of research on the impact of mastery learning models, comparing standard whole-class instruction (the control condition) with two experimental interventionsa mastery learning environment (where students aspire to achieving specific learning standards) and one-on-one tutoring of individual students. One hallmark of both experimental conditions was extensive use of formative classroom assessment during the learning process. Analysis of summative results revealed unprecedented gains in achievement for students in the experimental treatments when compared to the control groups. To be sure, the entire effect cannot be attributed to the effective use of classroom assessment. But, according to Bloom, a major portion can. Based on his 1988 compilation of available research, Terry Crooks concluded that classroom assessment can have a major impact on student learning when it: Places great emphasis on understanding, not just recognition or recall of knowledge; as well as on the ability to transfer learning to new situations and other patterns of reasoning Is used formatively to help students learn, and not just summatively for the assignment of a grade Yields feedback that helps students see their growth or progress while they are learning, thereby maintaining the value of the feedback for students Relies on student interaction in ways that enhance the development of self-evaluation skills Reflects carefully articulated achievement expectations that are set high, but attainable, so as to maximize students' confidence that they can succeed if they try and to prevent them from giving up in hopelessness Consolidates learning by providing regular opportunities for practice with descriptive, not judgmental, feedback Relies on a broad range of modes of assessment aligned appropriately with the diversity of achievement expectations valued in most classrooms

Covers all valued achievement expectations and does not reduce the classroom to focus only on that which is easily assessed

A decade later, Paul Black and Dylan Wiliam examined the measurement research literature worldwide in search of answers to three questions: (1) Is there evidence that improving the quality and effectiveness of use of formative (classroom) assessments raises student achievement as reflected in summative assessments? (2) Is there research evidence that formative assessments are in need of improvement?(3) Is there evidence about the kinds of improvements that are most likely to enhance student achievement? They uncovered forty articles that addressed the first question with sufficiently rigorous research designs to permit an estimation of the effects of improved classroom assessment on subsequent standardized test scores. They also uncovered profoundly large effects, including score gains that, if realized in the international math and science tests of the 1990s, would have raised the United States and England from the middle of the pack in the rank order of forty-two participating nations to the top five. Black and Wiliam go on to reveal that "improved formative assessment helps low achievers more than other students, and so reduces the range of achievement while raising achievement overall"(p. 141). They contend that this result has direct implications for districts having difficulty reducing achievement gaps between minorities and other students. The answer to their second question is equally definitive. Citing a litany of research similar to that referenced above, they describe the almost complete international neglect in assessment training for teachers. Their answer to the third question, asking what specific improvements in classroom assessment are likely to have the greatest impact, is the most interesting of all. They describe the positive effects on student learning of (a) increasing the accuracy of classroom assessments, (b) providing students with frequent informative feedback, rather than infrequent judgmental feedback, and (c) involving students deeply in the classroom assessment, record keeping, and communication processes. They conclude that "self-assessment by pupils, therefore, far from being a luxury, is in fact an essential component of formative assessment. When anyone is trying to learn, feedback about the effort has three elements: redefinition of the desired goal, evidence about present position, and some understanding of a way to close the gap between the two. All three must be understood to some degree by anyone before he or she can take action to improve learning"(p. 143). Standards of Quality To have such positive effects, classroom assessments must be carefully developed to yield dependable evidence of student achievement. If they meet the five standards of quality described below, they will, in all probability, produce accurate results. These standards can take the form of the five questions that the developer can ask about the assessment: (1) Am I clear about what I want to assess?(2) Do I know why I am assessing? (3) Am I sure about how to gather the evidence that I need? (4) Have I gathered enough evidence? (5) Have I eliminated all relevant sources of bias in results? Answers to these questions help judge the quality of classroom assessments. Each is considered in greater detail below. Standard 1. In any classroom assessment context, one must begin the assessment development process by defining the precise vision of what it means to succeed. Proper assessment methods can be selected only when one knows what kind of achievement needs to be assessed. Are students expected to master subjectmatter contentmeaning to know and understand? If so, does this mean they must know it outright, or does it mean they must know where and how to find it using reference sources? Are they expected to use their knowledge to reason and solve problems? Should they be able to demonstrate mastery of specific performance skills, where it's the doing that is important, or to use their knowledge, reasoning, and skills to create products that meet standards of quality?

Because there is no single assessment method capable of assessing all these various forms of achievement, one cannot select a proper method without a sharp focus on which of these expectations is to be assessed. The main quality-control challenge is to be sure the target is clear before one begins to devise assessment tasks and scoring procedures to measure it. Standard 2. The second quality standard is to build each assessment in light of specific information about its intended users. It must be clear what purposes a particular assessment will serve. One cannot design sound assessments without asking who will use the results, and how they will use them. To provide quality information that will meet people's needs, one must analyze their needs. For instance, if students are to use assessment results to make important decisions about their own learning, it is important to conduct the assessment and provide the results in a manner that will meet their needs, which might be distinctly different from the information needs of a teacher, parent, or principal. Thus, the developer of any assessment should be able to provide evidence of having investigated the needs of the intended user of that assessment, and of having conducted that assessment in a manner consistent with that purpose. Otherwise the assessment is without purpose. The quality-control challenge is to develop and administer an assessment only after it has been determined precisely who will use its results, and how they will use them. Within this standard of quality, the impact research cited above suggests that special emphasis be given to one particular assessment user, the student. While there has been a tendency to think of the student as the subject (or victim) of the assessment, the fact is that the decisions students make that are based on teacher assessments of their success drive their ultimate success in school. Thus, it is essential that they remain in touch with and feel in control of their own improvement over time. Standard 3. Since there are several different kinds of achievement to assess, and since no single assessment method can reflect them all, educators must rely on a variety of methods. The options available to the classroom teacher include selected response (multiple choice, true/false, matching, and fill-in), essays, performance assessments (based on observation and judgment), and direct personal communication with the student. The assessment task is to match a method with an intended target, as depicted TABLE 1 in Table 1. The quality-control challenge is to be sure that everyone concerned with quality assessment knows and understands how the various pieces of this puzzle fit together. Standard 4. All assessments rely on a relatively small number of exercises to permit the user to draw inferences about a student's mastery of larger domains of achievement. A sound assessment offers a representative sample of all those possibilities that is large enough to yield dependable inferences about how the respondent would perform if given all possible exercises. Each assessment context places its own special constraints on sampling procedures, and the quality-control challenge is to know how to adjust the sampling strategies to produce results of maximum quality at minimum cost in time and effort. Standard 5. Even if one devises clear achievement targets, transforms them into proper assessment methods, and samples student performance appropriately, there are still factors that can cause a student's score on a test to misrepresent his or her real achievement. Problems can arise from the test, from the student, or from the environment where the test is administered. For example, tests can consist of poorly worded questions; they can place reading or writing demands on respondents that are confounded with mastery of the material being tested; or they can have more than one correct response, be incorrectly scored, or contain racial or ethnic bias. The student can experience extreme evaluation anxiety or interpret test items differently from the author's intent, and students may cheat, guess, or lack motivation. In addition, the assessment environment could be uncomfortable, poorly lighted, noisy, or

otherwise distracting. Any of these factors could give rise to inaccurate assessment results. Part of the qualitycontrol challenge is to be aware of the potential sources of bias and to know how to devise assessments, prepare students, and plan assessment environments to deflect these problems before they ever have an impact on results.


PSYCHOMETRIC AND STATISTICAL The place of psychometric and statistical tools in assessment must be understood in terms of their use within a process of evidence gathering and interpretation. To see this, consider the assessment triangle featured in a recent National Research Council Report and shown in Figure 1. The central problem in assessment is making inferences about cognition from limited observations. Psychometric and statistical tools are located at the interpretation vertex of the triangle, where their role is to negotiate between the other two vertices. To appreciate the way these tools work, one must first understand the other two vertices. Hence, each vertex is described below. Robert Mislevy and Mark Wilson describe similar and more elaborated approaches. The triangle should be seen as a process that is repeated multiple times in both assessment development and in the application of assessment in education. The Assessment Triangle Ideally, all assessment design should begin with a theory of student cognition, or set of beliefs about how students represent knowledge and develop competence in a particular area. This theory should be the basis for a construct: the cognitive characteristic to be measured. Constructs are based on what the "assessment designers" know from foundational research in psychology and instruction and on experts' judgments about how students develop in a particular area. The construct distinguishes expert performance from other levels of performance in the area and considers those levels from the perspective of a developmental process. Furthermore, educational outcomes are not as straightforward as measuring FIGURE 1 height or weightthe attributes to be measured are often mental activities that are not directly observable. Having delineated the nature of the construct, one then has to determine what kinds of observations of behavior, products, and actions can provide evidence for making inferences concerning the construct, at the same time avoiding data that hold little value as evidence for the construct. In the classroom context, observations of learning activity are the relevant things that learners say and do (such as their words, actions, gestures, products, and performances) that can be observed and recorded. Teachers make assessments of student learning based on a wide range of student activity, ranging from observation and discussion in the classroom, written work done at home or in class, quizzes, final exams, and so forth. In large-scale assessment, standardized assessment tasks are designed to elicit evidence of student learning. These may range across a similar variation of types of performances as classroom-based assessments, but are often drawn from a much narrower range. At the interpretation vertex is located the chain of reasoning from the observations to the construct. In classroom assessment, the teacher usually interprets student activity using an intuitive or qualitative model of reasoning, comparing what she sees with what she would expect competent performance to look like. In largescale assessment, given a mass of complex data with little background information about the students' ongoing learning activities to aid in interpretation, the interpretation model is usually a psychometric or

statistical model, which amounts to a characterization or summarization of patterns that one would expect to see in the data at different levels of competence. Standard Psychometric Models Probably the most common form that the cognition vertex takes in educational assessment is that of a continuous latent variablecognition is seen as being describable as a single progression from less to more or lower to higher. This is the basis for all three of the dominant current approaches to educational measurement: Classical test theory (CTT), which was thoroughly summarized by Frederick Lord and Melvin Novick; generalizability theory (GT), which was surveyed by Robert Brennan, and item response theory (IRT), which was surveyed in the volume by Wim van der Linden and Ronald Hambleton. In CTT, the continuous variable is termed the true score, and is thought of as the long-run mean of the observed total score. In GT, the effects of various measurement design factors on this variable are analyzed using an approach that analyzes the variance. In IRT, the focus shifts to modeling individual item responses: The probability of an item response is seen as a function of the underlying latent variable representing student competence (often denoted by [.theta]), and parameters representing the item and the measurement context. The most fundamental item parameter is the item difficulty, but others are also used where they are thought to be useful to represent the characteristics of the assessment situation. The difficulty parameter can usually be seen as relating the item responses to the construct (i.e, the [.theta] variable). Other parameters may relate to characteristics of the observations. For example, differing item slopes can be interpreted as indicating when item responses are also possibly dependent on other (unmodeled) dimensions besides the [.theta] variable (these are sometimes called discrimination parameters, although that can lead to confusion with the classical discrimination index). Parameters for lower-item and upper-item asymptotes can be interpreted as indicating where item responses have "floor" and "ceiling" rates (where the lower asymptote is often called a guessing parameter). There is a debate in the literature about whether to include parameters beyond the basic difficulty parameters in the response model in order to make the model flexible, or whether to see them as indicating deviations from a regularity condition (specific objectivity) that sees them as threatening the interpretability of the results. A second possible form for the construct is as a set of discrete classes, ordered or unordered depending on the theory. The equivalent psychometric models are termed latent class (LC) models, because they attempt to classify the students on the basis of their responses as in the work of Edward Haertel. In these models, the form of cognition, such as problem-solving strategy, is thought of as being only pos sible within certain classes. An example might be strategy usage, where a latent class approach would be seen as useful when students could be adequately described using only a certain number of different classes. These classes could be ordered by some criterion, say, cognitive sophistication, or they could have more complex relations to one another. There are other complexities of the assessment context that can be added to these models. First, the construct can be seen as being composed of more than a single attribute. In the continuous construct approach, this possibility is generally termed the factor analysis model when a classical approach is taken, and a multidimensional item response model (MIRM) when starting from the continuum approach as in the work by Mark Reckase and Raymond Adams and his colleagues. In contrast to the account above, where parameters were added to the models of the item to make it more complex, here the model of the student is what is being enhanced. These models allow one to incorporate evidence about different constructs into the assessment situation. There are other ways that complexities of the as sessment situation can be built into the measurement models. For example, authors such as Susan Embretson, Bengt Muthen, and Khoo Siek-Toon have shown how repeated assessments over time can be seen as indicators of a new construct: a construct related to patterns of change in the original con struct. In another type of example, authors such as Gerhard Fischer have added

linear effect parameters, similar to those available in GT, to model observa tional effects such as rater characteristics and item design factors, and also to model complexities of the construct (e.g., components of the construct that in fluence item difficulty, such as classes of cognitive strategies). Incorporating Cognitive Elements in Standard Psychometric Models An approach called developmental assessment has been developed, by Geoffrey Masters and colleagues, building on the seminal work of Benjamin Wright and using the Rasch model, to enhance the interpretability of measures by displaying the variable graphically as a progress map or construct map. Mark Wilson and Kathryn Sloane have discussed an example shown in Figure 2, where the levels, called Criterion Zones in the figure, are defined in Figure 3. The idea is that many important features of assessments can be displayed in a technically accurate way by using the strength of the idea of a map to convey complicated measurement techniques and ideas. For example, one central idea is that the meaning of the construct can be conveyed by examining the order of item locations along the map, and the same technique has been used by Wilson as the basis for gathering validity evidence. In Figure 2, one can see how an individual student's assessments over time can be displayed in a meaningful way in terms of the Criterion Zones. The same approach can be adapted for reporting group results of assessments, and even large national surveys (e.g., that of Australia's Department of Employment, Education and Youth Affairs in 1997). One can also examine the patterns of results of individual students to help diagnose individual differences. An example from the Grade Map software developed by Wilson and his colleagues is shown in Figure 4. Here, an overall index of "fit" was used to flag the responses of subject Amy Brown that needed extra attention. In the figure, the expected result for each item for Amy Brown is shown using the gray band across the middle, while the observed results are shown by the black shading. Clearly Amy has responded in surprising ways to several items, and a content analysis of those items may prove interesting. An analogous technique has been developed by Kikumi Tatsuoka (1990, 1995) with the advantage of focusing attention on specific cognitive diagnoses. Adding Cognitive Structure to Psychometric Models One can go a step further than the previous strategy of incorporating interpretative techniques into the assessment reportingelements of the construct can be directly represented as parameters of the psychometric model. From a statistical point of view, this would most often be the preferred tactic, but in FIGURE 2 practice, it may add to the complexity of interpretation, so the merits should be considered for each application. A relatively straightforward example of this is the incorporation of differential item functioning (DIF) parameters into the psychometric model. Such parameters adjust other parameters (usually item difficulty parameters) for different effects between (known) groups of respondents. Most often it has been seen as an item flaw, needing to be corrected. But in this context, such parameters could be used to allow for different construct effects, such as using different solution strategies or linguistic differences. Another general strategy is the delineation of hierarchical classes of observation that group together the original observations to make them more interpretable. This can be seen as acting on either the student or the item aspects of the psychometric model. This could be seen as a way to split up the students into latent groups for diagnostic purposes as in the work of Edward Haertel and David Wiley. Or it could be seen as a way to split up the items into classes, allowing interpretation of student results at the level of, say, classes of skills rather than at the individual item level, as in the work of Rianne Janssen and her colleagues. Wilson has combined the continuum and latent class approaches, thus allowing constructs that are partly continuous and partly discontinuous. For example, the Saltus Model is designed to incorporate stage-like developmental changes along with more standard incremental increases in skill, as illustrated in the work of Mislevy and Wilson.

Generalized Approaches to Psychometric Modeling of Cognitive Structures Several generalist approaches have been proposed. One is the Unified Model, developed by Louis de Bello and his colleagues, which is based on the assumption that task analyses can classify students' performances into distinct latent classes. A second general approach, Peter Pirolli and Mark Wilson's M2RCML, has been put forward and is based on a distinction between knowledge level learning, as manifested by variations in solution strategies, and symbol-level learning, as manifested by variations in the success of application of those strategies. In work by Karen Draney and her colleagues, this approach has been applied to data related to both learning on a Lisp tutor and a rule assessment analysis of reasoning involving the balance scale. A very general approach to modeling such structures called Bayes Nets has been developed by statisticians working in other fields. Two kinds of variables appear in a Bayes Net for educational assessment: those that concern aspects of students' knowledge and skill, and others that concern aspects of the things they say, do, or make. All the psycho-metric models discussed in this entry reflect this kind of reasoning, and all can be expressed as particular implementations of Bayes Nets. The models described above each evolved in their own special niches; researchers in each gain experience in use of the model, write computer programs, and develop a catalog of exemplars. Bayes Nets have been used as the statistical model underlying such complex assessment contexts as intelligent tutoring systems as in the example by Mislevy and Drew Gitomer. Appraisal of Psychometric Models and Future Directions The psychometric models discussed above provide explicit, formal rules for integrating the many pieces of information that may be relevant to specific inferences drawn from observation of assessment tasks. Certain kinds of assessment applications require the capabilities of formal statistical models for the interpretation element of the assessment triangle. The psychometric models available in the early twentyfirst century can support many of the kinds of inferences that curriculum theory and cognitive science suggest are important to pursue. In particular, it is possible to characterize students in terms of multiple aspects of proficiency, rather than a single score; chart students' progress over time, instead of simply measuring performance at a particular point in time; deal with multiple paths or alternative methods of valued performance; model, monitor, and improve judgments based on informed evaluations; and model performance not only at the level of students, but also at the levels of groups, classes, schools, and states. Unfortunately, many of the newer models and methods are not widely used because they are not easily understood or are not packaged in accessible ways for those without a strong technical background. Much hard work remains to focus psycho-metric model building on the critical features of models of cognition and learning and on observations that reveal meaningful cognitive processes in a particular domain. If anything, the task has become more difficult because an additional step is now requireddetermining simultaneously the inferences that must be drawn, the observations needed, the tasks that will provide them, and the statistical models that will express the necessary patterns most efficiently. Therefore, having a broad array of models available does not mean that the measurement model problem is solved. More work is needed on relating the characteristics of measurement models to the specifics of theoretical constructs and types of observations. The longstanding tradition of leaving scientists, educators, task designers, and psychometricians each to their own realms represents perhaps the most serious barrier to the necessary progress.

International Assessments - International Association For Educational Assessment, International Association For The Evaluation Of Educational Achievement, Iea And Oecd Studies Of Reading Literacy - OVERVIEW

International comparisons of student achievement involve assessing the knowledge of elementary and secondary school students in subjects such as mathematics, science, reading, civics, and technology. The comparisons use test items that have been standardized and agreed upon by participating countries. These complex studies have been carried out since 1959 to explicitly compare student performance among countries for students at a common age. To participate in such a comparative study, a country must demonstrate that it has had prior experience in conducting empirical studies of education. Comparing student achievement between countries has several goals. To policymakers, country-to-country comparisons of student performance help indicate whether their educational system is performing as well as it could. To a researcher of education issues, the studies provide a basis for hypothesizing whether some policies and practices in education are necessary or sufficient for high student performance (such as requiring all teachers to obtain college degrees in the subject area they teach). To teachers and school administrators, international studies provide examples of behavior that may be a source of new forms of practice and selfevaluation. Types of Study Results The results of a large international study in 1995 showed that eighth-grade teachers in the United States are often not involved in decisions about the content areas of their teaching, as teachers are in other nations. U.S. teachers work longer hours than those in most other countries, they do not have as much time during the day to prepare for classes, and their daily classroom teaching is disrupted more often by things such as announcements, band practice, and scheduling changes. Moreover, the organization of curriculum used by elementary and middle schools in the United States appears not to be focused on topics that will propel students toward a more advanced understanding of mathematics. Comparisons with other countries show that U.S. students are just as interested in science and mathematics as other students, they study as long, and they watch just as much television. Organizational History Education researchers and policymakers from twelve countries first established a plan for making large-scale cross-national comparisons between countries on student performance in 1958 at the UNESCO Institute for Education in Hamburg, Germany. The first successful large-scale quantitative international study in mathematics was conducted in 1965 by the International Association for the Evaluation of Educational Achievement (IEA) and included Australia, Belgium, England, Finland, France, Germany, Israel, Japan, Netherlands, Scotland, Sweden, and the United States. Since then, studies in fourteen or more countries have been conducted periodically in several subject areas of elementary and secondary education. Between 1965 and 2001 the IEA sponsored studies of mathematics in 1965, 1982, 1995, and 1999; science in 1970, 1986, 1995, and 1999; reading in 1970, 1991, and 2001; civics in 1970 and 1998; and technology in 1990 and 1999. The Educational Testing Service conducted an International Assessment for Education Progress in science and mathematics in 1990. The Adult Literacy and Lifeskills survey is a large-scale comparative survey designed to identify and measure prose literacy, numeracy, and analytical reasoning in the adult population (those between sixteen and sixty-five years of age). This survey was conducted in 1994 and 2001. Studies such as these require the development of a set of test items, which are translated into the languages of the participating countries. The translated items are checked for proper translation and they are pretested in each country to determine whether they have misunderstandings or errors that would make the items unsuitable for use in the final study (about three times as many items are written as are finally used). The participating countries collectively agree upon a framework to define critical aspects of the topic area. For example, an elementary mathematics test would include items in numbers, geometry, algebra, functions, analysis, and measurement, and would also have items that represented different aspects of student

performance, such as knowing the topic, using procedures, solving problems, reasoning, and communicating. However, no single assessment could cover comprehensively an entire topic for all countries. The tests are administered to a sample of students in 100 to 200 schools, which are selected to represent all students in the country. An international referee monitors the school selection process to insure that all countries follow correct sampling procedures. The test items are scored according to internationally agreedupon procedures and are analyzed at an international center to insure cross-national comparability. Countries that do not meet high standards of participation are not included in the comparisons. Problems of Comparability Some educators believe that learning is too elusive and culturally specific to be measured in a statistical survey. They believe that the outcomes of education are too diverse, indirect, and unpredictable to be measured in a single instrument. Others believe that comparisons are "odious" because practices that work in one culture may not be appropriate in another culture due to differences in social context and history. The first IEA study planners were not confident that cross-national comparisons would be valid. They were concerned that the curriculum of different countries would stress different aspects of mathematics, science, or reading, and that any test of student performance might not reflect what students had been taught. To recognize national differences in teaching, the first studies measured the degree to which topics that were emphasized in the school system were actually covered. Curriculum differences were categorized as intended, implemented, or attained curriculum in order to separate the policies of the school district from classroom presentations and actual student performance. The amount of coverage of a topic became an important explanatory variable for between-school and between-country differences in achievement. The analysis showed that students in every country cover the same topics, but that they were often covered in a different order, and with a different emphasis, thus showing that international comparisons of student achievement do reflect the same content areas as other countries and thus they do make sense. Education practices in the countries studied have been found to have more similarities than differences. The differences can be studied, however, and give important insights into which practices can be improved. International studies have helped policymakers understand that student performance is strongly determined by how schools articulate the content areas they are responsible for. For example, a study conducted in 1965 showed significant differences in how countries approached the teaching of mathematics. Subsequent studies showed which topics of mathematics each country considered important, at what age they were introduced, and how the topics were sequenced. These studies led educators to pay closer attention to the underlying curriculum and the training of teachers in the United States. They also led to the earliest efforts by the mathematics education professionals to develop a single set of standards for mathematics teaching. Studies of writing have had difficulty in achieving standards that permit comparison across countries. After several attempts to develop a standard set of principles for grading the writing of students across countries, the IEA gave up its efforts to evaluate writing across cultures. However, a study of reading achievement was successfully conducted in elementary and middle school grades in 1970, and studies are being conducted by the IEA and the Organisation for Economic Co-operation and Development (OECD). International studies have shown that U.S. elementary school students have a high performance level in reading compared with the rest of the participating countries, but only moderate performance at grade nine. These results indicate that U.S. students begin school with sufficient ability to read and interpret texts. Forms of Inquiry

Comparative studies of student achievement require carefully designed statistical surveys for the statistical measurement aspect of the comparison. The populations must be defined in a common way for each country, even though definitions of a grade might differ from country to country. For example, one way to insure comparability is to select a careful sample of all students who attend whatever grade is common for fourteenyear-old students. These surveys involve students taking a test for about an hour and filling in a background questionnaire of their attitudes toward school. Teachers are asked to complete questionnaires about the curriculum topics they cover and their own professional training. Since the 1990s studies have sometimes involved the use of videotape technology to collect information on teaching practices and student activities. For example, large national samples of mathematics classrooms were videotaped in 1995 in Japan, Germany, and the United States, and classrooms for other subjects were videotaped, in additional countries, in 2000. Videotape methods permit a more careful description of teaching practices than classroom surveys, and they provide a check on the validity of teachers' self-reporting of their practices. Detailed case studies of educational practices in several countries have also provided information about the social context in which students are taught. International Assessments in the Twentieth Century The first international studies were carried out by university research centers unaffiliated with government agencies. The results of those studies were published in academic journals, technical volumes, and academic books. During the 1980s these studies influenced policies in American education. Beginning in 1989 government agencies decided that they should have a larger role in organizing and supporting the studies and improving their quality. The National Center for Education Statistics (NCES), an agency of the U.S. Department of Education, and the National Science Foundation provided the leadership and funding support for creating international assessments. The U.S. National Academy of Sciences established an oversight committee called the Board on International Comparative Studies in Education to monitor the progress of these studies. By 1995 international comparative studies had become an accepted continuing aspect of describing the status of the educational outcomes and were being carried out regularly by the NCES. Many countries originally participated in these studies in order to conduct an analysis of a single subject area in a single year. They have since shifted toward a more strategic plan to develop consistently measured trends in educational achievement with international benchmarks. International Assessments in the Twenty-First Century The complexity of conducting standardized comparisons of student achievement in many countries will always challenge researchers, yet they have become institutionalized in many countries. The OECD, which is based in Paris, has gained support from at least twenty-five governments for a continuing series of international comparisons of reading, mathematics, and science. These comparisons began in 2000. Also in 2000 UNESCO established the International Institute of Statistics to further institutionalize a process for improving the use of comparative statistics for policymaking. Studies on the use of technology in schools are being developed to provide new information on forms of instructional technology that are becoming widespread in schools. Schools all over the world have introduced the use of computers and other forms of technology to classroom instruction, and studies seek to determine how educational practices are being altered by these systems.

Testing - Standardized Tests And High-stakes Assessment, Statewide Testing Programs, Test Preparation Programs, Impact Of -


STANDARDIZED TESTS AND EDUCATIONAL POLICY The term standardized testing was used to refer to a certain type of multiple-choice or true/false test that could be machine-scored and was therefore thought to be "objective." This type of standardization is no longer considered capable of capturing the full range of skills candidates may possess. In the early twenty-first century it is more useful to speak of standards-based or standards-linked assessment, which seeks to determine to what extent a candidate meets certain specified expectations or standards. The format of the test or examination is less important than how well it elicits from the candidate the kind of performance that can give us that information. The use of standardized testing for admission to higher education is increasing, but is by no means universal. In many countries, school-leaving (exit) examinations are nonstandardized affairs conducted by schools, with or without government guidelines, while university entrance exams are set by each university or each university department, often without any attempt at standardization across or within universities. The functions of certification (completion of secondary school) and selection (for higher or further education) are separate and frequently noncomparable across institutions or over time. In most countries, however, a school-leaving certificate is a necessary but not a sufficient condition for university entrance. Certification In the United States, states began using high school exit examinations in the late 1970s to ensure that students met minimum state requirements for graduation. In 2001 all states were at some stage of implementing a graduation exam. These are no longer the "minimum competency tests" of the 1970s and 1980s; they are based on curriculum and performance standards developed in all fifty states. All students should be able to demonstrate that they have reached performance standards before they leave secondary school. Certification examinations based on state standards have long been common in European countries. They may be set centrally by the state and conducted and scored by schools (France); set by the state and conducted and scored by an external or semi-independent agency (the Netherlands, the United Kingdom, Romania, Slovenia); or set, conducted, and scored entirely within the schools themselves (Russian Federation), often in accordance with government guidelines but with no attempt at standardization or comparability. Since the objective is to certify a specified level of learning achieved, these exit examinations are strongly curriculum-based, essentiallycriterion-referenced, and ideally all candidates should pass. They are thus typically medium-or lowstakes, and failing students have several opportunities to retake the examination. Sometimes weight is given to a student's in-school performance as well as to exam resultsin the Netherlands, for example, the weightings are 50-50, giving students scope to show a range of skills not easily measured by examination. In practice, however, when constructing criteria for a criterion-referenced test, norm-referencing is unavoidable. Hidden behind each criterion is norm-referenced data: assumptions about how the average child in that particular age group can be expected to perform. Pure criterion-referenced assessment is rare, and it would be better to think of assessment as being a hybrid of norm-and criterion-referencing. The same is true of setting standards, especially if they have to be reachable by students of varying ability: one has to know something about the norm before one can set a meaningful standard. Selection By contrast, university entrance examinations aim to select some candidates rather than others and are therefore norm-referenced: the objective is not to determine whether all have reached a set standard, but to select "the best." In most cases, higher education expectations are insufficiently linked to K 12 standards.

Entrance exams are typically academic and high-stakes, and opportunities to retake them are limited. Where entrance exams are set by individual university departments rather than by an entire university or group of universities, accountability for selection is limited, and failing students have little or no recourse. University departments are unwilling to relinquish what they see as their autonomy in selecting entrants; moreover, the lack of accountability and the often lucrative system of private tutoring for entrance exams are barriers to a more transparent and equitable process of university selection. In the United States the noncompulsory SATs administered by the Educational Testing Service (ETS) are most familiar. SAT I consists of quantitative and verbal reasoning tests, and for a number of years ETS insisted that these were curriculum-free and could not be studied for. Indeed they used to be called "Scholastic AptitudeTests," because they were said to measure candidates' aptitude for higher level studies. The emphasis on predictive validity has become less; test formats include a wider range of question types aimed at eliciting more informative student responses; and the link with curriculum and standards is reflected in SAT II, a new subject test in high school subjects, for example English and biology. Fewer U.S. universities and colleges require SAT scores as part of their admission procedure in the early twenty-first century, though many still do. Certification Combined with Selection A number of countries (e.g., the United Kingdom, the Netherlands, Slovenia, and Lithuania) combine schoolleaving examinations with university entrance examinations. Candidates typically take a national, curriculumbased, high school graduation exam in a range of subjects; the exams are set and marked (scored) by or under the control of a government department or a professional agency external to the schools; and candidates offer their results to universities as the main or sole basis for selection. Students take only one set of examinations, but the question papers and scoring methods are based on known standards and are nationally comparable, so that an "A" gained by a student in one part of the country is comparable to an "A" gained elsewhere. Universities are still entitled to set their own entrance requirements such as requiring high grades in biology, chemistry, and physics for students who wish to study medicine, or accepting lower grades in less popular disciplines or for admittance to less prestigious institutions. Trends in Educational Policy: National Standards and Competence Two main trends are evident worldwide. The first is a move towards examinations linked to explicit national (or state) standards, often tacitly aligned with international expectations, such as Organisation for Economic Co-Operation and Development (OECD) indicators or the results of multinational assessments such as the Third Mathematics and Science Study (TIMSS) or similar studies in reading literacy and civics. The second trend is towards a more competence-based approach to education in general, and to assessment in particular: less emphasis on what candidates can remember and more on what they understand and can do. Standards. The term standards here refers to official, written guidelines that define what a country or state expects its state school students to know and be able to do as a result of their schooling. In the United States all fifty states now have student testing programs, although the details vary widely. Few states, however, have testing programs that explicitly measure student achievement against state standards, despite claims that they do. (Some states ask external assessors to evaluate the alignment of tests to standards.) Standards are also used for school and teacher accountability purposes; about half the states rate schools primarily on the basis of student test scores or test score gains over time, and decisions to finance, close, take over, or otherwise overhaul chronically low-performing schools can be linked to student results. Much debate centers on whether tests designed for one purpose (measuring student learning) can fairly be used to judge teachers, schools, or education systems.

The same debate is heard in England and Wales, where student performance on national curriculum "key stage" testing at ages seven, eleven, fourteen, and sixteen has led to the publication of "league tables" listing schools in order of their students' performance. A painstaking attempt to arrive at a workable "value-added" formula that would take account of a number of social and educational variables ended in failure. The concept of "value added" involves linking a baseline assessment to subsequent performance: the term refers to relative progress of pupils or how well pupils perform compared to other pupils with similar starting points and background variables. A formula developed to measure these complex relationships was scientifically acceptable but judged too laborious for use by schools. Nevertheless, league tables are popular with parents and the media and remain a feature of standards-based testing in England and Wales. Most countries in Central and Eastern Europe are likewise engaged in formulating educational standards, but standards still tend to be expressed in terms of content covered and hours on the timetable ("seat time") for each subject rather than student outcomes. When outcomes are mentioned, it is often in unmeasurable terms: "*Candidates+ must be familiar with the essence, purpose, and meaning of human life *and+ the correlation between truth and error" (State Committee for Higher Education, Russian Federation, p. 35). Competence. The shift from content and "seat-time" standards to specifying desired student achievement expressed in operational terms ("The student will be able to ") is reflected in new types of performance based assessment where students show a range of skills as well as knowledge. Portfolios or coursework may be assessed as well as written tests. It has been argued that deconstructing achievement into a list of specified behaviors that can be measured misses the point: that learning is a subtle process that escapes formulas people seek to impose on it. Nevertheless, the realization that it is necessary to focus on outcomes (and not only on input and process) of education is an important step forward. Apart from these conceptual shifts, many countries are also engaged in practical examinations reform. Seven common policy issues are: (1) changing concepts and techniques of testing (e.g., computer-based interactive testing on demand); (2) shift to standards-and competence-based tests; (3) changed test formats and question types (e.g., essays rather than multiple-choice); (4) more inclusive target levels of tests; (5) standardization of tests; (6) independent, external administration of tests; and (7) convergence of high school exit exams and university entrance. Achieving Policy Goals In terms of monitoring the achievement of education policy goals, standards-linked diagnostic and formative (national, whole-population, or sample-based) assessments at set points during a student's schooling are clearly more useful than scores on high-stakes summative examinations at the end. Trends in annual exam results can still be informative to policy makers, but they come too late for students themselves to improve performance. Thus the key-stage approach used in the United Kingdom provides better data for evidencebased policy making and more helpful information to parents than, for example, the simple numerical scores on SAT tests, which in any case are not systematically fed back to schools or education authorities. However, the U.K. approach is expensive and labor-intensive. The best compromise might be sample-based periodic national assessments of a small number of key subjects (for policy purposes), plus a summative, curriculum-and standards-linked examination at the end of a major school cycle (for certification and selection).