International Testing

"Our most affluent kids are getting their lunches eaten by kids in other countries.
The system we have has not served our children well. There is no point pouring more federal money into very broken bottles." -Amy Wilkins, Vice President of the Education Trust (Bracey, 2007) Our goal is to move beyond comparisons among the states and, more importantly begin comparing ourselves to top-performing countries such as Finland, China and others. -Dane Linn, Director of Education, National Governors Association (Education Week, 2008) (Presumably the reference was to Chinese Taipei. China has not participated in international assessments.) International Testing Comparisons and Large Scale Assessments 1. International testing programs involve a mix of standardized tests and background surveys of the test taking populations. Like other forms of technically competent assessment, international testing programs provide much useful information. Large scale assessment programs are highly complex and require substantial expertise to understand their proper uses and limits (Goldstein, 2004b). However, it appears that many end users do not understand the proper uses of the data produced. Few appreciate that the design and execution of these testing programs limit the range of questions for which they can provide valid evidence or answers. End users often display little knowledge of critical defining features of the underlying data (For examples, see Bracey, 1999, 2000, 2003, 2005, 2007). Perhaps it is idealistic to hope that all end users would first acquire a basic literacy in testing and research methods before attempting to use the data to inform policymaking. Unfortunately, when the politically charged world of educational policymaking merges with this age of instant punditry, methodologically sound analysis becomes less likely. It is no small irony that the educational standards and assessment movement itself is open to such criticism. The widespread tendency to summarize international test results in rankings, simple averages, or verbal descriptors such as percent achieving proficiency frequently leads to simplistic and misleading conclusions about complex measurement questions (Goldstein 1995, 2004a). Describing international differences in terms of descriptive statistics without reference to curriculum exposure, prior achievement, teachers inputs, or cultural expectations, to name a few, has little real use, other than as propaganda (Goldstein, 1995, p.19). In addition, the reporting of these measures often conveys a spurious precision, exaggerating small or insignificant differences. These measurement and reporting problems are multiplied as the scale of the assessments grows and covers more units of increasing diversity (e.g., countries, education systems, cultures, languages, socio-demographic backgrounds of test takers, curricula, funding, student years of schooling and ages.) Thus, one reasonably can say that the students tested in Country A averaged having higher scores than the tested
students in Country B. However, without detailed information about the technical qualities of the tests, sampling procedures, response rates, handling of missing data, curriculum and other school inputs, longitudinal test results for the same sample of students, and, the demographic background characteristics of the test takers, it would be inappropriate to draw conclusions about the educational performance of any units (classrooms, schools, states, provinces, countries) above the student level (Goldstein, 2004c). It may even be inappropriate to draw broad conclusions about the knowledge and skills of students beyond the actual test taking samples. By design, the three major international testing programs collect and report vast quantities of technical information. Unfortunately, that information receives relatively little popular exposure and is frequently ignored by popular commentators, reform advocates, and policymakers. The truly useful information from these studies largely remains buried in voluminous, dense technical reports, or in the secondary research of academics who generally publish their findings in obscure scholarly journals. Theres little evidence that this work informs the popular debate. 2. The tests: a. PIRLS (Progress in International Reading Literacy Study) is conducted by the International Association for the Evaluation of Educational Assessment (IEA). It is a test of reading comprehension administered to 4th graders every 5 years. Forty-five jurisdictions participated in 2006. In the U.S. 183 schools and 5,190 students participated (Baer, et. al., 2007). b. TIMSS (Trends in International Mathematics and Science Study) tests 4th and 8th graders every 4 years. It also is conducted by the IEA. Sixty countries participated in 2007. Those results will be reported in December of 2008. In 2003, 26 countries participated at the 4th grade level and 48 at the 8th grade level. In the U.S. 212 schools and 9,829 4th grade students participated. At the 8th grade level, 211 schools and 8,912 students participated. The IEA is a nongovernmental organization that dominates international comparative testing. Its member institutions (generally 1 per national educational system) send representatives to an annual general assembly, the decision-making body for the IEA. The IEA also has a small permanent secretariat (in The Hague) and a technical advisory committee (Goldstein, 1995). The IEAs testing programs are designed to measure attainment of a common curriculum agreed upon by the member institutions. (It does not reflect the full curriculum of any jurisdictions educational system.) c. PISA (Program for International Student Assessment) tests 15 year-olds in Math, Science, and Reading literacy. PISA is conducted by the Organization for Economic Cooperation and Development (OECD). PISA is designed to compliment PIRLS and TIMSS. Unlike the latter two
programs, PISA is designed to test the application of knowledge in reading, mathematics, and science to problems with a real life context (Baldie et. al., 2007). All subjects are assessed in three-year cycles that began in 2000. One of the three subjects is assessed in depth each cycle. The in-depth subject is the only one in which each tested student answers all questions. In 2006, science literacy was that subject. The questions in the other subjects are distributed across test booklets. So not every student answered both reading and math items. Fifty-seven jurisdictions participated in 2006, while 67 have agreed to participate in 2009. The 2006 U.S. sample included a total of 166 public and private schools and 5,611 students to be representative of the U.S. population. (OECD members include: Australia, Austria, Belgium, Canada, Czech Republic, Denmark, Finland, France, Germany, Greece, Hungary, Iceland, Ireland, Italy, Japan, Korea, Luxembourg, Mexico, the Netherlands, New Zealand, Norway, Poland, Portugal, Slovak Republic, Spain, Sweden, Switzerland, Turkey, United Kingdom, and the United States.) 3. Performance: (The international rankings, for what they are worth.) These rankings are no more, and probably even less meaningful than the unadjusted rankings of Pennsylvania schools based on their average PSSA scores. (Note: countries are ranked by their average scaled scores. If a country ranks higher or lower than another, that means its scaled score is also statistically different from the comparison countries at the .05 level.) a. PIRLS: i. Combined Reading Literacy Scale: U.S. 4th grade students scored 540, well above the international average of 500. They ranked higher than 4th graders in 22 of 44 jurisdictions and lower than 10. Asian American students scored the highest in the United States. Results also indicated a strong inverse relationship between the percentage of poor students in a school and scores (Baer et. al., 2007). b. TIMSS i. Mathematics: In 2003, U.S. 4th graders scored 518, above the international average of 495. They ranked higher than 4th graders in 13 of 24 countries and lower than 11. They ranked in the middle of the 10 participating OECD countries. In 8th grade Mathematics, U.S. students scored 504, well above the international average of 466. They ranked above their peers in 25 of the 44 other participating countries and below their peers in 9. They ranked above their peers in 2 of 12 participating OECD countries and below those in 5.
ii. Science: In 2003, U.S. 4th graders scored 536, well above the international average of 489. They ranked higher than 4th graders in 16 of 24 countries and lower than 3. They ranked higher than students in 7 participating OECD countries and lower than those in only 1. In 8th grade Science, U.S. students scored 527, well above the international average of 473. They ranked above their peers in 32 of the 44 other participating countries and below their peers in 7. They ranked above their peers in 5 of 12 participating OECD countries and below those in 3. c. PISA i. Combined Science: U.S. students scored lower than students in 16 of 29 OECD jurisdictions and 6 of 27 non-OECD jurisdictions. U.S. students scored higher than students in 5 OECD and 17 nonOECD jurisdictions (U.S. students scored lower than students in 22 of 56 OECD plus non-OECD jurisdictions and higher than students in 22). U.S. students in the 90th percentile (within the US or on the PISA test) scored above the OECD average (compared to students who were at their 90th percentile in their respective countries), but below the 90th percentile students in 9 OECD and 3 non-OECD countries. PISA also reports by proficiency levels, only 6 of 56 jurisdictions had a higher percentage of their students scoring at the top level, while 19 of 56 had a higher proportion of students scoring at the bottom level. (The TIMSS and PISA science results are surprisingly good for a country that still spends significant resources on battles over the introduction of creationism into the science curriculum.) ii. Mathematics: U.S. students scored lower than students in 23 of 29 OECD jurisdictions and 8 of 27 non-OECD jurisdictions. U.S. students scored higher than students in 4 OECD and 16 non-OECD jurisdictions. U.S. students in the 90th percentile scored below the OECD average and below the 90th percentile students in 23 OECD and 6 non-OECD countries. iii. Reading. PISAs 2006 Reading results were not reported due to an error in the printing of the test booklets. d. Researchers from The American Institutes for Research (AIR) performed a study that statistically linked state performance on the National Assessment of Educational Progress (NAEP) 8th grade Mathematics and Science tests with international performance on the TIMSS 8th grade Mathematics and Science tests (Phillips, 2007). All of the usual technical
problems associated with large scale assessments apply to both NAEP and TIMSS. To that is added the measurement problems of statistically linking performance on two different tests. (Feuer, et. al., 1999). Despite these important caveats, it is interesting to note that the AIR researchers found that Pennsylvanias NAEP performance projected on the TIMSS scale would rank it above the U.S. TIMSS average and the averages of 36 of the 48 countries in math. It ranked below that of 5 Asian jurisdictions (Singapore, Hong Kong, Korea, Chinese Taipei, and Japan) (Phillips, 2007, p.64). (Pennsylvania did not participate in the TIMSS Science tests.) e. Asian-Americans and PSSAs. While much has been written about the superior performance of certain Asian countries on international assessments, less attention has been devoted to what this might indicate about the influence of cultural factors on test performance. Although Asian-Americans are a diverse group originating from many countries and are not monolithically high performers, taken as a group, they outperform others. It is interesting to note that students categorized as Asian or Pacific Islander on the Pennsylvania System of School Assessment (PSSA) Math and Reading tests (grades 3 to 8, and 11) outperformed White, Hispanic, and Black students in 2007. Eighty-six percent of Asian or Pacific Islanders test takers scored proficient or advanced across all PSSA Math tests and 78.7 scored in those categories on the Reading tests. Whites were next highest performing group, scoring 75.8% proficient or advanced in Math and 74.7% in Reading. These results again suggest that that rankings based on average scores are not a sound basis for drawing broader conclusions about educational systems. Ignoring cultural and a host of other demographic factors will lead to incorrect inferences. 4. Technical issues affecting the validity of international comparisons: a. Validity - Is the test validated for the use to which it is being put? i. Although these international assessments gather both test scores and survey background school and family data, cross-national comparisons made by the media, advocacy groups, and policymakers invariably fail to adjust scores or ranks for the background data. There is no validity to conclusions drawn about schools, or educational systems based solely on test scores that fail to adjust (or statistically control) for other variables affecting performance. The eminent British statistician Harvey Goldstein goes further stating that longitudinal data (following students over time) is absolutely essential to any attempt to interpret differences in international test results: Difficult as this may be, without such longitudinal data, the existing research literature indicates that it is impossible properly to attribute any observed
differences to the effects of education per se, despite this being a major aim of comparative studies of achievement. Yet those agencies involved in carrying out these studies, principally the Organization for Economic Cooperation and Development (OECD) and the International Association for the Evaluation of Educational Achievement (IEA), continue to ignore the issue...(Goldstein, 2004a, p.228) b. Skills/Knowledge measuredi. PISA: tests problem-solving ability in Math, Reading, and Science. It explicitly does not include questions related to a school curriculum (Prais, 2003) (Unlike PIRLS and TIMSS, it is not designed to test the mastery of a common curriculum.) PISA is designed to test the application of knowledge in reading, mathematics, and science to problems with real-life context (Baldie, et. al. 2007). Its goal is to answer What knowledge and skills do students have at age 15 taking into account schooling and other factors that affect their performance? (Id.) Note that the tested group may come from different grade levels. There are substantial national differences in the reported grade levels of 15 year-olds, e.g., 100 percent of the tested Japanese students were in 10th grade, while 70.9 % of U.S. and 1.4% of Danish students were in that grade (Id.). ii. As indicated, TIMSS is designed to test mastery of school curriculum. Consequently, performance is affected by exposure to the curriculum tested. However, some critics have pointed out that students in some countries have not had the opportunity to learn some of the tested subjects (Bracey, 2000). iii. PIRLS is designed to measure reading comprehension including the ability of student to read for literary experience and to acquire and use information. c. Selection bias, sampling, and nonresponse. One of the foremost issues in survey research is the introduction of selection bias. A selection effect distorts the evidence or data due to the way in which data are collected. If selection bias is substantial, then the sample may not be representative of the underlying population. If, as in the case of international testing programs, there are selection issues arising from the different samples of test takers, selection bias will result in the noncomparability of test-taking populations, thus rendering comparisons invalid. According to Professor Iris Rotberg, co-director of the Center for Curriculum, Standards, and Technology at George Washington University:
The goal is to test a representative sample of students at all ability levels. But, in practical terms, there is a lot of slippage. There are, inevitably, major practical sampling problems--even with the best intentions and most sophisticated sampling designs--which make it extremely difficult to ensure that comparable samples of students, schools, and regions are tested across countries (Education Week, 2008). i. PISA has wide variation in the response rates to its surveys (Prais, 2003). In 2006, the weighted U.S. school response rate after replacement was lowest at 79% (Programme for International Student Assessment, 2007, p. 356). (School response rates in other countries ranged up to 100%.) PISA tests are administered to 15 year-olds. National results also will be affected by drop-out rates. Also PISA has been criticized for systematic marginalization of special needs students (Hopmann and Brinek, 2007, p.11). ii. TIMSS: The U.S. school response rate was 82 percent for 4th grade and 78 percent for 8th grade. Student response rates within these schools were 95 and 94 percent, respectively resulting in overall response rates of 78 and 73 percent. Although, TIMSS attempts to sample and represent the entire population of U.S. students, it is clear that nonresponse undermines that goal. (The U.S. response rates were among the lowest of participating countries.) TIMSS 2003 had as its intended target population all students at the end of their eighth and fourth years of formal schooling in the participating countries. However, for comparability with previous TIMSS assessments, the formal definition for the eighth-grade population specified all students enrolled in the upper of the two adjacent grades that contained the largest proportion of 13-year-old students at the time of testing. This grade level was intended to represent eight years of schooling, counting from the first year of primary or elementary schooling, and was indeed the eighth grade in most countries. Similarly, for the fourth-grade population, the formal definition specified all students enrolled in the upper of the two adjacent grades that contained the largest proportion of 9-year-olds (Martin and Mullis: 2004, pp. 4-5). From the preceding information it appears that the students tested in different countries have a different mix of actual ages and years of schooling. iii. PIRLS: The U.S. school response rate after replacement was 86 percent and the student rate was 95 percent.
d. Translation, quality of translations and equivalence difficulty in different languages. (Cultural bias such as the use of metric system) While great pains are taken to assure the equivalence of tests administered in different languages and countries, many have argued that problems persist (see Goldstein, 2004b). For example, translation problems may affect results on PISA (Puchammer, 2007). e. Demographics matter. i. Fuchs and Woessmann (2004) examined the student-level PISA database controlling for a number of student characteristics, family background measures, and school or other institutional inputs. Their model accounted for more than 85% of the variation in between-country performance variation. Approximately 44% of the variation was explained by institutions, resources, or teachers. So the design and inputs of the educational system do matter. However, the largest single influence they found was family background (e.g., parental education, work status) which accounted for 43% of the variation. Other home inputs and incentives (e.g., parental support and homework) accounted for approximately 5 %. Thus, international comparisons that ignore these factors can lead to misleading conclusions about the influence of school related educational inputs. Unfortunately, PIRLS, PISA, and TIMSS do not collect parental income data. ii. Family structure matters. According to a study done for the National Center for Educational Statistics, 45 percent of 15-yearold American students reported living in a non-two-parent household (Hampden-Thompson, et. al, 2006). This was nine percentage points (25 %) higher than the next highest of 19 other (primarily OECD) countries studied and 11 points (32%) above the group average. These researchers found that in all 20 countries, students from two-parent homes outperformed other students on the PISA mathematics literacy assessment. The achievement gap in the U.S. between students from two-parent and other homes was significantly greater than the international average and greater than the gap in 17 other countries (Id.). iii. Poverty obviously matters. While US educational rankings garner much attention in the media and from business groups, the persistence of poverty and its undesirable concomitants (other than the academic achievement gap) such as poor health care, do not generate the same level of response. For example rising infant mortality rates have reached Third World levels (Vallely, 2005).
The US has one of the highest poverty rates and widest distributions of income of all industrialized countries (Watkins, 2005). Student test performance is directly related to socioeconomic status. Educational Testing Service researchers Barton and Coley (2007) found that family characteristics including the parent-student ratio, reading to children, excessive television watching, and child absenteeism bore a strong, statistically significant relationship to performance on the National Assessment of Educational Progress. It is not coincidental that the U.S. ranks lower on some international measures of public health than it does on measures of education (Watkins, 2005). By another measure, the U.S. is well above average, but not near the top. According to the World Health Organizations 2000 rankings of the overall performance of health system performance, the U.S. ranked 37th of 191 countries (Musgrove, et. al., 2000). As with educational measures, summary rankings of health care systems can mislead as much as they inform. Only by looking at finer grained analyses can we begin to raise and answer questions that should be of interest to policymakers. Why, for example, do African-Americans have shorter life expectancies than do Nicaraguans or Moroccans (Arizona Daily Star, 2007; Kaiser Daily Health Report, 2007)? 5. Recently, the National Governors Association and some business groups have promoted the idea of international educational benchmarking (McNeil, 2008). Any idea embraced by both the National Governors Association and Achieve, Inc. should be viewed skeptically. Both political and business leaders pursue agendas that are not characteristically congruent with a disinterested search for objective truth. (Neither organization has demonstrated a capacity for methodologically sophisticated or nuanced thinking about complex social and economic problems.) 6. Conclusion: Large scale international assessment programs do provide much useful student level information regarding the correlates of academic performance. Researchers have found further evidence that family background and demographic factors exert the strongest influence over the level of student performance. However, due to the many reasons listed above, the use of international assessments to compare and rank the educational systems of different countries is invalid. In fact, we are not aware of any of the validation studies that would support such a use. So how do the tested US students stack up against the rest of the world on these international assessments? If one were inclined to follow the superficial lead of popular coverage, for most groups and tests, U.S. students perform above the comparison group averages. However, the correct answer is that meaningful and accurate answers depend on the host of factors described in the paragraphs above. For these, we would have to review the international scholarly literature, a task beyond the scope of this project.
References 1. Arizona Daily Star. (2007). U.S. now trails 41 other nations in life expectancy. August 12, 2007. www.dailystar.com 2. Baer, J., S. Baldie, K. Ayotte, and P. Green. (2007). The Reading Literacy of U.S. Fourth-Grade Students in an International Context: Results From the 2001 and 2006 Progress in International Reading Literacy Study (PIRLS) (NCES 2008017). (U.S. Department of Education, National Center for Education Statistics, Institute of Education Sciences: Washington, DC). 3. Baldie, S., Y. Jin, M. Skerner, P. Green, D. Hergel, and H. Xie. (2007). Highlights from PISA 2006: Performance of U.S. 15-year-old students in Science and Mathematics Literacy in an International Context (NCES 2008016). (U.S. Department of Education, National Center for Education Statistics: Washington, DC). 4. Barton, P. and R. Coley. (2007). The family: Americas smallest school. (Policy Evaluation and Research Center, Policy Information Center, ETS: Princeton, NJ). 5. Bracey, G. (1999). Lou Gerstners Lies. EDDRA. (posted October 10, 1999). 6. Bracey, G. (2000). The TIMSS Final Year Study and Report: A Critique. Educational Researcher. (Vol. 29, No. 4, pp. 4-10). 7. Bracey, G. (2003). PIRLS Before the Press. Phi Delta Kappan. (Vol. 84. No. 10, p. 795). 8. Bracey, G. (2005). Educations Groundhog Day: Point of view essay. Education Policy Research Unit, Education Policy Studies Laboratory (EPSL0502-103-EPRU) (Arizona State University: Tempe, AZ). 9. Bracey, G. (2007). The Education Trusts Disinformation Campaign. Huffington Post, July 22, 2007. Retrieved from: http://www.huffingtonpost.com/gerald-bracey/the-education-trustsdis_b_57327.html 10. Education Week (2008). Transcript: The Use of International Data to Improve U.S. Schools. (May 21, 2008). Retrieved from: http://www.edweek.org/chat/transcript_05_21_08.html 11. Feuer, M., P. Holland, B. Green, M. Bertenthal, and F.C. Hemphill, eds. (1999) Uncommon Measures: equivalence and linkage among educational tests. (National Academy Press: Washington, D.C.) 12. Fuchs, T. and L. Woessman. (2004). What accounts for international differences in student performance? A re-examination using PISA data. (March 24, 2004). 13. Goldstein, H. (1995). Interpreting international comparisons of student achievement. (UNESCO: Paris). 14. Goldstein, H. (2004a) Review Essay: International comparative assessment: how far have we really come? Assessment in Education. (Vol. 11, No. 2, pp. 227234.) 15. Goldstein, H. (2004b) International comparisons of student attainment: some issues arising from the PISA study. Assessment in Education (Vol. 11, pp. 319330).
10
16. Goldstein, H. (2004c). The Education World Cup: international comparisons of student achievement. Cadmo. (Anno xii, pp. 63-70). 17. Gonzales, P., J.C. Guzman, L. Partelow, E. Pahlke, L. Jocelyn, D. Kastberg, and T. Williams. (2004). Highlights from the Trends in International Mathematics and Science Study (TIMSS) 2003 (NCES 2005-005). (U.S. Department of Education, National Center for Education Statistics: Washington, DC). 18. Hampden-Thompson, G., J. Johnston, and American Institutes for Research. (2006). Variation in the relationship between nonschool factors and student achievement on international assessments. (NCES 415-920-9229). (U.S. Department of Education, National Center for Education Statistics: Washington, DC). 19. Hopmann, S.T. and G. Brinek. (2007) Introduction: PISA according to PISA Does PISA keep what it promises? In PISA according to PISA - Does PISA keep what it promises? S. Hopmann, S., G. Brinek and M. Retzl (eds.) (LIT Verlag : Berlin and Vienna). 20. Kaiser Daily Health Policy Report (2007). Coverage and access: U.S. life expectancy below that of 41 other nations. (Henry J. Kaiser Foundation) August 13, 2007 retrieved from: http://www.kaisernetwork.org/daily_reports/health2008dr.cfm?DR_ID=46838 21. Martin, M.O. and I.V.S. Mullis (2004). Overview of TIMSS 2003. In TIMSS 2003 Technical Report: Findings from the IEAs Trends in International Mathematics and Science Study at the Fourth and Eighth Grades. (Martin, Mullis and, Chrostowski, eds. (IEA/TIMSS & PIRLS Study Center, Lynch School of Education Boston College: Boston). 22. McNeil, M. (2008). Benchmarks momentum on increase: Governors group, state chiefs eyeing international yardsticks. Education Week Online, March 10, 2008. Retrieved from: http://www.edweek.org/ew/articles/2008/03/12/27nga_eph27.html 23. Musgrove, P., A. Creese, A. Preker, C. Baeza, A. Anell, and T. Prentice. (2000). The World health report 2000: health systems: improving performance. (World Health Organization: Geneva). 24. Phillips, G. (2007).Chance Favors the Prepared Mind: mathematics and science indicators for comparing states and nations. (American Institutes for Research: Washington, DC). 25. Prais, S.J. (2003). Cautions on OECDs recent educational survey (PISA). Oxford Review of Education. Vol. 29, No. 2, pp.139-163. 26. Programme for International Student Assessment. (2007). PISA 2006 Science Competencies for Tomorrows World: Volume 1 - Analysis. (OECD: Paris). 27. Puchammer, M. (2007). Language based item analysis - Problems in international comparisons. In PISA according to PISA - Does PISA keep what it promises? S. Hopmann, S., G. Brinek and M. Retzl (eds.) (LIT Verlag : Berlin and Vienna). 28. Vallely, P. (2005). UN hits back at US in report saying parts of America are as poor as Third World. The Independent, Online Edition. September 9, 2005. retrieved at: http://news.independent.co.uk/world/politics/article311066.ece
11
29. Watkins, K. (2005). Human Development Report 2005, international cooperation at a crossroads, aid trade and security in an unequal world. (United Nations Development Programme: New York).
12

International Testing

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

International Testing

Uploaded by

Copyright:

Available Formats

"Our most affluent kids are getting their lunches eaten by kids in other countries.

You might also like