Chapter 4 Validity 8

70 VALIDITY Ww hen constructing or selecting assessments, the most important questions are (a) to what extent will the interpretation of the scores he appropriate, meaningful, and useful for the intended application of the results? and (b) what are the consequences of the particular uses and interpretations that are made of the results? Assessments take a wide variety of forms, ranging from the familiar multiple-choice or other types of fixed-response tests to extended observations of performance. They also serve a variety of uses in the school. For example, assessment results might be used to identify student strengths and weaknesses, to plan instructional activities, or (0 communicate progress to students and parents; achievement tests might be used for selection, placement, diagnosis, or certification; aptitude tests might be used for predicting success in future learning activities or occupations; and appraisals of personal-social development might be used to better understand learning problems or to evaluate the effects of a particular school program. Regardless of the type of assessment used or how the results are to be used, all assessments should possess certain characteristics. The most essential of these are validity, reliability, and usability. ‘You may wonder why fairness is not listed. Fairness certainly is an essential characteristic of a good assessment. We did not list it separately, however, because fairness is an essential part of the comprehensive view of validity that is presented in this chapter. ‘Validity is the adequacy and appropriateness of the interpretations and uses of assessment results. Clearly, use of an assessment that Icads to unfair treatment of girls, or African Americans, or English-language learners would not be evaluated as either adequate or appropriate, Nor would the use of assessment results to make unfair interpretations about the “capacity” of a group of students to learn be considered either adequate or appropriate. In both cases, the lack of fairness would lead to a negative evaluation of the validity of the interpretation or use of assessment results. An evaluation of the validity of the use and interpretation of an assessment can take many forms. For example, if an assessment is to be used to describe student achievement, then we should like to be able to interpret the scores as a relevant and representativeChapter 4 Validity 7 sample of the achievement domain to be measured. If the results are to be used as a measure of students’ understanding of mathematical concepts, then we should like our interpretations to be based on evidence that the scores actually reflect mathematical understanding and are not distorted by irrelevant factors, such as the reading demands of the tasks. If the results are to be used to predict students’ success in some future activity, then we should like our interpretations to be based on as good an estimate of future success as possible. Basically, then, validity is always concerned with the specific use of assessment results and the soundness and fairness of our proposed interpreta~ tions of those results. As we will see later in this chapter, however, this does not mean that validation procedures can be matched to spe sessment Uses On a one-L0-one basis. In recent years, our understanding of validation has also come to include an evaluation of the adequacy and appropriateness of the uses that are made of assessment results. This expanded view of validity leads to a focus on the consequences of particular uses of assessment results. For example, if a state- or district-mandated test led teachers to ignore important content not covered by the test, then that consequence should be taken into account in judging the validity of the test use. In evaluating consequences it is important to distinguish between those consequences that are due to the assessment procedures and those that are the result of social or educational policies (Messick, 1994). Both are rele- ‘ant fo an evaluation of the use of the assessment, but it is consequences tied directly to characteristics of the assessment (e.g., an ovetemphasis on drill and practice, because the test places an undue emphasis on factual knowledge at the expense of conceptual understanding or problem-solving applications) that are the primary concern in validation of particular use of an assessment. Reliability refers to the consistency of assessment results. If we obtain quite similar scores when the same assessment procedure is used with the same students on two dif. ferent occasions, then we also can conclude that our results have a high degree of reliability from one occasion to another. Similarly, if different teachers independently rate student performances on the same assessment task and obtain similar ratings, we also can conclude that the results have a high degree of reliability from one rater to another. Like validity, reliability is intimately related to the type of interpretation to be made. For some uses, We may be interested in asking how reliable our assessment results are over a given period of time and, for others, how reliable they are over different samples of the same behavior. Tn all instances in which reliability is being determined, however, we are concerned with the consistency of the results rather than with the appropriateness of the interpretations made from the results (validity). The relation between reliability and validity is sometimes confusing to persons who encounter these terms for the first time. Reliability (consistency) of measurement is needed to obtain valid results, but we can have reliability without validity. That is, we can have consistent measures tat provide the wrong information or ate interpreted inappropriately. ‘The target-shooting illustration in Figure 4.1 depicts the concept that reliability is a necessary but not sufficient condition for validity. In addition to providing results that possess a satisfactory degree of validity and reliability, an assessment procedure must meet certain practical requirements. It should be economical from the viewpoint of both time and money, it should be easily administered and scored, and it should produce results that can be accurately interpreted and applied© © © Part 1 The Measurement and Assessment Process Target 4 Target 2 Target 3 Kit (Bullsoye") Carson Bill (‘Scattorshot") Henry Jack (*Rightpull) Armstrong (reliable and valid shooting) (unteliable and invalid shooting) (reliable but invalid shooting) Figure 4.1 Reliability (consistency) is needed to obtain valid results (but one can be consistently “off target”) by available school personnel. These practical aspects of an assessment procedure all can be included under the heading of usability. The term usability, then, refers only to the practicality of the procedure and says nothing about the other qualities present. NATURE OF VALIDITY ‘When using the term vafidity in relation to testing and assessment, keep the following cautions in mind. |. Validity refers to the appropriateness of the interpretation and use made of the results of an assessment procedure for a given group of individuals, not to the procedure itself, We sometimes speak of the “validity of a test” for the sake of convenience, but it is more correct ta speak of the validity of the interpretation and use to be made of the results. 2. Validity is a matter of degree; it does not exist on an all-or-none basis. Consequently, we should avoid thinking of assessment results as valid or invalid. Validity is best considered in terms of categories that specify degree, such as high validity, moderate validity, and low validity. 5. Validity is always specific to some particular use or interpretation for a specilic ‘population of test takers. No assessment is valid for all purposes. For example, the results ‘of a mathematics test may have a high degree of validity for indicating computational skill, a low degree of validity for indicating mathematical reasoning, a moderate degree of validity for predicting success in future mathematics courses, and essentially no validity for predicting success in art or music. When indicating computational skill, the mathematics test may also have a high degree of validity for third- and fourth-grade students but a low degree of validity for second- or fifth-grade students, Thus, when appraising or describing validity,Chapter 4 Validity 73 it is necessary to consider the specific interpretation or use to be made of the results. Assessment results are never just valid; they have a different degree of validity for each particular interpretation to be made. 4. Validity is a unitary concept. The conceptual nature of validity has typically been described for the testing profession in a set of standards prepared by a joint committee of members from three professional organizations that are especially concemed with educational and psychological testing and assessment. In the two most recent revisions of the Standards _for Educational and Psychological Testing by the American Educational Research Association (AERA), American Psychological Association (APA), and National Council on Measurement in Education CNCME) (1999), the waditional view that there are several different types of validity has been discarded. Instead, validity is viewed as a unitary concept based on various kinds of evidence. We will refer to the 1999 AERA, APA, NCME standards simply as the Standards. 5. Validity involves an overall evaluative judgment. It requires an evaluation of the degree (o which interpretations and uses of assessment results are justiled by supporting evidence and in terms of the consequences of those interpretations and uses. ‘There are many ways of accumulating evidence to support or challenge the validity of an interpretation of use of assessment results. The Siandards discuss five sources of evidence that are proposed for possible use in evaluating the validity of a specific use or interpretation, These sources of evidence are based on (a) test content, (b) response processes, (C) internal structure, (d) relations to other Variables, and (e) consequences of testing. Thus, validation may include a consideration of the content measured, the ways in which students respond, the relationship of individual items to the test scores, the relationship of performance to other assessments, and the consequences of using and interpreting assessment results. ‘Traditionally, the ways of accumulating evidence have been grouped together in one of three categories (content-related, construct-related, and criterion-related evidence). Each type of evidence is an important consideration in arriving at an overall evaluation of the degree of validity of any given interpretation of scores on a test of other assessment procedure. Although the traditional methods of accumulating evidence include the first four sources of evidence in the Standards (mostly under the broad umbrella of construct vali city), these traditional categories do not take into account that the consequences of uses and the interpretation of assessment results also influence validity. Hence, we will discuss four interrelated considerations—content, construct, assessment-criterion relationships, and consequences—in the evaluation of validity rather than a list of distinct validation methods. MAJOR CONSIDERATIONS IN ASSESSMEN’ VALIDATION Four major considerations for validation are briefly described in Table 4.1. ‘The strongest case for validity can be made when evidence is obtained regarding all four considerations herein, That is, interpretations and uses of assessment resulis are likely to have greater validity when we have an understanding of (a) the assessment content and the specifications from which it was derived, (b) the nature of the characteristie(s) being measured, (€) the4 Part The Measurement and Assessment Process Table 4.1 Major considerations in validation Consideration Procedure Meaning Content ‘Compare the assessment tasks How well the sample of assessment to the specifications describing tasks represents the domain of tasks the task domain under to be measured and how it consideration. ‘emphasizes the most important content. Construct Establish the meaning of the How well performance on the assessment results by controlling (or assessment can be interpreted as a examining) the development of the meaningful measure of some assessment, evaluate the cognitive characteristic or quality. For example, processes used by students to perform does the performance clearly imply tasks, evaluate the relationships of the that the student “understands” the scores with other relevant measures, _ relevant concept or principle intended and experimentally determine what | to be used in responding to the task? factors influence performance. Assessment- Compare assessment results with How well performance on the criterion another measure of performance assessment predicts future relationships obtained ata later date (for prediction) performanceor estimates current or with another measure of performance on some valued performance obtained concurrently measures other than the test itself (or estimating prosent status). (called a eriterion). Consequences Evaluate the effects of use of How well use of assessment results assessment results on teachers and accomplishes intended purposes and studonts, Both the intonded positive avoids unintended offects. effects (@.g., increased learning) and possible unintended negative effects (e.g., narrowing of instruction, crop out of schoo!) need to be evaluated. relation of the assessment results to other significant measures, and (d) the consequences of the uses and interpretations of the results. However, for many uses of a test or an assessment, it is not practical or necessary to have evidence dealing with all four considerations, For example, it is not practical to expect that a teacher would provide evidence that a classroom assessment designed to measure student learning is related to other significant ‘measures. In this case, the primary concern would be content, but some of the analyses of the meaning of the scares (construct considerations) and possible effects on student motivation and learning (consequence considerations) would be relevant. Similarly, in using a scholastic aptitude test to predict future success in school, test— ctiterion relationships would be of major interest; but we would also be concerned about the appropriateness of the content, possible irrelevant factors (not part of the construct) that influence test performance (e.g., motivation, test anxiety, test-taking skills), and possible ‘unintended consequences of using predictions. Thus, content, construct, and consequence considerations would be relevant. One consideration is likely to be of primary importance,Chapter 4 Validity 75 but the other three are useful for a fuller understanding of the meaning of the assessment results and, therefore, contribute to the validation of our interpretations Although many other considerations are relevant to validity, our discussions of content, construct, assessment-criterion relationships, and consequence considerations will focus ‘on those procedures that are most useful in practical educational settings CONTENT CONSIDERATIONS Content considerations are of special importance when we wish to describe how an individual performs on a domain of tasks that the assessment is supposed to represent. We may, for example, expect students to he able to spell the 200 words on a given list. Be- cause @ 200-word spelling test is 100 time consuming, we may select a sample of 20 words to represent the total domain of 200 spelling words. If Margaret correctly spells 80% of these 20 words, we would like to be able to say that she can probably spell approximately 80% of the 200 words. Thus, we would like to be able to generalize from the student's performance on the sample of words in the test to the performance that the student would be expected to demonstrate on the domain of spelling words that the test represents. ‘The validity of the interpretation, in which a test score implies that the student can probably spell a given percentage of words in the whole domain, depends on considerations that go beyond the question of content. For example, construct considerations, such as the as- sumption that Margaret was trying to do her best, that she did not copy her neighbor's spelling words, and that she understood the teacher’s pronunciation of the words, influence the validity of the interpretation that she can spell ¢ given fraction of the words. Here, however, our concem is with the extent to which our 20-word test constituted a representative sample of the 200 words. In this instance, we can obtain a fairly representative sample of spelling words by simply starting with our 200-word list and selecting every 10th word. Hav- ing thus assured ourselves that we have a reasonably representative sample, we would have good support for the desired interpretation, in terms of content considerations. As we will see shortly, judging how adequately an assessment samples a given domain of achievement is usually much more complex than in this simple example, particularly when the learning ‘outcomes involve more complex understandings or integrated performances. The essence of the content consideration in validation, then, is determining the adequacy of the sampling of the content that the assessment results are interpreted to represent. More formally, the goal in the consideration of content validation is to determine the extent to which a set of assessment tasks provides a relevant and representative sample of the domain of tasks about which interpretations of assessment results are made. Focusing only on the issue of the adequacy of the sample from a content domain, of course, begs the question of priorities that should be given to different aspects of the content domain to be assessed. The definition of the domain to be assessed should detive di- sectly from the identification of goals and objectives as discussed in Chapter 3. Emphasized in the quotes in the box “Defining the Content Domain of an Assessment,” it is critical that the assessment begin with a content domain that reflects the important goals and objectives. In classroom assessment, the domains of achievement tasks are determined by applicable content standards, the curriculum, and instruction. Assessment development involves (a) clearly specifying the domain of instructionally relevant tasks to be used 10 measure76 Part I The Measurement andl Assessment Process Figure 4.2 Saree Content considerations in the assessment Determines which intended leaming siolassiscn abhi evoment outcomes (objectives) are to be achieved by students ‘Achievement Domain Specifies and delimits a set of instructionally relevant leaming tasks to be measured by an assessment Instructional and Assossmont Priorities ‘Specifies the relative importance of learning objectives to be assessed ‘Achiovoment Assosemont Provides a sot of rolovant assossmont tasks designed to measure a Tepresentative sample of taske from the achievement domain student achievement, (b) specifying the emphasis or relative importance according to the priority of goals and objectives, and (c) constructing or selecting a representative set of assessment tasks. Thus, to obtain a valid measure of learning outcomes, we proceed from the instruction (what has been taugh®) to the achievement domain (what is 0 be measured) to the priorities for measurement (what should be emphasized in the assessment) and finally to the assessment itself (a representative sample of relevant tasks). As shown in Figure 4.2, content considerations in validation require a judgment that all four are in close harmony. Rigorous judgments regarding validity based on content considerations should not be confused with face validity, which refers only to the appearance of the assessment. Based ‘on a superficial examination of the tasks, does the assessment appear to be a reasonable ‘measure? A simple example can be used to draw a clear distinction between making validity claims based on rigorous consideration of content definitions and the adequacy of sampling of tasks and making claims based on face validity (.e., on the basis of appearance). On an arithmetic test to a young child, we might phrase an item as follows: “If you had a 10-foot piece of string and you cut it in half, how long would the two pieces be?” If the test was to be given (0 carpenters, we would substitute the word board for string in this item, Similarly, for plumbers we would use the word pipe and for electricians the word wire. The problem remains the same, but by phrasing it in appropriate terms, it appears more relevant to the test taker (ie. it has greater face validity). The validity of interpretations of the arithmetic test scores would not be determined by how the testChapter 4 Validity 7 Defining the Content Domain of an Assessment “Assessment activities should contribute to _ National Committee on Science Education instructional improvement by focusing on Standards and Assessment, National Research instruction targets that ace consistent with Council, 1995, p. 79). the goals of instructional activities’ (Linn & ‘Mathematics Standards: “The mathematics Baker, 1996). standard: Assessment should reflect the math- Science Standards: “Achievement data ‘ematics that all students need to know and collected focus on the science content that is be able to do” (National Council of Teachers ‘most important for students to learn” ‘of Mathematics, 1995, p. 11). looked, however, but by how well it sampled the domain of arithmetic tasks important to each group (i.c., children, carpenters, plumbers, and electricians). Thus, our arithmetic test may provide an adequate measure of content for one group but not another, even though the items were phrased in terms appropriate to each group. Although a test should Jook like an appropriate measure to obtain the cooperation of those taking the test, face validity should not be considered a substitute for more rigorous evaluation of content definitions and sampling adequacy. Content Considerations in Assessment Development to Enhance Validity Content issues are typically considered during the development of an assessment. Its primarily a matter of preparing detailed specifications and then constructing an assessment that meets these specifications, Although there are many ways of specifying what an assessment should measure, one widely used procedure in constructing achievement tests uses a two-way chart called a table of specifications. We will use a brief form of it here to help clarify the process of content Validation in preparing classroom assessments. More elaborate tables of specifications and other types of specifications will be described and illustrated in Chapter 6, ‘Table of Specifications. The learning outcomes of a course or curriculum may be broadly defined to include both subject-matter content and instructional objectives. ‘The former is concerned with the topics to be learned and the latter with the types of performance students are expected to demonsirate (e.g., knows, comprehends, applies, analyzes, synthesizes, evaiuates). Hoth of these aspects are of concern in defining content comain and ensuring adequate sampling from it, We should like any assessment of achievement that we construct to produce results that represent both the content areas and the ‘objectives we wish o measure, and the table of specifications aids in obtaining a sample of tasks that represents both. A table of specifications, in a very simple form, is presented in Table 4.2 to show how such a table is used in test development. The percentages in the table indicate the relative degree of emphasis that each content area and each instructional objective is to be given in the test. Thus, if a 50-item classroom test is to measure a representative78 Part | The Measurement and Assessment Process Table 4.2 Sample table of specifications Instructional Objectives Content Knows Comprehends Applies Analyzes Synthesizes Evaluates Area Concepts Concepts Concepts Concepts Concepts Concepts ‘Total Air pressure 2 4 4 6 4 2 22 Air temperature 2 4 4 2 4 4 20 Humivity and precipitation. 2 4 2 2 4 2 16 Wind 2 4 2 2 4 2 16 Clouds 2 4 2 2 4 ° 14. Fronts 2 4 2 2 ° 2 12 Total 12 24 16 16 20 2 100 sample of subject-matter content, then 22% of the items (i.e. 11 items) should be concerned with air pressure, 20% with air temperature, 16% with humidity and precipitation, 16% with the wind, 14% with the clouds, and 12% with fronts, Similarly, if the test is to measure a representative sample of the instructional objectives, then 12% of the items (Le., 6 items) should measure knowledge of concepts, 24% should measure comprehension of concepts, 16% should measure application of concepts, 16% should measure analysis of concepts, 20% should measure synthesis of concepts, and 12% should measure evaluation of concepts. Thus, 52% of the items should measure the three low- est levels of Bloom’s taxonomy and 48% of the items should measure the three highest levels of Bloom's taxonomy. This, of course, implies that the emphasis on knowledge, comprehension, application, analysis, synthesis, and evaluation for each content area will follow the percentages in the table of specifications. For example, 2% of the test items Ge., a single item) concemed should measure knowledge of air pressure concepts, 4% should measure comprehension of ait pressure concepts, 4% should measure application of air pressure concepts, 6% should measure analysis of air pressure concepts, 4% should measure synthesis of air pressure concepts, and 2% should measure evaluation of air pressure concepis. As noted earlier, the specifications describing the achievement domain to be measured should be in harmony with what was taught. Thus, the weights assigned in this table reflect the emphasis that was given during insiruction. For example, comprehension outcomes received more emphasis than either knowledge or application outcomes in the instruction, and synthesis outcomes received more emphasis than either analysis or evaluation outcomes. The weights assigned through the table show that 44% of the instruction should emphasize comprehension and synthesis. The table, then, indicates the sample of instructionally relevant learning tasks to be measured, and the more closely the test items correspond to the specified sample, the greater the likelihood of obtaining a valid measure of student learning.Chapter 4 Validity 79 ‘The test items and other kinds of assessment tasks must function as intended if valid results are to be obtained. Test items and assessment tasks may function improperly if they contain inappropriate vocabulary, unclear directions, or some other defect. Similarly, tasks clesigned to measure comprehension and synthesis may measure only the simple recall of information if the solutions to the problems have been directly taught during instruction. In short, a host of factors can influence the intended function of the tasks and thus the validity of the assessment results. Much of what is written in this book concerning the construction of classroom assessments is directed toward producing valid measures of achievement. Content Considerations in Test Selection Evidence obtained from an analysis of content domain definition and content coverage (sampling adequacy) is also of concern when selecting published achievement tests. ‘When test publishers prepare achievement tests for use in the schools, they pay special attention to content. Their test specifications, however, are based on what is commonly taught in many different schools. Thus, a published test may or may not fit a particular school situation. To determine whether it does, it is necessary to go beyond the title of the test and examine what the test actually measures. A careful consideration of what the test measures is a necessary step in adopting any test, because the Standards make it clear that the ultimate responsibility for test use and interpretation lies with the user (e.g., local educator) rather than the publisher. Thus, it is important to ask questions such as the following: How well is the test content aligned with the state or district content standards? How closely does the test content correspond (o the course content and the curriculum and instructional goals in the local instructional program? Does the test provide a balanced measure of the intended learning outcomes? Are the most important objectives emphasized appropriately, or are some areas overemphasized and others neglected? A published test may provide more valid results for one schoo! program than for another, depending on how closely the set of test tasks matches the achievement to be measured. ‘The same types of specifications used in preparing classroom assessments can be used in selecting published tests. The detailed descriptions of course content and instructional ‘objectives and the relative emphasis to be given to each can help us determine which of several published tests is most relevant to our particular situation. It is simply a matter of examining the items in each test and comparing them to our test specifications. The test that provides the most balanced measure of the specified achievement domain is the one that will produce the most valid results. Many test publishers include a detailed description of their test specifications in the test manual. Although this makes it easier to judge the potential validity of the test results, there is no substitute for examining the test tasks themselves and judging how validly they measure the intended learning outcomes in the local instructional program. Content Considerations in Other Areas Although content considerations are of primary interest in assessing achievement, they are also of interest in other areas. For example, examining the content of a scholastic aptitude test aids in understanding the meaning of the scores and provides some evidence concerning the types of prediction for which it might be best suited. Similarly, when constructing or selecting an80 Part | ‘The Measurement and Assessment Process attitude scale, we are interested in how adequately the items cover those attitudinal topics included in the domain to be measured. In the same manner, an interest inventory should include samples of items that adequately represent those aspects of interest we wish to measure. In these and other situations, the content validation procedure is essentially the same as that in testing and assessment of achievement. It is a matter of analyzing the content and tasks included in the measuring instrument and the domain of outcomes to be measured and judging the degree of comespondence between them. CONSTRUCT CONSIDERATIONS Although we are considering them second, most measurement specialists would give highest priority to construct considerations in evaluating the validity of an interpretation or use of an assessment. We began with content considerations because the review of the content domain and sampling helps us determine how well test or assessment scores represent a given domain of tasks and is especially useful in both the preparation and the evaluation of all types of assessment of achievement. However, because we usually wish to interpret test and assessment results in terms of more general individual characteristics (eg., reading comprehension, communication ability, understanding of scientific princ- ples), we need to consider the construct that the assessment is intended to measure. These characteristics may be labeled in a variety of ways (e-g., skills, accomplishments, abilities, psychological traits, personal qualities), but regardless of the label, they involve some in- ference about a construct that goes beyond the factual statement that a student obtained a given score on a particular test or assessment. For example, rather than only stating that a student correctly solved 75% of the tasks on a particular mathematics test, we might want to infer that the student possesses a certain degree of mathematical reasoning ability. This provides a broad general description of student performance that has implications for many different uses Whenever we wish to interpret assessment results in terms of some individual characteristic (e.g., reading comprehension, mathematics problem-solving ability), we are concemed with a construct. A construct is an individual characteristic that we assume exists in order to explain some aspect of behavior. Mathematical reasoning is a construct, and so are reading comprehension, understanding of the principles of electricity, intelligence, creativity, and such personality characteristics as sociability, honesty, and anxiety. These are called constructs because they are theoretical constructions that are used to explain performance on an assessinent, When we Interpret assessment results as a measure of a particular construct, we are implying that there is such a construct, that it differs from other constructs, and that the results provide a measure of the construct that is little influenced by extraneous factors. Verifying such implications is the task of construct validation. Construct validation may be defined as the process of determining the extent to which performance on an assessment can be interpreted in terms of one or mote constructs. Although construct validation has been commonly associated with theory building and theory testing (Cronbach & Meehl, 1955), it also has implications for the practical use of assessment results. Whenever an assessment is to be interpreted as a measure of a particular construct, the various types of evidence useful for construct validation should he considered during its development or selection. ‘This will almost certainly includeChapter 4 Validity BL consideration of content and may include consideration of assessment-criterion relationships, but a variety of other types of evidence Is also relevant. The most appropriate types of evidence will be dictated by the particular construct to be measured. ‘Two questions are central to any construct validation: (1) Does the assessment adequately represent the intended construct? and (2) is performance influenced by factors that are ancillary or irrelevant to the construct? In the jargon of measurement specialists, these ‘two questions are concemed with construct underrepresentation and constructirrelevant variance. Both are stated in the negative. That is, validity is reduced to the degree that important aspects of the construct are underrepresented in the assessment. Validity is also reduced to the degree that performance is influenced by irrelevant factors such as skills that are ancillary to the intent of the assessment (e.g., reading speed on a reading comprehension test or punctuation on a test of understanding scientific principles). Construct-irrelevant factors can lead to unfairness in the use and interpretation of assessment results. An assessment intended to measure understanding of mathematical concepts, for example, could lead to unfair inferences about the level of understanding of English-language leamers because of the heavy reading demands of the assessment tasks, which are presented only in English. In Chapter 3, 2 standard from the Colorado geography content standards was used as an illustration of state content standards. It will be repeated here to illustrate the important ideas of construct underrepresentation and construct irrelevant variance. The standard is as follows: students understand how economic, political, cultural, and social processes interact to shape patterns of human populations, interdependence, cooperation, and conflict. (Colorado Mode! Gontent Standards for Geography, Colorado Department of Education; adopted June 1995, amended November 1995) The validity of an assessment of this standard would be undermined by construct underrepresentation if it dealt only with economic and political processes and their interac- tions. The cultural and social processes would be undemepresented (or not represented at all). The validity of the assessment would also be called into question because of consttuc ‘relevant variance if it were found that the major factors influencing student scores on the assessment were writing mechanics (¢.g., grammar, punctuation, and spelling) of the essays written in response to the tasks. Although writing mechanics are skills that a teacher might ‘want to assess, those skills are ancillary to determining whether a student has acquired the understanding that is the intent of the illustrative geography standard. Although construct validation is often associated with tests and assessments used to measure theoretical psychological constructs such as anxiety or introversion, it should be clear from the previous example that considerations of construct undertepresentation and construct-isrelevant variance are equally applicable to both published and teacher- developed assessments used in the classroom. Some aspects of construct underrepresentation were dealt with directly when consid- ring the content basis of validity. For example, the validity assessment of a unit on weather that was supposed to be aligned with the table of specifications shown in Table 4.2 would be called into question if there were no tasks that required knowledge or understanding of clouds or if none of the higher cognitive levels (e.g. synthesis and evaluation) were needed to complete the unit. Construct considerations, however, also involve questions about82 Part T The Measurement andl Assessment Process student characteristics, such as understanding or problem-solving ability. This requires considering the possibility that a correct answer might simply reflect recall of an answer from textbook or going through the motions of applying an algorithm to a problem rather than real understanding or problem-solving ability—both of which may require that the task be novel to the student. In considering construct-imelevant factors may undermine validity, it is useful to think about ancillary skills that may have an impact on performance on the assessment. On an assessment in mathematics or science, for example, reading ability is one obvious skill that may be ancillary to the intent of the assessment, Thus, it would be important to review the reading demands of the tsks to ensure that performance of some students was lim- ited not by a lack of understanding of science principles or mathematical concepts but by reading difficultics. Such a review is likely to be particularly important for assessments involving English-language learners. A wide range of construct-irrelevant factors may undermine validity. In addition to the influence of ancillary skills (e.g., reading on a science test), the test can interact with student charactetistics such as test-wiseness, motivation, or anxiety. As a consequence, students who have low test-wiseness (or low motivation or high anxiety) might be expected to score lower that their true ability would indicate on a science test. Instruction can also introduce construct irrelevant factors when comparing two groups of students who have not had an equal opportunity to learn the material. For example, if one group of students has received instruction that emphasizes higher-order skills and another ‘group of students received instruction that emphasized lower-level skills, then comparisons of the students’ abilities on a test that measures higher-order skills would not lead to reasonable interpretations, In short, anything that affects performance on the assessment that is not the construct of interest introduces a potential source of construct-irrelevance, or invalidity. Construct validation takes place primarily during the development and tryout of a test or an assessment and is based on an accumulation of evidence from varied sources. When selecting a published test that presumably measures a particular construct, such. as mathematical reasoning or reading comprehension, the test manual should be exam- ined to determine what evidence is presented to support the validity of the proposed interpretations. An illustration of some types of evidence that might be used in the construct valida. tion of an assessment of mathematical reasoning is shown in Figure 4.3. Although other types of studies could be added, this listing is sufficient to clarify the variety of types of evidence needed to support the claim that the assessment scores can be interpreted as measures of mathematical reasoning. Notice that considerations of both content and assessment-criterion relationships (considered in greater detail in the next section of this chapter) are included, along with other comparisons and correlations. No single type of evidence is sufficient, but the accumulation of various types of evidence helps desctibe ‘what the assessment measures and how the scores relate to other significant variables. This clarifies the meaning of the assessment performance and aids in determining how validly mathematical reasoning is being measured. In theory building and theory testing, the accumulation of evidence for construct validation may be endless. As new data are gathered, both the theory and the test or assess ment are likely t be modified, and the testing of hypotheses continues. For the practicalChapter 4 Validity 83, Figure 4.3 ‘Types of evidence used in construct validation use of test and assessment results, however, we need to employ a more restricted frame- work when considering construct validation. During the development and selection of tests and assessments, our focus should be on the types of evidence that it seems reasonable to obtain, giving special attention to those data that are most relevant to the types of interpretations to be made, We can thus increase our understanding of what the assess- ‘ment measures and how validly it does so without becoming involved in an endless task of data gathering.84 Part The Measurement and Assessment Process Methods Used in Construct Validation Construct validation depends on logical inferences drawn from a variety of types of data. As noted earlier, analyses of content and criterion relationships provide partial support for ‘our interpretations, but this must be supplemented by various studies that further clarify the meaning of the assessment results. Although it is impossible to describe all the specific procedures that might be used in construct validation, the following exemplify some of the more commonly used methods. 1. Defining the domain or tasks to be measured. The specifications should be so well defined that the meaning of the construct is clear and it is possible to judge the extent to ‘which the assessment provides a relevant and representative measure of the task domain. IF a single construct is being measured, then the tasks should evoke similar types of responses and be highly interrelated (also a content consideration). 2, Analyzing the response process required by the assessment tasks. The tesponse process called forth by the assessment tasks can be determined both by examining the test tasks themselves and by administering the tasks to individual students and having them “think aloud” as they perform the tasks. Thus, examination of the items in a reading comprehension test may indicate that literal comprehension is emphasized, with relatively few items devoted to inferential comprehension, Similarly, a review of the requirements for performance of a laboratory task in science may reveal a substantial emphasis on accurate recording of results but too little emphasis on the conceptual integration of the laboratory results with theory or everyday experiences. Such judgments can be checked by administering the tasks to individual students and having them explain how they obtain their res-ponses. In the example in Figure 4.3, ‘thinking aloud” may verify that the tasks call for the intended reasoning process, or it may reveal that most problems can be solved by a simple tial-and-error procedure 5. Comparing the scores of known groups. In some cases, it is possible to predict that scares will differ from one group to another. These may be age groups, trained and untrained, adjusted and maladjusted, and the like. For example, level of achievement gen- erally increases with age (at least during childhood and adolescence). Also, itis reasonable to expect that performance on an assessment will differ for groups that have received different amounts of instruction in the subject matter of the assessment and that scores on adjustment inventories will discriminate between groups of adjusted and maladjusted individuals. Thus, a prediction of differences for a particular test or assessment can be checked against groups that are known to differ and the results used as partial support for construct validation. 4. Comparing scores before and after a particular learning experience or experimental treatment, We would like our assessments (0 be sensitive to some types of experiences and insensitive to others. Certainly, we would like assessments of student achievement in a given subject matter area to improve during the course of instruction. On the other hand, ‘we would not like them to be influenced by such factors as student anxiety. Thus, both a emonstration of increases in performance following instruction and a demonstration that performance was affected little by a treatment designed to reduce student anxiety would Jend support to the construct validity of the assessment.Chapter 4 Validity a5 5. Correlating the scores with other measures. ‘The scores of any particular assessment can be expected to correlate substantially with the scores of other measures of the same or a similar construct. (See the next section of this chapter and Appendix A fora discussion of com relation.) By the same token, lower correlations would be expected to be obtained with measures of a different ability or trait. For example, we would expect rather high comelation between two scholastic aptitude tests but much lower correlation between a scholastic and a musical aptitude test. Similarly, we would expect student performances on two assessments of writing to have substantially higher correlations with each other than either would have with an assessment in mathematics. Thus, for any given test or assessment, we would predict higher correlations with like tests and assessments and lower correlations with unlike tests and assessments. In addition, we might also predict that the assessment scores would correlate ‘with various practical criteria. Scholastic aptitude scores, for example, should correlate satis- factorily with school grades, scores on performance in chemisuy, and other measures of achievement in chemistry, This latter type of evidence is obtained by studies of assessment-criterion relationships. Our interest here, however, is not in the immediate problem of prediction, but in using these correlations to support the claim that the test measures scholastic aptitude of that the assessment measures understanding of chemical principles. As indi- cated earlier, construct validation depends on a wide array of evidence, including that provided by the other validation procedures. Broadly conceived, construct validation is an attempt to account for the differences in assessment results. During the development of an assessment, an attempt is made to rule ‘out extraneous factors that might distort the meaning of the scores, and follow-up studies are conducted to verify the success of these attempts. The aim is to clarify the meaning of student performance by identifying the nature and strength of all factors influencing the scares on the assessment. Construct validation is important to all types of testing and assessment—achievement, aptitude, and personal-social development. Whether constructing or selecting an assess- ‘ment, the meaning of the resulting scores depends on the care with which the assessment was constructed and the array of evidence supporting the types of interpretations to be made, Construct validation is emphasized in most recent discussions of validity in the tech- nical and theoretical literature, in part because, as we have seen, it subsumes considerations of content and criterion relationships and in part because meaning is crucial in our uses and interpretations of the scores. The latter point was stressed by Messick (1989), ‘who stated, *The meaning of the measure, and hence its construct validity, must always be pursued—not only to support test interpretation but also to justify test use” (p. 17). ASSESSMENT-CRITERION RELATIONSHIPS Although few teachers will conduct studies relating assessment results to other measures, it is important to understand the use of assessment-criterion relationships in evaluating validity. Understanding how assessment-criterion relationships are analyzed, for example, will help in evaluating the use of standardized tests to make predictions of student performance in other settings

Chapter 4 Validity 8

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 4 Validity 8

Uploaded by

Copyright:

Available Formats

You might also like