Reliability and Validity

Chong Ho Yu, Ph.D.
Conventional views of reliability (AERA et al., 1985)

Temporal stability: the same form of a test on two or more separate occasions to the same group of examinees (Test-retest). On many occasions this approach is not practical because repeated measurements are likely to change the examinees. For example, the examinees will adapt the test format and thus tend to score higher in later tests. Hence, careful implementation of the test-retest approach is strongly recommendation (Yu, 2005).

Form equivalence: two different forms of test, based on the same content, on one occasion to the same examinees (Alternate form). After alternate forms have been developed, it can be used for different examinees. It is very common in high-stake examination for pre-empting cheating. A examinee who took Form A earlier could not share the test items with another student who might take Form B later, because the two forms have different items.

Internal consistency: the coefficient of test scores obtained from a single test or survey (Cronbach Alpha, KR20, Spilt-half). For instance, let's say respondents are asked to rate statements in an attitude survey about computer anxiety. One statement is "I feel very negative about computers in general." Another statement is "I enjoy using computers." People who strongly agree with the first statement should be strongly disagree with the second statement, and vice versa. If the rating of both statements is high or low among several respondents, the responses are said to be inconsistent and patternless. The same principle can be applied to a test. When no pattern is found in the students' responses, probably the test is too difficult and students just guess the answers randomly. • Reliability is a necessary but not sufficient condition for validity. For instance, if the needle of the scale is five pounds away from zero, I always over-report my weight by five pounds. Is the

portfolio. the reliability of writing skill test score is affected by the raters. The internal structure of things may not concur with the appearance. the mode of discourse. face validity has come back in another form. Conventional views of validity (Cronbach. As a check on face validity. where the tasks vary substantially from student to student and where multiple tasks may be evaluated simultaneously. 1971) • Face validity: Face validity simply means the validity at face value. the ultimate inference should go beyond one single testing occasion to others (Yu. One of the difficulties is that there are more than one source of measurement errors in performance assessment. different types of reliability measures share a common thread: What constitutes a replication of a measurement procedure? (Brennan. While discussing the validity of a theory. and several other factors (Parkes. Nevertheless. Lacity and Jansen (1994) defines validity as making common sense. 2000). test/survey items are sent to teachers to obtain suggestions for modification. but it is consistently wrong! Is the measurement valid? No! (But if it under-reports my weight by five pounds. and being persuasive and seeming right to the reader. This measure is used because it is convenient to compute the reliability index based upon data collected from one occasion. any procedures for estimating reliability should attempt to mirror a result based upon full-length replications. In other words. However. For example. validity of a theory refers to results that have the appearance of truth or reality.measurement consistent? Yes. • Replications as unification: Users may be confused by the diversity of reliability indices. 2005). and responsive evaluations. outside the measurement arena. However. 2001) Take internal consistency as an example. are attacked for lacking reliability. psychometricians have abandoned this concept for a long time. Many times professional . For Polkinghorne (1988). Because of its vagueness and subjectivity. I will consider it a valid measurement) • Performance.

word processing. the knowledge and skills covered by the test items should be representative to the larger domain of knowledge and skills." "appearance. Second. very often content experts fail to identify the learning objectives of a subject.e. internet. this approach looks similar to the validation process of face validity. this approach has some drawbacks. computer literacy includes skills in operating system. to administer a test covering all aspects of computing. In short. Content validity is concerned with sample-population representativeness. Take computer literacy as an example again. In content validity. The criteria of validity in research should go beyond "face. only several tasks are sampled from the population of computer skills. it is difficult. It is not uncommon that some tests written by content experts are extremely difficult. evidence is obtained by looking for agreement in judgments by judges. i. However. and many others. graphics. By the first glance. However.knowledge is counter-common sense. database. Take the following question in a philosophy test as an example: Top of Form What is the time period of the philosopher Epicurus? . if not impossible. Content validity is usually established by content experts. face validity can be established by one person but content validity should be checked by a panel." and "common sense. spreadsheet. Usually experts tend to take their knowledge for granted and forget how little other people know. Therefore. A test of computer literacy should be written or reviewed by computer science professors because it is assumed that computer scientists should know what are important in his discipline. For example." • Content validity: draw an inference from test scores to a large domain of items similar to those on the test. but yet there is a difference.

1949 b. The content expert may argue that "historical facts" are important for a student to further understand philosophy. None of the above Bottom of Form This type of question tests the ability of memorizing historical facts. 341-270 BC b. It was invented by R. 280-207 BC d. Look at the following two questions: Top of Form When was the founder and CEO of Microsoft. A. 1953 c. 331-232 BC c. None of the above Bottom of Form Top of Form Which of the following statement is true about ANOVA a. 1957 d. Let's change the subject to computer science and statistics. William Gates III born? a. but not philosophizing. Fisher in 1914 .a.

which will be discussed later. sampling knowledge from a larger domain of knowledge involves subjective values. The correlation coefficient between them is called validity coefficients. but less questions on watercolor paintings and photography because of the perceived importance of oil paintings in art history. a test regarding art history may include many questions on oil paintings. None of the above Bottom of Form It would be hard pressed for any computer scientist or statistician to accept that the above questions fulfill content validity. As a matter of fact. A. Regression analysis can be applied to establish criterion validity. On the other hand. It was invented by Karl Pearson in 1920 d. Further. the criterion variable.b. An independent variable could be used as a predictor variable and a dependent variable. Fisher in 1920 c. . For example. • Criterion: draw an inference from test scores to performance. (Goodenough. the memorization approach is a common practice among instructors. 1949). Content validity is sample-oriented rather than sign-oriented. Construct validity and criterion validity. It was invented by R. are sign-oriented because both of them indicate behaviors different from those of the test. A high score of a valid test indicates that the tester has met the performance criteria. a behavior is considered a sign when it is an indictor or a proxy of a construct. A behavior is viewed as a sample when it is a subgroup of the same kind of behaviors.

criterion validity is about prediction rather than explanation. the simulation test is claimed to have a high degree of criterion validity. In short. an evaluator has to conduct construct validation. • Construct: draw an inference form test scores to a psychological construct. However. Construct validity can be measured by the correlation between the intended independent variable (construct) and the proxy independent variable (indicator. he/she should meet the criterion of being a safe driver. Predication is concerned with non-casual or mathematical dependence where as explanation is pertaining to causal or logical dependence. It is hypothesized that if the tester passes the simulation test. an evaluator wants to study the relationship between general cognitive ability and job performance. it is a matter of degree. one cannot explain why the weather changes by the change of mercury height. the height of mercury could satisfy the criterion validity as a predictor. For example. In this case. scores of the driving test by simulation is the predictor variable while scores of the road test is the criterion variable. According to Hunter and Schmidt (1990). For example. construct validity is also known as theoretical construct.For instance. sign) that is actually used. Because of this limitation of criterion validity. one can predict the weather based on the height of mercury inside a thermometer. the evaluator may not be able to administer a cognitive test to every subject. However. After he administered a cognitive test to a portion of all subjects and found a strong correlation between general cognitive ability and amount of education. In other words. he can use a proxy variable such as "amount of education" as an indirect indicator of cognitive ability. construct validity is a quantitative question rather than a qualitative distinction such as "valid" or "invalid". Thus. . if the simulation test scores could predict the road test scores in a regression model. the latter can be used to the larger group because its construct validity is established. Because it is concerned with abstract and theoretical construct.

○ Formative indictor: the cause of the construct. there is no mathematical index of construct validity. but isn't a sufficient condition. There are two types of indictors: ○ Reflective indictor: the effect of the construct. committee members are not trained to agree on a common set of criteria and standards . factor analysis is used for construct validation.1988. 1983). A modified view of reliability (Moss. In other words. • Construct validation as unification: The criterion and the content models tends to be empirical-oriented while the construct model is inclined to be theoretical. In many situations such as searching faculty candidate and conferring graduate degree. Nevertheless. 1994) • • • There can be validity without reliability if reliability is defined as consistency among independent measures. When an indictor is expressed in terms of multiple items of an instrument. theoretical construct validation is considered functioning as a unified framework for validity (Kane. Angoff. it is important to formulate an interpretative (theoretical) framework clearly and then to subject it to empirical challenges. Test bias is a major threat against construct validity. However. Rather the nature of construct validity is qualitative. the absence of test bias is a necessary. all models of validity requires some form of interpretation: What is the test measuring? Can it measure what it intends to measure? In standard scientific inquiries. Cronbach & Quirk. 2001). distinctions between reliability and validity blur. and therefore test bias analyses should be employed to examine the test items (Osterlind. In this sense. The presence of test bias definitely affects the measurement of the psychological construct.g. 1976) argue that construct validity cannot be expressed in a single coefficient. As assessment becomes less standardized. the absence of test bias does not guarantee that the test possesses construct validity.Other authors (e. Reliability is an aspect of construct validity.

. • Probability-based reasoning to more complex assessments based upon cognitive psychology is needed.. psychomterics is datammetrics. in which a holistic and integrative approach to understand the whole in light of its parts is used. teachers. compute reliability indices) may fail if we quantify things or tasks that we don't know much about. among students. Mislevy demanded psychometricians to think about what they intend to make inferences about. Li (2003) argued that the preceding view is incorrect: • The definition of reliability should be defined in terms of the classical test theory: the squared correlation between observed and true scores or the proportion of true variance in obtained test scores. It has been a tradition that multiple factors are introduced into a test to improve validity but decrease internal-consistent reliability.g. Thus. A radical view of reliability (Thompson et al. • Initial disagreement (e. By blending psychometrics and Hermeneutics. An extended view of Moss's reliability (Mislevy. Off-the-shelf inferential machinary (e. rather problem solving involves arguments or chains of reasoning with massive evidence. Mislevy went further to ask whether there can be reliability without reliability (indices). rather it is attached to the property of the data. • • In many cases we don't present just one argument. and parents in responsive evaluation) would not invalidate the assessment. 2004) • • Being inspired by Moss. Rather it becomes an empirical puzzle to be solved by searching for a more comprehensive interpretation. • • Reliability is a unitless measure and thus it is already model-free or standard-free. 2003) • Reliability is not a property of the test. Rather it would provide an impetus for dialog.• Inconsistency in students' performance across tasks does not invalidate the assessment.g.

for differential decisions. He asserted. reviewed the historical development of Cronbach Alpha. A modified view of validity (Messick. construct) is fragmented and incomplete. Over the years. Reliability generalization.1991) • • Content validity is not a type of validity at all because validity refers to inferences made about scores. Lee Cronbach. "I no longer regard the formula (of Cronbach Alpha) as the most appropriate way to examine most data.• • Tests are not reliable. my associates and I developed the complex generaliability (G) theory" (p. The very definition of a construct implies a domain of content. criterion. 403). It is important to explore reliability in virtually all studies. or both. which can be used in a meta-analysis application similar to validity generalization. • Number of conditions for the test A critical view of validity (Pedhazur & Schmelkin. An updated perspective of reliability (Cronbach. especially because it fails to take into account both evidence of the value implications of score meaning as a basis for . not a coefficient. • • • Independence of sampling Heterogeneity of content How the measurement will be used: Decide whether future uses of the instrument are likely to be exclusively for absolute decisions. the inventor of Cronbach Alpha as a way of measuring reliability. There is no sharp distinction between test content and test construct. 2004) In a 2004's article. 1995) The conventional view (content. Nevertheless. Discussion of the G theory is beyond the scope of this document. not to an assessment of the content of an instrument. should be implemented to assess variance in measurement error across studies. Cronbach did not object use of Cronbach Alpha but he recommended that researchers should take the following into consideration while employing this approach: • Standard error of measurement: It is the most important piece of information to report regarding the instrument.

A different view of reliability and validity (Salvucci. which may or may not commensurate with the construct's trait implications and need to be addressed in appraising score meaning. but have very high variance from a small sample size. . but rather of the meaning of the test scores. representativeness. and justice. This school of thought conceptualizes reliability as invariance and validity as unbiasedness. accountability should lie with the misuser. Walter. Conversely. In this view. A sample statistic may have an expected value over samples equal to the population parameter (unbiasedness). & Saba (1997) Some scholars argue that the traditional view that "reliability is a necessary but not a sufficient condition of validity" is incorrect.action and the social consequences of score use. Conley. • • • • • • Content: evidence of content relevance. a sample statistic can have very low sampling variance but have an expected value far departed from the population parameter (high bias). they should still pay attention to the unanticipated consequences of legitimate score interpretation. fairness. and technical quality Substantive: theoretical rationale Structural: the fidelity of the scoring structure Generalizability: generalization to the population and across populations External: applications to multitrait-multimethod comparison Consequential: bias. the social consequence of the assessment to the society Critics argued that consequences should not be a component of validity because test developers should not be held responsible for the consequences of misuse. While test developers should not be accountable to misuse of tests. Fink. a measure can be unreliable (high variance) but still valid (unbiased). Messick (1998) counter-argued that social consequences of score interpretation include the value implications of the construct label. Validity is not a property of the test or assessment.

Low reliability is a signal of high measurement error. the experimenter should explicitly instruct students to choose "I don't know" instead of making a guess if they really don't know the answer. The choice "I don't know" can help in closing this gap.Population parameter (Red line) = Sample statistic (Yellow line) --> unbiased High variance (Green line) Unreliable but valid Population parameter (Red line) <> Sample statistic (Yellow line) --> Biased low variance (Green line) Invalid but reliable Caution and advice There is a common misconception that if someone adopts a validated instrument. which reflects a gap between what students actually know and what scores they receive. "I am taking a drug approved by FDA and therefore I don't need to know whether it works for me or not!" A responsible evaluator should still check the instrument's reliability and validity with his/her own subjects and make any modifications if necessary. a low reliability caused by random guessing is expected. Low reliability is less detrimental to the performance pretest. In an experimental settings where students' responses would not affect their final grades. One easy way to overcome this problem is to include "I don't know" in multiple choices. I tell him. he/she does not need to check the reliability and validity with his/her own data. Last Updated: 2008 . In the pretest where subjects are not exposed to the treatment and thus are unfamiliar with the subject matter. Imagine this: When I buy a drug that has been approved by FDA and my friend asks me whether it heals me.

anxious. .com/teaching/assessment/reliability. in a noisy room. An observed test score is made up of the true score plus measurement error. The goal of estimating reliability (consistency) is to determine how much of the variability in test scores is due to measurement error and how much is due to variability in true scores. etc. . hungover. You would not consider one multiple-choice exam question to be a reliable basis for testing your knowledge of "individual differences".creative-wisdom.overview Reliability is the extent to which a test is repeatable and yields consistent scores. essay. presentation) to help provide a more reliable score. Reliability can be improved by: • • getting repeated measurements using the same test and getting many different measures using slightly different techniques and methods. so the aim is to minimize it. but reliability does not guarantee validity. Many questions are asked in many different formats (e.http://www. All measurement procedures have the potential for error.. a test must be reliable.e. Note: In order to be valid. Consider university assessment for grades involve several sources. Measurement errors are essentially random: a person?s test score might not reflect the true score because they were sick.html Reliability .g.g. exam.

80 is considered as a reasonable benchmark How reliable should tests be? Some reliability guidelines .Types of reliability There are several types of reliability: There are a number of ways to ensure that a test is reliable.look up answers and do better next time. Internal consistency Internal consistence is commonly measured as Cronbach's Alpha (based on inter-item correlations) .80 = moderate reliability . However. if you measure children?s reading ability in February and the again in June the change is likely due to changes in children?s reading ability. alpha of . 4. the greater the internal consistency.. Test-retest reliability The test-retest method of estimating a test's reliability involves administering the test to the same group of people at least twice. History quiz . there are multiple markers to help ensure accurate assessment by checking inter-rater reliability 5. e.g.90 (high). The greater the number of similar items.70 = low reliability High reliability is required when (Note: Most standardized tests of intelligence report reliability estimates around . Alternate Forms Administer Test A to a group and then administer Test B to same group.g if you use a tape measure to measure a room on two different days. Split Half reliability Relationship between half the items and the other half. That?s why you sometimes get very long scales asking a question a myriad of different ways ? if you add more items you get a higher cronbach?s. Also the actual experience of taking the test can have an impact (called reactivity). 2. Also might remember original answers. Generally. any differences in the result is likely due to measurement error rather than a change in the room size. Then the first set of scores is correlated with the second set of scores. Correlations range between 0 (low reliability) and 1 (high reliability) (highly unlikely they will be negative!) Remember that change might be due to measurement error e.90 = high reliability .g. I?ll mention a few of them now: 1.between 0 (low) and 1 (high). Inter-rater Reliability Compare scores given by different raters. for important work in higher education (e. Correlation between the two scores is the estimate of the test reliability 3. theses). ..

A good test must also be internally consistent.  Reliability refers to whether a test is consistent. reliability is higher than validity. 16% of the variability in test scores is attributable to error) Reliability estimates below .• • tests are used to make important decisions individuals are sorted into many different categories based upon relatively small individual differences e.e. Read more: The Reliability & Validity of Psychological Tests | eHow. Importance  Reliability and validity are crucial to quality psychological testing. p.g. Levels of reliability typically reported for different types of tests and measurement devices are reported in Table 7-6: Murphy and Davidshofer (2001. in its results. intelligence Lower reliability is acceptable when (Note: For most testing applications..g. which underscores the need for additional information. Validity refers to whether test results describe a person's actual behavior. then the test can't be trusted to make predictions about behavior in a real-world setting.ehow. such as interviews. 49% consistent variation (.html#ixzz17cE1VCcH . reliability estimates around . For most psychological tests. then the test can't be trusted to make valid assessments in a clinical setting. over time. If reliability isn't present.142). • • tests are used for preliminary rather than final decisions tests are used to sort people into a small number of groups based on gross individual differences e. height or sociability /extraversion Reliability estimates of .80 or higher are typically regarded as moderate to high (approx.70 are usually regarded as low .com/facts_7282618_reliability-validity-psychological-tests.i.60 are usually regarded as unacceptably low.com http://www. If validity isn't present.7 to the power of 2).