CHAPTER 5: Reliability total variance attributed to true variance, the more reliable the test. o True differences are assumed to be Reliability stable = they will yield consistent - Synonym for dependability or scores on repeated administrations consistency. of the same test as well as on - In Psychometric sense, it only refers equivalent forms of tests. to something that is consistent—not o Because error variance may necessarily consistently good or increase or decrease a test score by bad, but simply consistent. varying amounts, consistency of the - Not an all-or-none matter; A test test score—and thus the reliability may be reliable in one context and —can be affected. unreliable in another. There are - The term measurement error refers to all of different types and degrees of the factors associated with the process of reliability. measuring some variable, other than the variable being measured. (Ex: administering English-language test on the subject of 12th- Reliability Coefficient grade algebra to students from China) - An index of reliability, a proportion o Can be categorized as being either that indicates the ratio between the random or systematic. true score variance on a test and the o Random error – a source of error total variance. in measuring a targeted variable caused by unpredictable The Concept of Reliability fluctuations and inconsistencies of - A score on an ability test is reflected other variables in the measurement not only the testtaker’s true score on process. the ability being measured but also Sometimes referred to as error. “noise,” this source of error fluctuates from one - Error refers to the component of the testing situation to another observed test score that does not with no discernible pattern have to do with the testtaker’s ability. that would systematically - X=T+E raise or lower scores. (Ex: o X = observed score unanticipated events in the o T = true score vicinity of test o E = error environment like rally or - A statistic useful in describing lightning strike; sources of test score variability is the unanticipated physical events happening within variance (σ2). the test taker like surge in - Variance from true differences is their blood pressure.) true variance, and variance from o Systematic error – a source of error irrelevant, random sources is error in measuring a variable that is variance. typically constant or proportionate - σ2 = σ2th + σ2e to what is presumed to be the true value of the variable being o In this equation, the total measured. variance in an observed Once a systematic error distribution of test scores (σ 2) becomes known, it equals the sum of the true becomes predictable—as variance (σ2th) plus the error well as fixable. Note also variance (σ2e). that a systematic source of - The term reliability refers to the proportion error does not affect score of the total variance attributed to true consistency. (Ex: a systematic error source may also serve as a source of error. would not change the So, for example, test results may variability of the vary depending upon whether the distribution or affect the test taker’s country is at war or at measured reliability of the peace (Gil et al., 2016). A variable instrument. In the end, the of interest when evaluating a individual crowned “the patient’s general level of biggest loser” would suspiciousness or fear is the indeed be the contestant patient’s home neighborhood and who lost the most weight lifestyle. —it’s just that he or she o Other potential sources of error would actually weigh 5 variance during test administration pounds more than the are test taker variables. Pressing weight measured by the emotional problems, physical show’s official scale.) discomfort, lack of sleep, and the effects of drugs or medication can Sources of Error Variance all be sources of error variance. - Include test construction, administration, Formal learning experiences, casual scoring, and/or interpretation. life experiences, therapy, illness, - Test construction – One source of variance and changes in mood or mental during test construction is item sampling or state are other potential sources of content sampling, terms that refer to test taker-related error variance. variation among items within a test as well o Examiner-related variables are as to variation among items between tests. potential sources of error variance. o The extent to which a test taker’s The examiner’s physical score is affected by the content appearance and demeanor—even sampled on a test and by the way the presence or absence of an the content is sampled (that is, the examiner—are some factors for way in which the item is consideration here. Some constructed) is a source of error examiners in some testing variance. situations might knowingly or o From the perspective of a test unwittingly depart from the creator, a challenge in test procedure prescribed for a development is to maximize the particular test. On an oral proportion of the total variance that examination, some examiners may is true variance and to minimize the unwittingly provide clues by proportion of the total variance that emphasizing key words as they is error variance. pose questions. Clearly, the level of - Test administration – Sources of error professionalism exhibited by variance that occur during test examiners is a source of error administration may influence the test taker’s variance. attention or motivation. The test taker’s - Test scoring and interpretation – In many reactions to those influences are the source tests, the advent of computer scoring and a of one kind of error variance. growing reliance on objective, computer- o Examples of untoward influences scorable items have virtually eliminated during administration of a test error variance caused by scorer differences. include factors related to the test However, not all tests can be scored from environment: room temperature, grids blackened by no. 2 pencils. level of lighting, and amount of Individually administered intelligence tests, ventilation and noise, for instance. some tests of personality, tests of creativity, A relentless fly may develop a various behavioral measures, essay tests, tenacious attraction to an portfolio assessment, situational behavior examinee’s face. tests, and countless other tools of assessment o External to the test environment in still require scoring by trained personnel. a global sense, the events of the day o Scorers and scoring systems are of abuse also may potential sources of error variance. contribute to systematic A test may employ objective-type error. Females, for items amenable to computer example, may underreport scoring of well-documented abuse because of fear, reliability. Yet even then, a shame, or social technical glitch might contaminate desirability factors and the data. over-report abuse if they o If subjectivity is involved in are seeking help. scoring, then the scorer (or rater) can be a source of error variance. Reliability Estimates Indeed, despite rigorous scoring Test-Retest Reliability Estimates criteria set forth in many of the - One way of estimating the reliability of a better-known tests of intelligence, measuring instrument is by using the same examiner/scorers occasionally still instrument to measure the same thing at two are confronted by situations where points in time. In psychometric parlance, an examinee’s response lies in a this approach to reliability evaluation is gray area. called the test-retest method, and the result - Other sources of error – Surveys and polls of such an evaluation is an estimate of test- are two tools of assessment commonly used retest reliability. by researchers who study public opinion. In - Test-retest reliability is an estimate of the political arena, for example, researchers reliability obtained by correlating pairs of trying to predict who will win an election scores from the same people on two may sample opinions from representative different administrations of the same test. voters and then draw conclusions based on The test-retest measure is appropriate when their data. evaluating the reliability of a test that o The error in such research may be a purports to measure something that is result of sampling error—the extent relatively stable over time, such as a to which the population of voters in personality trait. If the characteristic being the study actually was measured is assumed to fluctuate over time, representative of voters in the then there would be little sense in assessing election. the reliability of the test using the test-retest o Alternatively, the researchers may method. have gotten such factors right but - It is generally the case (although there are simply did not include enough exceptions) that, as the time interval people in their sample to draw the between administrations of the same test conclusions that they did. This increases, the correlation between the scores brings us to another type of error, obtained on each testing decreases. The called methodological error. passage of time can be a source of error o Certain types of assessment variance. The longer the time that passes, the situations lend themselves to greater the likelihood that the reliability particular varieties of systematic coefficient will be lower. and nonsystematic error. For - When the interval between testing is greater example, consider assessing the than six months, the estimate of test-retest extent of agreement between reliability is often referred to as the partners regarding the quality and coefficient of stability. quantity of physical and o An estimate of test-retest reliability psychological abuse in their from a math test might be low if the relationship. test takers took a math tutorial A number of studies before the second test was (O’Leary & Arias, 1988; administered. An estimate of test- Riggs et al., 1989; Straus, retest reliability from a personality 1979) have suggested that profile might be low if the test taker underreporting or over- suffered some emotional trauma or reporting of perpetration received counseling during the intervening period. - A low estimate of test-retest reliability might Overall, it appears that most scientists now be found even when the interval between recognize replicability as a concern that testings is relatively brief. This may well be needs to be addressed with meaningful the case when the testings occur during a changes to what has constituted “business- time of great developmental change with as-usual” for so many years. respect to the variables they are designed to assess. Parallel-Forms and Alternate-Forms Reliability - An estimate of test-retest reliability may be Estimates most appropriate in gauging the reliability of - The degree of the relationship between tests that employ outcome measures such as various forms of a test can be evaluated by reaction time or perceptual judgments means of an alternate-forms or parallel- (including discriminations of brightness, forms coefficient of reliability, which is loudness, or taste). often termed the coefficient of equivalence. o However, even in measuring - Parallel forms of a test exist when, for each variables such as these, and even form of the test, the means and the variances when the time period between the of observed test scores are equal. In theory, two administrations of the test is the means of scores obtained on parallel relatively small, various factors forms correlate equally with the true score. (such as experience, practice, More practically, scores obtained on parallel memory, fatigue, and motivation) tests correlate equally with other measures. may intervene and confound an o The term parallel forms reliability obtained measure of reliability. refers to an estimate of the extent to which item sampling and other *Psychology’s Replicability Crisis errors have affected test scores on - Low replication rate helped confirm that versions of the same test when, for science indeed had a problem with each form of the test, the means replicability, the seriousness of which is and variances of observed test reflected in the term replicability crisis. scores are equal. - (1) A general lack of published replication - Alternate forms are simply different attempts in the professional literature, (2) versions of a test that have been constructed editorial preferences for positive over so as to be parallel. Although they do not negative findings, and ((3) questionable meet the requirements for the legitimate research practices on the part of authors of designation “parallel,” alternate forms of a published studies. test are typically designed to be equivalent o Replication by independent parties with respect to variables such as content and provides for confidence in a level of difficulty. finding, reducing the likelihood of o The term alternate forms experimenter bias and statistical reliability refers to an estimate of anomaly. Indeed, had scientists the extent to which these different been as focused on replication as forms of the same test have been they were on hunting down novel affected by item sampling error, or results, the field would likely not be other error. in crisis now. - Obtaining estimates of alternate-forms o Positive findings typically entail a reliability and parallel-forms reliability is rejection of the null hypothesis. In similar in two ways to obtaining an estimate essence, from the perspective of of test-retest reliability: most journals, rejecting the null (1) Two test administrations with hypothesis as a result of a research the same group are required, and study is a newsworthy event. By (2) test scores may be affected by contrast, accepting the null factors such as motivation, fatigue, hypothesis might just amount to or intervening events such as “old news.” practice, learning, or therapy - Moreover, replication efforts—beyond even (although not as much as when the that of the Open Science Collaboration—are same test is administered twice). becoming more common (Klein et al, 2013). - An additional source of error variance, item the test and even-numbered items to the sampling, is inherent in the computation of other half. This method yields an estimate of an alternate- or parallel-forms reliability split-half reliability that is also referred to as coefficient. odd-even reliability. - Developing alternate forms of tests can be - Yet another way to split a test is to divide time-consuming and expensive. Imagine the test by content so that each half contains what might be involved in trying to create items equivalent with respect to content and sets of equivalent items and then getting the difficulty. same people to sit for repeated - In general, a primary objective in splitting a administrations of an experimental test! On test in half for the purpose of obtaining a the other hand, once an alternate or parallel split-half reliability estimate is to create form of a test has been developed, it is what might be called “mini-parallel-forms,” advantageous to the test user in several with each half equal to the other—or as ways. nearly equal as humanly possible—in - An estimate of the reliability of a test can be format, stylistic, statistical, and related obtained without developing an alternate aspects. form of the test and without having to - The third step requires the computation of administer the test twice to the same people. the Spearman-Brown formula, which Deriving this type of estimate entails an allows a test developer or user to estimate evaluation of the internal consistency of the internal consistency reliability from a test items. Logically enough, it is referred to correlation of two halves of a test. It is a as an internal consistency estimate of specific application of a more general reliability or as an estimate of inter-item formula to estimate the reliability of a test consistency. that is lengthened or shortened by any number of items.
- A formula is necessary for estimating the
Split-Half Reliability Estimates reliability of a test that has been shortened or - An estimate of split-half reliability is lengthened. The general Spearman–Brown obtained by correlating two pairs of scores (rSB) formula is: obtained from equivalent halves of a single test administered once. o Where rSB is equal to the reliability - It is a useful measure of reliability when it is adjusted by the Spearman–Brown impractical or undesirable to assess formula, rxy is equal to the Pearson r reliability with two tests or to administer a in the original-length test, and n is test twice (because of factors such as time or equal to the number of items in the expense). revised version divided by the o Step 1. Divide the test into number of items in the original equivalent halves. version. o Step 2. Calculate a Pearson r between scores on the two halves - By determining the reliability of one half of of the test. a test, a test developer can use the o Step 3. Adjust the half-test Spearman–Brown formula to estimate the reliability using the Spearman– reliability of a whole test. Because a whole Brown formula (discussed shortly). test is two times longer than half a test, n - Simply dividing the test in the middle is not becomes 2 in the Spearman–Brown formula recommended because it’s likely that this for the adjustment of split-half reliability. procedure would spuriously raise or lower The symbol rhh stands for the Pearson r of the reliability coefficient. scores in the two half tests: - One acceptable way to split a test is to randomly assign items to one or the other half of the test. - Another acceptable way to split a test is to assign odd-numbered items to one half of of items that measure more than - Usually, but not always, reliability increases one trait. as test length increases. Ideally, the - The more homogeneous a test is, the more additional test items are equivalent with inter-item consistency it can be expected to respect to the content and the range of have. Because a homogeneous test samples a difficulty of the original items. Estimates of relatively narrow content area, it is to be reliability based on consideration of the expected to contain more inter-item entire test therefore tend to be higher than consistency than a heterogeneous test. Test those based on half of a test. homogeneity is desirable because it allows - If test developers or users wish to shorten a relatively straightforward test-score test, the Spearman–Brown formula may be interpretation. used to estimate the effect of the shortening o It is often an insufficient tool for on the test’s reliability. Reduction in test measuring multifaceted size for the purpose of reducing test psychological variables such as administration time is a common practice in intelligence or personality. One certain situations. way to circumvent this potential - A Spearman–Brown formula could also be source of difficulty has been to used to determine the number of items administer a series of homogeneous needed to attain a desired level of reliability. tests, each designed to measure In adding items to increase test reliability to some component of a a desired level, the rule is that the new items heterogeneous variable. must be equivalent in content and difficulty so that the longer test still measures what the The Kuder-Richardson Formulas original test measured. - Dissatisfaction with existing split-half methods of estimating reliability compelled Other Methods of Estimating Internal G. Frederic Kuder and M. W. Richardson to Consistency develop their own measures for estimating - Other formulas developed by Kuder and reliability. Richardson (1937) and Cronbach (1951). - The most widely known of the many formulas they collaborated on is their Inter-item consistency Kuder–Richardson formula 20, or KR-20, - Refers to the degree of correlation among all so named because it was the 20th formula the items on a scale. A measure of inter-item developed in a series. consistency is calculated from a single o KR-20 is the statistic of choice for administration of a single form of a test. An determining the inter-item index of inter-item consistency, in turn, is consistency of dichotomous items, useful in assessing the homogeneity of the primarily those items that can be test. scored right or wrong (such as o Tests are said to be homogeneous if multiple-choice items). they contain items that measure a o Will get a lower reliability estimate single trait. As an adjective used to than split-half method if test items describe test items, homogeneity are more heterogeneous. (derived from the Greek words o KR-20 is computed: homos, meaning “same,” and genos, meaning “kind”) is the degree to which a test measures a single factor. In other words, homogeneity is the extent to which items in a scale are unifactorial. rKR20 = Kuder–Richardson formula 20 reliability o In contrast to test homogeneity, coefficient heterogeneity describes the degree k = number of test items to which a test measures different σ2 = variance of total test scores factors. A heterogeneous (or p = proportion of test takers who pass the item nonhomogeneous) test is composed q = proportion of people who fail the item Σpq = sum of the pq products over all items. - Also, a myth about alpha is that “bigger is - The KR-21 formula may be used if there is always better.” As Streiner (2003b) pointed reason to assume that all the test items have out, a value of alpha above .90 may be “too approximately the same degree of difficulty. high” and indicate redundancy in the items. o Formula KR-21 has become outdated in an era of calculators Average Proportional Distance (APD) and computers. Way back when, - Rather than focusing on similarity between KR-21 was sometimes used to scores on items of a test (as do split-half estimate KR-20 only because it methods and Cronbach’s alpha), the APD is required many fewer calculations. a measure that focuses on the degree of - The one variant of the KR-20 formula that difference that exists between item scores. has received the most acceptance and is in - Accordingly, we define the average widest use today is a statistic called proportional distance method as a measure coefficient alpha. You may even hear it used to evaluate the internal consistency of a referred to as coefficient α−20. test that focuses on the degree of difference that exists between item scores. Coefficient Alpha - The general “rule of thumb” for interpreting - Developed by Cronbach (1951) and an APD is that an obtained value of .2 or subsequently elaborated on by others it may lower is indicative of excellent internal be thought of as the mean of all possible consistency, and that a value of .25 to .2 is split-half correlations, corrected by the in the acceptable range. A calculated APD Spearman–Brown formula. of .25 is suggestive of problems with the - Appropriate for use on tests containing internal consistency of the test. nondichotomous items. - One potential advantage of the APD method over using Cronbach’s alpha is that the APD index is not connected to the number of items on a measure. Cronbach’s alpha will be higher when a measure has more than 25 items (Cortina, 1993). ra = coefficient alpha k = number of items, is the variance of one Measures of Inter-Scorer Reliability item - Unfortunately, in some types of tests under Σ = sum of variances of each item some conditions, the score may be more a σ2 = variance of the total test scores function of the scorer than of anything else. - Variously referred to as scorer reliability, - Preferred statistic for obtaining an estimate judge reliability, observer reliability, and of internal consistency reliability. A inter-rater reliability, inter-scorer variation of the formula has been developed reliability is the degree of agreement or for use in obtaining an estimate of test-retest consistency between two or more scorers (or reliability (Green, 2003). judges or raters) with regard to a particular - It is widely used as a measure of reliability, measure. in part because it requires only one - Inter-rater consistency may be promoted by administration of the test. providing raters with the opportunity for - Unlike a Pearson r, which may range in group discussion along with practice value from −1 to +1, coefficient alpha exercises and information on rater accuracy typically ranges in value from 0 to 1. The (Smith, 1986). reason for this is that, conceptually, - Perhaps the simplest way of determining the coefficient alpha (much like other degree of consistency among scorers in the coefficients of reliability) is calculated to scoring of a test is to calculate a coefficient help answer questions about how similar of correlation. This correlation coefficient is sets of data are. referred to as a coefficient of inter-scorer - Still, because negative values of alpha are reliability. theoretically impossible, it is recommended under such rare circumstances that the alpha The Importance of the Method Used for coefficient be reported as zero (Henson, Estimating Reliability 2001). - A published study by Chmielewski et al. factor, such as one ability or one trait, are (2015) highlighted the substantial influence expected to be homogeneous in items. For that differences in method can have on such tests, it is reasonable to expect a high estimates of inter-rater reliability. degree of internal consistency. - Diagnostic reliability must be acceptably high in order to accurately identify risk - Dynamic versus static characteristics. factors for a disorder that are common to Static characteristic is unchanging, such as subjects in a research study. intelligence. Unlike in measuring dynamic - The utility and validity of a particular characteristics, a test-retest measure would diagnosis itself can be called into question if be of little help in gauging the reliability of expert diagnosticians cannot, for whatever the measuring reason, consistently agree on who should instrument. and should not be so diagnosed. - Results suggest that the reliability of - Restriction or inflation of range. Referred diagnoses is far lower than commonly to as restriction of range or restriction of believed. Moreover, the results demonstrate variance (or, conversely, inflation of range the substantial influence that method has on or inflation of variance. estimates of diagnostic reliability even when o If the variance of either variable in other factors are held constant. a correlational analysis is restricted by the sampling procedure used, Using and Interpreting a Coefficient of Reliability then the resulting correlation - Reliability is a mandatory attribute in all coefficient tends to be lower. tests we use. o If the variance of either variable in a correlational analysis is inflated The Purpose of the Reliability Coefficient by the sampling procedure, then the - Transient error - a source of error resulting correlation coefficient attributable to variations in the test taker’s tends to be higher. feelings, moods, or mental state over time. Then again, this 5% of the error may be due - Speed tests versus power tests. A reliability to other factors that are yet to be identified. estimate of a speed test should be based on performance from two independent testing The Nature of the Test periods using one of the following: (1) test- - Closely related to considerations concerning retest reliability, (2) alternate-forms the purpose and use of a reliability reliability, or (3) split-half reliability from coefficient are those concerning the nature two separately timed half tests. If a split-half of the test itself. Included here are procedure is used, then the obtained considerations such as whether reliability coefficient is for a half test and (1) The test items are homogeneous should be adjusted using the Spearman– or heterogeneous in nature; Brown formula. (2) The characteristic, ability, or trait being measured is presumed to - Criterion-referenced tests. Designed to be dynamic or static provide an indication of where a test taker (3) The range of test scores is or is stands with respect to some variable or not restricted; (4) the test is a speed criterion, such as an educational or a or a power test; and (5) the test is or vocational objective. is not criterion-referenced. o Some traditional procedures of - The child tested just before and again just estimating reliability are usually after a developmental advance may perform not appropriate for use with very differently on the two testings. In such criterion-referenced tests. cases, a marked change in test score might - In criterion-referenced testing, and be attributed to error when in reality it particularly in mastery testing, how different reflects a genuine change in the test taker’s the scores are from one another is seldom a skills. focus of interest. In fact, individual differences between examinees on total test - Homogeneity versus heterogeneity of the scores may be minimal. The critical issue for test items. Tests designed to measure one the user of a mastery test is whether or not a - Proponents of domain sampling theory seek certain criterion score has been achieved. to estimate the extent to which specific sources of variation under defined The True Score Model of Measurement and conditions are contributing to the test score. Alternatives to It (CTT = portion of test score attributable to - Classical Test Theory (CTT) or true score error) (or classical) model of measurement. - Of the three types of estimates of reliability, o One of the reasons it has remained measures of internal consistency are perhaps the most widely used model has to the most compatible with domain sampling do with its simplicity, especially theory. when one considers the complexity of other proposed models of Generalizability theory measurement. - In one modification of domain sampling - True score as a value that according to theory called generalizability theory, a classical test theory genuinely reflects an “universe score” replaces that of a “true individual’s ability (or trait) level as score” (Shavelson et al., 1989). measured by a particular test. - Based on the idea that a person’s test scores - One’s true score on one test of extraversion, vary from testing to testing because of for example, may not bear much variables in the testing situation. Instead of resemblance to one’s true score on another conceiving of all variability in a person’s test of extraversion. scores as error, Cronbach encouraged test o Comparing a test taker’s scores on developers and researchers to describe the two different tests purporting to details of the particular test situation or measure the same thing requires a universe leading to a specific test score. sophisticated knowledge of the o This universe is described in terms properties of each of the two tests, of its facets, which include things as well as some rather complicated like the number of items in the test, statistical procedures designed to the amount of training the test equate the scores. scorers have had, and the purpose - For starters, one problem with CTT has to of the test administration. do with its assumption concerning the - According to generalizability theory, given equivalence of all items on a test; the exact same conditions of all the facets in o All items are presumed to be the universe, the exact same test score contributing equally to the score should be obtained. This test score is the total. universe score, and it is, as Cronbach noted, - Another problem has to do with the length analogous to a true score in the true score of tests that are developed using a CTT model. model. Whereas test developers favor shorter rather than longer tests (as do most - Cronbach (1970): The person will ordinarily test takers), the assumptions inherent in CTT have a different universe score for each favor the development of longer rather than universe. Mary’s universe score covering shorter tests. tests on May 5 will not agree perfectly with - For these reasons, as well as others, her universe score for the whole month of alternative measurement models have been May. . . . Some testers call the average over developed. a large number of comparable observations a “true score”; e.g., “Mary’s true typing Domain sampling theory rate on 3-minute tests.” Instead, we speak of - The 1950s saw the development of a viable a “universe score” to emphasize that what alternative to CTT. It was originally referred score is desired depends on the universe to as domain sampling theory and is better being considered. For any measure there known today in one of its many modified are many “true scores,” each forms as generalizability theory. corresponding to a different universe. - Rebels against the concept of a true score - When we use a single observation as if it existing with respect to the measurement of represented the universe, we are psychological constructs. generalizing. We generalize over scorers, over selections typed, perhaps over days. If the observed scores from a procedure agree closely with the universe score, we can say that the observation is “accurate,” or “reliable,” or “generalizable.” - There is a different degree of generalizability for each universe.
- A generalizability study examines how
generalizable scores from a particular test are if the test is administered in different situations. o Examines how much of an impact different facets of the universe have on the test score. o Is the test score affected by the time of day in which the test is administered? The influence of particular facets on the test score is represented by coefficients of generalizability.
- After this, it is recommended that the test
developers do a decision study. o In the decision study, developers examine the usefulness of test scores in helping the test user make decisions. o Designed to tell the test user how test scores should be used and how dependable those scores are as a basis for decisions, depending on the context of their use. o Taking a better measure improves the sensitivity of an experiment in the same way that increasing the number of subjects does. - From the perspective of generalizability theory, a test’s reliability is very much a function of the circumstances under which the test is developed, administered, and interpreted.
Item Response Theory
- Provide a way to model the probability that a person with X ability will be able to perform at a level of Y. - Latent-trait theory – physically unobservable - Examples of two characteristics of items within an IRT framework are the difficulty level of an item and the item’s level of discrimination