History and Theory of Reliability— Conceptualization of Error • Measurement in psychology is more difficult than physical measurements like height or length • The difficulty comes from measuring complex traits that often cannot be directly observed but rather ‘inferred’… e.g. Intelligence, motivation, self concept, alexithymia, extraversion. – They use “rubber yardsticks” – Part of their task is to ask how much rubber is in their measurement tool – Measuring error is an important part of many sciences, not just psychology
• 1896: Karl Pearson—product moment correlation • 1904: Charles Spearman—publishes article about reliability theory • 1904: E.L. Thorndike—publishes An Introduction to the Theory of Mental and Social Measurements • Other experts developed different measures of reliability in the years that followed
• Error reflects the difference between a true score and
an observed score on some trait or quality – Observed score = True score + Error X=T+E • Classical test theory assumes that errors of measurement are random
Basics of Test Score Theory (2 of 3) FIGURE 4.1 Distribution of observed scores for repeated testing of the same person. The mean of the distribution is the estimated true score, and the dispersion represents the distribution of random errors.
Basics of Test Score Theory (3 of 3) • The notion of a “rubber yardstick” explains how distributions may be more or less varied (Figure 4.2) • Standard error of measurement FIGURE 4.2 Three distributions of observed scores. The far- left distribution reflects the greatest error, and the far-right distribution reflects the least.
• A limited number of measurements to describe a larger,
more complex construct has problems • Measurements are taken from a smaller number of instances (a sample) and then a reliability estimate is conducted – How much error was created by not measuring all instances? – Compare sample variance to population variance • Estimates are made about the scores of instances that are not directly measured – Different random samples can create different error
• Measures must have demonstrated reliability before
they can be used to make employment or educational decisions • Reliability estimates are correlations but can also be thought of as mathematical ratios • Sources of error – Situational factors or characteristics of the test • Test–retest reliability • Parallel forms reliability • Internal consistency reliability
• Compares scores on two different measures of the
same quality • Also called equivalent forms method • If the two are given on the same day, differences in scores can only reflect random error or differences between the two tests – A rigorous assessment of reliability – Generally underutilized
• The same test is administered to the same person at
different points in time – Only useful when assessing traits that do not change over time (e.g., intelligence quotient) – Would not be useful on qualities that are known to change • Involves calculating the correlation between the two measurements • Carryover effects – Only an issue when changes over time are random – Include practice effects • The interval between measurements must be considered
underestimate of true reliability • For the sake of an example, assume correlation (r) between the “odd” half and “even” half of the psychopathy scale is .360. • The Spearman–Brown formula is used to correct for half-length and increases the estimate of reliability
• Kuder and Richardson (1937)—methods for estimating
reliability with a single test administration • Calculates reliability when a criterion is dichotomous • Refer to your text for a thorough description of the KR20 formula and the details associated with its calculation • Has been found to produce reliability estimates that are the same as the split-half method but are more valuable than that approach
• Is useful for tests where there is no single “correct”
answer to all questions (a limitation of the KR20) • Coefficient alpha (or Cronbach’s alpha) assesses reliability in such cases – Similar to the KR20 in other regards • Like other measures of reliability, assesses whether different items on a test all measure the same ability or trait – If they do not, internal consistency will be low – Factor analysis is more useful when such situations occur, but each factor may be internally consistent
• Difference scores occur when one value is subtracted
from another • In such cases, the error is assumed to be higher than the true score or the observed score, and thus reliability of a difference score will be lower • Difference scores are most aptly calculated after converting to a Z score • Cannot be used for interpreting patterns due to lower reliability
Reliability in Behavioral Observation Studies • Observing behaviors is complicated – Samples are taking since all behaviors cannot be monitored • Errors occur when observations do not reflect true occurrences • Interrater reliability may be used, when a correlation between the observations of two different individuals is calculated – Can be achieved in a variety of ways • Kappa statistic
Connecting Sources of Error with Reliability Assessment Method TABLE 4.2 Sources of Measurement Error and Methods of Reliability Assessment Source of error Example Method How assessed Time sampling Same test given at two Test–retest Correlation between scores obtained points in time on the two occasions Item sampling Different items used to Alternate forms Correlation between equivalent assess the same attribute or parallel forms forms of the test that have different items Internal Consistency of items within 1. Split-half 1. Corrected correlation between consistency the same test 2. KR20 two halves of the test 2. See Appendix 4.2 3. Alpha 3. See Appendix 4.1 Observer Different observers Kappa statistic See Fleiss (1981) differences recording
Standard Errors of Measurement and the Rubber Yardstick • Standard error of measurement allows us to estimate the degree to which a test provides inaccurate reading – An assessment of how much “rubber” is in the yardstick – All tests are not equally accurate but also not equally inaccurate • Higher standard error of measurement leads to less certainty about the accuracy of a given measurement of an attribute • Creates confidence intervals—a range of likely scores within which the correct value is believed to exist
• The acceptable level of reliability depends largely on
what it is that is being measured • .70 to .80 is seen as “good enough” for basic research • Researchers often look for reliability of .90 or even .95 – Some argue that these reliability levels are less valuable, as they indicate that all items on a test measure the same thing, and thus the test could be shortened – But these levels are more necessary in clinical settings, when they influence treatment decisions
• There are several approaches that a researcher can
use to deal with an issue of low reliability • Increase the number of items – Is particularly effective in the domain sampling model – Spearman–Brown formula can be used to calculate the number of items needed • Factor and item analysis – Discriminability analysis • Correction for attenuation – Estimating what a correlation would have been if measurements had not included error