Internal consistency

From Wikipedia, the free encyclopedia
Jump to: navigation, search In statistics and research, internal consistency is a measure based on the correlations between different items on the same test (or the same subscale on a larger test). It measures whether several items that propose to measure the same general construct produce similar scores. For example, if a respondent expressed agreement with the statements "I like to ride bicycles" and "I've enjoyed riding bicycles in the past", and disagreement with the statement "I hate bicycles", this would be indicative of good internal consistency of the test. Internal consistency is usually measured with Cronbach's alpha, a statistic calculated from the pairwise correlations between items. Internal consistency ranges between zero and one. A commonly-accepted rule of thumb is that an α of 0.6-0.7 indicates acceptable reliability, and 0.8 or higher indicates good reliability. High reliabilities (0.95 or higher) are not necessarily desirable, as this indicates that the items may be entirely redundant. The goal in designing a reliable instrument is for scores on similar items to be related (internally consistent), but for each to contribute some unique information as well.

Reliability (statistics)
In statistics, reliability is the consistency of a set of measurements or measuring instrument, often used to describe a test. This can either be whether the measurements of the same instrument give or are likely to give the same measurement (test-retest), or in the case of more subjective instruments, such as personality or trait inventories, whether two independent assessors give similar scores (inter-rater reliability). Reliability is inversely related to random error. Reliability does not imply validity. That is, a reliable measure is measuring something consistently, but not necessarily what it is supposed to be measuring. For example, while there are many reliable tests of specific abilities, not all of them would be valid for predicting, say, job performance. In terms of accuracy and precision, reliability is precision, while validity is accuracy. In experimental sciences, reliability is the extent to which the measurements of a test remain consistent over repeated tests of the same subject under identical conditions. An experiment is reliable if it yields consistent results of the same measure. It is unreliable if repeated measurements give different results. It can also be interpreted as the lack of random error in measurement.[1] In engineering, reliability is the ability of a system or component to perform its required functions under stated conditions for a specified period of time. It is often reported in terms of a probability. Evaluations of reliability involve the use of many statistical tools. See Reliability engineering for further discussion. An often-used example used to elucidate the difference between reliability and validity in the experimental sciences is a common bathroom scale. If someone that weighs 200 lbs. steps on the scale 10 times, and it reads "200" each time, then the measurement is reliable and valid. If the scale consistently reads "150", then it is not valid, but it is still reliable because the measurement is very consistent. If the scale varied a lot around 200 (190, 205, 192, 209, etc.), then the scale could be considered valid but not reliable.

Reliability may be estimated through a variety of methods that couldn't fall into two types: Single-administration and multiple-administration. Multiple-administration methods require that two assessments are administered. In the test-retest method, reliability is estimated as the Pearson product-moment correlation coefficient between two administrations of the same measure. In the alternate forms method,

reliability is estimated by the Pearson product-moment correlation coefficient of two different forms of a measure, usually administered together. Single-administration methods include split-half and internal consistency. The split-half method treats the two halves of a measure as alternate forms. This "halves reliability" estimate is then stepped up to the full test length using the Spearman-Brown prediction formula. The most common internal consistency measure is Cronbach's alpha, which is usually interpreted as the mean of all possible split-half coefficients.[2] Cronbach's alpha is a generalization of an earlier form of estimating internal consistency, Kuder-Richardson Formula 20.[2] Each of these estimation methods isn't sensitive to different sources of error and so might not be expected to be equal. Also, reliability is a property of the scores of a measure rather than the measure itself and are thus said to be sample dependent. Reliability estimates from one sample might differ from those of a second sample (beyond what might be expected due to sampling variations) if the second sample is drawn from a different population because the true reliability is different in this second population. (This is true of measures of all types--yardsticks might measure houses well yet have poor reliability when used to measure the lengths of insects.) Reliability may be improved by clarity of expression (for written assessments), lengthening the measure,[2] and other informal means. However, formal psychometric analysis, called the item analysis, is considered the most effective way to increase reliability. This analysis consists of computation of item difficulties and item discrimination indices, the latter index involving computation of correlations between the items and sum of the item scores of the entire test. If items that are too difficult, too easy, and/or have near-zero or negative discrimination are replaced with better items, the reliability of the measure will increase.
• •

R(t) = 1 − F(t). R(t) = exp( − λt). (where λ is the failure rate)

[edit] Classical test theory
In classical test theory, reliability is defined mathematically as the ratio of the variation of the true score and the variation of the observed score. Or, equivalently, one minus the ratio of the variation of the error score and the variation of the observed score:

where ρxx' is the symbol for the reliability of the observed score, X; , and are the variances on the measured, true and error scores


respectively. Unfortunately, there is no way to directly observe or calculate the true score, so a variety of methods are used to estimate the reliability of a test. Some examples of the methods to estimate reliability include test-retest reliability, internal consistency reliability, and parallel-test reliability. Each method comes at the problem of figuring out the source of error in the test somewhat differently.

[edit] Item response theory
It was well-known to classical test theorists that measurement precision is not uniform across the scale of measurement. Tests tend to distinguish better for test-takers with moderate trait levels and worse among highand low-scoring test-takers. Item response theory extends the concept of reliability from a single index to a function called the information function. The IRT information function is the inverse of the conditional observed score standard error at any given test score. Higher levels of IRT information indicate higher precision and thus greater reliability.

Cronbach's alpha
From Wikipedia, the free encyclopedia
Jump to: navigation, search Cronbach's α (alpha) is a statistic. It has an important use as a measure of the reliability of a psychometric instrument. It was first named as alpha by Cronbach (1951), as he had intended to continue with further instruments. It is the extension of an earlier version, the Kuder-Richardson Formula 20 (often shortened to KR-20), which is the equivalent for dichotomous items, and Guttman (1945) developed the same quantity under the name lambda-2. Cronbach's α is a coefficient of consistency and measures how well a set of variables or items measures a single, unidimensional latent construct.

Cronbach's α is defined as

where N is the number of components (items or testlets), of the observed total test scores, and

is the variance

is the variance of component i.

Alternatively, the standardized Cronbach's α can also be defined as

where N is the number of components (items or testlets), equals the average variance and is the average of all covariances between the components.

Cronbach's alpha and internal consistency
Cronbach's alpha will generally increase when the correlations between the items increase. For this reason the coefficient is also called the internal consistency or the internal consistency reliability of the test.

Cronbach's alpha in classical test theory
Alpha is an unbiased estimator of reliability if and only if the components are essentially τ-equivalent (Lord & Novick, 1968[1]). Under this condition

the components can have different means and different variances, but their covariances should all be equal - which implies that they have 1 common factor in a factor analysis. One special case of essential τequivalence is that the components are parallel. Although the assumption of essential τ-equivalence may sometimes be met (at least approximately) by testlets, when applied to items it is probably never true. This is caused by the facts that (1) most test developers invariably include items with a range of difficulties (or stimuli that vary in their standing on the latent trait, in the case of personality, attitude or other non-cognitive instruments), and (2) the item scores are usually bounded from above and below. These circumstances make it unlikely that the items have a linear regression on a common factor. A factor analysis may then produce artificial factors that are related to the differential skewnesses of the components. When the assumption of essential τequivalence of the components is violated, alpha is not an unbiased estimator of reliability. Instead, it is a lower bound on reliability. α can take values between negative infinity and 1 (although only positive values make sense). Some professionals, as a rule of thumb, require a reliability of 0.70 or higher (obtained on a substantial sample) before they will use an instrument. Obviously, this rule should be applied with caution when α has been computed from items that systematically violate its assumptions. Further, the appropriate degree of reliability depends upon the use of the instrument, e.g., an instrument designed to be used as part of a battery may be intentionally designed to be as short as possible (and thus somewhat less reliable). Other situations may require extremely precise measures (with very high reliabilities). Cronbach's α is related conceptually to the Spearman-Brown prediction formula. Both arise from the basic classical test theory result that the reliability of test scores can be expressed as the ratio of the true score and total score (error and true score) variances:

Alpha is most appropriately used when the items measure different substantive areas within a single construct. Conversely, alpha (and other internal consistency estimates of reliability) are inappropriate for estimating the reliability of an intentionally heterogeneous instrument (such as screening devices like biodata or the original MMPI). Also, α can be artificially inflated by making scales which consist of superficial changes to the wording within a set of items or by analyzing speeded tests.

Cronbach's alpha in generalizability theory
Cronbach and others generalized some basic assumptions of classical test theory in their generalizability theory. If this theory is applied to test

construction, then it is assumed that the items that constitute the test are a random sample from a larger universe of items. The expected score of a person in the universe is called the universum score, analogous to a true score. The generalizability is defined analogously as the variance of the universum scores divided by the variance of the observable scores, analogous to the concept of reliability in classical test theory. In this theory, Cronbach's alpha is an unbiased estimate of the generalizability. For this to be true the assumptions of essential τ-equivalence or parallelness are not needed. Consequently, Cronbach's alpha can be viewed as a measure of how well the sum score on the selected items capture the expected score in the entire domain, even if that domain is heterogeneous.

Cronbach's alpha and the intra-class correlation
Cronbach's alpha is equal to the stepped-up consistency version of the Intra-class correlation coefficient, which is commonly used in observational studies. This can be viewed as another application of generalizability theory, where the items are replaced by raters or observers who are randomly drawn from a population. Cronbach's alpha will then estimate how strongly the score obtained from the actual panel of raters correlates with the score that would have been obtained by another random sample of raters.

Cronbach's alpha and factor analysis
As stated in the section about its relation with classical test theory, Cronbach's alpha has a theoretical relation with factor analysis. There is also a more empirical relation: Selecting items such that they optimize Cronbach's alpha will often result in a test that is homogeneous in that they (very roughly) approximately satisfy a factor analysis with one common factor. The reason for this is that Cronbach's alpha increases with the average correlation between item, so optimization of it tends to select items that have correlations of similar size with most other items. It should be stressed that, although unidimensionality (i.e. fit to the onefactor model) is a necessary condition for alpha to be an unbiased estimator of reliability, the value of alpha is not related to the factorial homogeneity. The reason is that the value of alpha depends on the size of the average inter-item covariance, while unidimensionality depends on the pattern of the inter-item covariances.

Cronbach's alpha and other disciplines
Although this description of the use of α is given in terms of psychology, the statistic can be used in any discipline.

Construct creation
Coding two (or more) different variables with a high Cronbach's alpha into a construct for regression use is simple. Dividing the used variables by their means or averages results in a percentage value for the respective case. After all variables have been re-calculated in percentage terms, they can easily be summed to create the new construct.

Kuder-Richardson Formula 20
From Wikipedia, the free encyclopedia
(Redirected from Kuder) Jump to: navigation, search In statistics, the Kuder-Richardson Formula 20 (KR-20) is a measure of internal consistency reliability for measures with dichotomous choices, first published in 1937. It is analogous to Cronbach's α, except Cronbach's α is used for non-dichotomous (continuous) measures. [1] A high KR-20 coefficient (e.g., >0.90) indicates a homogeneous test. Values can range from 0.00 to 1.00 (sometimes expressed as 0 to 100), with high values indicating that the examination is likely to correlate with alternate forms (a desirable characteristic). The KR20 is impacted by difficulty, spread in scores and length of the examination. In the case when scores are not tau-equivalent (for example when there is not homogeneous but rather examination items of increasing difficulty) then the KR-20 is an indication of the lower bound of internal consistency (reliability).

Note that variance for KR-20 is

If it is important to use unbiased operators then the Sum of Squares should be divided by degrees of freedom (N − 1) and the probabilities are multiplied by

Since Cronbach's α was published in 1951, there has been no known advantage to KR-20 over Cronbach. KR-20 is seen as a derivative of the Cronbach formula, with the advantage to Cronbach that it can handle both dichotomous and continuous variables.

Estimating Reliability
Estimating Reliability - Forced-Choice Assessment
The split-half model and the Kuder-Richardson formula for estimating reliability will be described here. Given the demands on time and the need for all assessment to be relevant, school practitioners are unlikely to utilize a test-retest or equivalent forms procedure to establish reliability. Reliability Estimation Using a Split-half Methodology The split-half design in effect creates two comparable test administrations. The items in a test are split into two tests that are equivalent in content and difficulty. Often this is done by splitting among odd and even numbered items. This assumes that the assessment is homogenous in content. Once the test is split, reliability is estimated as the correlation of two separate tests with an adjustment for the test length. Other things being equal, the longer the test, the more reliable it will be when reliability concerns internal consistency. This is because the sample of behavior is larger. In split-half, it is possible to utilize the Spearman-Brown formula to correct a correlation between the two halves--as if the correlation used two tests the length of the full test (before it was split), as shown on the next page. For demonstration purposes a small sample set is employed here--a test of 40 items for 10 students. The items are then divided even (X) and odd (Y) into two simultaneous assessments.
Student Score (40) A 40 B 28 C 35 D 38 E 22 F 20 G 35 H 33 I 31 J 28 MEAN 31.0 SD X Even (20) 20 15 19 18 10 12 16 16 12 14 15.2 3.26 Y Odd (20) 20 13 16 20 12 8 19 17 19 14 15.8 3.99 x 4.8 -0.2 3.8 2.8 -5.2 -3.2 0.8 0.8 -3.2 -1.2 y 4.2 -2.8 0.2 4.2 -3.8 -7.8 3.2 1.2 3.2 -1.8 x2 23.04 0.04 14.44 7.84 27.04 10.24 0.64 0.64 10.24 1.44 95.60 y2 17.64 7.84 0.04 17.64 14.44 60.84 10.24 1.44 10.24 3.24 143.60 xy 20.16 0.56 0.76 11.76 19.76 24.96 2.56 0.96 -10.24 2.16 73.40

From this information it is possible to calculate a correlation using the Pearson Product-Moment Correlation Coefficient, a statistical measure of the degree of relationship between the two halves. Pearson Product Moment Correlation Coefficient:

where x is each student's score minus the mean on even number items for each student. y is each student's score minus the mean on odd number items for each student. N is the number of students. SD is the standard deviation. This is computed by     squaring the deviation (e.g., x ) for each student, summing the squared deviations (e.g., Σ x ); dividing this total by the number of students minus 1 (N-l) and taking the square root.
2 2

The Spearman-Brown formula is usually applied in determining reliability using split halves. When applied, it involves doubling the two halves to the full number of items, thus giving a reliability estimate for the number of items in the original test.

Estimating Reliability using the Kuder-Richardson Formula 20 Kuder and Richardson devised a procedure for estimating the reliability of a test in 1937. It has become the standard for estimating reliability

for single administration of a single form. Kuder-Richardson measures inter-item consistency. It is tantamount to doing a split-half reliability on all combinations of items resulting from different splitting of the test. When schools have the capacity to maintain item level data, the KR20, which is a challenging set of calculations to do by hand, is easily computed by a spreadsheet or basic statistical package. The rationale for Kuder and Richardson's most commonly used procedure is roughly equivalent to: 1) Securing the mean inter-correlation of the number of items (k) in the test, 2) Considering this to be the reliability coefficient for the typical item in the test, 3) Stepping up this average with the Spearman-Brown formula to estimate the reliability coefficient of an assessment of k items.

1 Studen t (N) A B C D E F G H I J Σ = 1 1 1 1 1 1 1 1 1 0 9




ITEM (k) 5 6 7



10 11 12

1 1 1 1 1 1 1 1 1 0 9

1=correct 1 1 1 1 1 1 1 0 1 1 1 0 1 1 0 1 1 0 0 1 8 7

1 1 1 1 1 0 0 0 1 1 7

1 1 1 1 0 1 0 0 0 0 5

1 1 1 0 0 1 1 0 0 0 5

0=incorrect 0 1 1 1 1 0 1 1 1 1 0 0 1 1 0 0 1 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 5 4 3 2

x=Xx2 mean (score (Score ) mean) 1 11 4.5 20.25 0 10 3.5 12.25 0 9 2.5 6.25 0 7 0.5 0.25 0 7 0.5 0.25 0 6 -0.5 0.25 0 5 -1.5 2.25 0 4 -2.5 6.25 0 4 -2.5 6.25 0 2 -4.5 20.25 1 65 0 74.50 mean Σ x2 6.5 74.50 X


0.9 0.9 0.8 0.7 0.7 0.5 0.5 0.5 0.4 0.3 0.2 0.1 Q-value 0.1 0.1 0.2 0.3 0.3 0.5 0.5 0.5 0.6 0.7 0.8 0.9 pq 0.09 0.09 0.16 0.21 0.21 0.25 0.25 0.25 0.24 0.21 0.16 0.09 2.21 Σ pq

Here, Variance Kuder-Richardson Formula 20

p is the proportion of students passing a given item q is the proportion of students that did not pass a given item σ is the variance of the total score on this assessment x is the student score minus the mean score; x is squared and the squares are summed (Σ x ); the summed squares are divided by the number of students minus 1 (N-l) k is the number of items on the test.
2 2

For the example,

Estimating Reliability Using the Kuder-Richardson Formula 21 When item level data or technological assistance is not available to assist in the computation of a large number of cases and items, the simpler, and sometimes less precise, reliability estimate known as Kuder-Richardson Formula 21 is an acceptable general measure of internal consistency. The formula requires only the test mean (M), the variance (σ 2) and the number of items on the test (k). It assumes that all items are of approximately equal difficulty. (N=number of students) For this example, the data set used for computation of the KR 20 is repeated.
Student (N=l0) A B C D E F G X (Score) 11 10 9 7 7 6 5 x= X-mean (score-mean) 4.5 3.5 2.5 0.5 0.5 -0.5 -1.5 x2 20.25 12.25 6.25 0.25 0.25 0.25 2.25


4 4 2 mean = 6.5

-2.5 -2.5 -4.5

6.25 6.25 20.25 Σ x2 = 74.50


Kuder-=Richardson formula 21 M - the assessment mean (6.5) k - the number of items in the assessment (12) σ - variance (8.28).

Therefore; in the example: The ratio [ mean (k-mean)] / kσ in KR21 is a mathematical approximation of the ratio Σ pq/σ in KR20. The formula simplifies the computation but will usually yield, as evidenced, a lower estimate of reliability. The differences are not great on a test with all items of about the same difficulty.
2 2

In addition to the split-half reliability estimates and the KuderRichardson formulas (KR20, KR21) as mentioned above, there are many other ways to compute a reliability index. Another one of the most commonly used reliability coefficients is Cronbach's alpha (α ). It is based on the internal consistency of items in the tests. It is flexible and can be used with test formats that have more than one correct answer. The split-half estimates and KR20 are exchangeable with Cronbach's alpha. When examinees are divided into two parts and the scores and variances of the two parts are calculated, the split-half formula is algebraically equivalent to Cronbach's alpha. When the test format has only one correct answer, KR20 is algebraically equivalent to Cronbach's alpha. Therefore, the split-half and KR20 reliability estimates may be considered special cases of Cronbach's alpha. Given the universe of concerns which daily confront school administrators and classroom teachers, the importance is not in knowing how to derive a reliability estimate, whether using split halves,

KR20 or KR21. The importance is in knowing what the information means in evaluating the validity of the assessment. A high reliability coefficient is no guarantee that the assessment is well-suited to the outcome. It does tell you if the items in the assessment are strongly or weakly related with regard to student performance. If all the items are variations of the same skill or knowledge base, the reliability estimate for internal consistency should be high. If multiple outcomes are measured in one assessment, the reliability estimate may be lower. That does not mean the test is suspect. It probably means that the domains of knowledge or skills assessed are somewhat diverse and a student who knows the content of one outcome may not be as proficient relative to another outcome.

Establishing Interrater Agreement for PerformanceBased or Product Assessments (Complex Generated Response Assessments)
In performance-based assessment, where scoring requires some judgment, an important type of reliability is agreement among those who evaluate the quality of the product or performance relative to a set of stated criteria. Preconditions of interrater agreement are: 1) A scoring scale or rubric which is clear and unambiguous in what it demands of the student by way of demonstration. 2) Evaluators who are fully conversant with the scale and how the scale relates to the student performance, and who are in agreement with other evaluators on the application of the scale to the student demonstration. The end result is that all evaluators are of a common mind with regard to the student performance and that one mind is reflected in the scoring scale or rubric and that all evaluators should give the demonstration the same or nearly the same ratings. The consistency of rating is called interrater reliability. Unless the scale was constructed by those who are employing the scale and there has been extensive discussion during this construction, training is a necessity to establish this common perspective. Training evaluators for consistency should include:

A discussion of the rating scale by all participating evaluators so that a common interpretation of the scale emerges and so diverse interpretations can be resolved or referred to an authority for determination.

The opportunity to review sample demonstrations which have been anchored to a particular score on the scale. These representative works were selected by a committee for their clarity in demonstrating a value on the scale. This will provide operational models for the raters who are being trained. Opportunities to try out the scale and discuss the ratings. The results can be used to further refine common understanding. Additional rounds of scoring can be used to eliminate any evaluator who cannot enter into agreement relative to the scale.

Gronlund (1985) indicated that "rater error" can be related to: 1) Personal bias which may occur when a rater is consistently using only part of the scoring scale, either in being overly generous, overly severe or evidencing a tendency to the center of the scale in scoring. 2) A "halo effect" which may occur when a rater's overall perception of a student positively or negatively colors the rating given to a student. 3) A logical error which may occur when a rater confuses distinct elements of an analytic scale. This confounds rating on the items. Proper training to an unambiguous scoring rubric is a necessary condition for establishing reliability for student performance or product. When evaluation of the product or performance begins in earnest, it is necessary that a percentage of student work be double scored by two different raters to give an indication of agreement among evaluators. The sample of performances or products that are scored by two independent evaluators must be large enough to establish confidence that scoring is consistent. The smaller the number of cases, the larger the percentage of cases that will be double scored. When the data on the double-scored assessments is available, it is possible to compute a correlation of the raters' scores using the Pearson Product Moment Correlation Coefficient. This correlation indicates the relationship between the two scores given for each student. A correlation of .6 or higher would indicate that the scores given to the students are highly related. Another method of indicating the relationship between the two scores is through the establishing of a rater agreement percentage--that is, to take the assessments that have been double scored and calculate the number of cases where there has been exact agreement between the two raters. If the scale is analytic and rather extensive, percent of agreement can be determined for the number of cases where the scores are in exact agreement or adjacent to each other (within one point on the scale). Agreement levels should be at 80% or higher to

establish a claim for interrater agreement.

Establishing Rater Agreement Percentages Two important decisions which precede the establishment of a rater agreement percentage are:

How close do scores by raters have to be to count as in "agreement?" In a limited holistic scale, (e.g., 1-4 points) it is most likely that you will require exact agreement among raters. If an analytic scale is employed with 30 to 40 points, it may be determined that exact and adjacent scores will count as being in agreement. What percentage of agreement will be acceptable to ensure reliability? 80% agreement is promoted as a minimum standard above, but circumstances relative to the use of the scale may warrant exercising a lower level of acceptance. The choice of an acceptable percentage of agreement must be established by the school or district. It is advisable that the decision be consultative.

After agreement and the acceptable percentage of agreement have been established, list the ratings given to each student by each rater for comparison:
Student A B C D E F G H I J Score: Rater 1 6 5 3 4 2 7 6 5 3 7 Score: Rater 2 6 5 4 4 3 7 6 5 4 7 Agreement X X X X X X X

Dividing the number of cases where student scores between the raters are in agreement (7) with the total number of cases (10) determines the rater agreement percentage (70%). When there are more than two teachers, the consistency of ratings for two teachers at a time can be calculated with the same method. For example, if three teachers are employed as raters, rater agreement

percentages should be calculated for
• • •

Rater 1 and Rater 2 Rater 1 and Rater 3 Rater 2 and Rater 3

All calculations should exceed the acceptable reliability score. If there is occasion to use more than two raters for the same assessment performance or product, an analysis of variance using the scorers as the independent variable can be computed using the sum of squares. In discussion of the various forms of performance assessment, it has been suggested how two raters can examine the same performance to establish a reliability score. Unless at least two independent raters have evaluated the performance or product for a significant sampling of students, it is not possible to obtain evidence that the score obtained is accurate to the stated criteria.

Sign up to vote on this title
UsefulNot useful