You are on page 1of 9

The Spearman-Brown Formula: Definition & Example

The Spearman-Brown formula is used to predict the reliability


of a test after changing the length of the test.
The formula is:
Predicted reliability = kr / (1 + (k-1)r)
where:
 k: Factor by which the length of the test is changed. For
example, if original test is 10 questions and new test is
15 questions, k = 15/10 = 1.5.
 r: Reliability of the original test. We typically
use Cronbach’s Alpha for this, which is a value that
ranges from 0 to 1 with higher values indicating higher
reliability.
The following example shows how to use this formula in
practice.
Example: How to Use the Spearman-Brown Formula
Suppose a company uses a 15-item test to assess employee
satisfaction and the test is known to have a reliability of 0.74.
If the company increases the length of the test to 30 items,
what is the predicted reliability of the new test?
We can use the Spearman-Brown formula to calculate the
predicted reliability:
 Predicted reliability = kr / (1 + (k-1)r)
 Predicted reliability = 2*.74 / (1 + (2-1)*.74)
 Predicted reliability = 0.85

The new test has a predicted reliability of 0.85.


Note: We calculated k as 30/15 = 2.
Cautions on Using the Spearman-Brown Formula
Based on the Spearman-Brown formula, we can see that
increasing the number of items on a test by any number will
increase the predicted reliability of the test.
For example, suppose we increase the number of items on
the test from the previous example from 15 to 16. Then we
would calculate k as 16/15 = 1.067.

The predicted reliability would be:


 Predicted reliability = kr / (1 + (k-1)r)
 Predicted reliability = 1.067*.74 / (1 + (1.067-1)*.74)
 Predicted reliability = 0.752
The new test has a predicted reliability of 0.752, which is
higher than the reliability of 0.74 on the original test.
Using this logic, we might think that increasing the length of
the test by a massive amount of items is a good idea because
we could push the reliability closer and closer to 1.

However, we should keep in mind the following:


1. Using too many items can cause fatigue effects.
If a test has too many questions then individuals may become
fatigued as they answer more and more questions, causing
them to produce less reliable answers as the test drags on.
2. The new items added to the test should be of equal difficulty
to the existing items.
It’s important that if we do decide to increase the length of a test
that we make sure the new items / questions we’re adding are of
equal difficulty to the existing items otherwise the predicted
reliability will not be accurate•
Cronbach’s alpha

Cronbach’s alpha is a convenient test used to estimate the


reliability, or internal consistency, of a composite score.

Cronbach’s alpha coefficient measures the internal consistency, or


reliability, of a set of survey items. Use this statistic to help
determine whether a collection of items consistently measures the
same characteristic. Cronbach’s alpha quantifies the level of
agreement on a standardized 0 to 1 scale. Higher values indicate
higher agreement between items.

Cronbach’s alpha ranges from 0 to 1:

Zero indicates that there is no correlation between the items at all.


They are entirely independent. Knowing the value of a response to one
question provides no information about the responses to the other
questions.

One indicates that they are perfectly correlated. Knowing the value of
one response provides complete information about the other items.

Now, what on Earth does that mean? Let’s start with reliability.
Say an individual takes a Happiness Survey. Your happiness
score would be highly reliable (consistent) if it produces the same
or similar results when the same individual re-takes your survey,
under the same conditions. However, say an individual, who, at
the same level of real happiness, takes this Happiness Survey
twice back-to-back, and one score shows high happiness and the
other score shows low happiness—that measure would not be
reliable at all.

Cronbach’s alpha gives us a simple way to measure whether or


not a score is reliable. It is used under the assumption that you
have multiple items measuring the same underlying construct: so,
for the Happiness Survey, you might have five questions all
asking different things, but when combined, could be said to
measure overall happiness.

Theoretically, Cronbach’s alpha results should give you a number


from 0 to 1, but you can get negative numbers as well. A
negative number indicates that something is wrong with your data
—perhaps you forgot to reverse score some items. The general
rule of thumb is that a Cronbach’s alpha of .70 and above is
good, .80 and above is better, and .90 and above is best.

Cronbach’s alpha does come with some limitations: scores that


have a low number of items associated with them tend to have
lower reliability, and sample size can also influence your results
for better or worse. However, it is still a widely used measure,
so if your committee is asking for proof that your instrument
was internally consistent or reliable, Cronbach’s alpha is a good
way to go!

Inter-Rater Reliability

Inter-rater reliability measures the agreement between


subjective ratings by multiple raters, inspectors, judges, or
appraisers. It answers the question, is the rating system
consistent? High inter-rater reliability indicates that
multiple raters’ ratings for the same item are consistent.
Conversely, low reliability means they are inconsistent.X

For example, judges evaluate the quality of academic


writing samples using ratings of 1 – 5. When multiples
raters assess the same writing, how similar are their
ratings?

Evaluating inter-rater reliability is vital for understanding


how likely a measurement system will misclassify an
item. A measurement system is invalid when ratings do
not have high inter-rater reliability because the judges
frequently disagree. For the writing example, if the judges
give vastly different ratings to the same writing, you
cannot trust the results because the ratings are
inconsistent. However, if the ratings are very similar, the
rating system is consistent.

Ratings data can be binary, categorical, and ordinal.


Examples of these ratings include the following:

Inspectors rate parts using a binary pass/fail system.

Judges give ordinal scores of 1 – 10 for ice skaters.

Doctors diagnose diseases using a categorical set of


disease names.
In all these examples, inter-rater reliability studies have
multiple raters evaluate the same set of items, people, or
conditions. Then subsequent analysis quantifies the
consistency of ratings between raters.

Methods for Evaluating Inter-Rater Reliability

Evaluating inter-rater reliability involves having multiple


raters assess the same set of items and then comparing
the ratings for each item. Are the ratings a match,
similar, or dissimilar?

There are multiple methods for evaluating rating


consistency. I’ll start with percent agreement because it
highlights the concept of inter-rater reliability at its most
basic level. Then I’ll explain how several more
sophisticated analyses improve upon it.

Percent Agreement

Percent agreement is simply the average amount of


agreement expressed as a percentage. Using this
method, the raters either agree, or they don’t. It’s a
binary outcome with no shades of grey. In other words,
this form of inter-rater reliability doesn’t give partial credit
for being close.
Imagine we have three judges evaluating the writing
samples. They use a rating scale of 1 to 5. The school
performs a study to see how closely they agree and
record the results in the table below.

Writing Sample Judge 1 Judge 2 Judge 3

1 5 4 4

2 3 3 3

3 4 4 4

4 2 2 3

5 1 2 1

Next, count the number of agreements between pairs of judges in each row. With
three judges, there are three pairings and, hence, three possible agreements per
writing sample. I’ll add columns to record the rating agreements using 1s and 0s for
agreement and disagreement, respectively. The final column is the total number of
agreements for that writing sample.
Writing Judge Judge Judge 1 &1 &2 &
Total
Sample 1 2 3 2 3 3

1 5 4 4 0 0 1 1

2 3 3 3 1 1 1 3

3 4 4 4 1 1 1 3

4 2 2 3 1 0 0 1

5 1 2 1 0 1 0 1

Finally, we sum the number of agreements (1 + 3 + 3 + 1 + 1 = 9) and divide by the


total number of possible agreements (3 * 5 = 15). Therefore, the percentage
agreement for the inter-rater reliability of this dataset is 9/15 = 60%.

Weaknesses of Percent Agreement

While this is the simplest form of inter-rater reliability, it falls short in several ways.
First, it doesn’t account for agreements that occur by chance, which causes the
percent agreement method to overestimate inter-rater reliability. Second, it
doesn’t factor in the degree of agreement, only absolute agreement. It’s either a
match or not. On a scale of 1 – 5, two judges scoring 4 and 5 is much better than
scores of 1 and 5!

Unsurprisingly, statisticians have developed various methods to account for both


facets. In the following sections, I provide an overview of these more sophisticated
inter-rater reliability methods. Then I’ll perform a more thorough analysis using these
methods and interpret the inter-rater reliability results using an example dataset.

You might also like