You are on page 1of 18

Types of reliability and how to measure

them
Date published August 8, 2019 by Fiona Middleton. Date updated: January 13, 2020

When you do quantitative research, you have to consider the reliability and validity of
your research methods and instruments of measurement.

Reliability tells you how consistently a method measures something. When you apply the same
method to the same sample under the same conditions, you should get the same results. If not,
the method of measurement may be unreliable.

There are four main types of reliability. Each can be estimated by comparing different sets of
results produced by the same method.

Reliability

Type of reliability Measures the consistency of…

Test-retest The same test over time.

Interrater The same test conducted by different people.

Parallel forms Different versions of a test which are designed to be equivalent.

Internal consistency The individual items of a test.

Table of contents

1.
2.
3.
4.
5.

Test-retest reliability
Test-retest reliability measures the consistency of results when you repeat the same test on the
same sample at a different point in time. You use it when you are measuring something that you
expect to stay constant in your sample.
A test of colour blindness for trainee pilot applicants should have high test-retest reliability, because
colour blindness is a trait that does not change over time.

Why it’s important


Many factors can influence your results at different points in time: for example, respondents
might experience different moods, or external conditions might affect their ability to respond
accurately.

Test-retest reliability can be used to assess how well a method resists these factors over time.
The smaller the difference between the two sets of results, the higher the test-retest reliability.

How to measure it
To measure test-retest reliability, you conduct the same test on the same group of people at two
different points in time. Then you calculate the correlation between the two sets of results.

Test-retest reliability example


You devise a questionnaire to measure the IQ of a group of participants (a property that is
unlikely to change significantly over time).You administer the test two months apart to the same
group of people, but the results are significantly different, so the test-retest reliability of the IQ
questionnaire is low.

Improving test-retest reliability

 When designing tests or questionnaires, try to formulate questions, statements and tasks in a
way that won’t be influenced by the mood or concentration of participants.
 When planning your method of data collection, try to minimize the influence of external factors,
and make sure all samples are tested under the same conditions.
 Remember that changes can be expected to occur in the participants over time, and take these
into account.

Interrater reliability
Interrater reliability (also called interobserver reliability) measures the degree of agreement
between different people observing or assessing the same thing. You use it when data is collected
by researchers assigning ratings, scores or categories to one or more variables.

In an observational study where a team of researchers collect data on classroom behavior, interrater
reliability is important: all the researchers should agree on how to categorize or rate different types of
behavior.

Why it’s important


People are subjective, so different observers’ perceptions of situations and phenomena naturally
differ. Reliable research aims to minimize subjectivity as much as possible so that a different
researcher could replicate the same results.
When designing the scale and criteria for data collection, it’s important to make sure that
different people will rate the same variable consistently with minimal bias. This is especially
important when there are multiple researchers involved in data collection or analysis.

How to measure it
To measure interrater reliability, different researchers conduct the same measurement or
observation on the same sample. Then you calculate the correlation between their different sets
of results. If all the researchers give similar ratings, the test has high interrater reliability.

Interrater reliability example


A team of researchers observe the progress of wound healing in patients. To record the stages of
healing, rating scales are used, with a set of criteria to assess various aspects of wounds. The
results of different researchers assessing the same set of patients are compared, and there is a
strong correlation between all sets of results, so the test has high interrater reliability.

Improving interrater reliability

 Clearly define your variables and the methods that will be used to measure them.
 Develop detailed, objective criteria for how the variables will be rated, counted or categorized.
 If multiple researchers are involved, ensure that they all have exactly the same information and
training.

What can proofreading do for your paper?


Scribbr editors not only correct grammar and spelling mistakes, but also strengthen your writing
by making sure your paper is free of vague language, redundant words and awkward phrasing.
See editing example

Parallel forms reliability


Parallel forms reliability measures the correlation between two equivalent versions of a test. You
use it when you have two different assessment tools or sets of questions designed to measure the
same thing.

Why it’s important


If you want to use multiple different versions of a test (for example, to avoid respondents
repeating the same answers from memory), you first need to make sure that all the sets of
questions or measurements give reliable results.

In educational assessment, it is often necessary to create different versions of tests to ensure that
students don’t have access to the questions in advance. Parallel forms reliability means that, if the same
students take two different versions of a reading comprehension test, they should get similar results in
both tests.

How to measure it
The most common way to measure parallel forms reliability is to produce a large set of questions
to evaluate the same thing, then divide these randomly into two question sets.

The same group of respondents answers both sets, and you calculate the correlation between the
results. High correlation between the two indicates high parallel forms reliability.

Parallel forms reliability example


A set of questions is formulated to measure financial risk aversion in a group of respondents. The
questions are randomly divided into two sets, and the respondents are randomly divided into two
groups. Both groups take both tests: group A takes test A first, and group B takes test B first. The
results of the two tests are compared, and the results are almost identical, indicating high parallel
forms reliability.

Improving parallel forms reliability

 Ensure that all questions or test items are based on the same theory and formulated to measure
the same thing.

Internal consistency
Internal consistency assesses the correlation between multiple items in a test that are intended to
measure the same construct.

You can calculate internal consistency without repeating the test or involving other researchers,
so it’s a good way of assessing reliability when you only have one data set.

Why it’s important


When you devise a set of questions or ratings that will be combined into an overall score, you
have to make sure that all of the items really do reflect the same thing. If responses to different
items contradict one another, the test might be unreliable.

To measure customer satisfaction with an online store, you could create a questionnaire with a set of
statements that respondents must agree or disagree with. Internal consistency tells you whether the
statements are all reliable indicators of customer satisfaction.

How to measure it
Two common methods are used to measure internal consistency.

Average inter-item correlation: For a set of measures designed to assess the same construct,
you calculate the correlation between the results of all possible pairs of items and then calculate
the average.

Split-half reliability: You randomly split a set of measures into two sets. After testing the entire
set on the respondents, you calculate the correlation between the two sets of responses.

Internal consistency example


A group of respondents are presented with a set of statements designed to measure optimistic and
pessimistic mindsets. They must rate their agreement with each statement on a scale from 1 to 5.
If the test is internally consistent, an optimistic respondent should generally give high ratings to
optimism indicators and low ratings to pessimism indicators. The correlation is calculated
between all the responses to the “optimistic” statements, but the correlation is very weak. This
suggests that the test has low internal consistency.
Improving internal consistency

 Take care when devising questions or measures: those intended to reflect the same concept
should be based on the same theory and carefully formulated.

Which type of reliability applies to my research?


It’s important to consider reliability when planning your research design, collecting and
analyzing your data, and writing up your research. The type of reliability you should calculate
depends on the type of research and your methodology.
Spearman's Rank-Order Correlation
This guide will tell you when you should use Spearman's rank-order correlation to
analyse your data, what assumptions you have to satisfy, how to calculate it, and how to
report it. If you want to know how to run a Spearman correlation in SPSS Statistics, go
to our Spearman's correlation in SPSS Statistics guide.

When should you use the Spearman's rank-order correlation?


The Spearman's rank-order correlation is the nonparametric version of the Pearson
product-moment correlation. Spearman's correlation coefficient, (ρ, also signified by rs)
measures the strength and direction of association between two ranked variables.

What are the assumptions of the test?

You need two variables that are either ordinal, interval or ratio (see our Types of
Variable guide if you need clarification). Although you would normally hope to use a
Pearson product-moment correlation on interval or ratio data, the Spearman correlation
can be used when the assumptions of the Pearson correlation are markedly violated.
However, Spearman's correlation determines the strength and direction of
the monotonic relationship between your two variables rather than the strength and
direction of the linear relationship between your two variables, which is what Pearson's
correlation determines.

What is a monotonic relationship?

A monotonic relationship is a relationship that does one of the following: (1) as the value
of one variable increases, so does the value of the other variable; or (2) as the value of
one variable increases, the other variable value decreases. Examples of monotonic and
non-monotonic relationships are presented in the diagram below:
Join the 10,000s of students, academics and professionals who rely
on Laerd Statistics.TAKE THE TOURPLANS & PRICING

Why is a monotonic relationship important to Spearman's correlation?

Spearman's correlation measures the strength and direction of monotonic association


between two variables. Monotonicity is "less restrictive" than that of a linear relationship.
For example, the middle image above shows a relationship that is monotonic, but not
linear.

A monotonic relationship is not strictly an assumption of Spearman's correlation. That is,


you can run a Spearman's correlation on a non-monotonic relationship to determine if
there is a monotonic component to the association. However, you would normally pick
a measure of association, such as Spearman's correlation, that fits the pattern of the
observed data. That is, if a scatterplot shows that the relationship between your two
variables looks monotonic you would run a Spearman's correlation because this will
then measure the strength and direction of this monotonic relationship. On the other
hand if, for example, the relationship appears linear (assessed via scatterplot) you
would run a Pearson's correlation because this will measure the strength and direction
of any linear relationship. You will not always be able to visually check whether you
have a monotonic relationship, so in this case, you might run a Spearman's correlation
anyway.

How to rank data?

In some cases your data might already be ranked, but often you will find that you need
to rank the data yourself (or use SPSS Statistics to do it for you). Thankfully, ranking
data is not a difficult task and is easily achieved by working through your data in a table.
Let us consider the following example data regarding the marks achieved in a maths
and English exam:

  Marks

English 56 75 45 71 61 64 58 80 76 61

Maths 66 70 40 60 65 56 59 77 67 63

The procedure for ranking these scores is as follows:

First, create a table with four columns and label them as below:

English (mark) Maths (mark) Rank (English) Rank (maths)

56 66 9 4

75 70 3 2

45 40 10 10

71 60 4 7

61 65 6.5 5

64 56 5 9

58 59 8 8

80 77 1 1

76 67 2 3

61 63 6.5 6

You need to rank the scores for maths and English separately. The score with the
highest value should be labelled "1" and the lowest score should be labelled "10" (if
your data set has more than 10 cases then the lowest score will be how many cases
you have). Look carefully at the two individuals that scored 61 in the English exam
(highlighted in bold). Notice their joint rank of 6.5. This is because when you have two
identical values in the data (called a "tie"), you need to take the average of the ranks
that they would have otherwise occupied. We do this because, in this example, we have
no way of knowing which score should be put in rank 6 and which score should be
ranked 7. Therefore, you will notice that the ranks of 6 and 7 do not exist for English.
These two ranks have been averaged ((6 + 7)/2 = 6.5) and assigned to each of these
"tied" scores.

What is the definition of Spearman's rank-order correlation?

There are two methods to calculate Spearman's correlation depending on whether: (1)
your data does not have tied ranks or (2) your data has tied ranks. The formula for when
there are no tied ranks is:

where di = difference in paired ranks and n = number of cases. The formula to use when
there are tied ranks is:

where i = paired score.

Spearman's Rank-Order Correlation


(cont...)
What values can the Spearman correlation coefficient, rs, take?

The Spearman correlation coefficient, rs, can take values from +1 to -1. A rs of +1
indicates a perfect association of ranks, a rs of zero indicates no association between
ranks and a rs of -1 indicates a perfect negative association of ranks. The closer rs is to
zero, the weaker the association between the ranks.

An example of calculating Spearman's correlation


To calculate a Spearman rank-order correlation on data without any ties we will use the
following data:

  Marks

English 56 75 45 71 62 64 58 80 76 61

Maths 66 70 40 60 65 56 59 77 67 63

We then complete the following table:

English (mark) Maths (mark) Rank (English) Rank (maths) d d2

56 66 9 4 5 25

75 70 3 2 1 1

45 40 10 10 0 0

71 60 4 7 3 9

62 65 6 5 1 1

64 56 5 9 4 16

58 59 8 8 0 0

80 77 1 1 0 0

76 67 2 3 1 1

61 63 7 6 1 1

Where d = difference between ranks and d2 = difference squared.

We then calculate the following:

We then substitute this into the main equation with the other information as follows:
as n = 10. Hence, we have a ρ (or rs) of 0.67. This indicates a strong positive
relationship between the ranks individuals obtained in the maths and English exam.
That is, the higher you ranked in maths, the higher you ranked in English also, and vice
versa.

Join the 10,000s of students, academics and professionals who rely


on Laerd Statistics.TAKE THE TOURPLANS & PRICING

How do you report a Spearman's correlation?

How you report a Spearman's correlation coefficient depends on whether or not you
have determined the statistical significance of the coefficient. If you have simply run the
Spearman correlation without any statistical significance tests, you are able to simple
state the value of the coefficient as shown below:

However, if you have also run statistical significance tests, you need to include some
more information as shown below:
where df = N – 2, where N = number of pairwise cases.

How do you express the null hypothesis for this test?

The general form of a null hypothesis for a Spearman correlation is:

H0: There is no [monotonic] association between the two variables [in the population].

Remember, you are making an inference from your sample to the population that the
sample is supposed to represent. However, as this a general understanding of
an inferential statistical test, it is often not included. A null hypothesis statement for the
example used earlier in this guide would be:

H0: There is no [monotonic] association between maths and English marks.

How do I interpret a statistically significant Spearman correlation?

It is important to realize that statistical significance does not indicate the strength of
Spearman's correlation. In fact, the statistical significance testing of the Spearman
correlation does not provide you with any information about the strength of the
relationship. Thus, achieving a value of p = 0.001, for example, does not mean that the
relationship is stronger than if you achieved a value of p = 0.04. This is because the
significance test is investigating whether you can reject or fail to reject the null
hypothesis. If you set α = 0.05, achieving a statistically significant Spearman rank-order
correlation means that you can be sure that there is less than a 5% chance that the
strength of the relationship you found (your ρ coefficient) happened by chance if the null
hypothesis were true.
Split-Half Basic Concepts
One way to test the reliability of a test is to repeat the test. This is not always possible.
Another approach, which is applicable to questionnaires, is to divide the test into even
and odd questions and compare the results.

Example 1: 12 students take a test with 50 questions. For each student the total score is
recorded along with the sum of the scores for the even questions and the sum of the
scores for the odd question as shown in Figure 1. Determine whether the test is reliable
by using the split-half methodology.

Figure 1 – Split-half methodology for Example 1


The statistical test consists of looking at the correlation coefficient (cell G3 of Figure 1).
If it is high then the questionnaire is considered to be reliable.

r = CORREL(C4:C15,D4:D15) = 0.667277


See Basic Concepts of Correlation for more information about the correlation
coefficient r.
One problem with the split-half reliability coefficient is that since only half the number
of items is used the reliability coefficient is reduced.  To get a better estimate of the
reliability of the full test, we apply the Spearman-Brown correction, namely:

This result shows that the test is quite reliable.

This version of the Spearman-Brown correction works properly when the two halves
have equal length. If not, then we can use the following formula (provided r ≠ ±1):

where c = 2p(1–p) where p = the proportion of the test due to the first half. Note that if
the two halves are equal, then c = 2(.5)(.5) = .5, and so
Note that if a test has an odd number of items 2n + 1, then n/(2n+1), and so

For example, if we obtain a correlation coefficient of .6 on a 2-3 split of a 7 question test,


then c = 2(.4)(.6) = .48 and so

which is slightly higher than the result that would be obtained if we assumed an even
number of questions, i.e.

Note that SB_CORRECTION(.6,5,2) = .756 using the Real Statistics function described
next.

Real Statistics Functions: The Real Statistics Resource Pack contains the following
functions:
SB_CORRECTION(r, n, m) = Spearman-Brown correction when the split-half
correlation based on an m vs. n-m split is r. If n is omitted, then it is assumed that there
is a 50-50 split. If n is present, but m is omitted, then it is assumed that m = n/2.
SB_SPLIT(R1, s) = split half coefficient (after Spearman-Brown correction) for data in
R1 based on the split described by the string s. String s consists of 0’s and 1’s where each
character in the string corresponds to one column in R1 (thus the length of smust be
equal to the number of columns in R1)
SPLIT_HALF(R1, R2) = split half coefficient (after Spearman-Brown correction) for
data in ranges R1 and R2; assumes a 50-50 split.
SPLITHALF(R1, type) = split-half measure for the scores in the first half of the items
in R1 vs. the second half of the items if type = 0 and the odd items in R1 vs. the even
items if type = 1.
The SPLIT_HALF function ignores any empty cells and cells with non-numeric values.
This is no so for the SPLITHALF function.

For Example 1, SPLIT_HALF(C4:C15, D4:D15) = .800439.

Example 2: Calculate the split half coefficient of the ten question questionnaire using a
Likert scale (1 to 7) given to 15 people whose results are shown in Figure 2.
Figure 2 – Data for Example 2
We first split the questions into the two halves: Q1-Q5 and Q6-Q10, as shown in Figure
3.

Figure 3 – Split-half coefficient (Q1-Q5 v. Q6-Q10)


E.g. the formula in cell B23 is =SUM(B4:F4) and the formula in cell C23 is
=SUM(G4:K4). The coefficient 0.64451 (cell H24) can be calculated as in Example 1.
Alternatively, the coefficient  can be calculated by the worksheet formula
=SPLIT_HALF(B23,B37,C23:C37) or =SPLITHALF(B4:K18,0).

We can also split the questionnaire into odd and even questions, as shown in Figure 4.
Figure 4 – Split-half coefficient (odd v. even)
E.g. the formula in cell L23 is =B4+D4+F4+H4+J4 and the formula in cell M23 is
=C4+E4+G4+I4+K4. The coefficient 0.698813 (cell R24) can be calculated as in
Example 1. Alternatively, the coefficient  can be calculated by the Real Statistics formula
=SPLIT_HALF(L23,L37,M23:M37) or =SPLITHALF(B4:K18,1).

You might also like