Professional Documents
Culture Documents
ERPV38 1. Drost E. 2011. Validity and Reliability in Social Science Research
ERPV38 1. Drost E. 2011. Validity and Reliability in Social Science Research
Introduction
An important part of social science research is the quantification
of human behaviour that is, using measurement instruments to
observe human behaviour. The measurement of human behaviour
belongs to the widely accepted positivist view, or empiricalanalytic approach, to discern reality (Smallbone & Quinton, 2004).
Because most behavioural research takes place within this
paradigm, measurement instruments must be valid and reliable.
The objective of this paper is to provide insight into these two
important concepts, and to introduce the major methods to assess
validity and reliability as they relate to behavioural research. The
paper has been written for the novice researcher in the social
sciences. It presents a broad overview taken from traditional
literature, not a critical account of the general problem of validity
of research information.
The paper is organised as follows. The first section presents what
reliability of measurement means and the techniques most
frequently used to estimate reliability. Three important questions
researchers frequently ask about reliability are discussed: (1) what
Ellen Drost
affects the reliability of a test?, (2) how can a test be made more
reliable?, and (3) what is a satisfactory level of reliability? The
second section presents what validity means and the methods to
develop strong support for validity in behavioural research. Four
types of validity are introduced: (1) statistical conclusion validity,
(2) internal validity, (3) construct validity and (4) external validity.
Approaches to substantiate them are also discussed. The paper
concludes with a summary and suggestions.
Reliability
Reliability is a major concern when a psychological test is used to
measure some attribute or behaviour (Rosenthal and Rosnow,
1991). For instance, to understand the functioning of a test, it is
important that the test which is used consistently discriminates
individuals at one time or over a course of time. In other words,
reliability is the extent to which measurements are repeatable
when different persons perform the measurements, on different
occasions, under different conditions, with supposedly alternative
instruments which measure the same thing. In sum, reliability is
consistency of measurement (Bollen, 1989), or stability of
measurement over a variety of conditions in which basically the
same results should be obtained (Nunnally, 1978).
Data obtained from behavioural research studies are influenced by
random errors of measurement. Measurement errors come either in
the form of systematic error or random error. A good example is a
bathroom scale (Rosenthal and Rosnow, 1991). Systematic error
would be at play if you repeatedly weighed yourself on a
bathroom scale which provided you with a consistent measure of
your weight, but was always 10lb. heavier than it should be.
Random error would be at work if the scale was accurate, but you
misread it while weighing yourself. Consequently, on some
occasions, you would read your weight as being slightly higher
and on other occasions as slightly lower than it actually was.
These random errors would, however, cancel out, on the average,
over repeated measurements on a single person. On the other
hand, systematic errors do not cancel out; these contribute to the
106
mean score of all subjects being studied, causing the mean value
to be either too big or too small. Thus, if a person repeatedly
weighed him/herself on the same bathroom scale, he/she would
not get the exact same weight each time, but assuming the small
variations are random and cancel out, he/she would estimate
his/her weight by averaging the values. However, should the scale
always give a weight that is 10lb. too high, taking the average will
not cancel this systematic error, but can be compensated for by
subtracting 10lb. from the persons average weight. Systematic
errors are a main concern of validity.
There are many ways that random errors can influence
measurements in tests. For example, if a test only contains a small
number of items, how well students perform on the test will
depend to some extent on their luck in knowing the right answers.
Also, when a test is given on a day that the student does not feel
well, he/she might not perform as strongly as he/she would
normally. Lastly, when the student guesses answers on a test, such
guessing adds an element of randomness or unreliability to the
overall test results (Nunnally, 1978).
In sum, numerous sources of error may be introduced by the
variations in other forms of the test, by the situational factors that
influence the behaviour of the subjects under study, by the
approaches used by the different examiners, and by other factors
of influence. Hence, the researcher (or science, in general) is
limited by the reliability of the measurement instruments and/or by
the reliability with which he/she uses them.
Somewhat confusing to the novice researcher is the notion that a
reliable measure is not necessarily a valid measure. Bollen (1990)
explains that reliability is that part of a measure that is free of
purely random error and that nothing in the description of
reliability requires that the measure be valid. It is possible to have
a very reliable measure that is not valid. The bathroom scale
example described earlier clearly illustrates this point. Thus,
reliability is a necessary but not a sufficient condition for validity
(Nunnally, 1978).
107
Ellen Drost
Estimates of reliability
Because reliability is consistency of measurement over time or
stability of measurement over a variety of conditions, the most
commonly used technique to estimate reliability is with a measure
of association, the correlation coefficient, often termed reliability
coefficient (Rosnow and Rosenthal, 1991). The reliability
coefficient is the correlation between two or more variables (here
tests, items, or raters) which measure the same thing.
Typical methods to estimate test reliability in behavioural research
are: test-retest reliability, alternative forms, split-halves, inter-rater
reliability, and internal consistency. There are three main concerns
in reliability testing: equivalence, stability over time, and internal
consistency. These concerns and approaches to reliability testing
are depicted in Figure 1. Each will be discussed next.
Test-retest reliability. Test-retest reliability refers to the temporal
stability of a test from one measurement session to another. The
procedure is to administer the test to a group of respondents and
then administer the same test to the same respondents at a later
date. The correlation between scores on the identical tests given at
different times operationally defines its test-retest reliability.
Despite its appeal, the test-retest reliability technique has several
limitations (Rosenthal & Rosnow, 1991). For instance, when the
interval between the first and second test is too short, respondents
might remember what was on the first test and their answers on
the second test could be affected by memory. Alternatively, when
the interval between the two tests is too long, maturation happens.
Maturation refers to changes in the subject factors or respondents
(other than those associated with the independent variable) that
occur over time and cause a change from the initial measurements
to the later measurements (t and t + 1). During the time between
the two tests, the respondents could have been exposed to things
which changed their opinions, feelings or attitudes about the
behaviour under study.
108
Test-Retest
Stability over
time
Alternative
Forms
VALIDITY
RELIABILITY
Equivalence
VALIDITY
Internal
Consistency
VALIDITY
Split-Half
Inter-rater
Cronbach
Alpha
109
Ellen Drost
Rating
----------
Judge 2
Subject 1
---Subject 10
Rating
----------
The correlation between the ratings made by the two judges will
tell us the reliability of either judge in the specific situation. The
composite reliability of both judges, referred to as effective
reliability, is calculated using the Spearman-Brown formula (see
Rosenthal & Rosnow, 1991, pp. 51-55).
Ellen Drost
correlate .40 with true scores, and a 12-item test might correlate
.80 with true scores.
Consequently, the individual item would be expected to have only
a small correlation with true scores. Thus, if coefficient alpha
proves to be very low, either the test is too short or the items have
very little in common. Coefficient alpha is useful for estimating
reliability for item-specific variance in a unidimentional test
(Cortina, 1993). That is, it is useful once the existence of a single
factor or construct has been determined (Cortina, 1993). Next in
conclusion of this section, three important questions researchers
frequently ask about reliability are considered.
What factors affect the reliability of a test?
There are many factors that prevent measurements from being
exactly repeatable or replicable. These factors depend on the
nature of the test and how the test is used (Nunnally, 1978). It is
important to make a distinction between errors of measurement
that cause variation in performance within a test, and errors of
instrumentation that are apparent only in variation in performance
on different forms of a test.
Sources of error within a test. A major source of error within a
test is attributable to the sampling of items. Because each person
has the same probability of answering an item correctly, the higher
the number of items on the test, the lower the amount of error in
the test as a whole. However, error due to item sampling is
entirely predictable from the average correlation, thus coefficient
alpha would be the correct measure of reliability. Other examples
of sources of errors on tests are: guessing on a test, marking
answers incorrectly (clerical errors), skipping a question
inadvertently, and misinterpreting test instructions.
On subjective tests, such as essay tests, measurement errors are
often caused by fluctuations in standards by the individual grader
and by the differences in standards of different graders. For
example, on an essay examination the instructor might grade all
112
113
Ellen Drost
Validity
Validity is concerned with the meaningfulness of research
components. When researchers measure behaviours, they are
concerned with whether they are measuring what they intended to
measure. Does the IQ test measure intelligence? Does the GRE
actually predict successful completion of a graduate study
program? These are questions of validity and even though they
can never be answered with complete certainty, researchers can
develop strong support for the validity of their measures (Bollen,
1989).
114
115
Ellen Drost
Construct validity
If a relationship is causal, what are the particular cause and effect
behaviours or constructs involved in the relationship? Construct
validity refers to how well you translated or transformed a
concept, idea, or behaviour that is a construct into a
functioning and operating reality, the operationalisation (Trochim,
2006). To substantiate construct validity involves accumulating
evidence in six validity types: face validity, content validity,
concurrent and predictive validity, and convergent and
discriminant validity. Trochim (2006) divided these six types into
two categories: translation validity and criterion-related validity.
These two categories and their respective validity types are
depicted in Figure 2 and discussed in turn, next.
Translation Validity. Translation validity centres on whether the
operationalisation reflects the true meaning of the construct.
Translation validity attempts to assess the degree to which
constructs are accurately translated into the operationalisation,
using subjective judgment face validity and examining content
domain content validity.
Face Validity. Face validity is a subjective judgment on the
operationalisation of a construct. For instance, one might look at a
measure of reading ability, read through the paragraphs, and
decide that it seems like a good measure of reading ability. Even
though subjective judgment is needed throughout the research
process, the aforementioned method of validation is not very
convincing to others as a valid judgment. As a result, face validity
is often seen as a weak form of construct validity.
116
Content
Validity
Face Validity
Translation
Validity
VALIDITY
CONSTRUCT
VALIDITY
CriterionRelated
Validity
Predictive
Validity
Concurrent
Validity
VALIDITY
Convergent
Validity
Discriminant
Validity
117
Ellen Drost
118
Behaviours
Method 1
12345
Method 2
12345
Method 3
12345
Method 4
12345
Method 5
12345
Ellen Drost
External validity
If there is a causal relationship from construct X to construct Y,
how generalisable is this relationship across persons, settings, and
times? External validity of a study or relationship implies
generalising to other persons, settings, and times. Generalising to
well-explained target populations should be clearly differentiated
from generalising across populations. Each is truly relevant to
external validity: the former is critical in determining whether any
research objectives which specified populations have been met,
and the latter is crucial in determining which different populations
have been affected by a treatment to assess how far one can
generalise (Cook & Campbell, 1979).
For instance, if there is an interaction between an educational
treatment and the social class of children, then we cannot infer that
the same result holds across social classes. Thus, Cook and
Campbell (1979) prefer generalising across achieved (my
emphasis) populations, in which case threats to external validity
120
Conclusion
This paper was written to provide the novice researcher with
insight into two important concepts in research methodology:
reliability and validity. Based upon recognised and classical works
from the literature, the paper has clarified the meaning of
reliability of measurement and the general problem of validity in
behavioural research. The most frequently used techniques to
assess reliability and validity were presented to highlight their
conceptual relationships. Three important questions researchers
frequently ask about what affects reliability of their measures and
how to improve reliability were also discussed with examples
from the literature. Four types of validity were introduced:
statistical conclusion validity, internal validity, construct validity
and external validity or generalisability. The approaches to
substantiate the validity of measurements have also been presented
with examples from the literature. A final discussion on common
method variance has been provided to highlight this prevalent
threat to validity in behavioural research. The paper was intended
to provide an insight into these important concepts and to
encourage students in the social sciences to continue studying to
advance their understanding of research methodology.
121
Ellen Drost
References
Avolio, B. J. , Yammanno, F. J. and Bass, B. M. (1991).
Identifying Common Methods Variance With Data Collected
From A Single Source: An Unresolved Sticky Issue. Journal of
Management, 17 (3), 571-587.
Bollen, K. A. (1989). Structural Equations with Latent Variables
(pp. 179-225). John Wiley & Sons,
Brinberg, D. and McGrath, J. E. (1982). A Network of Validity
Concepts Within the Research Process. In Brinberg, D. and
Kidder, L. H., (Eds), Forms of Validity in Research, pp. 5-23.
Campbell, D.T. and Fiske, D.W. (1959). Convergent and
discriminant validation by the multitrait-multimethod matrix.
Psychological Bulletin, 56, 81-105.
Chapman, L.J. and Chapman, J.P. (1969). Illusory correlations as
an obstacle to use of valid psychodiagnostic signs. Journal of
Abnormal Psychology, 74, 271-280.
Cook, T. D. and Campbell, D. T. (1979). Quasi-Experimentation:
Design & Analysis Issues for Field Settings. Boston: Houghton
Muffin Company, pp. 37- 94.
Cortina, J. M. (1993). What is Coefficient Alpha? An Examination
of Theory and Applications. Journal of Applied Psychology, 78
(1), 98-104.
Cronbach, L. J. (1951). Coefficient alpha and the internal structure
of tests. Psychometrika, 16(3), 297-334.
Dansereau, F., Alutto, J.A. and Yammarino, F.J. (1984). Theory
testing in organizational behavior: The variant approach.
Englewood cliffs, NJ: Prentice Hall.
Fiske, Donald W. (1982). Convergent--Discriminant Validation in
Measurements and Research Strategies. In Brinberg, D. and
Kidder, L. H., (Eds), Forms of Validity in Research, pp. 77-93.
Nunnally, J. C. (1978). Psychometric Theory. McGraw-Hill Book
Company, pp. 86-113, 190-255.
Nunnaly, J. D. and Bernstein, I. H. (1994). Psychometric Theory.
New York, NY: McGraw Hill.
Podsakoff, P M and Organ, D W (1986). Self-reports in
organizational research: Problems and prospects. Journal of
Management, 12: 531-544.
122
123