You are on page 1of 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/240729253

The Place of Inter-Rater Reliability in Qualitative Research: An Empirical Study

Article  in  Sociology · August 1997


DOI: 10.1177/0038038597031003015

CITATIONS READS
746 9,027

4 authors, including:

David Armstrong John Weinman


King's College London King's College London
223 PUBLICATIONS   8,550 CITATIONS    429 PUBLICATIONS   35,708 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Foucault Studies Special Issue: Biopolitical tensions after pandemic times View project

A new history of identity View project

All content following this page was uploaded by David Armstrong on 28 May 2014.

The user has requested enhancement of the downloaded file.


Sociology August 1997 v31 n3 p597(10) Page 1

The place of inter-rater reliability in qualitative research: an empirical


study.
by David Armstrong, Ann Gosling, Josh Weinman and Theresa Martaeu

Assessing inter-rater reliability, whereby data are independently coded and the codings compared
for agreement, is a recognised process in quantitative research. However, its applicability to
qualitative research is less clear: should researchers be expected to identify the same codes or
themes in a transcript or should they be expected to produce different accounts? Some
qualitative researchers argue that assessing inter-rater reliability is an important method for
ensuring rigour, others that it is unimportant; and yet it has never been formally examined in an
empirical qualitative study. Accordingly, to explore the degree of inter-rater reliability that might
be expected, six researchers were asked to identify themes in the same focus group transcript.
The results showed close agreement on the basic themes but each analyst ’packaged’ the themes
differently.

Key words: inter-rater reliability, qualitative research, research methods.

© COPYRIGHT 1997 British Sociological Association consistency of findings from an analysis conducted by two
Publication Ltd. (BSA) or more researchers. However, the concept emerges
implicitly in descriptions of procedures for carrying out the
Reliability and validity are fundamental concerns of the analysis of qualitative data. The frequent stress on an
quantitative researcher but seem to have an uncertain analysis being better conducted as a group activity
place in the repertoire of the qualitative methodologist. suggests that results will be improved if one view is
Indeed, for some researchers the problem has apparently tempered by another. Waitzkin described meeting with two
disappeared: as Denzin and Lincoln have observed, research assistants to discuss and negotiate agreements
’Terms such as credibility, transferability, dependability and disagreements about coding in a process he
and confirmability replace the usual positivist criteria of described as ’hashing out’ (1991:69). Another example is
internal and external validity, reliability and objectivity’ afforded by Olesen and her colleagues (1994) who
(1994:14). Nevertheless, the ghost of reliability and validity described how they (together with their graduate students -
continues to haunt qualitative methodology and different a standard resource in these reports - ’debriefed’ and
researchers in the field have approached the problem in a ’brainstormed’ to pull our first-order statements from
number of different ways. respondents’ accounts and agree them. Indeed, in
commenting on Olesen and her colleagues work, Bryman
One strategy for addressing these concepts is that of and Burgess (1994) wondered whether members of teams
’triangulation’. This device, it is claimed, follows from should produce separate analyses and then resolve any
navigation science and the techniques deployed by discrepancies, or whether joint meetings should generate
surveyors to establish the accuracy of a particular point a single, definitive coded set of materials.
(though it bears remarkable similarities to the
psychometric concepts of convergent and construct Qualitative methodologists are keen on stressing the
validity). In this way, it is argued, diverse confirmatory transparency of their technique, for example, in carefully
instances in qualitative research lend weight to findings. documenting all steps, presumably so that they can be
Denzin (1978) suggested that triangulation can involve a ’checked’ by another researcher: ’by keeping all collected
variety of data sources; multiple theoretical perspectives to data in well-organized, retrievable form, researchers can
interpret a single set of data; multiple methodologies to make them available easily if the findings are challenged
study a single problem; and several different researchers or if another researcher wants to reanalyze the data’
or evaluators. This latter form of triangulation implies that (Marshall and Rossman 1989:146). Although there is no
the difference between researchers can be used as a formal description of how any reanalysis of data might be
method for promoting better understanding. But what role used, there is clearly an assumption that comparison with
is there for the more traditional concept of reliability? the original findings can be used to reject, or sustain, any
Should the consistency of researchers’ interpretations, challenge to the original interpretations. In other words,
rather than their differences, be used as a support for the there is an implicit notion of reliability within the call for
status of any findings? transparency of technique.

In general, qualitative methodologies do not make explicit Unusually for a literature that is so opaque about the
use of the concept of inter-rarer reliability to establish the importance of independent analyses of a single dataset,

- Reprinted with permission. Additional copying is prohibited. - GALE GROUP


Information Integrity
Sociology August 1997 v31 n3 p597(10) Page 2

The place of inter-rater reliability in qualitative research: an empirical


study.
Mays and Pope explicitly use the term ’reliability’ and, and, more commonly, those who reject the term but allow
moreover, claim that it is a significant criterion for the concept to creep into their work. On the other hand are
assessing the value of a piece of qualitative research: ’the those who adopt such a relativist position that issues of
analysis of qualitative data can be enhanced by organising consistency are meaningless as all accounts have some
an independent assessment of transcripts by additional ’validity’ whatever their claims. A theoretical resolution of
skilled qualitative researchers and comparing agreement these divergent positions is impossible as their core
between the raters’ (1995:110). This approach, they claim, ontological assumptions are so different. Yet this still
was used by Daly et al. (1992) in a study of clinical leaves a simple empirical question: do qualitative
encounters between cardiologists and their patients when researchers actually show consistency in their accounts?
the transcripts were analysed by the principal researcher The answer to this question may not resolve the
and ’an independent panel’, and the level of agreement methodological confusion but it may clarify the nature of
assessed. However, ironically, the procedure described by the debate. If accounts do diverge then for the modernists
Daly et al. was actually one of ascribing quantitative there is a methodological problem and for the
weights to pregiven ’variables’ which were then subjected postmodernists a confirmation of diversity; if accounts are
to statistical analysis (1992:204). similar, the modernists’ search for measures of
consistency is reinforced and the postmodernists need to
A contrary position is taken by Morse who argues that the recognise that accounts do not necessarily recognise the
use of ’external raters’ is more suited to quantitative multiple character of reality.
research; expecting another researcher to have the same
’insights’ from a limited data base is unrealistic: ’No-one The purpose of the study was to see the extent to which
takes a second reader to the library to check that indeed researchers show consistency in their accounts and
he or she is interpreting the original sources correctly, so involved asking a number of qualitative researchers to
why does anyone need a reliability checker for his or her identify themes in the same data set. These accounts were
data?’ (Morse 1994:231). This latter position is taken then themselves subjected to analysis to identify the
further by those so-called ’post-modernist’ qualitative degree of concordance between them.
researchers (Vidich and Lyman 1994) who would
challenge the whole notion of consistency in analysing Method
data. The researcher’s analysis bears no direct
correspondence with any underlying ’reality’ and different As part of a wider study of the relationship between
researchers would be expected to offer different accounts perceptions of disability and genetic screening, a number
as reality itself (if indeed it can be accessed) is of focus groups were organised. One of these focus
characterised by multiplicity. For example, Tyler (1986) groups consisted of adults with cystic fibrosis (CF), a
claims that a qualitative account cannot be held to genetic disorder affecting the secretory tissues of the
’represent’ the social world, rather it ’evokes’ it- which body, particularly the lung. Not only might these adults with
means, presumably, that different researchers would offer cystic fibrosis have particular views of disability but theirs
different evocations. Hammersely (1991) by contrast was a condition for which widespread genetic screening
argues that this position risks privileging the rhetorical over was being advocated. The aim of such a screening
the ’scientific’ and argues that quality of argument and use programme was to identify ’carriers’ of the gene so that
of evidence should remain the arbiters of qualitative their reproductive decisions might be influenced to prevent
accounts; in other words, a place remains for some sort of the birth of children with the disorder.
correspondence between the description and reality that
would allow a role for ’consistency’. Presumably this latter The focus group was invited to discuss the topic of genetic
position would be supported by most qualitative screening. The session was introduced with a brief
researchers, particularly those drawing inspiration from summary of what screening techniques were currently
Glaser and Strauss’s seminal text which claimed that the available and then discussion from the group on views of
virtue of inductive processes was that they ensured that genetic screening was invited and facilitated. The ensuing
theory was ’closely related to the daily realities (what is discussion was tape recorded and transcribed. Six
actually going on) of substantive areas’ (1967:239). experienced qualitative investigators in Britain and the
United States who had some interest in this area of work
In summary, the debates within qualitative methodology on were approached and asked if they would ’analyse’ the
the place of the traditional concept of reliability (and transcript and prepare an independent report on it,
validity) remain confused. On the one hand are those identifying, and where possible rank ordering, the main
researchers such as Mays and Pope who believe reliability themes emerging from the discussion (with a maximum of
should be a benchmark for judging qualitative research; five themes). The analysts were offered a fee for this work.
- Reprinted with permission. Additional copying is prohibited. - GALE GROUP
Information Integrity
Sociology August 1997 v31 n3 p597(10) Page 3

The place of inter-rater reliability in qualitative research: an empirical


study.
The choice of method for examining the six reports was context that gave it coherence. At its simplest this can be
made on pragmatic grounds. One method, consistent with illustrated by the way that the theme of the relative
the general approach, would have been to ask a further six invisibility of genetic disorders as forms of disability was
researchers to write reports on the degree of consistency handled. All six analysts agreed that it was an important
that they perceived in the initial accounts. But then, these theme and in those instances when the analysts attempted
accounts themselves would have needed yet further a ranking, most placed it first. For example, according to
researchers to be recruited for another assessment, and the third rarer:
so on. At some point a ’final’ judgement of consistency
needed to be made and it was thought that this could just The visibility of the disability is the single most important
as easily be made on the first set of reports. Accordingly, element in its representation. [R3]
one of the authors (DA) scrutinised all six reports and
deliberately did not read the original focus group transcript. But while all analysts identified an invisibility theme, all
The approach involved listing the themes that were also expressed it as a comparative phenomenon:
identified by the six researchers and making judgements traditional disability is visible while CF is invisible.
from the background justification whether or not there were
similarities and differences between them. The stereotypes of the disabled person in the wheelchair;
the contrast between visible, e.g. gross physical, and
Results invisible, e.g. specific genetic, disabilities; and the special
problems posed by the general invisibility of so many
The focus group interview with the adults with cystic genetic disabilities. [R2]
fibrosis was transcribed into a document 13,500 words
long and sent to the six designated researchers. All six In short, the theme was contextualised to make it
researchers returned reports. Five of the reports, as coherent, and give it meaning. Perhaps because the
requested, described themes: four analysts identified five invisibility theme came with an implicit package of a
each, the other four. The sixth analysts returned a lengthy contrast with traditional images of deviance, there was
and discursive report that commented extensively on the general agreement on the theme and its meaning across
dynamics of the focus group, but then discussed a number all the analysts. Even so, the theme of invisibility was also
of more thematic issues. Although not explicitly described, used by some analysts as a vehicle for other issues that
five themes could be abstracted from this text. they thought were related: a link with stigma was
mentioned by two analysts; another pointed out the
In broad outline, the six analysts did identify similar themes difficulty of managing invisibility by CF sufferers.
but there were significant differences in the way they were
’packaged’. These differences can be illustrated by Ignorance. Whereas the theme of invisibility had a clear
examining four different themes that the researchers referent of visibility against which there could be general
identified in the transcript, namely, ’visibility’, ’ignorance’, consensus, other themes offered fewer such ’natural’
’health service provision’ and ’genetic screening’. backdrops. Thus, the theme of people’s ignorance about
genetic matters was picked up by five of the six analysts,
Visibility. All six analysts identified a similar constellation of but presented in different ways. Only one analyst
themes around such issues as the relative invisibility of expressed it as a basic theme while others chose to link
genetic disorders, people’s ignorance, the eugenic debate ignorance with other issues to make a broader theme. One
and health care choices. However, analysts frequently linked it explicitly with the need for education.
differed in the actual label they applied to the theme. For
example, while ’misperceptions of the disabled’, ’relative The main attitudes expressed were of great concern at the
deprivation in relation to visibly disabled’, and ’images of low levels of public awareness and understanding of
disability’ were worded differently, it was clear from the disability, and of great concern that more educational effort
accompanying description that they all related to the same should be put into putting this right. [R2]
phenomenon, namely the fact that the general public were
prepared to identify - and give consideration to - disability Three other analysts tied the population’s ignorance to the
that was overt, whereas genetic disorders such as CF eugenic threat. For example:
were more hidden and less likely to elicit a sympathetic
response. Ignorance and fear about genetic disorders and screening,
and the future outcomes for society. The group saw the
Further, although each theme was given a label it was public as associating genetic technologies with Hitler,
more than a simple descriptor; the theme was placed in a eugenics, and sex selection, and confusing minor gene
- Reprinted with permission. Additional copying is prohibited. - GALE GROUP
Information Integrity
Sociology August 1997 v31 n3 p597(10) Page 4

The place of inter-rater reliability in qualitative research: an empirical


study.
therapy alterations with alterations of the whole person. In terms of social welfare, there was quite an acute, and
[R6] arguably, very British, framing of themes within a
discourse of what kind of support should be provided and
But this did not mean that the need for education or the how it should be paid for. [R4]
potential eugenic threat was absent from those analysts
who did not link the theme with ignorance. For example, Genetic screening. Two analysts made no mention of the
one analyst thought that education deserved to be labelled NHS and health care in general. However, they, like all the
as a separate theme, another placed it in the context of other analysts commented on issues of genetic screening
’the parenting dilemma’: and therapy - though each used the theme in different
ways. Three linked screening explicitly with increasing
The dilemmas for prospective parents of genetic individual choice. For example:
screening: an ignorant public needs education, those
involved need information, protection and confidentiality. The participants are in favour of generally available carrier
[R5] screening. They see it as a way of allowing choice,
especially if it goes hand-in-hand with better education
And a third in the context of choice and genetic screening: about genetics. [R1]

The participants . . . see [screening] as a way of allowing A fourth analyst also discussed genetic screening in the
choice, especially if it goes hand in hand with better context of choice, but perhaps because the notion of
education about genetics. [R1] screening was embedded in a wider theme (genetic
research and therapy), also linked it with other issues.
Health service provision. The theme of health service
provision for the genetically ’disabled’ was another The attitudes expressed towards genetic science are a bit
common one but this was even more widely contexualised muted, and I found it hard to assess attitudes towards, for
than ’ignorance’. Health care was mentioned in some form example, gene therapy. On the one hand, there was
by four of the analysts, all in the context of a wider criticism from one male speaker in particular of ’public
resource allocation debate. One analyst simply recognised misunderstanding’ of genetic science. On the other hand,
the problem of limited health care resources. attitudes towards genetic screening tended to dissolve into
an awareness of choice and diversity. [R4]
The cost to the health service for CF patients is high and is
recognised as such. With the health service debate A fifth analyst also placed screening in the context of
dominated by cost the CF patients are keenly aware that choice but packaged the link with the dilemmas introduced
their position is precarious. [R5] by such a choice, in particular resonating with themes of
ignorance and education (as described above).
Two others set these demands in a wider moral context.
The parenting dilemma: The dilemmas for prospective
Theme: MEETING THE COST OF MEDICAL SUPPORT. parents of genetic screening: an ignorant public needs
The costs of providing for disability; the need for adequate education, those involved need information, protection and
provision through the NHS; the moral responsibilities of the confidentiality. The personal dilemmas for those found to
able-bodied majority and the disabled minority; etc . . . and be gene carriers, the decision to go ahead with a
of concern that government was generally reluctant to pregnancy or to abort is clearly a difficult decision but, in
invest sufficiently in medical treatment. [R2] the final analysis, it should be left to the parents. [R5]

The need for adequate and non-discriminatory provision of In similar fashion, the sixth analyst chose to use the
treatment and facilities for people with disabilities under screening theme to make a statement about eugenics- a
the NHS, to come if necessary from increased NI theme that several of the other analysts had addressed in
contributions. Unfair to penalise the ’innocent party’ - the relationship to the ’education’ theme.
child born with the disability. [R6]
GENETIC ENGINEERING AND EUGENICS: The general
The fourth analyst placed health care needs within a social unease and misapprehension surrounding modern genetic
welfare model. But this analyst also identified a cultural screening and genetic treatment, due principally to the
context for the focus group transcript that presumably historical association of genetics with Nazi eugenics. [R2]
reflects on the sense of strangeness that the analyst found
in the discussion. In short, analysts tackled the ’core’ theme of genetic
- Reprinted with permission. Additional copying is prohibited. - GALE GROUP
Information Integrity
Sociology August 1997 v31 n3 p597(10) Page 5

The place of inter-rater reliability in qualitative research: an empirical


study.
screening in different ways. In a sense it was impossible which it is located. The results of this study would suggest
for ’genetic screening’ to stand on its own as a theme, like, that an exactly analogous process occurs in qualitative
for example, ignorance, as the explicit purpose of the research. The analysts involved in this study all chose to
focus group discussion had been to look at the problem of embed the themes they identified in a wider context of
screening in the context of genetic disability. Instead, other themes. Such contexts might have reflected
rather than state the obvious, analysts had to make geography (analysts were drawn from Britain and the
choices of how to present this important facet of the United States), or discipline (they include anthropology,
discussion. The theme of choice was a common one but psychology and sociology), or personal differences in
even so this could link with other issues such as education experience or views, but the number of analysts was too
which other analysts had chosen to treat as separate few to draw any definite conclusions.
themes. And then one analyst who had already identified
an explicit theme of ignorance and education needed to It is important to separate out the level at which
make a different link - in this case with eugenics. concordance might be expected. In qualitative research
the raw data initially yields the basic themes, though
Discussion clearly these might be used to construct many different
’stories’. In exactly the same way, a quantitative dataset
This study was necessarily limited using only a single may yield a number of different ’stories’ depending on how
transcript as its dataset. This was a limit imposed by the it is analysed and the interpretation placed on those
need to restrict the amount of reading so as to obtain the findings by the researcher. Even so, at a purely descriptive
co-operation of the researchers in question. It is possible level there should be some concordance about, say 45 per
that different results would have been obtained had a cent of the sample being male or that 73 per cent had
wider range of material been presented, though the main visited a doctor in the last year. What is not clear is
findings would seem to be independent of the amount of whether there would be concordance among qualitative
material that was scrutinised. Moreover, the transcript was researchers on the basic ’themes’ contained within a data
not based on an interview with a single respondent but set irrespective of how these might be interwoven into a
with a focus group voicing a number of different ’story’ if the material were being placed in a more formal
perspectives on the problem. Criticism can also be levelled paper.
at the failure to involve the analysts at all stages of the
research, as is advised by many writers on qualitative What then are the implications for ’inter-rater reliability’ in
methodology (Burgess 1984), but this too was impractical. qualitative research? For the modernists, concerned to
establish some degree of ’accuracy’ in representation, it
The study attempted to begin to provide an answer to the does seem that the technique of triangulation - at least by
question of reliability in qualitative research. By assessing using different researchers - is limited by the processes
the extent of consensus on the basic building blocks of this inherent in qualitative data analysis. As the analysts used
form of research, namely the raw themes that emerge from in this study showed, all analysis is a form of interpretation
relatively unselected data, it is possible to see the range of and interpretation involves a dialogue between researcher
bias. The six researchers who took part in this study were and data in which the researcher’s own views have
all highly experienced qualitative researchers (several with important effects. Of course, this inherent subjectivity is
national and international reputations): it might be freely acknowledged in qualitative research, indeed it is
expected that if any group was to show agreement it would often cited as a hallmark of this approach unlike the
be these experienced researchers; the corollary is that any spurious claims to objectivity of more traditional
idiosyncrasies identified in this study are likely to be quantitative methods. Nevertheless, subjectivity does not
considered magnified if a group of less experienced necessarily mean singularity: the views of the analysts in
researchers had been recruited. this study were socially patterned, and this determined
their interpretations. In other words, while quantitative
The main finding is that there is indeed a degree of methods are rightly criticised for encoding the researchers’
consensus in the identification of themes between the biases in apparently objective ’numbers’, the same can be
different analyses but that the ’packaging’ of these themes said for qualitative method. An interview transcript might
showed a number of different configurations. Social represent ’raw’ data but the basic themes that are
representation theory has argued that people’s extracted have already been ’contaminated’ by the
representations are embedded in a network of other researcher.
related representations (Moscovici 1981). Thus an
interview elicits not only the representation under In some ways this argument lends support to the
consideration but the map of other representations in postmodernists in their claim that all accounts are unique
- Reprinted with permission. Additional copying is prohibited. - GALE GROUP
Information Integrity
Sociology August 1997 v31 n3 p597(10) Page 6

The place of inter-rater reliability in qualitative research: an empirical


study.
in that they represent the differing perspectives of different Forgas (ed.) Social Cognition: Perspectives on Everyday
observers. None the less, the findings of this study did not Life. London: Academic Press.
find completely divergent interpretations but a
concordance at a level of situating themes within a wider OLESON, V., DROES, N., HATTON, D., CHICO, N. and
framework. SCHATZMAN, L. 1994. ’Analyzing together: recollections
of a team approach’, in A. Bryman and R. G. Burgess
Acknowledgements (eds.) Analyzing qualitative data. London: Routledge.

We would like to thank the Medical Research Council for TYLER, S. A. 1986. ’Post-modern Ethnography: From
funding the project from which this study is drawn. Theresa Document of the Occult to Occult Document’, pp. 122-40
Marteau is funded by the Wellcome Trust. in J. Clifford and G. E. Marcus (eds.) Writing Culture: The
Poetics and Politics of Ethnography. Berkeley: University
References of California Press.

BRYMAN, A. and BURGESS, R. G. (eds.) 1994. Analyzing VIDICH, A. J. and LYMAN, S. M. 1994. ’Qualitative
qualitative data. London: Routledge. Methods: Their History in Sociology and Anthropology’, pp.
23-59 in N. K. Denzin and Y. S. Lincoln (eds.) Handbook
BURGESS, R. G. 1984. In the Field: An Introduction to of Qualitative Research. London: Sage.
Field Research. London: Allen and Unwin.
WAITZKIN, H. 1991. The Politics of Medical Encounters.
DALY, J., MACDONALD, I. and WILLIS, E. 1992. ’Why New Haven: Yale.
Don’t You Ask Them? A Qualitative Research Framework
for Investigating the Diagnosis of Cardiac Normality’, pp. Biographical note: DAVID ARMSTRONG is Reader in
189-206 in J. Daly, I. MacDonald and E. Willis (eds.) Sociology as applied to Medicine in the Department of
Researching Health Care: Designs, Dilemmas, Disciplines. General Practice; ANN GOSLING is a Research Associate
London: Tavistock. and THERESA MARTEAU Professor of Health Psychology
in the Psychology and Genetic Research Group; and
DENZIN, N. K. 1978. The Research Act: A Theoretical JOHN WEINMAN is Professor of Psychology as applied to
Introduction to Sociological Methods. New York: McGraw Medicine in the Unit of Psychology, all at UMDS, London.
Hill.
Address: Dr David Armstrong, Department of General
DENZIN, N. K. and LINCOLN, Y. S. 1994. ’Introduction: Practice, UMDS, 5 Lambeth Walk, London SE11 6SP.
Entering the Field of Qualitative Research’, pp.1-18 in N.
K. Denzin and Y. S. Lincoln (eds.) Handbook of Qualitative
Research. London: Sage.

GLASER, B. and STRAUSS, A. 1967. The Discovery of


Grounded Theory: Strategies for Qualitative Research.
Chicago: Aldine.

HAMMERSELY, M. 1991. Reading Ethnographic


Research: A Critical Guide. London: Longman.

MARSHALL, C. and ROSSMAN, G. B. 1989. Designing


Qualitative Research. London: Sage.

MAYS, N. and POPE, C. 1995. ’Rigour and Qualitative


Research’. British Medical Journal 311:109-12.

MORSE, J. M. 1994. ’Designing Funded Qualitative


Research’, pp. 220-35 in N. K. Denzin and Y. S. Lincoln
(eds.) Handbook of Qualitative Research. London: Sage.

MOSCOVICI, S. 1981. ’On Social Representation’, in J.P.


- Reprinted with permission. Additional copying is prohibited. - GALE GROUP
Information Integrity
View publication stats

You might also like