Professional Documents
Culture Documents
• covers 18 key skills including corpus building, the role of frequency, different
corpus methods, transcription and annotation;
• demonstrates the use of available corpora and desktop and online corpus ana-
lysis tools to conduct original analyses;
• features case studies and step-by-step guides within each chapter;
• emphasises the use of interview data in research projects.
Corpus Linguistics for Education is an essential guide for students and researchers
studying or conducting their own corpus-based research in education.
Routledge Corpus Linguistics Guides provide accessible and practical introductions to using
corpus linguistic methods in key sub-fields within linguistics. Corpus linguistics is one of
the most dynamic and rapidly developing areas in the field of language studies, and use
of corpora is an important part of modern linguistic research. Books in this series provide
the ideal guide for students and researchers using corpus data for research and study in a
variety of subject areas.
List of figures x
List of tables xii
Preface xv
Acknowledgements xviii
2 Analysing text 20
2.1 Different approaches to text analysis 20
2.2 Text as register 27
2.2.1 Corpus linguistics and the analysis of register 27
References 32
8 Conclusion 168
References 170
Index 171
Figures
the latest, newest interface, which is expected to remain stable for quite some time.
Other excellent software packages are available to those wishing to familiarise
themselves with corpus work and, of course, the reader will be able to use their
software of preference while following our discussion, or at least most of it. We
are using both terms, corpora and datasets, interchangeably as no differences are
implied between them in the context of this publication.
Our intention is to present a stimulating and informative discussion with plenty
of visuals and relevant information that can be accessed readily by the reader. We
have broken down a selection of corpus linguistics methods into 18 skills that can
serve as guidance to educational researchers wishing to explore the usefulness of
corpus linguistics in their projects. These skills have been distributed across all the
chapters in the book. This is the list of skills that we have put together:
In chapters 4 and 7, the readers will find some activities that seek to encourage
self-reflection and further thinking about how to implement corpus linguistics
research methods.
We sincerely hope that the book can stimulate the use of corpus linguistics in
education research. We are confident that future contributions from readers of
this book will enrich our understanding of how language and education can col-
laborate to gain a profound understanding of the discourses, practices and lives of
ever-increasingly complex societies.
newgenprepdf
Acknowledgements
I would like to thank Michael McCarthy and Anne O’Keeffe for their advice
and for encouraging me to write this book. Special thanks to Anne O’Keeffe and
Geraldine Mark: I learn so much from you whenever we discuss corpus analysis,
not to mention how much fun we have. Thanks to my colleagues at the Faculty of
Education at the University of Cambridge for inspiring me in so many ways, and
Adam Woods and Lizzie Cox for their patience and guidance. I would also like to
thank my PhD students at the University of Cambridge: you are such a fabulous
bunch. Also, thanks to David, Diane, Encarna and Fernando. You were always
there when I most needed you.
Thank you, too, to Laurence Anthony and Sketch Engine for their permission
to use screenshots from AntConc and Sketch Engine, respectively; to Mark Davies
for his permission to use screenshots from www.english-corpora.org/; and to Phil
Durrant for his permission to use the GIG corpus.
Chapter 1
Introduction
Corpus linguistics and education research
Research questions
Corpora
Figure 1.1 Corpora as a research method
Before we look at how these questions are examined by means of a corpus, can
you think of other research methods that can be used to answer these questions?
How will your data be collected? How will it be analysed?
Arguably, these questions can be answered by drawing on different research
methodologies and methods, as suggested in Table 1.1. However, corpus linguis-
tics will put more emphasis on the notion of usage and the need to use a repre-
sentative body of textual evidence. The three questions above can be answered
by putting together and querying different corpora that can be used as proxies of
the phenomena under investigation. In the case of the first research question (A),
we may want to use the English-Corpora.org TV Corpus. This corpus contains
325 million words from 75,000 episodes of TV shows dating from the 1950s
to 2018 produced and recorded in, among other countries, the US, the UK,
Introduction 3
Discussion
Australia and New Zealand. Depending on your area of interest, you may want
to narrow down your focus and examine only talk shows or, for example, all
soaps or just one specific genre (i.e. comedy). Bednarek (2018) used the Sydney
Corpus of Television Dialogue (SydTV)2 to investigate dialogues in American
TV series and developed a categorisation of their functions, namely, narrative-
related functions (progressing the plot or filling out character) and medium-
related functions (endorsing products or engaging audience emotions).
Question B calls for the compilation of a specific collection of texts represen-
tative of student writing in HE. The British Academic Written English Corpus
(BAWE)3 seems like a good fit when approaching this research question. The
BAWE contains university-level student writing: around 3,000 good-standard stu-
dent assignments totalling 6,506,995 words. The corpus showcases different types
of text (essays, critiques, explanation, literature reviews, etc.) across different dis-
ciplines (agriculture, economics, biological sciences, business, classics, engineering,
etc.). Nesi & Gardner (2018) have discussed how the genres in this corpus can be
linked to different social purposes:
For each of these purposes, Nesi & Gardner have analysed an inventory of
subgenres and have developed specific materials that can be used to teach HE
writing across different levels of expertise and disciplines. These findings are
solely based on the evidence provided by the texts included in the British Academic
Written English Corpus.
Question C represents one of the areas where corpus linguistics has been
most productive in the last decades: the analysis of specialised languages (Bhatia,
Sánchez Hernández & Pérez-Paredes, 2011). The use of professional registers has
attracted the attention of applied linguists who have found in corpora an oppor-
tunity to examine evidence of how specialised discourse is used and applications
of evidence-based knowledge in education. Thus, Biber & Conrad (2009) have
stressed the educational potential of the analysis of corpora:
Text varieties and the differences among them constantly affect people’s daily
lives. Proficiency with these varieties affects not only success as a student, but
also as a practitioner of any profession, from engineering to creative writing
to teaching. Receptive mastery of different text varieties increases access to
information, while productive mastery increases the ability to participate in
varying communities. And if you cannot analyze a variety that is new to you,
you cannot help yourself or others learn to master it.
(Biber & Conrad, 2009: 4)
avoids confirmation bias and, in simple terms, text cherry-picking. Let us revisit
our three questions and exemplify how the principle of total accountability works
(Table 1.2).
Corpora are finite and, inescapably, represent a selection of data. McEnery &
Hardie (2012: 15) have noted that researchers ‘can only seek total accountability
relative to the dataset that [they] are using, not to the entirety of language itself ’.
This is a key point that we need to bear in mind as researchers. When knowledge
claims are made in terms of usage, we need to be aware that our claims will neces-
sarily be constrained by the instrument that was designed and compiled to extract
our results and findings (Figure 1.1).
McEnery & Hardie (2012) have suggested that it is essential that results are
replicated, and their consistency tested across different methods, that is, across
different corpora. While the use of Mark Davies’ TV corpus for question A above
appears quite robust in terms of replicability (other corpora will necessarily include
most of the TV shows already in this corpus), questions B and C will benefit
from further scrutiny with other datasets, that is, with other students writing other
texts, in the case of question B, and with other texts and subgenres in the case of
question C. McEnery & Hardie have noted that corpus designers and researchers
need to engage with the notion of validity:
Total accountability to the data at hand ensures that our claims meet the
standard of falsifiability; total accountability to other data in the process of
checking and rechecking ensures that they meet the standard of replicability;
and the combination of falsifiability and replication can make us increas-
ingly confident in the validity of corpus linguistics as an empirical, scientific
enterprise.
(2012: 16)
6 Introduction
So far, corpus linguistics has been successfully used in different areas of linguistics
and applied linguistics such as contrastive linguistics, discourse analysis, language
learning and teaching, lexicography, pragmatics, semantics and sociolinguistics.
However, we must never forget that corpora are proxies. As McEnery & Hardie
(2012: 26) have observed, ‘corpora allow us to observe language, but they are not
language itself ’. With this caveat in mind, and with a good grasp of the prin-
ciple of accountability, we are better equipped as researchers to make the most of
corpus linguistics methods even in areas outside linguistics.
You may think that you are not interested in describing language use exclusively
from a linguistic perspective. However, CL is not just used by linguists researching
linguistics. Corpus linguistics has been used in anthropology, economics, history,
law and, among other areas, sociology. If, as a researcher, you are interested in
the implications of language usage within a community of users, then there may
be something in CL for you. In education research, you may perhaps use data
collection methods such as interviews or focus groups –in other words, you will
be looking at textual data and, possibly, discourse or discourses. These texts can
also be explored by means of the corpus research methods that we will discuss
throughout this book. Figure 1.2 offers a visual rendering of the role of bigger,
representative corpora in our use of corpus research methods to process and ana-
lyse data from interviews and other qualitative instruments
For some research questions all we need is a representative corpus of the
domains, registers or language users that are the focus of our research. As we have
seen, questions A, B and C can be explored using such instruments. If, for example,
you are analysing language policy, CL research methods may be of interest simi-
larly. Let us take the UK government’s Education Act 2011.4 According to its
Research questions
Text
Corpora
Figure 1.2 Using corpora to examine textual data from data elicitation methods such as
interviews or focus groups
Introduction 7
1 A word list that gives us the raw frequency of all the words in the text. This
is usually a great starting point to get the gist of the lexical items used in a
text and therefore its overall content. We can always read such a list either in
descending order of frequency or increasing order of frequency. In the latter,
we start with the words that are only used once in the text. In the former, quite
unsurprisingly, the majority of the top ten most frequent words are function
words such as the, of, in, a, to, and, for, by and or. The tenth most frequent word
is section. If we zoom out, we will start to notice that the more frequent items
among, say, the top 30 most frequent words, are those related to legal jargon
such as annotation, force, commencement, paragraph, person, substitute or provision. It is
in the top 100 most frequent ranking where we will start to find lexical items
that we may want to explore further such as staff, corporation or teacher. For each
of these words, we can then explore the concordance lines and contexts where
they occur. We will learn how to do this in the forthcoming chapter.
2 A list can include specific items only. For example, we can generate a list of the
top verbs in the Education Act 2011 and acquire a sense of those actions that
have been specified by the legislator. A list of adjectives will give us insight into
what is considered as serious or interim, for example, within the Education Act.
3 A more sophisticated keyword analysis list using inferential statistics will reveal
that the multiword terms that characterise the Education Act 2011 are transfer
scheme, alternative provision, education corporation and service provider (all in the top 15).
Epistemology
(Paradigms)
Research Theoretical
methods perspectives
Research
methodology
Figure 1.3 Research in education
Based on Pring (2004) and Gray’s (2004) adaptation of the work of Crotty (1998)
Introduction 9
many realities as there are conceptions of it’ (Pring, 2004: 50). Choosing a meth-
odology and a method without a detailed understanding of the epistemology that
supports it will create unnecessary tensions between how we analyse and interpret
our findings. For example, if we adopt positivism as our theoretical stance and
the scientific paradigm as our epistemology, we will most likely choose a research
methodology that uses scientific observation and controlled empirical inquiry. If
we adopt the constructivist paradigm and interpretivism as our epistemology, we
will most likely use a research methodology that looks at the lived realities of those
people being researched and methods that examine how meanings emerge out of
social interaction, and may want to use, among other options, phenomenological
or ethnographic methodologies. Table 1.3 explores the differences between posi-
tivism and phenomenological education research.
Research methodologies are sensitive to our paradigm inclination as
researchers. If we situate our research within the positivist paradigm, an experi-
mental or survey methodology will seem fit for purpose. If, on the contrary, we
situate ourselves as researchers within the constructivist paradigm, case studies,
ethnography or grounded theory are good candidates as methods for our research
project. Methodologically speaking, a researcher that perceives reality as objective
will, very likely, use mathematical models and quantitative analysis to operation-
alise an abstraction of reality. Those educational researchers that perceive reality
as subjective will rely on methods that look at the ‘representation of reality for
purposes of comparison’ (Cohen, Manion & Morrison, 2018: 7), very frequently
by analysing language and meaning. Despite this interest in analysing language,
CL methodology and methods do not feature in major accounts of educational
research such as Gray (2004) or Cohen, Manion & Morrison (2018).
There are many reasons why CL methods are not widely used in education
research. One reason is the fact that CL is a relatively recent discipline that, even
in linguistics, has met lots of criticism. Timmis (2015) has pointed out that corpora
can only reflect usage, not other areas of language-related phenomena that are
Positivism Phenomenology
Epistemology Reality is external and Reality is socially constructed
objective
Researchers Explore causality between Construct theories and models
variables from the data
Methods • Mainly quantitative methods • Mainly qualitative methods
• Concepts are operationalised • Phenomena are complex
and measured and multiple methods are
• Use of large samples and required
generalisation • Use of small samples
researched in depth
Based on Cohen, Manion & Morrison, 2018
10 Introduction
private or which the users are not willing to share. While this is a fair criticism, this
concern applies also to other research instruments such as interviews or diaries.
Eventually, these instruments can only record language that is voluntarily shared
by speakers. Large, representative corpora are, admittedly, different. Most of the
language in the 100 million-word British National Corpus (BNC) was not elicited
to be part of the corpus, that is, the texts in the BNC were used as a result of a
design process that deemed particular types of texts appropriate. According to
the BNC reference guide,6 published written texts were selected partly at random
from Whitaker’s Books in Print 1992 and selection features such as domain (sub-
ject field), time (within certain dates) and medium (book, periodical, etc.). Another
criticism is that, even with robust, big corpora, the language in each corpus will
always be a partial representation of usage. As we have seen, corpus triangulation
can provide researchers with optimised results. We will see in the forthcoming
pages how this can be achieved. A third area of criticism revealed by Timmis
(2015: 184) is that most big corpora only give us basic textual information. In the
case of spoken language this may be an important limitation as we are not, as
yet, given access to recorded files and annotated prosodic and paralinguistic oral
features. This may change in the forthcoming years but there is no doubt that the
compilation of written data is favoured as it presents the researchers with fewer
challenges in terms of the ethical and logistical issues involved.
We propose in this book that corpus linguistics be seen and conceptualised both
as a research methodology and as a set of research methods. For example, the use of
keyword analysis can be massively useful to analyse the content of texts. Culpeper
& Demmen (2015:105) have noted that ‘the full potential of [keyword] analyses
across the humanities and social sciences has yet to be realised’. We believe that
this full potential can only be achieved if we further clarify the different roles of
CL when seen either as a methodology or as a set of methods. Mautner (2019: 9)
believes researchers should be cautious when it comes to interpreting results when
using corpus methods to examine either individual texts or corpora:
Let us turn our attention now to the role of frequency in language and in corpus
analysis.
Introduction 11
[…] repetition of actions brings about the formation of structures; thus in lan-
guage, too, we see that repetition is a necessary component of grammar forma-
tion […] The reason frequency or repetition plays a role in grammar formation
is that the mind is sensitive to repetition. This is a domain general principle;
that is, it does not apply just to language but to other cognitive domains as well.
(Bybee, 2007: 8)
Bybee (2007) stresses the fact that, according to psycholinguistic and cognitive accounts
of language learning and cognition, repetition strengthens memory representations
for linguistic forms. This fact makes highly frequent forms more accessible from a
cognitive perspective and more likely to be used by more and more members of a
community of speakers. Put simply, those speakers that use language while enacting
certain types of register (conversation, academic language, etc.) will tend to notice
and use (and reuse) highly frequent items (or constructions as noted as such in the
specialised literature). Not only do language constructions become entrenched in our
minds as we learn our first language(s), they also become entrenched for a commu-
nity of users of that language. Interestingly, frequency plays a major role in how we
learn language. Ellis (2002) has summarised some of these effects:
judged to be words in lexical decision tasks […], and they are spelled more
accurately […]. Auditory word recognition is better for high-frequency than
low-frequency words […]
(Ellis, 2002)
Bybee (2010) has proposed that grammar is the cognitive organisation of one’s
experience with language and Ellis (2019) has recently defined language as the quint-
essence of distributed cognition. Frequency, according to usage-based perspectives,
functions both as the main factor impacting the cognitive representation of lan-
guage and also usage: ‘each instance of language use impacts representation’ (Bybee,
2010: 9). Corpora offer evidence of such representations as used by speakers when
engaging in communication. As such, corpora offer researchers a fertile ground to
test how repetition of items triggers chunking, that is, linguistic units of meaning of
varied size that are easily stored and retrieved from our memory.
So far, we have seen that the frequency of occurrence of different language
units (from morphemes to complex constructions) has an impact on our learning
and use of those very same units and beyond. We note that frequency affects how
we acquire language and how we use language, which in turn affects how others
learn and use language.
Baker goes on to suggest that the use of large representative corpora can give us
access to what is considered to be normal or usual in a given community of users.
Introduction 13
After consulting the frequency of use and the collocational behaviour of confine
and wheelchair in the BNC, Baker concluded the following:
Tognini-Bonelli (2010: 19) has noted that, in corpora, the significant elements are
‘the patterns of repetition and patterns of co-selection’. She holds that it is the fre-
quency of occurrence that ‘takes pride of place’. By using corpora, researchers can
study the repeated patterns of usage and establish whether these patterns show
evidence of hegemonic discourses, majority common-sense ways of viewing the
world or even resistant discourses (Baker, 2006). These analyses start with observing
the frequencies of occurrence of linguistic units. Then, the researchers can move
on to qualitative examinations of the contexts and situations of use. Let’s look at
some examples. Corpus-assisted discourse analysis (CADA) uses corpora to ana-
lyse, among other areas, how minorities are represented. An important study in
this area is Baker, Gabrielatos & McEnery (2013), who looked at the representa-
tion of Muslims and Islam in the British press between 1998 and 2009. In their
research, these authors used a corpus of 143 million words and included everything
published in papers as diverse as The Guardian and The Sun that dealt with Muslims
or Islam during that time. In total, the corpus includes over 200,000 articles. The
authors concluded that while overt negative representation of Muslims was carefully
avoided, a number of strategies were identified that favoured a distorted represen-
tation of Muslims, in particular in right-wing tabloids. Based on the evidence
provided, left-leaning broadsheets were found to present a more balanced reporting
of Muslims and were ‘more likely to give voices to Muslims and reflect on issues to
do with terminology or representation’ (Baker, Gabrielatos & McEnery, 2013: 259).
It is impossible to present a fair summary of this study in just a few lines. For the time
being, let us advance that the words Islam, Islamic, Islamism, Islamist and Islamists all
collocated with the words terror and terrorism. In simple terms this means that, in the
context of written media in the UK, it is very likely that these words occur together.
The authors wonder whether this fact contributes to the emergence and perpetu-
ation of a public discourse that tends to pigeonhole Muslims as a violent community
against other British citizens. This is evidenced by the fact that almost 50% of the
topic indicators in the corpus are concerned with the idea of conflict:
This type of study has been used to understand the representation of other minor-
ities such as migrants and asylum seekers, gay men and members of the LGBT
community as well as people with diseases such as cancer or mental health issues.
C C
T
1 2
Figure 1.4 Using corpora as primary or secondary data
texts (T) that can be explored using CL data analysis methods. In the context of
this approach, a larger, representative corpus will be queried as a source of usage
information that will inform the researchers’ analysis and interpretation of their
main data source (T). Coming back to our long definition of CL provided in the
first paragraph of this chapter (CL studies the usage of language by examining how repre-
sentative texts of a given genre reflect the discursive practices of actual language users), we are
now better equipped to understand that while our interest is the ‘T’ in Figure 1.4,
it is the ‘C’ that will be used as a baseline to judge how our data differs from what
can be understood as ‘expected’, ‘normal’ or ‘usual’ in usage.
Let us turn our attention now to the first two skills that need to be mastered in
the context of this book.
and social class. It also features spoken language from different contexts such as
business and government meetings, and radio shows including phone-ins.
Corpora can offer a detailed account of the frequency of different linguistic
items and identify those units that are most frequently used. Potentially, most
corpus management software can offer frequencies of the following:
As you can see, there is some variety in terms of the units that can be counted
in a corpus. This fact gives researchers plenty of freedom in terms of searching
for a diverse range of items. If our interest is to examine the use of nouns
in the first place, the BNC can give us interesting insights into how frequent
nouns are in British English. Overall, it seems that 21% of English words in
the BNC are nouns. However, there are important differences across registers.
While in spoken English nouns account for 14% of all words, in academic
English nouns are 25% of all words while in fiction the figure decreases to 17%.
This shows evidence of how naturally occurring language is affected by the
medium (spoken vs. written) and the type of register through which commu-
nication happens (conversation, academic language, fiction, news, etc.). While
all texts use language, different registers tend to reflect different distribution
of linguistic elements. One of the tasks of the linguist is to interpret these
differences in terms of linguistic theory, for example Biber’s functional linguis-
tics approach (Biber, 1988; Biber & Conrad, 2009; Biber, Johansson, Leech,
Conrad & Finegan, 1999). It is the task of the educational researcher to inter-
pret different frequency patterns in the light of their specific inquiry. Table 1.4
summarises why frequency matters.
normalised frequency ‘answers the question “how often might we assume we will
see the word per x words of running text?” ’ Normalised frequencies are calculated
using this easy formula:
In the case of the BAWE we know that the corpus is made up of 6,968,089
words.8 So the normalised frequency of education in the BAWE is 236.5
per million words:
236.5 per million words reflects how often education would be expected to occur
on average in each million words of the BAWE. In the case of the BNC, we find
that education occurs 25,947 times in the total 96,134,547 words. The normalised
frequency of education in the BNC is the following:
269.9 per million words reflects how often education would be expected to occur
on average in each million words of the BNC. With two normalised frequencies
calculated using the same base of normalisation, we can now start to compare the
frequency of use across the BAWE and the BNC. Of course, we could decide to
use a different base of normalisation, for e xample 1,000 words. Our normalised
frequencies would be 0.23 per 1,000 words in the BAWE and 0.26 per 1,000
words in the BNC.
Table 1.5 gives a useful breakdown of our second skill, how to interpret
frequency.
18 Introduction
Notes
1 www.english-corpora.org/iweb/
2 www.syd-tv.com/
3 www.coventry.ac.uk/ r esearch/ r esearch- d irectories/ c urrent- p rojects/ 2 015/
british-academic-written-english-corpus-bawe/
4 www.legislation.gov.uk/ukpga/2011/21/contents/enacted
5 Cohen, Manion & Morrison (2018) go as far as to identify six paradigms: empirical-
analytic, pragmatic, interpretive, critical, post-structuralist and transcendental.
6 www.natcorp.ox.ac.uk/docs/URG/BNCdes.html#BNCpurp
7 www.natcorp.ox.ac.uk/corpus/index.xml
8 These totals are calculated using the Corpus Info sections of these corpora on Sketch
Engine: www.sketchengine.eu/
References
Baker, P. (2006). Using corpora in discourse analysis. London: Continuum.
Baker, P., Gabrielatos, C. & McEnery, T. (2013). Discourse analysis and media attitudes: The
representation of Islam in the British press. Cambridge: Cambridge University Press.
Bednarek, M. (2018) Language and television series. A linguistic approach to TV dialogue. Cambridge:
Cambridge University Press.
Bhatia, V., Sánchez Hernández, P. & Pérez-Paredes, P. (Eds.) (2011). Researching specialised
languages. Amsterdam: John Benjamins Publishing.
Biber, D. (1988). Variation across spoken and written English. Cambridge: Cambridge University
Press.
Biber, D. & Conrad, S. (2009). Genre, register and style. Cambridge: Cambridge University Press.
Biber, D., Johansson, S., Leech, G., Conrad, S. & Finegan, E. (1999). Longman grammar of
written and spoken English. Harlow: Longman.
Bybee, J. (2007). Frequency of use and the organisation of language. Oxford: Oxford University Press.
Bybee, J. (2010). Language, usage and cognition. Cambridge: Cambridge University Press.
Cohen, L., Manion, L. & Morrison, K. (2018). Research methods in education. London: Taylor
Francis.
Crosthwaite, P. & Cheung, L. (2019). Learning the Language of Dentistry: Disciplinary corpora in
the teaching of English for Specific Academic Purposes. Amsterdam: John Benjamins Publishing.
Crotty, M. (1998). The foundations of social research: Meaning and perspective in the research process.
London: Sage Publications Limited.
Introduction 19
Culpeper, J. & Demmen, J. (2015). Keywords. In Biber, D. & Reppen, R. (Eds.) The Cambridge
handbook of English corpus linguistics, 90–105.
Ellis, N. (2002). Frequency effects in language processing: A review with implications for
theories of implicit and explicit language acquisition. Studies in Second Language Acquisition,
24(2), 143–188.
Ellis, N. C. (2019). Essentials of a theory of language cognition. The Modern Language Journal,
103, 39-60.
Gray, D.E. (2004). Doing research in the real world. London: Sage Publications Limited.
Mautner, G. (2019). A research note on corpora and discourse: Points to ponder in research
design. Journal of Corpora and Discourse Studies, 2, 2–13.
McEnery, A.M. & Wilson, A. (1996). Corpus Linguistics. Edinburgh: Edinburgh University
Press.
McEnery, T. & Hardie, A. (2012). Corpus linguistics: Method, theory and practice. Cambridge:
Cambridge University Press.
Nesi, H. & Gardner, S. (2018). The BAWE corpus and genre families classification of
assessed student writing. Assessing Writing, 38, 51–55.
Pérez-Paredes, P. (2017). A keyword analysis of the 2015 UK Higher Education Green
Paper and the Twitter debate. In Orts, M. A., Breeze, R. & Gotti, M. (Eds.) Power, persua-
sion and manipulation in specialised genres: providing keys to the rhetoric of professional communities.
Bern: Peter Lang, 161–191.
Pring, R. (2004). The philosophy of education. London: Bloomsbury.
Scott, M. & Tribble, C. (2006). Textual patterns: Key words and corpus analysis in language education.
Amsterdam: John Benjamins Publishing.
Timmis, I. (2015). Corpus linguistics for ELT: Research and practice. London: Routledge.
Tognini- Bonelli, E. (2010). Theoretical overview of the evolution of corpus linguis-
tics. In O’Keeffe, A. & McCarthy, M. (Eds.) The Routledge handbook of corpus linguistics.
London: Routledge, 42–56.
Viana, V., Zyngier, S. & Barnbrook, G. (Eds.). (2011). Perspectives on corpus linguistics.
Amsterdam: John Benjamins Publishing.
Chapter 2
Analysing text
[…] enable the researcher to identify similar information. The researcher can
search, retrieve and assemble the data in terms of those items that bear the
same code. Codes can be regarded as an indexing or categorizing system, like
the index in a book […] and the data can be stored under the same code, with
an indexed entry for that code.
(Cohen, Manion & Morrison, 2018: 669)
Content analysis is, therefore, oriented to a system of categories that relies on pre-
formulated models. In addition, it is theory-dependent as the coding is theoretically
underpinned (Sandelowski & Barroso, 2003). An example of this is the T-SEDA1
project, which seeks to give teachers and researchers tools to reflect on the quality
of educational dialogue across a wealth of subjects and school contexts. The
authors developed a coding framework (Hennessy, Rojas-Drummond, Higham
et al., 2016) based on educational dialogue theory that includes top-level concepts
such as invitation to build on ideas, challenging students, making reasoning explicit,
inviting students to reason, coordination of ideas and agreement, reflection on the
activity, or, among others, expressing ideas. Turns that include teacher’s talk like
the following:
were coded as invitation to build on ideas. Vrikki, Kershner, Calcagni et al. (2019)
have revisited this coding scheme by grouping the most relevant codes into eight
clusters and proposed an interesting approach that compares situated live coding
22 Analysing text
were audio recorded and transcribed. The thematic analysis approach used by the
researchers follow a five-stage process:
1 the researchers first familiarise themselves with the data by reading the
transcriptions and listening to the children’s actual voices;
2 initial codes are generated by both researchers independently and consistency
is discussed;
3 codes are organised into themes;
4 themes are reviewed and sub-themes are grouped;
5 themes are defined and named, and a thematic map is developed so as to
represent the relationships between themes and sub-themes.
In the case of this research, three main themes were identified: break from rou-
tine, learning through play and collaboration and teamwork. The authors claim
that this analysis has facilitated their understanding of how ‘specific facets of play
create meaning in the learning journeys of children’ (Coates & Pimlott, 2019: 35).
Sometimes grounded theory is used when approaching theme analysis of phenom-
enological data. In short, as opposed to grand theories, grounded theory emerges
from the data. It is a bottom-up theory that looks at the data as defining and
shaping the phenomenon of concern. Hadley (2014: 11) notes that ‘grounded
theorists adopt a stance of ‘theoretical agnosticism’ […] meaning that although
there is an area of specific interest that has motivated them to begin, they reflex-
ively recognize that their own sets of personal constructs could limit what they see
and hear’. Hadley used grounded theory to examine the lived experiences and
practices of English for Academic Purposes (EAP) teachers, HE administrators
and students in neoliberal universities in the UK and the US.
Conversational analysis (CA) is a type of discourse analysis that is primarily
focused on spoken discourse. Mercer (2010) has pointed out that CA deserves to
be described as a methodology rather than a method. According to Bergmann
(2004: 296), CA examines social interaction ‘as a continuing process of produ-
cing and securing meaningful social order’. Originally based on ethnomethod-
ology and interaction analysis (Bergmann, 2004), CA looks at reality as created
in situated contexts where people attribute and interpret meanings by using the
language conventionalised by a community of speakers, hence the interest in
language structure and, in particular, turn-taking and pragmatics. In CA, talk is
considered as an intersubjective phenomenon. Wooffitt (2005: 73) highlights the
usefulness of CA when exploring interaction as ‘social action is accomplished
through the participants’ use of tacit, practical reasoning skills and competen-
cies’. Conversational analysts strive to preserve the observed phenomena as
fully as possible and that is why so much attention is given to the recording
(Wooffitt, 2005), its transcription (Jefferson, 2004) and the sequence of turns
(Schegloff, 2007). Bergman (2004: 297–299) outlines the following analytical
maxims in CA:
Analysing text 25
Despite the openness to data and analysis, Wilkinson & Kitzinger (2019: 556–557)
note that CA has in particular provided considerable insight into the practices of
talk interaction in six areas: turn-taking, sequence organisation, action-formation,
repair, word-selection and overall structural organisation of talk. In schools, CA
has been used extensively to research classroom talk (Sinclair & Coulthard, 1975;
Seedhouse & Walsh, 2010). However, Mercer (2010) has noted that using CA to
analyse classroom talk may not be convenient when handling large sets of data.
According to his estimate, transcribing and analysing one hour of classroom may
take between five and 12 hours of research time. According to our own experience,
transcribing a 13-minute dialogue, with some minimal annotation for pauses, turn-
taking and dysfluency phenomena, the process may even take longer. Mercer has
also noted that making convincing generalisations may prove extremely challenging
as only specific illustrative examples can be offered in standard research publications.
Discourse analysis (DA) is a different approach to textual data analysis that has
been used extensively to look at changes in ideas and viewpoints over time. DA is
useful to track down how groups of people or concepts have been construed in
discourse. Rogers (2011: 1) maintains that three areas, at least, justify the use of
critical discourse analysis in educational research. The first is the communicative
nature of texts, talk and other semiotic interactions that are found in learning
and education; second, DA is particularly sensitive to sociocultural theory (SCT)
and some of its tenets, mainly, the fact that discourse constructs and reflects the
social world through a myriad of sign systems; finally, discourse and educational
research are ‘both socially committed paradigms that address problems though a
range of theoretical perspectives’.
Bergström, Ekström & Boréus (2017) note that the delimitation of the discourses
to analyse is paramount and stress that it is crucial that researchers discuss and
justify their choices in an explicit way. A tenet of DA is that social relations in
discourse are revealed through language. Parker (2004) argues that, because
discourse is organised around patterns and structures that fix the meaning of
26 Analysing text
symbolic material, researchers can study the ‘ideological force of language’ (Parker,
2004: 310) in discourse and, accordingly, understand how entities and concepts
are defined. Although there are many approaches to DA (e.g. James Gee, Norman
Fairclough, Gunther Kress), Parker (2004) suggests that, in practical terms, DA can
be done following a set of steps such as itemising the objects in the text by looking
at the nouns, keeping a distance from the text so that the text is seen as one more
object in the context of the wider research project, itemising the subjects in the
text, reconstructing the rights and the responsibilities of the subjects and mapping
the networks of relationships into patterns. For Parker, discourses are ‘located in
relations of ideology, power and institutions’ (Parker, 2004: 311), so it seems that
DA is most suitable when one of these areas is the object of our inquiry. Discourse,
when seen as social practice, process and product, needs to be considered critically
as a sort of battlefield where meanings are invented, negotiated, used and, often,
imposed. In particular, critical discourse analysis (CDA) has examined the study
of power in society, social structures and individuals by looking at different foci
(Wodak & Meyer, 2009): power as the result of specific resources of actors; power
as an attribute of interactions; and power as a systemic element of society. Rogers
(2011) argues that educational practices are suitable to be examined by CDA as
interactions and practices are constructed across time and contexts in education:
We can see this analysis as a way to give visibility to the otherwise invisible process
that, in discourse and through discourse, affects social structures, social relations,
and agendas.
Swales’ reflection brings home some of the frictions that CL users experience when
using research methods: corpus-based vs. corpus-driven approaches, and theory-
driven vs. data-driven research designs are just but some of the tensions we face
when reading and engaging with other colleagues’ research. Guy Aston (Viana,
Zyngier & Barnbrook, 2011) notes that CL is both a methodology and a science
and that it is only an emphasis on applications that swings the pendulum towards
CL as a set of methods.
Corpus linguistics uses both qualitative7 and quantitative methods to derive
knowledge from observed, attested uses of language. Among the former we find
concordance lines, while among the latter we find collocation or keyword analyses.
What characterises corpus linguistics is its emphasis on the study of data that
has been produced while the users are engaged in communication and therefore
communicating within the boundaries of a language register (Biber & Conrad,
2009). Indeed, an emphasis on usage and the blurring of the lexical and gram-
matical (formal) distinctions are the blueprint of most corpus linguistics research.
However, most of us in the field of corpus linguistics will agree that corpus
research methods are characterised by the fact that, in most research designs,
a control corpus from a reference-variety is compared against an experimental
corpus (often the researched area or question) by examining normalised frequency
counts, applying statistical tests and procedures or by manually coding more com-
plex patterns of usage.
Callies (2015) has defined corpus data as hard, quantitative data that can be
identified, quantified, classified and is subject to refined statistical analysis. Callies
notes that corpus linguistics research methods make findings more generalisable
and our research more easily replicable. Unfortunately, replication studies are not
frequent in corpus linguistics, and caution is required before generalising in most
corpus studies and, in particular, in corpus research involving language learners.
Callies (2015: 36) has defined, mainstream, corpus research methodology as
follows: ‘The research methodology that underlies the quantitative analyses […]
is primarily deductive, product-oriented and designed to test a specific hypoth-
esis, which can then be confirmed or rejected, or refined and re-tested’. Certainly,
an emphasis on hypothesis testing and a lack of explicitness about the research
subject in language research has dominated the research in the first waves of CL
research during the last decades of the 20th century and the first of the 21st
century.
The notion of representativeness has been a central topic in the design of cor-
pora, predominantly in the field of descriptive linguistics. It is so entrenched in
corpus linguistics that we hardly stop to think what the implications are for our
research ontology and epistemology. Corpus linguistics is based on two empirical
principles according to Stubbs (2007: 130):
So, where do we look? What do we look at? Hunston (2002: 14) distinguishes different
uses of corpora. One of them is the use of ‘general corpora […] to establish norms
of frequency and usage against which individual texts can be measured’. This is
an excellent instance of the assumed epistemology which also infuses the study of
usage: objectivist epistemology (Gray, 2018). Objectivism has met huge criticism
in the social sciences (Gray, 2018), and we cannot surely escape the fact that there
is some danger in believing that, because an objectivist paradigm is in place, our
research necessarily presents ‘objective facts and established truths’ (Gray, 2004: 18).
In this book, we have tried to adopt an approach that is aware of the challenges
that affect how contrast, frequency and representativeness are impacted by the
corpora analysed, as well as by the inquiry methods employed –among others, the
tagger, the corpus management tools, or the very finiteness of the corpora used. In
doing so, we set out to strengthen our findings and claims by presenting a critical
perspective on how researchers set out to investigate our research field, and how
we try to distance ourselves from our own observations on usage.
When we speak or write, we unconsciously use language as the vehicle to
express our ideas. When a friend calls, or when we reply to an email, we use the
language that we deem fit for the purpose of maintaining communication. Corpus
linguistics has provided evidence that the language we use when speaking on the
phone, or writing an email, adapts to different situations in the way that each
and every register displays a distinctive frequency and distribution of linguistic
features. Biber & Conrad (2009: 47) have suggested that registers can be studied by
‘describing the situational characteristics of the register; [by] analyzing the [most
common] linguistic characteristics of the register; and [by] identifying the func-
tional forces that help to explain why those linguistic features tend to be associated
with those situational characteristics’. Table 2.1 offers guidance in understanding
the basics of register.
Parcipants
Relaons
Topic among
parcipants
Communicave Channel of
purposes communicaon
In other words, all phone conversations, emails, textbooks, laws, texts, service
encounters and so on share a common set of formal, linguistic features that can
be analysed through CL methods. Even within the same broad register category,
differences tend to be significant. For example, the linguistic differences between
a textbook and a research paper can be reduced and quantified in terms of the
features used in those texts. In order to understand how a register is constituted,
we need to explore the situation in which it is used, paying special attention to a
myriad of factors that are based on both systemic functional linguistics and func-
tional linguistics (Halliday, 1978; Biber, 1988).
Register analysis seeks to explain linguistic data in the light of language vari-
ation across different dimensions of use, that is, different sets of co-occurring
linguistic features that display distinct functional underpinnings (see Figure 2.2,
based on Biber & Conrad, 2009). Douglas Biber’s seminal work in this area (Biber,
1988) identified five broad dimensions of use that explain the underlying motiv-
ations of speakers when using their language. These dimensions are:
that-deletion (i.e. omitting ‘that’ from a sentence such as ‘I think [that] you
are right’). Texts that score high on the other end of the continuum typically
display dense informational structures in noun phrases, with multiple nouns
premodifying noun headwords in noun phrases.
• Dimension 2: narrative vs. non- narrative concerns. Texts that display a
narrative orientation, such as different types of fiction, novels, etc. show a
high frequency of past tenses, third person pronouns, perfect aspect tenses
and opinion verbs.
• Dimension 3: explicit versus situation-dependent reference. Explicit texts
tend to be written registers while the latter tend to be spoken. Texts that score
high on explicit features display a high frequency of wh-relative clauses in
object positions, wh-relative clauses in subject positions, phrasal coordination
and nominalisations.
• Dimension 4: overt expression of persuasion. This dimension is associated
with the expression of our point of view and/or with the use of argumen-
tation to persuade the interlocutor. Texts that score high on this dimension
display a high frequency of infinitives, prediction modals, persuasion verbs,
conditional subordination and necessity modals.
• Dimension 5: abstract vs. non-abstract information. Texts scoring high on
this dimension are highly abstract and display a high frequency of conjuncts,
agentless passives and by-passives.
These dimensions explain how the frequency and distribution of formal linguistic
features affect usage. Although we will not approach the study of register using
Biber’s multidimensional analysis methodology in this book, an understanding
of the theoretical foundations of the register-related linguistic theory will let us
explore how situational differences correspond to systematic linguistic differences.
Corpus linguistics often explores the linguistic properties of texts against the back-
drop of the register where they belong, which provides opportunities to close in
usage across users or other comparable corpora. This awareness of the differences
in language use will make us more appreciative of how registers can constrain
usage. Table 2.2 gives a breakdown of textual features and textual data that are
likely to be examined in register analysis.
Notes
1 www.educ.cam.ac.uk/research/projects/tseda/index.html
2 www.educ.cam.ac.uk/research/projects/tseda/Information%20for%20teachers%20
T-SEDA%20180618.pdf
3 www.qsrinternational.com/nvivo/home
4 www.maxqda.com
5 www.laurenceanthony.net/software/antconc/
6 https://info.leximancer.com
7 Mike Scott, widely known for developing WorsdSmith Tools and for his research at
Aston University, has noted that the sheer power of the tools and the corpora have
brought about not a simple quantitative change but a qualitative one, too (Viana,
Zyngier & Barnbrook, 2011).
References
Atkinson, J.M. & Heritage, J. (1984). Introduction. In Atkinson, J.M. & Heritage, J. (Eds.),
Structures of Social Action. Cambridge: Cambridge University Press, 1–15.
Bergmann, J.R. (2004). Conversation analysis. In Flick, U., von Kardoff, E. & Steinke, I.
(Eds.) A companion to qualitative research. London: Sage Publications Limited, 296–302.
Bergström, G. Ekström, L. & Boréus, K. (2017). Discourse analysis. In Boréus, K. &
Bergström, G. (Eds.) Analysing text and discourse: Eight approaches for the social sciences.
London: Sage Publications Limited.
Biber, D. (1988). Variation across spoken and written English. Cambridge: Cambridge University
Press.
Biber, D. & Conrad, S. (2009). Genre, register and style. Cambridge: Cambridge University
Press.
Bond, M., Zawacki-Richter, O. & Nichols, M. (2019). Revisiting five decades of educational
technology research: A content and authorship analysis of the British Journal of Educational
Technology. British Journal of Educational Technology, 50(1), 12–63.
Boréus, K. & Bergström, G. (2017). Analysing text and discourse: Eight approaches for the social
sciences. London: Sage Publications Limited.
Braun, V. & Clarke, V. (2006). Using thematic analysis in psychology. Qualitative Research in
Psychology, 3, 77–101.
Analysing text 33
Callies, M. (2015). Learner corpus methodology. In Granger, S., Gilquin, G. & Meunier, F.
(Eds.) The Cambridge handbook of learner corpus research. Cambridge University Press, 35–55.
Coates, J. K. & Pimlott-Wilson, H. (2019). Learning while playing: Children’s forest school
experiences in the UK. British Educational Research Journal, 45(1), 21–40.
Flynn, N. (2007). What do effective teachers of literacy do? Subject knowledge and peda-
gogical choices for literacy. Literacy, 41(3), 137–146.
Gray, D.E. (2004). Doing research in the real world. London: Sage Publications Limited.
Gray, D.E. (2018). Doing research in the real world. 4th Edition. London: Sage Publications
Limited.
Hadley, G. (2014). English for academic purposes in neoliberal universities: A critical grounded theory.
New York: Springer.
Halliday, M.A.K. (1978). Language as social semiotic: the social interpretation of language and
meaning. London: Edward Arnold.
Hennessy, S., S. Rojas-Drummond, R. Higham, O. Torreblanca, M.J. Barrera, A.M.
Marquez, R. García Carrión, Maine, F. & Ríos, R.M. (2016). Developing a coding
scheme for analysing classroom dialogue across educational contexts. Learning, Culture and
Social Interaction, 9, 16–44.
Hunston, S. (2002). Corpora in applied linguistics. Cambridge: Cambridge University Press.
Jefferson, G. (2004). Glossary of transcript symbols with an introduction. Pragmatics and
Beyond, 125, 13–34.
Kellams, S. (1975). Research studies on higher education: A content analysis. Research in
Higher Education, 3(2), 139–154.
McEnery, T. & Hardie, A. (2012). Corpus linguistics: Method, theory and practice. Cambridge:
Cambridge University Press.
Mercer, N. (2010). The analysis of classroom talk: Methods and methodologies. British
Journal of Educational Psychology, 80(1), 1–14.
Parker, I. (2004). Discourse analysis. In Flick, U., von Kardoff, E. & Steinke, I. (Eds.) A com-
panion to qualitative research. London: Sage Publications Limited, 308–312.
Rogers, R. (2011). Critical approaches to discourse analysis in educational research. In Rogers,
R. (Ed.) An introduction to critical discourse analysis in education. London: Routledge, 29–48.
Sandelowski, M. & Barroso, J. (2003). Classifying the findings in qualitative studies.
Qualitative Health Research, 13, 905–923.
Schegloff, E. (2007). Sequence organisation in interaction: A primer in conversation analysis.
Cambridge: Cambridge University Press.
Seedhouse, P. & Walsh, S. (2010) Learning a second language through classroom inter-
action. In Seedhouse P., Walsh S. & Jenks C. (Eds.) Conceptualising ‘learning’ in applied lin-
guistics. London: Palgrave Macmillan, 127–146.
Sinclair, J. (1991). Corpus, concordance, collocation. Oxford: Oxford University Press.
Sinclair, J.M. & Coulthard, M. (1975). Towards an analysis of discourse: The English used by
teachers and pupils. Oxford: Oxford University Press.
Stubbs, M. (2007) On Texts, Corpora and Models of Language. In Stubbs, M., Hoey,
M., Teubert, W. & Mahlberg, M. (Eds.) Text, Discourse and Corpora: Theory and Analysis.
London: Continuum, 127–162.
Vaismoradi, M., Turunen, H. & Bondas, T. (2013), Qualitative descriptive study. Nursing and
Health Sciences, 15, 398–405.
Vrikki, M., Kershner, R., Calcagni, E., Hennessy, S., Lee, L., Hernández, F., Estrada,
N. & Ahmed, F. (2019). The teacher scheme for educational dialogue analysis (T-
SEDA): developing a research-based observation tool for supporting teacher inquiry into
34 Analysing text
and the co-author of a book about corpus analysis in language education (Scott &
Tribble, 2006) has stated the following on this point:
We could not agree more. Corpus linguistics methods can be used by any researcher
interested in exploring how language is used in textual data, whether it is a text
originally intended to be printed or published electronically, or a transcription of
an interview or a conversation. In CL methods, we can see a type of a reduction
effort that, either by examining or comparing large datasets, extracts regularities
and patterns. Scott & Tribble have described this process almost in terms of what
a cook does when reducing a sauce or stock. A researcher will eventually boil down
language and keep a refined extract:
Scott & Tribble (2006) argue that corpora can serve different purposes and so
researchers can decide to focus on a myriad of research questions, not necessarily
or exclusively linguistic or grammar-related. Scott & Tribble have identified four
aspects of words that can be examined with CL methods:
But from an intermental perspective, we can see that language genres are also
related to conventional, collective ways of thinking in particular communi-
ties and societies. People unfamiliar with a community’s ways with words are
likely to be excluded from its activities. Those familiar with its genres know
how to use language to participate, how to work with others to get things
done. Expert members of communities can use language features to recog-
nize when a particular kind of activity is taking place, and this enables them
easily to draw on past experience relevant to the joint intellectual activity they
become engaged in. Genres are templates for interthinking, which, like all
social conventions, both facilitate and constrain what we do.
(Mercer, 2002: 6)
Formal Language
aspects of in society
language
1 The frequent use of the first-person pronoun I together with believe, feel,
think and, particularly, the use of I know with the adverb not suggests that the
interviewees expressed different degrees of certainty by means of different lin-
guistic devices. The more infrequent use of we suggests the construction of a
group identity, something which was not expected by the researcher.
2 The use of modal verbs and the use of the subjunctive suggests a range of
attitudes towards the tool analysed that cannot be grasped through theme ana-
lysis. The interviews were conducted in German so these remarks refer to the
peculiarities of the German modal verb system.
3 The use of quantifiers is peculiar in the dataset analysed. Fest (2015: 63)
suggests that her interviews contain implicit comments ‘phrased as suggestions’
(2015: 64) and found a pattern of use:
When looking at the immediate contexts of these words as given in the con-
cordance lines, the first notable result is that the two most frequent ones, ganz
(pretty) and sehr (very), both co-occur most frequently with the same adjective,
namely gut (good). There are 20 instances of ganz gut (pretty good) and 17
of sehr gut (very good) in the corpus, which shows a slight tendency towards
Approaches to understanding language use 39
stressing positive aspects. Praise for the tool is often emphasized by the inten-
sifier very, whereas the construction ganz gut rather equals an only slightly
better than neutral evaluation, like okay.
(Fest, 2015: 63)
This is a great example of how we can use some of the corpus methods to go
beyond the meanings of words (as found in dictionaries) to examine the meanings
of words in context as used by speakers. The analysis of accumulated concord-
ance lines can provide researchers with the chance to examine data and identify
patterns that are difficult to spot due to the amount of variation in actual usage
(Pérez-Paredes, 2017). We will come back to this point when we examine skill six,
reading concordances.
The data were collected as part of case studies of four most improved and
effective secondary schools across diversified school populations in different
socio-economic contexts. These focussed upon the ways in which government
reforms were mediated by principals, senior and middle leaders and teachers in
order to assess the extent to which the primary intentions had been translated
into practice and sustained; and if not, why not. Data were collected over
three phases. Firstly, school principals identified policy initiatives which had
the greatest impact on their schools and explained how they interpreted/
mediated policy. Second, interviews with senior and middle leaders (n=6–8)
and classroom teachers (n=6–8) permitted progressive focusing […] on how
policies were understood, communicated and enacted in each school. Third,
a further visit explored emergent themes with key staff members.
(Gu, 2015)
The original interviews were digitally recorded and transcribed. The interview
protocol, the participant consent form as well as the participation information
sheet can also be found online. The author looked at the transcripts, categorised
Approaches to understanding language use 41
and refined emergent themes, and identified topic patterns using grounded
theory to define the emerging variables in the investigation. The transcriptions
of these interviews can be downloaded and further analysed using a wide range
of data analysis methods, including corpus methods. Although the original data
was not intended as a corpus per se, the fact that the transcriptions are available
in text format facilitates our use of this resource as a corpus. Note that when
using resources that have not been designed within our own research project, we
will need to make sure that, if needed, we clean up some of the markup in the
text such as interview boundaries (‘end of recording’), identification of speakers
(‘interviewer’, ‘teacher, ‘head’), page numbers and footers and headers. Once we
have made sure that only the transcribed interviews are in .txt format, we are
ready to upload our data to a corpus management tool.
However, we may even discover that there is a proper corpus available ready
to be used. Say, we want to examine children’s writing development in schools.
Durrant (2019) put together and distributed a corpus of school children’s writing
in England collected between 2015 and 2019. This is how the corpus is described
in the UK Data Service website:
This corpus offers a balanced sample of English schools according to the following
categories:
This is a very interesting corpus design that will facilitate the analysis of children’s
writing across a variety of demographic factors and with a huge potential to offer
insights into writing development from a variety of angles. Table 3.1 offers a
breakdown2 of the number of texts collected, the number of schools involved,
the number of students as well as the percentages of English as an Additional
Language (EAL) students and students that qualify for the free school meal in each
year sample.
42 Approaches to understanding language use
Table 3.1 GIG corpus data
In total, this corpus includes almost 3,000 writing samples from almost 1,000
children studying in 24 schools across England. This is a potentially hugely useful
corpus resource for educational researchers that want to look at writing develop-
ment using either cross-sectional or pseudo-longitudinal designs. This corpus has
particular potential because of its representativeness, covering a wide range of
schools and pupils, and can be used in combination with smaller datasets to offer
a baseline for comparison and further scrutiny. The researchers that put together
this corpus stress, however, that it may be used to better understand ‘how language
can express attitudes towards social relations or to help students better develop the
linguistic resources for expressing such attitudes’. In other words, the corpus was
not collected so as to reveal the attitudes of the students. Table 3.2 shows how to
approach our fifth skill, using an existing corpus.
some of the occurrences of the word fact in George Orwell’s 1984 as displayed by
Laurence Anthony’s AntConc3 following a keyword-in-context (KWIC) concord-
ance format.
As McEnery & Hardie (2012: 35) put it, ‘a concordancer allows us to search
a corpus and retrieve from it a specific sequence of characters of any length –
perhaps a word, part of a word, or a phrase. This is then displayed […] as an
output where the context before and after each example can be clearly seen’.
The word fact appears exactly in the middle of the lines, with both preceding and
following context on the left and right. There are 51 occurrences of fact in Orwell’s
1984. The screenshot shows 35 of those. Figure 3.3 shows the same search using
the online corpus management tool Sketch Engine.
Concordance software such as AntConc or Sketch Engine will let you (up)load
your data, search it and export your search results in different formats, which gives
researchers the opportunity to work with their data in different ways. While some
researchers may prefer to have their results in spreadsheets, others will prefer to
store them in text files or even in pdf format. What is crucial, however, is the
fact that concordance lines allow us to examine all the occurrences of a word (or
words) or a lemma (or lemmas) (see Table 3.3) in either a corpus or a single text in
the context in which they occurred.
Sinclair (1991) has stressed that the examination of natural occurring language
embodies the idea that any aspect of language use depends on its surrounding
context (1991: 5): ‘The details of choice shown in any segment of a text depend –
some of them –on choices made elsewhere in the text’. Sinclair argues that when
we examine uses of the language, we are in fact looking at the ‘constraints that
44 Approaches to understanding language use
Table 3.3 Essential terminology: lemma
determine the precise relationship of any fragment of text with the surrounding
text’ (1991: 6). We can therefore think beyond the word unit and become aware of
the connections at the phrasal, sentence and discourse levels, as it is in these units
where constraints at the lexico-grammatical level operate.
Concordance lines can be ordered in different ways. They can be ordered
alphabetically, typically to the right of the central word-form. Sinclair (1991: 33)
argues that this ordering ‘highlights phrases and other patterns that begin with
the central word’. Another convenient ordering is reverse alphabetisation to the
left of the central form, which provides a useful clue to the topic of the passage
Approaches to understanding language use 45
if the form is a verb (Sinclair, 1991: 34) or, as in Figure 3.3, this ordering reveals
the syntactic function and meanings of the noun phrase in which fact is found in
1984. In the case of fact in Figure 3.3, by examining concordance lines it is rela-
tively straightforward to identify uses of this noun as in in fact and, maybe, focus
on phrases where determiners such as the or a are used by Orwell. For example, the
fact occurs 17 times in the novel. Table 3.4 shows some of the key instances that
emerge from an examination of the concordance lines.
Most of these uses of fact seems to suggest that the word was used by Orwell
to facilitate apposition in noun phrases. As you can see, we have grouped these
occurrences using formal linguistic criteria, but this is not the only way to proceed.
Instead, we suggest that we pre-empt any assumptions and follow a step-by-step
procedure that situates our research inquiry at the centre of the process.
Table 3.5 Essential terminology: types/tokens
and inclusion in (a) The Times during the period 2015–18 and (b) The Guardian
during 2018. The two corpora can, potentially, be useful in helping researchers
understand how these news outlets communicate news that deal with education
and inclusion. The Times corpus is made up of 64 different texts and contains
174,459 words (tokens), of which 25,758 are unique words (types) (see Table 3.5).
Approaches to understanding language use 47
Word lists can be simple and offer a ranking of every word in the corpus, from
the most frequent to those that occur only once. They can also offer specific infor-
mation about particular word classes (only nouns, only verbs, etc.) or we can even
form lists of lemmas or even POS tags. In the case of The Times corpus, the most
frequent token is to, somewhat unexpected as the article the tends to be the most
frequent word in most English corpora. In this case, this can be explained by the
inclusion of the New Year’s Honours list4 in The Times across four years where the
expression service to is used very frequently. Figure 3.4 consists of a screenshot of
the word list function in Sketch Engine.
The figure shows the interface that Sketch Engine users will see when they run
the word list function. The results can be exported and be kept in a variety of
formats, e.g. pdf, but other formats may be more interesting to researchers:
• CSV. A comma-separated values file is a text file that uses a comma to sep-
arate values and stores tabular data. It can be used across a variety of software
(spreadsheets, notepads, etc.) and platforms (Windows, Mac, Linux). In general
terms, CSV files preserve the essential information and can be used in almost
any computer you can think of.
• -XLS. A spreadsheet file that can be read by Microsoft Excel and other spread-
sheet software.
• -XML. An Extended Markup File is a file that offers structured informa-
tion that can be reused in other applications. In the case of a word list, the
XML file offers the information about the corpus and subcorpus names as
well as the words and their frequencies enclosed in the structure attributes
(<corpus>(</corpus>, <word list></word list>, etc.).
Once we have a word list of the words (or lemmas, tags, etc.) in a corpus or a text,
we can identify candidates for our initial search. Say we want to explore the use of
inclusion and mental. The former occurs 105 times in the corpus, the latter 55 times.
Following Sinclair (2003), Pérez-Paredes, Sánchez-Tornel, Calero & Jiménez
(2011) and Pérez-Paredes, Sánchez-Tornel & Calero (2012), the analysis of con-
cordance lines can be seen as a procedure that follows well-established steps that
start with the selection of a search term known as the node. Let us consider them.
• Step 1: initiate. The researcher observes the words to the left and to the
right of the node. The goal of this step is to come up with a selection of
sequence candidates where typically one or two sequences will be stronger
than the rest in terms of the evidence provided. Sinclair suggests that if the
same words occur in more than half the instances in the sample, it is sensible
to think that the link between these words is pretty strong. I would avoid, how-
ever, being this specific in a context where datasets may vary greatly in length.
If there is not one single word that stands out either to the left or to the right
of the node, Sinclair suggests that we look at word classes instead (is it a noun?
Is it a verb? Or adverb?).
48 Approaches to understanding language use
Let us begin with a search on The Guardian corpus of education and inclusion,
a corpus of 68 texts published by this newspaper in 2018. A search of the term
inclusion returns 89 hits. What we want to know is whether the term is used in
specific ways and whether those ways can be identified through the examination
of concordance lines. We will follow the procedure above, trying to illustrate the
outcomes at every stage. We discuss this model further in chapter 7.
Step 1: initiate
We have 89 occurrences of inclusion in the corpus. We can start by sorting the
words to the left of the node alphabetically. This is what we find:
Step 2: interpret
It seems that inclusion tends to be used in adjacency or in coordination with diversity
and when premodified by an adjective it is social.
Approaches to understanding language use 49
Step 3: consolidate
Diversity and inclusion occur 11 times in the entire corpus, that is, 12% of the
occurrences of inclusion are found in this coordinated phrase. Diversity occurs 55
times, so 20% of the times this word occurs with inclusion in coordinated phrases.
When the context is expanded, we look further to the left of the node and we can
develop a better sense of how patterns are used. The following three excerpts
from The Guardian corpus shown in Table 3.6 capture the width of meanings
represented across the 11 concordance lines examined.
(A) represents the use of diversity and inclusion as part of the role of an
authority; (B) stands for cases where diversity and inclusion appear to be
neglected by a group of people or an organisation; finally, (C) represents uses
where diversity and inclusion is used in noun phrases (diversity and inclusion task
force) to signal groups of people and institutions working towards achieving
inclusion in society. This is obviously an interpretation of the contexts where
diversity and inclusion occur. However, it is an interesting one where evidence of
usage is provided. Furthermore, every single instance of use in the corpus has
been examined.
Step 4: report
After steps 1, 2 and 3 we have noted that the contexts to the left of the word
inclusion seem to favour the use of the word in coordinated phrases together with
diversity.
Step 5: recycle
So far, we have looked at the context to the left on inclusion. Let us now examine
what happens to the right of the node. Let us first order the concordance lines
alphabetically and then go over steps 1, 2 and 3 again. It appears that the following
trends emerge:
50 Approaches to understanding language use
Step 6: results
The use of inclusion in The Guardian 2018 corpus suggests that the term tends to
be used in conjunction with words such as diversity and equity when it is used to
Our third corpus, Corpus del Español includes Spanish texts from 1200 to
1900 and it was compiled as a diachronic corpus where researchers track the
evolution of the Spanish language over eight centuries. Finally, our fourth corpus,
French Web 2012, is a massive crawled corpus. It contains almost 10 billion words
extracted from websites in French. The TenTen Family10 of corpora have been
collected following the same criteria and can be regarded as comparable cor-
pora. In general, large representative corpora can be situated on a continuum
that ranges from those that are small, their design careful and well documented
to those that are massive, highly informative but poorly structured in terms of
the different genres represented (e.g. news, academic publications, fiction, blogs,
forums, etc.).
Let us go back to The Guardian 2018 corpus. When we load the corpus file on
AntConc and run a word list search, we obtain the size of the corpus. Figure 3.6
shows the screen shot of this search.
The corpus has 73,259 words (tokens) and 8,967 types (different words).
Figures 3.5 and 3.6 are extremely relevant in our work with corpora. We know
that 73,259 is the absolute token size of the corpus, and the word education, for
example, occurs 357 times in this corpus. Is this high frequency? Is it not? There
is no way we can know. We need to normalise this frequency count in order to
understand its true significance. According to Brezina (2018: 43), relative fre-
quency is ‘the mean […] of the frequencies of the word in hypothetical samples of
x tokens from the corpus, where x is the basis for normalisation’. To calculate the
relative frequency of education we divide its absolute frequency (357) by the number
of tokens in the corpus (73,259) and multiply it by a basis for normalisation. For
example, if we choose 100,000 as our basis for normalisation, the formula will
be (357/73,259) x 100,000 and the relative frequency will be 487 per 100,000
words. If we chose a different basis, for example 10,000, the relative frequency
of education in The Guardian 2018 corpus would be 40 per 10,000 (or, similarly, 4
per 1,000 words). Using relative frequencies is absolutely essential to compare fre-
quencies across different corpora. The most usual bases are 10,000, 100,000 and
1,000,000. If we are working with small corpora, smaller bases for normalisation
Approaches to understanding language use 53
are more appropriate (Brezina, 2018), but which base to use is ultimately the deci-
sion of the researcher.
We can also run a frequency test that looks at lemmas rather than words. This
is probably a good idea if we are not interested in differences between the sin-
gular or plural forms of a noun, or in the different tense inflections of a verb. In
AntConc, we can activate this option and load a lemma list. Figure 3.7 shows the
three steps required to do this.
We need to load a lemma list that AntConc can use to perform this analysis.
There is a wealth of online resources that will meet most of your needs, so don’t
worry too much if you are new to corpus analysis. We suggest that you visit Mike
Scot’s website and download one of the lemma lists there.11 You will need to load
this lemma list and then click Apply. Once you have done this, you can go to
Word List and run a new analysis. The results will appear in the form shown in
Figure 3.8.
On the left-hand side, we can see the lemma overall counts. For example,
the lemma school occurs 672 times in the corpus. On the right, we can find the
breakdown of the different words (tokens) that are part of this lemma and their
corresponding count (school, schooled, schooling, schools). These forms were
defined in the lemma list that we downloaded, and they may not necessarily
coincide with how we want to parse our lemmas, so it is worth exploring other
alternatives or even compiling our own list of lemmas. Although most lemma
lists will meet our needs, it is necessary that we fully understand the range of
forms that are attributed to each lemma stem (i.e. researcher is not a form in the
54 Approaches to understanding language use
lemma research in the aforementioned list). On top of Figure 3.8 we can see that
the frequency-related information provided is lemma sensitive: while the lemma
tokens are the same as the number of word tokens, the lemma types are lower
than the word types (7,000 vs. 8,967). In Sketch Engine, the information about
corpus size has a dedicated window. The tokens in Sketch Engine include punctu-
ation, so you should expect to find differences between corpus management tools.
We will come back to this idea later.
When discussing the frequency of a word or a lemma, it is necessary that we
consider the concept of the dispersion of that word or lemma across our corpus.
Brezina has defined it in the following way:
AntConc has a tool called ‘concordance plot’ that will let us explore visually where
in the corpus a particular feature is more frequent. If a corpus is made up of
many individual texts, it is a good idea to upload the individual files to AntConc
(see Figure 3.6, left side of the screenshot) so that the exploration of dispersion is
truly relevant. If all the texts are concatenated into one single file, the use of such
a tool is discouraged. Table 3.8 summarises what is involved in skill seven, hand-
ling frequencies.
3.4 Collocations
The term collocation is used in corpus linguistics to denote the idea that ‘important
aspects of the meaning of a word [or a lemma or other linguistic unit] are not
contained within the word itself [...] but rather subsist in the characteristic asso-
ciations that the word participates in, alongside other words or structures with
which it frequently co-occurs’ (McEnery & Hardie, 2012: 122–123). This idea has
important implications. In theme analysis and other methods, the semantics of
the units analysed is rarely discussed or problematised, as if word meanings were
obvious and their identification straightforward. Evert (2007: 4) defined colloca-
tion as follows: ‘a combination of two words that exhibit a tendency to occur near
each other in natural language, i.e. to cooccur […] The term ‘word pair’ is used
to refer to such combination of two words […]’.
Sinclair (1991: 113) argued that the core meaning of a word is not a de-lexical
one and that ‘frequent words have less of a clear and independent meaning’.
Despite the limitations of an overemphasis on collocation (McEnery & Hardie,
2012), the analysis of individual word meaning in isolation may misrepresent how
language is actually used. Collocates (Table 3.9) are the words that co-occur with
node words in a corpus.
There are at least two major ways to identify collocations. One of those is to ask
native speakers of a language to identify them. However, this methodology may
be flawed as our intuitions about language are affected by so many variables, e.g.
our own memory, retrieval routines, previous experience with different domains
of the language, etc. One of the variables that affects our intuitions as native
speakers is the frequency of occurrence of words in the language. Evert (2007)
describes these collocations as a phraseological, theoretical notion. Siyanova-
Chanturia & Spina (2015) looked at the intuitions about collocation frequency
in L1 and L2 Italian (80 noun-adjective collocations). These researchers found
that ‘both native speakers and (advanced and intermediate) non-natives were
Table 3.9 Essential terminology: collocate
Figure 3.9 AntConc collocate window
As expected, low frequency words tend to top up the ranks as all ten words
occur only once (see Figure 3.9). So what does this analysis tell us? The MI
.
score reveals that these words tend to collocate with inclusion, that is, when they
are used in The Guardian 2018 corpus they tend to appear in the vicinity of the
node word.
Once we have obtained a list of collocation candidates like the one in Figure 3.9,
we need to examine it and consider each of these collocates in isolation (Hunston,
2002). This is very much qualitative analysis and involves three steps: (a) explor-
ation of the context(s) in which the collocate appears, (b) building a hypothesis
about the meaning of the extended unit (node + collocate) and (c) Reporting
of our finding(s). The first step involves a careful examination of the context in
which the collocate appears. In most software tools, this involves clicking on the
word (‘2’ in Figure 3.9) and considering the span size of our search (five words to
the left and five words to the right in this case). After clicking on warmth, we will
obtain the following: OU’s great strengths of warmth and inclusion in an academic com-
munity. Obviously, we need more context to interpret this. We click again, and we
will have access to the precise point where this micro context comes from in the
corpus. In this case, this collocate is found in one of three letters to The Guardian
published together under the headline Our Open University has become a daydream.15
This is the paragraph where warmth is found:
Approaches to understanding language use 59
As an OU tutor in Northern Ireland in the early 70s I saw how the tutorial
system, regular meetings at study centres and summer schools were a vital and
intellectually stimulating part of the students’ experience. Many were studying
for the first time, others to increase existing qualifications, and some for the
sheer pleasure of learning. All had something to teach each other. The chilling
new proposals for an OU based on service centres and media platforms oper-
ating in the cloud, are bound to kill off the OU’s great strengths of warmth
and inclusion in an academic community. I remember Peter Horrocks when
he was head of TV news at the BBC and I was a daily newscaster. He was a
man of dry intellectual brilliance, but his admitted shyness revealed a crippling
lack of social skills. He is not the visionary needed to lead and encourage the
Open University community to grow and prosper in the 21st century.
(Letters, The Guardian, 14 January 2018)
Now we have a clear picture of how the use of inclusion is operationalised by one of
the letter writers, a former OU tutor. Service centres and cloud media platforms
are positioned against traditional values of the OU: community, inclusion and
warmth. In term of analysis, now we can potentially move to a different collo-
cate. Normally, one looks at the significant collocates but limits the exploration to
a certain number of them (the 10, 20 and up to 100 most frequent; only nouns,
only verbs, etc.). It is absolutely essential, though, that we understand the extent
to which these collocates co-occur significantly with a node in the context of the
corpus being queried. In the case of the example we are using here, this is a corpus
of news articles and features published in The Guardian during 2018 and which
featured both education and inclusion in their texts. Different research projects will
make use of corpora that are instrumental in understanding how language is used
across a wide range of different contexts.
Now consider Table 3.11, which lists the top 10 collocates in The Guardian 2018
corpus using T-scores.
We can appreciate that most of the top ten collocates are function words such
as articles (the, a, an) and prepositions (of, in). We should not forget that this is
totally expected as T-score tests favour frequency of occurrence, and function
words are more frequent in language use than lexical word classes such as nouns,
verbs, adjectives and adverbs. In spoken English, seven out ten words are function
words and five out of ten in academic language (Biber, Johansson, Leech, Conrad
& Finegan, 1999). When using the T-score we therefore need to make sure we
understand that there is an impact on the co-selection of function words. If we
widen our scope of collocates, we will find equality (top 12, z = 2.9), education (top
18, z = 2.2), issues (top 19, z = 2.2) and social (top 21, z = 2.1). The following list
contains the five co-occurrences of issues and inclusion:
1 ‘[…] authoritarianism and a detraction from more serious issues such as dis-
cipline and inclusion. Olivier […]’
2 ‘[…] Paris Agreement, but also the social and inclusion issues that have so
clearly impacted the […]’
3 ‘[…] work is continuing. we’re looking at issues of diversity, inclusion, in all
of our […]’
4 ‘[…] and self-esteem, peer relationships and social inclusion. But on many
issues more parents were […]’
5 ‘[…] universities should demonstrate how they are addressing issues of race,
equality and inclusion. This is […]’
We note that all of these uses of inclusion are vaguely linked to the mainstream idea
found in dictionaries that ‘everyone should be able to use the same facilities, take
part in the same activities, and enjoy the same experiences, including people who
have a disability or other disadvantage’.16 However, the uses in the five contexts
above go beyond this mainstream notion in different ways and, interestingly, these
concordance lines provide the language evidence of how inclusion is used in the
vicinity of issue(s) in our corpus.
Using a combination of MI and T-scores can be effective in understanding how
language is used by a group of people or in a set of documents, whether these are
interviews with school counsellors, policy drafts or statutes. Collocations that are
Notes
1 https://lexically.net/wordsmith/
2 http://reshare.ukdataservice.ac.uk/853809/33/SummaryCorpusContents.pdf
3 Antconc version 3.5.8. –www.laurenceanthony.net/software/antconc/
4 According to Gov.uk website, the New Year Honours list recognises the achievements
and service of extraordinary people across the United Kingdom. The compete 2019
Honours list is a 119-page pdf document.
5 Labour politician. Minister of State for Education in HM Government from 2005
to 2007.
6 www.english-corpora.org/bnc/
7 www.english-corpora.org/coca/
8 www.corpusdelespanol.org
9 www.sketchengine.eu/frtenten-french-corpus/
10 According to the designers, TenTen corpora are built using technology specialised in
collecting only linguistically valuable web content; www.sketchengine.eu/documenta-
tion/tenten-corpora/
11 https://lexically.net/wordsmith/support/lemma_lists.html
12 This is known as span size.
13 Another measure is Z-score. This test looks at the observed frequency, that is, the
actual frequency of the collocation candidates, with the frequency expected, that is,
occurrence of w1 and w2 by chance. A high z score indicates a greater degree of
collocability of an item with the node.
14 More information at: www.sketchengine.eu/my_keywords/logdice/
15 www.theguardian.com/ e ducation/ 2 018/ j an/ 1 4/ o ur- o pen- u niversity- h as-
become-a-daydream
16 https://dictionary.cambridge.org/dictionary/english/inclusion
References
Biber, D. (1988). Variation across spoken and written English. Cambridge: Cambridge University
Press.
Biber, D. & Conrad, S. (2009). Genre, register and style. Cambridge: Cambridge University Press.
Biber, D., Johansson, S., Leech, G., Conrad, S. & Finegan, E. (1999). Longman grammar of
written and spoken English. Harlow: Longman.
Brezina, V. (2018). Statistics in corpus linguistics. Cambridge: Cambridge University Press.
Bybee, J. (2007). Frequency of use and the organisation of language. Oxford: Oxford University Press.
Davies, M (2010) The corpus of contemporary American English as the first reliable
monitor corpus of English. Literary and Linguistic Computing, 25(4), 447–464.
Durrant, P. (2019). Growth in grammar corpus 2015–2019. [data collection]. UK Data
Service. SN: 853809, http://doi.org/10.5255/UKDA-SN-853809
62 Approaches to understanding language use
Stubbs, M. (2007) On texts, corpora and models of language. In Stubbs, M., Hoey,
M., Teubert, W. & Mahlberg, M. (Eds.) Text, discourse and corpora: theory and analysis.
London: Continuum, 127–162.
Viana, V., Zyngier, S. & Barnbrook, G. (Eds.). (2011). Perspectives on corpus linguistics.
Amsterdam: John Benjamins Publishing.
Villares, R. (2019). The role of language policy documents in the internationalisation of
multilingual higher education: An exploratory corpus-based study. Languages, 4(3), 56.
Chapter 4
For explorations that are designed to capture all the senses of a particular
word or set of words, as in building a dictionary, then the corpus needs to
Using your own corpus 65
So, focusing on (large) corpus size is probably not the best thing to do, or at least the
main thing to do, at the design stage. What needs some important consideration,
however, is how we want to integrate corpus data into our research. In chapter 1,
we proposed two approaches to corpus analysis in educational research. While the
first approach uses a corpus as primary data, the second approach uses a corpus as
secondary data (see Figure 1.4). The choice of approach has important methodo-
logical implications. When we are using a corpus as primary data, we advocate the
use of corpus linguistics as our main research methodology. When we use a corpus
as secondary data, however, we are using corpora and CL methods as part of a
larger research methodology framework, most likely mixed methods and pragma-
tism. In short, the educational researcher has essentially two options when it comes
to integrating CL methods in their own research. For the sake of clarity, we will dis-
cuss and exemplify broadly these two options by exploring two research case studies
that examine education policies. The first research project is set in Australia and
looks at how the National Quality Framework is represented and constructed in the
media. The second project looks at financial literacy education policy in Canada.
The researcher in this project uses CL as a complementary method. These two
studies are just examples of how CL methods can be used to research education
policies but, understandably, offer just a glimpse of the many potential uses.
Fenech & Wilkins researched printed Australian media and their mediatising of
early childhood education (ECE) policy. In particular, the researchers looked at
the ‘representations of, and claims made about, the National Quality Framework
(NQF) –Australia’s system of regulation and quality assurance of ECE and child-
care services –in newspaper media’. (Fenech & Wilkins, 2019: 749). According to
the Australian government1 the NQF sets out ‘to raise quality and drive continuous
improvement and consistency in children’s education and care services’. The
researchers wanted to examine the role of Australian media as a ‘potential discursive
influence on parents’ childcare choice’. Their research questions were the following:
1 What key propositions and claims about the NQF are proffered in the
Australian print media?
2 Are these claims and purported impacts consistent across media organisations
(Fairfax vs. NewsCorp) and newspaper types (‘broadsheet’ vs. ‘tabloid’)?
3 Whose voices and agendas are being heard?
66 Using your own corpus
As their interest was to analyse Australian printed media, the researchers had to
come up with a corpus design that could be representative of the whole country.
They included three states (New South Wales, Queensland and Victoria), which
arguably are mostly representative of the eastern part of the country, as northern
Australia and the state of Western Australia were not sampled2. The researchers
then moved on to deciding on the newspapers to examine. They chose the two
papers with the largest circulation in each of the three states from 1 August 2013
to 31 May 2015. In total, 801 newspaper articles that had a focus on childcare
were selected and included in their corpus. Note that the researchers did not set
any target size in order to validate the suitability of their corpus. Instead, they
drew up a set of criteria that they thought would be necessary to meet in order
to obtain the right dataset that could help them answer their research questions.
They used the corpus management tool Wordsmith Tools3 to do the following:
1 Identify which of the initial pool of 801 articles made specific references to the
National Quality Framework or aspects of the National Quality Framework.
2 Discover the keywords that were specific to the group of texts that discussed
the National Quality Framework.
3 Study the evolution of the topics discussed between 2013 and 2015.
As a result, Fenech & Wilkins found that 121 articles had some focus on the NQF,
60% from Fairfax papers (broadsheets) and 40% from News Corp (tabloid) papers.
The authors also identified a set of somewhat expected keywords such as quality
standards, report, qualifications and ratios as well as some other less expected keywords
that emerged from the texts. Their analysis of the topics and voices appearing con-
tinuously during the almost two years’ worth of data allowed the researchers to
focus their analyses on ideas and stakeholders of ‘continuing prominence’ (Fenech
& Wilkins, 2019: 756). In terms of the two main media sources, Fairfax and News
Corp papers, the authors found that most of the keywords identified were ‘dis-
tinct to one of the two media corporations’ (Fenech & Wilkins, 2019: 757). These
differences were discovered by using quantitative methods4 and point to the fact
that different media adopt and promote different positionings of the National
Quality Framework. It was the keyword analysis performed by Fenech & Wilkins
that identified in a precise way the nature and extent of these (lexical) differences.
In short, Fairfax articles seemed to focus on quality whereas News Corp laid
emphasis on care. This is where the power of CL methods lies: statistical analysis
can be used to explore language use and inform further research methods that
could be used by researchers to explore data qualitatively. In this research paper,
the authors carried out a content analysis of those articles that explicitly discussed
the NQF, only 13 of the 121 in the corpus. This is an excellent example of how to
use CL methods and other data analysis methods to explore a research question.
Fenech & Wilkins concluded that most aspects of the educational policy
analysed are actually mediated by journalists and media groups. The two media
companies represented the NQF and positioned themselves in radical different
Using your own corpus 67
ways, thus presumably affecting the impact of the implementation of the NQF
agenda as seen by, for example, for profit and not-for-profit education groups.
In terms of the design of the corpus, researchers need to reflect on a variety
of aspects in what Nelson (2010) has called an initial planning document. We
are offering here a template (Table 4.1) that can be used to help you think about
the planning and subsequent building of your corpus. The questions included
will make you reflect on your research question(s) and other alternative research
methods for data collection and analysis (A,B), the source of your data (C) as well
as the feasibility of the data collection process (E, F).
Once we are sure that a corpus of texts5 is necessary as part of our research, we
need to devise a strategy for data collection. We are assuming that the corpus we
seek to build is not already available, and that we will need to put it together. To
do this we must consider the implications of modelling the types and the nature
of texts to be included in our corpus so that it can answer our research question.
We have just seen how Fenech & Wilkins (2019) made a series of decisions that
sought to ensure that their overarching question could be answered: ‘What key
propositions and claims about the NQF are proffered in the Australian print
media?’ The researchers need to operationalise all of the arguments within their
research question so that they can be dealt with methodologically. Let us illustrate
this process by breaking down Fenech & Wilkins’ overarching question into three
arguments A, B and C.
We need to understand how each of the arguments as set out in Figure 4.1 will
impact our corpus design and data collection. Starting with C, the researchers will
need to devise strategies to collect data in a digital format. If the textual data is
printed, then the researchers will have to come up with an OCR (optical character
recognition) or transcription strategy so that their texts are stored in a machine-
readable format (.txt, .docx, .rtf, or even .pdf). Transcription will be dealt with in
chapter 5 so let us assume that we can find a digital version of our data on the
Internet.
If the target data can be accessed online, the researchers will need to come up
with a plan to make sure that their texts have been cleaned up and can be processed
68 Using your own corpus
Research What key propositions and claims about the NQF are proffered in the
queson Australian print media?
A B C
Geographical
Domain: what is locaon of data
Analysis being talked
about Register/genre
Figure 4.1 Breaking down a research question (Fenech & Wilkins, 2019) into arguments
by a corpus management tool. By clean texts we mean files that, clean of HTML
formatting or PDF coding, etc., contain the language, and only the language, that
needs to be examined in our research. This follows the general guidelines that
‘the safest policy is to keep the text as it is, unprocessed and clean of other codes’
(Sinclair 1991: 21). Let us break down this process in two parts: obtaining the data
and preparing the data for corpus analysis.
Depending on the scope of our corpus, we may want to obtain the data either
by visiting the online provider website or by using specialised services that can
speed up the process. In the former scenario, we can use Google or any other
search site to locate the information for us. For example, if we wanted to search
for news features that discuss the Australian National Quality Framework (NQF)
in The Age,6 we could do one of the following:
1 Search for ‘National Quality Framework’ in the search box of the paper. Then
click on every article and copy the page or the text. As we obtained 887 results,
this will be a time-consuming task, and, probably, not very efficient. The old
school way to do this would be to save the page first as html and then extract
the text. This can be done in many different ways. We could open the html file
with a simple text editor. No matter what you use, it will look ugly. You will find
in the file the navigation structure of the site and so much more noise we are
not interested in. However, you can also find some of the metadata that you
may want to keep (Name of the author, date of publication, original URL,
etc.), so it is not a bad idea to screen every single file manually and decide how
you want to save and store your data. Using an excel file to document this
Using your own corpus 69
process is usually a good idea. Alternatively, you can select the text, copy it and
paste it on NotePad (Windows) or TextEdit (Mac). Make sure you save this file
as plain text (.txt extension) and save it as UNICODE UTF-87. If you have
some basic knowledge of Python programming, or somebody in your team has
it, web scraping is a great option. All you need is a Python library and to get
familiar with how it pulls out data from html pages. ‘Requests’ and ‘Beautiful
Soup’ are easy to use and will most likely meet your basic needs.
2 Alternatively, we can use a third-party service such as LexisNexis or Factiva
to fetch the texts for us. It is a good idea to double-check whether your
institution or university has access to these services. Factiva8 is an inter-
national news database with over 32,000 sources from 200 countries in 28
languages. The potential to locate text-based sources is, as you can see, huge.
The databases include national, international and regional newspapers,
magazines, journals, newswires (i.e. Reuters), TV or radio podcasts (e.g. BBC,
CNN, ABC, CBS, NBC, Fox, etc.), news and business information websites,
blogs, company reports and the EUR-Lex website. LexisNexis9 is an excel-
lent service that will let you collect all sorts of law-related texts, including
annotation on whether and when an Act was repealed and information on
additional provisions and savings. LexisNexis will also let you search for news
contents and select many or just a particular source. This type of service
will let you specify the requirements of your search, including search words,
search time span and type of source. A search of ‘mental health’ and ‘educa-
tion’ in blogs in North America during the last year (2018–2019) in Factiva
will return 74 results. These results can be emailed or saved either as pdf or
text files and reused for private, academic purposes. LexisNexis will not let
you download all texts in one single file, but you will have the opportunity to
search within a single source, for example, TES, and then examine every hit
as clean text on screen. Another advantage of these services (Factiva or Lexis
Nexia) is that we can discover new sources of potential data.10
3 Attention must be given to copyright issues and the anonymisation of the data.
In the European Union, the General Data Protection Regulation has strict laws
that may potentially affect how metadata is collected and stored electronically.
Pinto (2013) researched the status of financial literacy education in Canada after
the 2008 crisis. She used narrative policy analysis (NPA) and corpus linguis-
tics methods to examine the narratives around financial literacy education and
uncover the ‘assumptions underlying public policy […] embedded within rhet-
orical devices’. Rather than extending her inquiry to the whole country, Pinto
examined education policies in Ontario as education is under the provincial
jurisdiction in Canada. NPA is interpretive ‘in that it regards a certain form of
storytelling as analytic entrée into policy-relevant experiences with emphasis on
the identification of schemes and tropes, and especially policy metaphors’ (Pinto,
2013: 100). Pinto describes her data collection in the following way:
70 Using your own corpus
Pinto examined inductively each of the individual texts collected and decided
whether the narratives favoured financial literacy education (LE) by promoting
an idea of LE as a solution to looming economic uncertainty (the dominant
narrative) or by promoting suspicion about government interest in LE (the counter
narrative). After this interpretive analysis, Pinto used collocation analysis to iden-
tify ‘unique, recurrent semantic devices’ in each subcorpus of texts:
I further ran all text files through a corpus linguistics research software tool,
AntConc 3.2.4, to triangulate my interpretation as the software would identify
collocations that I may have missed. I also searched for schemes and tropes
operating within the data sources with particular attention to the trope of
metaphor as a rhetorical device shaping each of the narratives […] Metaphor
is especially valuable for identifying underlying themes and revealing power
dynamics within policy […] especially given policies are often understood
in symbolic terms […] Certainly, ‘those who will control the metaphors will
ultimately control the action: and those who change the metaphors will ultim-
ately change the action’ [Monin & Monin, 1997: 57].
(Pinto, 2013: 102)
The use of two subcorpora that can be contrasted is a common research design
in CL. This approach was advocated by Pinto, who analysed pro-literacy educa-
tion narratives from two different camps by isolating the linguistic devices used by
those defending them. The dominant narrative used the metaphor of a morally
superior crusade that used ‘a relatively neutral language an tone from a rhetorical
Using your own corpus 71
standpoint’; the counter narrative used ‘passionate and emotional language to cast
doubt on the true intentions of the crusaders’ (Pinto, 2013: 110).
The type of research in Pinto (2013) differs from the one in the first case study
in this section in that (1) the corpus size and collection criteria are more loosely
defined and (2) the CL methods are subordinate to the overarching research meth-
odology embraced by the researcher. Figure 4.2 illustrates how NPA was used to
elucidate the main types of narrative surrounding financial literacy education in
Canada after 2008.
Note that CL methods were not used to tell those two types of narratives
apart. What is important to understand is that the overarching research question
in Figure 4.2 is not contingent on the application of CL methods. What the
researcher did instead was to submit the texts classified either as dominant or
counter narratives to collocation analysis, and thus triangulate the results. Table 4.2
sums up how the two research projects in this section approached corpus building.
As in most research projects that do not aim at recording linguistic use, data
size per se is not important or, put it in a different way, should not determine
the overall quality of the corpus. Despite the literature devoted to the time-
consuming nature of putting a corpus together (Clancy, 2010), we believe that,
well into the 21st century, this can no longer be a defining criterion for what
should count as a good corpus. The availability of online data, processing tools
and corpus management software make CL methods available to every researcher
72 Using your own corpus
With around 90% of global economic growth in the next five years expected
to originate outside the European Union, forging a new role for the United
Kingdom on the world stage starts with rising to the exporting challenge –of
which this strategy and the education sector will form a key part. Working
together, we can help UK education reach its full, global exporting potential.
(DfE & DIT, 2019)
The document outlines the objectives for UK higher education, as well as the role
of the government:
The objective of the policy mentions both incoming and outgoing students, as
well as the wellbeing of all the students involved:
low TTR. The reason is simple: in English the most frequent 2,000 words account
for 87.4% of fiction books and 90.3% of spoken communication, which reinforces
the idea that it takes massive corpus data to cover a wide range of types.
By default, every corpus uploaded to Sketch Engine is part-of-speech tagged,
which means that every word in the corpus is annotated with morphological infor-
mation. This gives researchers the possibility to run sophisticated searches that
combine POS tags and different types of unit (lemmas and words). There are
different POS tagging services and tagsets that can be used, their main differences
being that the tags used display different levels of depth. Sketch Engine uses by
default the English TreeTagger PoS tagset with Sketch Engine modifications, but
other services, and of course other languages, will use other software and tagsets.
The software that performs the analysis of the language and tags every token as
a part of speech is called a ‘tagger’. There are freely available solutions13 such as
the Stanford Part of Speech Tagger14 or the CLAWS free online service15 at the
University of Lancaster (only 100,000 words).
A sentence from the UK policy document such as ‘Our higher education
institutions are amongst the most renowned and prestigious in the world’ will look
like this once it has been POS-tagged on Sketch Engine:
<s>
Our PPZ our- d
higher JJR high- j
education NN education- n
institutions NNS institution- n
are VBP be- v
amongst IN amongst- i
the DT the- x
most RBS most- a
renowned JJ renowned- j
and CC and- c
prestigious JJ prestigious- j
in IN in-i
the DT the- x
world NN world- n
. SENT .- x
</s>
On the left, we can see our sentence in vertical format and next to it the actual tag
that was assigned by the tagger. Our has been tagged as PPZ (possessive pronoun),
higher as JJR (comparative adjective), education as NN (singular noun) and so on.
The tagset used by the Sketch Engine English TreeTagger PoS tagset contains
55 tags. CLAWS tagset 7 contains 137 tags, allowing for further discrimination
between word categories such as adjectives or adverbs, for example. The same
sentence tagged by the CLAWS free online service will return the following:
76 Using your own corpus
Our has been tagged as APPGE (prenominal possessive pronoun), higher as JJR
(general comparative adjective), education as NN1 (singular common noun) and
so on. As you can see, the tags are not the same but, in principle, the fact that
different tagsets co-exist should not be something that is of primary importance
in the context of our research. We need to be aware that while different taggers
and tagsets16 will use different categories, the fundamental, broad morphological
categories of analysis will be present in most software.
In practical terms, there are two ways to POS-tag a corpus. If we are using
services such as Sketch Engine, the POS annotation will remain invisible to the
researcher although the search interface will allow us to carry out searches that use
POS tags. The following screenshot from Sketch Engine (Figure 4.3) shows how
we can search nouns, adjectives, verbs, etc. in our UK policy document.
This type of search can be performed only because our corpus has been POS-
tagged. If we are using stand-alone software like AntConc, we will need to upload
a corpus that has already been POS-tagged. This is a different process altogether.
Fortunately, AntConc is highly customisable. Among other things, we can decide
(1) whether we want to see tags or not and (2) what can be considered as a tag (start
and end tag symbol). Figure 4.4 shows the Global Settings window where we can
set up our preferences.
There are two major types of tags: non-embedded tags and embedded tags.
Non-embedded tags are independent of the text being annotated. The following
is an example of a poem by William Blake from the Text Encoding Initiative
(TEI) website17 where an introduction to Extended Markup Language (XML) is
offered. This example shows how non-embedded tags can be used to annotate
structure and structure elements. Note that every tag, for example <poem>, is
followed at some point by a closing tag </poem>. So, in the following example
we have six different types of tags, both opening and closing tags: <anthology>,
<poem>,<heading>, <stanza>, and <line>.
Using your own corpus 77
<anthology>
<poem>
<heading>The SICK ROSE</heading>
<stanza>
<line>O Rose thou art sick.</line>
<line>The invisible worm,</line>
<line>That flies in the night</line>
<line>In the howling storm:</line>
</stanza>
<stanza>
<line>Has found out thy bed</line>
<line>Of crimson joy:</line>
<line>And his dark secret love</line>
<line>Does thy life destroy.</line>
</stanza>
</poem>
Figure 4.5 TagAnt interface
Once a corpus has been POS-tagged, we are ready to query our texts by exam-
ining how the lexical items and the grammatical categories are related. The
corpus size will need to be reported and considered when calculating the relative
frequency of occurrence of the linguistic items explored in the corpus. Where
to start? A word list of the most frequent nouns, or verbs, provides us with a first
glimpse of the lexical items used in the two policy documents. Table 4.4 shows the
20 most frequent nouns in both policy documents and their relative frequencies
per 1,000 words.
The information presented in Table 4.4 shows us how some nouns are more
frequent in each of the two policies. For example, while in the UK policy there
seems to be a high use of trade, export and market, in the NZ policy there seem to
be slightly more frequent references to market, quality and region. This is an excel-
lent departure point to start our exploration of both policy documents. We can
either focus on the lexical items that are identical in both documents or, alterna-
tively, look at what is unique, more frequent or not frequent at all, in one of the
documents. Using this list as suggested in previous chapters, we are now ready to
examine the contexts of use of some of these items in every dataset. For example,
the use of trade in the UK policy document is almost exclusively linked to the activ-
ities of the Department for International Trade, and in 33 of the 78 concordance
lines analysed is used with the auxiliary verb will followed by work ten times and
encourage nine times. In the latter, most of these uses appear in an action subsection
of the document. These are the nine concordance lines (Table 4.5) where encourage
follows The Department for International Trade will in the document.
Concordance lines 6–9 are repeated in the document as the will predictions are
revisited in terms of the timeline for their implementation. Although the analysis
of nouns will be discussed in chapter 6, even with the somewhat limited insight we
have gained from examining some of the top nouns in the UK document, we can
note how trade and related concepts play a substantial role. A further colligational
analysis of trade as a noun will reveal the following:
None of these analyses could have been carried out if the datasets had not been
POS-tagged. In terms of comparison between the two policy documents, we
could use either descriptive or inferential statistics. Descriptive statistics will give
us a measure of the quantity of word classes used. In Table 4.6, we can see the
number of raw lemmas (in brackets) and the relative frequency per 1,000 lemmas.
Note that we needed the total of lemmas in the UK and NZ policy documents,
2,012 and 1,227, respectively, to calculate the relative frequencies. What these fre-
quencies tell us is that both policy documents used a very similar range of word cat-
egories, which is expected given their similar nature and functions. However, this is
82 Using your own corpus
1 […] this it needs reliable information on where the best opportunities are for
different types of providers. The Department for International Trade will
encourage the growth of the early years market by sharing more intelligence
with the sector about the scale and scope of […]
2 […] growth, including in the European market where we are seeing growing
demand for UK schools. The Department for International Trade will
encourage independent schools to access international opportunities, using
improved education exports data to […]
3 […] bodies across the UK, a number that is forecast to increase. The
Department for International Trade will encourage a greater proportion
of UK skills organisations to consider taking their offer internationally, where
[…]
4 […] new and existing providers, and to improve the overall evidence base around
best practice and impact. The Department for International Trade will
encourage the sector to grow TNE by engaging in dialogue with countries with
recognised export potential.
5 […] international objectives. It is this physical presence that the UK government
can help facilitate. The Department for International Trade will
encourage the EdTech and educational supplies sector to engage with buyers
both in the UK and overseas.
6 […] basis given the differences in demand from different parts of the world.
Completion Spring 2020 Action 11. The Department for International
Trade will encourage the growth of the early years market by sharing more
intelligence with the sector about the scale and scope of […]
7 […] that want to expand their offer to find the best export opportunities for them.
Completion Spring 2020 Action 12. The Department for International
Trade will encourage independent schools to access international
opportunities, using improved education exports data to […]
8 […] to raise awareness of the ELT offer for the benefit of the UK education sector.
Ongoing: review Spring 2020 Action 17. The Department for International
Trade will encourage a greater proportion of UK skills organisations to
consider taking their offer internationally, where […]
9 […] will focus on countries of particular interest and opportunity for the sector.
Completion Spring 2020 Action 22. The Department for International
Trade will encourage the EdTech and educational supplies sector to engage
with buyers both in the UK and overseas.
• Modal verbs (in particular will and can), are statistically more frequent (11.49)
in the UK policy document. The LogRatio, which measures the effect size, is
0.61. These are five of the 197 concordance lines where will is used in the UK
document:
s strategy and the education sector will form a key part. Working together,
that only government can give. We will seek to grow education exports and i
ses to Grow on the World Stage. We will seek to use the opportunities presen
across the UK education sector. We will champion the breadth and diversity
ondem with the education sector, we will provide the practical solutions and
• The preposition for is statistically more frequent in the UK document (11.40).
The LogRatio is 0.58. Most of these uses involve a prepositional phrase that is
complementing a noun (reputation, opportunities, processes). These are five of the
293 concordance lines where for is used in the UK document.
Setting the foundations for global success. The UK’s global
and embrace our ambitious objectives for the education sector. The
Government
The UK has a global reputation for education, characterised by excellen
first by international students for student experience across several mea
the UK, we are the European leaders for education technology. Our cultural
Other relevant POS tags that were statistically more frequent in the UK docu-
ment were the use of plural nouns expressing time (years), the use of existential there
(there is) or the use of more (see Table 4.8 for information of significance). These
three POS tags display a LogRatio of around 1.85, which suggests that they are
used twice as many times in the UK policy document. Table 4.7 summarises what
is involved in comparing two corpora, and Table 4.8 summarises each of the stat-
istical tests we have mentioned so far.
4.3.1 Chapter 1
In this chapter, we discussed the use of corpora as tools that can generate insights
into language usage. We examined some corpora that are freely available to
researchers and that have been widely used to examine language use.
Visit Mark Davies’ website: english-corpora.org
How do you think we can use Mark Davies’ TV corpus in our research? And
the Corpus of Historical American English (COHA)? Visit the Google Books
corpus and the Hansard Corpus. Check out Table 4.9 and try to come up with
some of the potential uses of these corpora for educational research.
Using your own corpus 85
Table 4.8 Skill 11: statistical tests
Choose one of the corpora above. See Table 1.4, Skill 1. In which ways is fre-
quency represented there? Why is it of relevance? See Table 1.5, Skill 2. How can
you make sense of the normalised and the raw frequencies in this corpus? What
do they tell you?
4.3.2 Chapter 2
See Table 2.1, Skill 3. Can you define what a register is? Can you think of the range
of registers that you use on an everyday basis? Can you identify at least three?
86 Using your own corpus
Now go to Table 2.2, Skill 4. Go through the list of linguistic features enumerated.
Can you think how those three registers you have just identified in your everyday
life differ in terms of the frequency of use of some of these features?
4.3.3 Chapter 3
Go to Table 3.2, Skill 5. Think about your own research interests. Can you think
of a project would have been useful? In which ways? What sort of ideal corpus
would be useful?
See Table 3.7, Skill 6. What do concordance lines reveal? How do we generally
approach the analysis of concordance lines?
Go to Table 3.8, Skill 7. Why is dispersion of interest? How are frequency
counts and dispersion related?
Now go to Table 3.12, Skill 8. How can we interpret collocations? What do they
tell us about a node and its collocates? How is it calculated?
4.3.4 Chapter 4
Go to Table 4.3, Skill 9. Outline three areas at least that need to be considered
before designing your own corpus. Why are they relevant? In what ways can they
impact your data collection and future analysis of the data?
See Table 4.6, Skill 10 and 4.7, Skill 11. Before actually comparing two cor-
pora, what needs to be done? Revisit Figure 1.2. How do the two alternatives there
affect your comparison of the two datasets? What are the implications?
Notes
1 www.acecqa.gov.au/nqf/about
2 The authors acknowledged this bias in their paper and justified it because two of the
partner childcare organisations supporting their research are based in these three states.
3 Current version of Word Smith Tools is version 7; www.lexically.net/wordsmith/
4 We will deal with keyword analysis in chapter 6.
5 The term ‘text’ is used in this book to denote both spoken and written language.
6 A daily newspaper published in Melbourne, Victoria, Australia, owned by Halifax.
7 By using UTF 8 we make sure that all Latin-script alphabets, Greek, Cyrillic, Coptic,
Armenian, Hebrew, Arabic, Chinese, Japanese and Korean characters, mathemat-
ical symbols, and emojis can be read and interpreted by our machine. Source: http://
unicode.org/main.html
8 https://professional.dowjones.com/factiva/
9 www.lexisnexis.com
10 Note that most education databases are excellent sources of textual data, but they are
primarily academic or research oriented. Education Abstracts will let you search in
magazines and periodicals, while ERIC will let you search in reports and PhD theses.
11 https:// a ssets.publishing.service.gov.uk/ g overnment/ u ploads/ s ystem/ u ploads/
attachment_data/file/799349/International_Education_Strategy_Accessible.pdf
Using your own corpus 87
12 https://enz.govt.nz/assets/Uploads/International-Education-Strategy-2018–2030.
pdf
13 Martin Weisser maintains a website with detailed information on tagging solutions for
different languages; http://martinweisser.org/corpora_site/taggers.html
14 https://nlp.stanford.edu/software/tagger.shtml
15 http://ucrel-api.lancaster.ac.uk/claws/free.html
16 You can find a description of the Sketch English adaptation of Tree Tagger at www.
sketchengine.eu/english-treetagger-pipeline-2/ and CLAWS tagset 7 at http://ucrel.
lancs.ac.uk/claws7tags.html
17 https://tei-c.org/release/doc/tei-p5-doc/en/html/SG.html
18 www.laurenceanthony.net/software/tagant/
19 The UK’s Department for International Trade (DIT) helps businesses export, drives
inward and outward investment, negotiates market access and trade deals, and
champions free trade.
20 https://ucrel-wmatrix4.lancaster.ac.uk/wmatrix4.html
References
Biber, D. & Conrad, S. (2009). Genre, register and style. Cambridge: Cambridge University Press.
Brezina, V. (2018). Statistics in corpus linguistics. Cambridge: Cambridge University Press.
Clancy, B. (2010). Building a corpus to represent a variety of a language. In O’Keeffe, A. &
McCarthy, M. (Eds.) The Routledge handbook of corpus linguistics. London: Routledge, 80–92.
Cohen, L., Manion, L. & Morrison, K. (2011). Research methods in education. London: Taylor
Francis.
Department for Education and Department for International Trade [DfE & DIT] (2019).
International education strategy: global potential, global growth. London: DfE & DIT.
Fenech, M. & Wilkins, D.P. (2019). The representation of the national quality framework
in the Australian print media: silences and slants in the mediatisation of early childhood
education policy. Journal of Education Policy, 34(6), 748–770.
Nelson, M. (2010). Building a written corpus: What are the basics? In O’Keeffe, A. &
McCarthy, M. (Eds.) The Routledge handbook of corpus linguistics. London: Routledge, 53–65.
New Zealand Government. (2010). International education strategy 2018–2030. Wellington:
New Zealand Government.
Pinto, L.E. (2013). When politics trumps evidence: Financial literacy education narratives
following the global financial crisis. Journal of Education Policy, 28,(1), 95–120.
Rayson, P. (2005). Wmatrix. Lancaster University. www.comp.lancs.ac.uk/ucrel/wmatrix.
Rayson, P. (2008). From key words to key semantic domains. International Journal of Corpus
Linguistics, 13(4), 519–549.
Reppen, R. (2010). Building a corpus: what are the key considerations? In O’Keeffe, A. &
McCarthy, M. (Eds.) The Routledge handbook of corpus linguistics. London: Routledge, 31–37.
Sinclair, J. (1991). Corpus, concordance, collocation. Oxford: Oxford University Press.
Chapter 5
Interview data
Transcription and annotation
A benefit of written interviews is that you can avoid the monotonous tran-
scription process […]. It is important to clearly label and mark up your tran-
script so that it is in a form that is easy to analyse—for example, the use of
bold and italic fonts consistently (e.g., to mark when the researcher is talking,
or when the participant emphasised something). By going the extra mile
during data preparation, the process of data analysis and interpretation is
made easier. Transcription is a tedious process and ample time for it should
be built into the research design. […] If you’re applying for funding for your
research, you may consider budgeting for a transcriber.
(Leavy, 2017: 142)
Interview data 89
Reppen (2010) argues that, depending on the level of detail included in the tran-
scription, it may take up to 15 hours to transcribe and annotate one hour of
spoken language. This is certainly a laborious activity. King, Horrocks & Brooks
(2019: 46) in their second edition of Interviews in Qualitative Research discuss tran-
scription as a ‘demanding task […] often contracted out to people with the essen-
tial skills […] realistically time constraints may mean you need to employ others to
do this task’. Note that the implication here may be that transcriptions are pretty
straightforward to do and, most likely, do not need any type of data-sensitive
coding or markup.
Cohen, Manon & Morrison (2018) discuss transcription in more depth than
authors of other educational research handbooks. They acknowledge conflicting
views on how to conduct the interview and the role of the interviewer/researcher.
They argue that ‘the problem with much transcription is that it becomes solely
a record of data rather than a record of a social encounter’. They caution ‘the
researcher against believing that they [can] catch everything that happened in
the interview’ and suggest a list of data that happen in an interview and which,
depending on the research aim, need to be recorded in a transcript:
The issue here is that it is often inadequate to transcribe only spoken words;
other data are important. Of course, as soon as other data are noted, this
becomes a matter of interpretation (what is a long pause, what is a short
90 Interview data
pause, was the respondent happy or was it just a ‘front’, what gave rise to
such-and-such a question or response, why did the speaker suddenly burst
into tears?). […] interviewees’ statements are not simply collected by the
interviewer, they are, in reality, co-authored.
(Cohen, Manon & Morrison, 2018: 524)
From a linguistic perspective, there are at least two main types of transcription:1
orthographic and prosodic. Orthographic transcription renders spoken data using
standard orthographic conventions, which, despite being the most straightforward
type of transcription, brings up all sorts of challenges, especially if what is being
transcribed is not a monologue and the language shows a high degree of involve-
ment and interaction. Orthographic transcription of spoken language involves an
act of interpretation that needs to be acknowledged and reflected upon. Prosodic
transcription adds prosodic marking to orthographic transcripts (i.e. intonation).
Reppen (2010: 33–35) has put together a set of issues that researchers need to
address before transcribing language:
• How will reduced forms (e.g. wanna, gonna, cuz) be transcribed? Complete
form? Reduced form? Double coding?
• What will be transcribed when it is difficult to understand what was said?
• How will overlapping speech be treated?
• How will conversational facilitators (uh, mmm, hum, etc.) be transcribed?
• How will repetitions (I I I I I I I don’t think) be treated?
• What about pauses? Will pauses be ‘transcribed’? Will they be timed?
• What about laughter? Shall we transcribe this?
These decisions will impact how the linguistic data will be treated when uploaded
either to Sketch Engine or AntConc. When transcribing an interview or a focus
group, we need to keep a record of these decisions. In due time, this record or
set of guidelines will help us understand our own interpretation of the findings
against the backdrop of those decisions. Consider, for example, the number of
times personal pronouns are repeated during a conversation. If we decide not
to reflect these repetitions in the orthographic transcription, we are deliberately
adopting an approach where we are priming an oversimplified rendering of
spoken data over the complexity of spoken communication. If, on the other hand,
we decide to represent spoken disfluency phenomena such as repetition and hesi-
tation, we are adding extra complexity to our analysis.
The Child Language Data Exchange System (CHILDES) project2 has
developed over the decades essential know-how to approach the transcription of
spoken language data in a robust way. Although their focus is the emergence and
use of language in children, the range of tools and the standards developed over
the years is of interest to anyone thinking about transcribing spoken data. Brian
MacWhinney, the original researcher behind CHILDES, distinguishes between
transcription and coding. In his view, the former is the production of a written
Interview data 91
Transcripon
• Interviews
• Focus groups, etc. • Level of • Ad hoc coding
granularity • Markup
• .txt files • .txt files
Recording Coding
record that tries to represent, quite often unsuccessfully, the original spoken inter-
action. Coding, however, is so much more:
Researchers need to work out how they want to approach their transcription pro-
ject, what phenomena they want to code or annotate, and how they want to do
this in terms of the context of their research methodology.
5.2 Transcription basics
Before starting our transcription, we need to plan what needs to be transcribed
and coded. The process is summarised in Figure 5.1. Each stage (Recording –
Transcription –Coding) requires careful planning and making decisions that will
impact the final data-format of the interviews that will be used in our analysis. We
suggest that the researchers work with text (.txt) files as they offer maximum flexi-
bility and guarantee compatibility with a wider range of analysis tools.
Transcription plays a central role in research methodologies that use interviews
or other types of spoken data to gain access to the experiences and opinions of
informants. Bailey (2008) has pointed that, unfortunately, the role of the tran-
scriber in qualitative research is often neglected.3
Transcribing is an essential part of the research flow and those in this role need
a proper understanding of the implications of achieving robust transcriptions
that are consistent throughout the entire dataset. This is particularly important
if more than one transcriber is involved and, as is often the case, transcription
is subcontracted. If our entire dataset has not been transcribed using the same
guidelines in a consistent way, the validity of our findings will be compromised.
Similarly, if we have not coded our interviews, it will be impossible to perform
sophisticated searches on our data.
There are different desktop solutions that can help us with our transcription pro-
ject. Desktop solutions such as Inscribe4 or the EXMARaLDA Partitur Editor5 are
cross-platform tools for transcribing and annotating of digital audio and video files.
Inscribe is commercial software that can be used with a USB foot pedal to control
media playback. Subtitled video files can be exported, which is a great feature across
different educational projects. With the EXMARaLDA Partitur Editor, digital
audio or video recordings can be transcribed and aligned (transcription and multi-
media), a great feature for most research projects. Partitur is not only freeware, it
can also process transcriptions according to different transcription conventions and
styles (HIAT, CHAT), output transcription data in different layouts (score format
or line-for-line) and in different document formats (HTML, MS Word) and with
multimedia links to audio or video. Transcription data can be exchanged with other
systems such as Praat6 or ELAN7. Partitur can be used with the EXMARaLDA
Corpus-Manager (Coma), a tool that allows to link EXMARaLDA transcriptions
with metadata in order to compile them into corpora. There are many other tools
that can be of use. The UAM Corpus Tool8 is a great freeware solution if we
already have a transcription and wish to annotate our data and create our own
annotation layers or taxonomies. Folia is an XML-based annotation format for
the representation of linguistically annotated language resources. Their developers
note that their aim is to ‘introduce a single rich format that can accommodate a
wide variety of linguistic annotation types through a single generalised paradigm.
We do not commit to any label set, language or linguistic theory’.9 Folia developers
have created resources10 that can be useful to understand different annotation types
and XML-related tags and structures.
There are plenty of web services that can help us with our transcription, ran-
ging from the pretty basic but essential functionalities of Otranscribe11 to the
more complex and automated functionalities of Transcribe,12 which can automat-
ically transcribe audio files that, eventually, will have to be edited and corrected
by a human transcriber. This subscription web service gives you the chance to
Interview data 93
work on a web browser environment, slow down your audio, use a foot pedal and
define your own acronyms for frequently used words and phrases, which will be
expanded to their full form as you type along. One of the advantages of these
services is that you can use a wider range of devices (e.g. tablets) to carry out the
transcription. Brat13 is a great option if you have some basic server infrastructure
and basic natural language processing (NLP) expertise or support. Annotating is
easy and very much a drag and drop, intuitive experience.
As for orthographic transcription, there are numerous transcription guidelines
available that we can use as a starting point. The Louvain International Database
of Spoken English Interlanguage (LINDSEI) (Gilquin, De Cock & Granger, 2010)
contains oral data from advanced learners of English from several mother tongue
backgrounds. Each L1 group (French, German, Dutch, Spanish, etc.) includes
50 interviews made up of three tasks: a set topic, a free discussion and a picture
description. LINDSEI guidelines14 are an excellent starting point to start a conver-
sation about what needs to be considered even before we transcribe the first word
of an interview. These guidelines contain elements from both the transcription as
well as the markup stages in Figure 5.1. Table 5.1 shows a summary of some of
the areas that need attention and how LINDSEI researchers actually proceeded.
Qualitative research experts such as Silverman (1993) and King, Horrocks &
Brooks (2019) have devised transcription systems that can offer researchers a con-
sistent way to transcribe interviews. King, Horrocks & Brooks (2019: 194) point
out that there is ‘less standardisation among the simplest forms of transcription’,
which is only natural as transcription systems tend to be project-driven and,
accordingly, unique in different ways. For these authors, it is crucial the researchers
use a transcription system that captures every aspect of speech ‘that might indi-
cate something about the way verbal interaction operates and what it achieves’.
They suggest the following:
Table 5.1 Cont.
corpus as this approach will very likely distort the contents of the text in terms
of their analysis. A corpus annotated with regular brackets and text will contain
words that are not part of the corpus strictly speaking, so the resulting text will
not only be larger than it should be, but it may also mess up the consistency of
the data. On top of this, it will not be straightforward to find a way around how
to account for these annotations in the overall analysis of the corpus. Certainly, it
will be very difficult to do in Sketch Engine, and in AntConc it will require setting
brackets as default tag symbols. Using regular tags <tag> will do the job across
the board in a nicer way.
The tagged annotations discussed above can be tracked down in AntConc
effortlessly. For example, if we want to search for the foreign words in the Spanish
96 Interview data
editors of the Text Encoding Initiative (TEI) (see Table 5.3) and responsible for
the digital infrastructure of the British National Corpus, highlights the role of
metadata in corpus linguistics research:
Burnard has in mind here a big corpus of language such as the BNC (100 million
words), which contains thousands of sources, writers, contexts of use and so
98 Interview data
on. While most educational researchers will not necessarily need a large corpus
and a sophisticated corpus design, it is a fact that a well-annotated corpus will
increase the efficiency of the searches and consequently the validity of their
findings.
If your research project involves different interviews, it is recommended that you
keep them in separate files. Having all our interviews in individual files, instead
of one single large file, will provide us with the opportunity to perform complex
searches and know more about the distribution of certain features across the
corpus (see chapter 4). Corpus management tools such as Sketch Engine16 can
recognise structures and parts in a corpus:
[…] a corpus has to be equipped with marks or labels indicating the beginnings
and ends of such parts. These marks or labels are called structure tags and the
parts of a corpus they mark are called structures. The most typical parts are
files, paragraphs and sentences.
(www.sketchengine.eu)
Corpus management software generally does not prescribe (and neither does
Sketch Engine) what structures should be included in the corpus and what
they should look like. It is, however, advisable to include at least the basic set
Interview data 99
Sketch Engine will add a basic structure to our files. A file will become a <doc>,
paragraphs will be enclosed in <p></ p> elements, while sentences will be
annotated as <s></s>. However, we can modify or improve this structure. The
way we can do this is by using angle brackets where the closing tag must have a
slash. This is the point where we, as researchers, need to translate our research
questions to effective coding that can increase the efficiency of our searches in
the corpus. Typically, these will be related to our dependent and independent
variables. The good news is that Sketch Engine converts all metadata to ‘Text
Types’. Let us see how this works. For example, CHILDES corpora are made
up from transcripts of child language of spontaneous conversational interactions.
The speakers involved are young children speaking with their parents or family.
The corpus is annotated to reflect the following features:
• sex
• L1
• languages spoken during the interaction
• age group
• participant role
• date.
This means we could search for all occurrences of, for example, the use of second
pronoun you in conversations with kids between four and six years of age involving
mothers, fathers, carers, etc. Each utterance is inserted in a <s></s> structure.
The following is taken from one of those conversations where the verb talk is used
(there are over 12,000 uses of talk within the frame of our search):
Note that in the above transcription dots stand for pauses and xxx stands for unin-
telligible speech,18 but this annotation is totally open to being redefined, and you
can define your own transcription guidelines as you deem appropriate.
Adding metadata to our transcriptions is no longer an option. Before the spread
of electronic textual data, metadata was typically kept separate in reference
manuals and was not generally included as part of the transcription. Including
metadata in electronic format is not only good practice, but an essential practice to
exchange and distribute our data and maximise the opportunities to interact with
our dataset in digital research contexts. Depending on the scope of our research
project and the resources available, we may want to adopt different strategies
towards annotation. For big projects with plenty of resources and support, it is
necessary to develop an annotation strategy that makes use of encoding standards.
However, finding the right strategy and coding system will take time and some
kind of trial and error approach. Scott & Tribble (2006) have noted that there is
no such a thing as a perfect markup system that caters for all researchers:
We need to start somewhere, though. The Text Encoding Initiative (TEI) guidelines
are one of those standards that need to be considered by researchers that need to
work with relatively large and complex datasets. We will explore TEI in the rest of
the chapter. If this is of interest to your project, do not hesitate to read more about
it on https://tei-c.org The TEI guidelines19 suggest that an electronic text should
include the following metadata:20
This information is included in the TEI header before the transcription itself. The
following is a minimal header structure recommended by the TEI consortium
taken from their official website:21
Interview data 101
<teiHeader>
<fileDesc>
<titleStmt>
<title>A Title is given here</title>
<respStmt>
<name>A name is given here</name>
<sourceDesc> A description of the source of the data</sourceDesc>
</ respStmt>
</ titleStmt>
<publicationStmt>
<distributor>This can be your institution or you</distributor>
</ publicationStmt>
</fileDesc>
</teiHeader>
And Table 5.4 contains some of the main tags from the example above.
Note that (a) it is not compulsory to use all the tags that can potentially be used
in the header, only those that are necessary for the description of the file,22 and
those that are of interest for (b) the distribution of the resource and (c) coding
some of the features that will be used when querying the data. For example, in
the title statement (<titleStmt>) we may specify, among others, the following:
More information on these and more elements and how to use them can be found
at https://tei-c.org/guidelines/p5/23
Some TEI elements are unique to spoken texts.24 The most important is <u>, a
spoken element analogous to a paragraph in written texts. U stands for utterance
and <u> </u> contain speech usually preceded and followed by silence or by
a change of speaker. Look at this transcription from the tei-c.org website25 and
annotated using TEI guidelines:
<u who=“#mar”>you never <pause/> take this cat for show and tell
<pause/> meow meow</u>
<incident>
<desc>toy cat has bell in tail which continues to make a tinkling sound</
desc>
</incident>
<vocal who=“#mar”>
<desc>meows</desc>
</vocal>
Let us go deeper into some of coding used for this transcription. Every <u>
</u> element contains an utterance that was said by a speaker that is also identi-
fied in the transcription, for example:
Interview data 103
We know that it was Ros that spoke this thanks to the who element immediately
after <u. Note that her words follow the who element and they are transcribed
before the closing element </u>. We can use <incident> elements to describe
what is going on during the interview:
<incident>
<desc>toy cat has bell in tail which continues to make a tinkling
sound</desc>
</incident>
The <desc> </desc> tags enclose this description. While the example may
seem irrelevant to your research, a taxonomy of such incidents developed in
the context of your research project may help you locate important infor-
mation about the circumstances in which the interviews or the focus groups
took place. The elements <kinesic></kinesic> include gestures, frowning,
nodding, etc. The elements <vocal></vocal> includes non-lexical voice such
as whistles. These are interesting if we want to examine language and non-
verbal communication.
The <choice> </choice> elements are useful to portray the exact words as
heard on the audio or video file and the regular way to represent the word(s) in
conventional spelling. This is an example:
<choice>
<orig>bout</orig>
<reg>about</reg>
</choice>
This is a more efficient way to represent emphasis than capital letters as we can
build complex searches that include both lexical items and annotation. Another
interesting part of the code provided above is the declaration of the speakers
involved.
<person xml:id=“ros”>
<!-- ... -->
</person>
<person xml:id=“fat”>
<!-- ... -->
</person>
</listPerson>
As exemplified above, when transcribing our data we can use the element
who=“#ros” to identify the speaker:
While the purpose of this book is not to provide a detailed account of how to
transcribe spoken language using TEI guidelines, an appreciation of its useful-
ness and possibilities may open up new ways to look at how transcriptions can
support our insight into the language used during interviews. As for our own
annotation of the corpus, we have different options at our disposal: (1) we can
use annotation tags straightaway in our transcription or (2) we can develop an
annotation taxonomy that will be formally declared using TEI elements. The
former is quick and useful in terms of our searches; the latter is more time-
consuming, but it will offer us more sophisticated search options. Let us have a
look at them.
style but of their understanding and diagnosis of the school’s needs and their
application of clearly articulated, organisationally shared educational values
through multiple combinations and accumulations of time and context sen-
sitive strategies that are ‘layered’ and progressively embedded in the school’s
work, culture, and achievements.
(Day, Gu & Sammons, 2016: 222)
The authors stress that mixed-methods research designs ‘provide finer grained,
more nuanced evidence based understandings of the leadership roles and
behaviours of principals who achieve and sustain educational outcomes in
schools than single lens quantitative analyses, meta-analyses, or purely qualitative
approaches’ (Day, Gu & Sammons, 2016: 222). In this context, the use of corpus
linguistics methods can inform our understanding of the stakeholders’ positioning
towards leadership, improvement and, among others, success in schools. We have
selected one of the interviews from the dataset in order to illustrate how we can
use interview data in corpus analysis. In total, the whole project is made up of
68 interviews from four schools. First, we will describe the role of each of the
participants in the interview. We will use the <pers> element and the attribute role
in the following way:
Note that the words used by the speaker appear immediately before the closing tag
</pers>. Now, we will do the same with the interviewee. The following is just an
extract from the reply to the opening question:
As above, the interviewee’s reply appears before the closing </pers> tag. Now we
are ready to add some annotation to the text. Depending on their research methods
(theme analysis, content analysis, etc.), the researchers will come up with a different
tagset or taxonomy. In the example below we just show how we could assign a
simple, non-hierarchical annotation scheme to this ICT leader’s words. Just for
illustration purposes, we have identified the following themes in the interview:
• appraisal
• action
• impact
• personal evaluation
• risks involved.
106 Interview data
• <Appraisal> </Appraisal>
• <Action> </Action>
• <Impact> </Impact>
• <PersonalEvaluation> </PersonalEvaluation>
• <RisksInvolved> </RisksInvolved>
This is how the interview extract will look like once annotated:
Note that in the annotation above we have adopted the sentence as our unit of
annotation, although sometimes we have included more than one sentence in
some of the annotations. Some sentences have no annotation at all, though. Also
note that some sentences contain more than one annotation:
Make sure the closing tags are inserted following the general rule that the first tag
is the last one to be closed:
Once we have uploaded our text to Sketch Engine, we will see that the only
structures than can be searched from the Text Types dialogue box are those
annotated in the <doc> element, so there is no trace of our annotation here. We
will look at this in the next section. What we can do, however, is to search within
the tags. How to do this? Follow the following steps:
• [lemma=“we”] within <Appraisal/> It will return all the instances where the
speaker uses we. We could search for [lemma=“I”] within <Appraisal/> and
[lemma=“they”] within <Appraisal/> to see how different personal pronouns
are used to express agency in the sentences annotated as conveying appraisal.
• Compare the results obtained when we search for [lemma=“performance”],
in the entire text or corpus, and in [lemma=“performance”] within
<PersonalEvaluation/>.
108 Interview data
This search power will become more and more evident as we gather more
interviews in our corpus and we combine these searches with the values of
the attributes identified in the Text Types. For example, we could combine
[lemma=“performance”] within <PersonalEvaluation/> across the four schools
featured in Gu (2015) and the role of the person being interviewed. In the next
section we will show how to use our annotation in a TEI scheme.
These [labels] are called metadata or structure attributes. For example, the
structure might carry information about the year of publication, the genre,
the dialect, the style, author, source, simply anything that the author wants
to include. If the corpus author is unsure as to what to include, it is best
to include all available metadata. A corpus without metadata can still be
used for many tasks but metadata (information about the text) cannot be
used in searches or analysis. Metadata (or structure values) are automatically
processed by Sketch Engine into text types, making it easy to set search cri-
teria or build subcorpora from texts belonging to the same category.
(www.sketchengine.eu/corpus-annotation-and-structures/)
Interview data 109
In practical terms, this means that we can add structure information and meta-
data to our transcription so as to improve the depth of granularity of our searches.
We can do this at the document level (in other words, at the individual interview
level) by inserting just a few attributes and values. Examine the following.
In Figure 5.4 we have declared (i.e. we have inserted) some attributes to the
interview (<doc>) in question. We have specified the year when it was conducted
(Year), the language in which it was conducted (Lang), the name of the school
where the interview took place (School), the country where it was conducted
(Country) and the role of the person interviewed (StakeholderRole). This may
not seem terribly interesting if you just happen to be working with one or two
interviews, but it is absolutely essential when you have 68 interviews to analyse.
We can also specify whether the text was uttered by the interviewer or by the inter-
viewee by defining different roles, as shown in Figure 5.5.
By annotating attributes and values like those in Figures 5.4 and 5.5, we can
enable more sophisticated searches on Sketch Engine, filtering our searches
through the attributes and values specified during the transcription and anno-
tation of the interviews. For example, we can now obtain the parts of the inter-
view that were contributed by the different roles involved, across different schools,
years, countries and languages used. Note that when some of these categories are
filtered out, we will obtain more focused and fine-grained results, allowing us the
opportunity to explore independent variables such as school types, countries or
management roles. And this is the really interesting bit: we can create subcorpora
that can be compared and analysed.
We suggest developing a basic TEI transcription template for spoken data based
on Pérez-Paredes & Alcaraz-Calero (2009) that can be used to gather the meta-
data for each and every interview of our corpus. Pérez-Paredes & Alcaraz-Calero
(2009: 68) have noted that TEI offers ‘extensibility, interoperability and standard-
ization, three characteristics […] of the utmost importance for the re-usability of
our annotated corpora’. The proposed template is divided in two parts: Metadata
and Body. In the former we include relevant metadata, while in the body section
we provide the transcription. This is what the template looks like:
110 Interview data
Figure 5.5 Adding person roles
This template can be used to draft the all the necessary information before it is
actually coded, which will make both our transcription and coding more system-
atic. This is what the actual TEI coding may look like:26
<teiCorpus>
<teiHeader>
<fileDesc>
<titleStmt>
<title>Colebrook-ICT Middle Leader-Interview 1 </title>
<author xml:id=“Initials”>Name</author>
<respStmt>
<name xml:id=“Initials”>Name</name>
<resp>transcription</resp>
<resp>annotation</resp>
</respStmt>
<sponsor>Name of sponsor</sponsor>
<funder>
<address>
<addrLine>Address line 1</addrLine>
<addrLine>Address line 2</addrLine>
</address>
<email>email here</email>
</funder>
</titleStmt>
<publicationStmt>
<publisher>Name of the publisher, usually the institution</publisher>
<distributor>Name of the distributor, either the institution or a repository
name</distributor>
<availability status=“free”>
<p>Published under a <ref target=“http://creativecommons.org/
licenses/by-sa/3.0/”>Creative Commons Attribution ShareAlike 3.0
License</ref>.</p>
</availability>
<date when=“2014-01-01”>1 January 2014</date>
</publicationStmt>
</fileDesc>
<encodingDesc>
<editorialDecl>
<normalisation method=“markup” source=“http://www.oed.com/”>
<p>Spelling has been modernised using the <gi>orig</ gi> /<gi>reg
</gi> elements, wrapped in a <gi>choice</gi> element.</p>
</normalisation>
<interpretation>
112 Interview data
<taxonomy xml:id=“SchoolInterviews”>
<category xml:id=“ SchoolInterviews. Appraisal”>
<catDesc>Annotating appraisal</catDesc>
<category xml:id=“ SchoolInterviews.Appraisal.one”>
<catDesc>Appraisal feature 1</catDesc>
</category>
<category xml:id=“ SchoolInterviews.Appraisal.two”>
<catDesc>Appraisal feature 2</catDesc>
<category xml:id=“ SchoolInterviews.Appraisal.two.a”>
<catDesc> Appraisal feature 2 type A</catDesc>
</category>
<category xml:id=“ SchoolInterviews.Appraisal.two.b”>
<catDesc> Appraisal feature 2 type B</catDesc>
</category>
<category xml:id=“ SchoolInterviews.Action”>
<catDesc>Annotating Action</catDesc>
</category>
<category xml:id=“ SchoolInterviews.Impact”>
<catDesc>Annotating Impact</catDesc>
</category>
</taxonomy>
</classDecl>
</encodingDesc>
<revisionDesc>
<change when=“2106-01-01” who=“#Initials”>What was done</change>
<change when=“2015-01-01” who=“#Initials”>What was done</change>
</revisionDesc>
</teiHeader>
<text>
<! Transcription and annotation here>
</text>
</teiCorpus>
Interview data 113
Alternatively, we can choose the attributes that are relevant and include them
in the <doc attribute=“value” string. The classification schemes used must
be defined in the <classDecl> subsection of the encoding description in the
header. Each classification scheme should be identified by means of the xml:id
attribute of a <taxonomy> element. Such taxonomy declarations can define
their own classification categories inside specific <category> elements. The
category descriptions describe the category in a <catDesc> element. We can
use the <taxonomy> element and develop classification categories that can be
defined in separate <category> elements, each with their own xml:id code.
As is the norm in XML, the category is described in a <catDesc> element.
A great advantage is that classification categories can be nested, which means
that we can develop a hierarchical classification system in no time. Let us con-
sider the following coding. We have defined an annotation taxonomy called
SchoolInterviews that includes, just for illustration purposes, three subcat-
egories: Appraisal, Action and Impact.
<taxonomy xml:id=“SchoolInterviews”>
<category xml:id=“ SchoolInterviews. Appraisal”>
<catDesc>Annotating appraisal</catDesc>
<category xml:id=“ SchoolInterviews.Appraisal.one”>
<catDesc>Appraisal feature 1</catDesc>
</category>
<category xml:id=“ SchoolInterviews.Appraisal.two”>
<catDesc>Appraisal feature 2</catDesc>
<category xml:id=“ SchoolInterviews.Appraisal.two.a”>
<catDesc>Appraisal feature 2 type A</catDesc>
</category>
<category xml:id=“ SchoolInterviews.Appraisal.two.b”>
<catDesc>Appraisal feature 2 type B</catDesc>
</category>
<category xml:id=“ SchoolInterviews.Action”>
<catDesc>Annotating Action</catDesc>
</category>
<category xml:id=“ SchoolInterviews.Impact”>
<catDesc>Annotating Impact</catDesc>
</category>
</taxonomy>
Note that the hierarchy in the example above is taxonomy > Category >
Subcategory > Sub-subcategory. So the taxonomy has one major category that
includes different sub-categories or labels that can be used to describe, annotate,
code or classify relevant parts of the interview. These annotatated sections can be
searched, retrieved and analysed in an efficient way. Table 5.5 shows a breakdown
of the annotation taxonomy and Table 5.6 summarises how to use your own tax-
onomy to annotate and query your data.
114 Interview data
Table 5.5 Annotation taxonomy
Table 5.6 Cont.
Notes
1 There are many more types of annotation: syntactic annotation (parsing), semantic
annotation, pragmatic annotation, discourse annotation, stylistic annotation, lexical
annotation, etc. Parsing and semantic annotation can be carried out automatically.
2 https://childes.talkbank.org/
3 She also argues that the limited space in research journals excludes a more in-depth dis-
cussion of the transcription decisions adopted. We wonder whether this is also having
a negative impact on younger researchers who very rarely see that transcription itself is
part of the data gathering and analysis process.
4 www.inqscribe.com
5 https://exmaralda.org/en/release-version/
6 www.fon.hum.uva.nl/praat/
7 https://tla.mpi.nl/tools/tla-tools/elan/
8 www.corpustool.com
9 https://proycon.github.io/folia/
10 https://folia.readthedocs.io/en/latest/introduction.html
11 https://otranscribe.com/
12 https://transcribe.wreally.com/
13 http://brat.nlplab.org/
14 https://uclouvain.be/en/research-institutes/ilc/cecl/transcription-guidelines.html
15 http://users.ox.ac.uk/~lou/wip/metadata.html
16 www.sketchengine.eu/corpus-annotation-and-structures/
17 www.sketchengine.eu/corpus-annotation-and-structures/
18 https://talkbank.org/manuals/CHAT.pdf
19 Recent TEI guidelines suggest the use of a container element (xenoData) if metadata
from non-TEI schemes is used in the document.
20 http://users.ox.ac.uk/~lou/wip/metadata.html
21 www.tei-c.org/release/doc/tei-p5-doc/en/html/examples-teiHeader.html
22 The <fileDesc> element is compulsory and so are <titleStmt>, <publicationStmt>,
and <sourceDesc>.
23 At the time of writing, the lastest TEI guidelines version was 3.6.0, updated 16
June 2019.
24 https://tei-c.org/Vault/P5/3.6.0/doc/tei-p5-doc/en/html/TS.html
25 https://tei-c.org/Vault/P5/3.6.0/doc/tei-p5-doc/en/html/TS.html
26 This is based on the guidelines at: https://teibyexample.org/
27 See the BERA Ethical Guidelines for Educational Research: www.bera.ac.uk/publica-
tion/ethical-guidelines-for-educational-research-2018
116 Interview data
References
Bailey, J. (2008). First steps in qualitative data analysis: transcribing. Family Practice, 25(2),
127–131.
Cohen, L., Manion, L. & Morrison, K. (2018). Research methods in education. London: Taylor
Francis.
Day, C., Gu, Q. & Sammons, P. (2016). The impact of leadership on student outcomes: How
successful school leaders use transformational and instructional strategies to make a
difference. Educational Administration Quarterly, 52(2), 221–258.
Gilquin, G. De Cock, S. & Granger, S. (2010) Louvain international database of spoken English
interlanguage (CD-ROM + handbook). Louvain-la-Neuve, BE: Presses Universitaires de
Louvain.
Gray, D.E. (2004). Doing research in the real world. London: Sage Publications Limited.
Gu, Q. (2015). Interviews at four secondary case study schools. [data collection]. UK Data
Service. SN: 851579, http://doi.org/10.5255/UKDA-SN-851579
King, N., Horrocks, C. & Brooks, J. (2019). Interviews in qualitative research. 2nd edition.
London: Sage Publishing Company.
Leavy, P. (2017). Research design: Quantitative, qualitative, mixed methods, arts-based, and community-
based participatory research approaches. New York: Guilford Publications.
MacWhinney, M. (2019). Tools for Analyzing Talk. Part 1: The CHAT Transcription Format.
https://childes.talkbank.org/
Pérez-Paredes, P. & Alcaraz-Calero, J. (2009). Developing annotation solutions for online
data driven learning. ReCALL, (21)1.
Reppen, R. (2010). Building a corpus: what are the key considerations? In O’Keeffe, A. &
McCarthy, M. (Eds.) The Routledge handbook of corpus linguistics. London: Routledge, 31–37.
Scott, M. & Tribble, C. (2006). Textual patterns: Key words and corpus analysis in language education.
Amsterdam: John Benjamins Publishing.
Silverman, D. (1993) Interpreting qualitative data: methods for analysing talk, text and interaction.
London: Sage Publishing Company.
Chapter 6
Examining lexis
Analysing peace treaties and
children’s literature
6.1 Examining lexis
In chapter 5, we looked at the role of transcription critically, trying to go beyond
some standard practices that see transcription as a mere recording of the words
used in interviews. This chapter will examine how CL methods can provide us
with insights into how vocabulary is used in a corpus. Building on the discussions
in previous chapters, we will offer useful insights into the vocabulary found in a
corpus and discuss ways in which the lexical component of a corpus can con-
tribute to our understanding of how speakers have used vocabulary in distinct
ways. We will use a corpus of peace treaties to examine the relevance of education
in these treaties and try to illuminate how education and possible related concepts
are conceptualised in these documents. We will also look at children’s fiction and
try to showcase some CL methods that can be used to research textual data.
In this chapter we will first look at keyword analysis as a powerful method that
can reveal aspects of language use that might go unnoticed without the use of stat-
istical inference. As Baker (2004) put it, the examination of keywords can reveal,
among others, aspects of ideology and embedded discourse:
Keywords […] will direct the researcher to important concepts in a text (in
relation to other texts) that may help to highlight the existence of types of
(embedded) discourse or ideology. Examining how such keywords occur in
context and which grammatical categories they appear in, and looking at
their common patterns of co-occurrence should therefore be revealing.
(Baker, 2004: 347)
that have been set up to describe a resource. For example, when we read a piece
of news in an online paper such as The Guardian, the readers can actually see and
read the URL of the resource and the title1 of the piece on their screens. This
type of information is defined in the <title> of the resource in the following way:
In this case, the author and the name of the publication are also displayed and
can be read by the users. However, an online article is also defined by a set of
keywords2 that have been identified either by the author or the editorial team of
the newspaper in the following way:
These keywords are not immediately visible to readers as their main ‘readers’ are
search engines such as Google or Bing, enabling them to locate this resource and
‘know’ what is about. Shari Thurow (2010) puts it this way:
Many search engine optimization (SEO) professionals feel that a web page’s
aboutness is communicated simply by keyword repetition. If you use keywords
many times on a web page, then clearly the page is focused on those keyword
phrases, right? I wish it were that simple. First of all, search engines haven’t
measured keyword density as a ranking factor for a very long time. However,
that doesn’t mean that web pages (and graphic images and multimedia files)
shouldn’t contain keywords. Keywords are essential for communicating
aboutness. But keywords should be placed judiciously so that the aboutness of
the page is clear to both search engines and web searchers.
(Thurow, 2010)
So, in the context of search engine optimisation, keywords are meta tags that
describe what the resource is about. The choice of these meta tags is intentional
and carried out by individuals in order to provide a description of a text or a
resource. In this chapter, we are referring to keywords as items that are identified
exclusively through quantitative methods (Scott & Tribble, 2006), and not to the
notion of keywords as used by SEO experts or as used in sociocultural studies as
glossed by Pérez-Paredes (2017):
O’Halloran (2010) and Taylor (2017) have suggested that two keyword con-
ceptualization traditions have co-existed in the past. One is influenced by
cultural studies traditions and sees these words as the body of meanings of
the practices that are central to our societies and institutions. The second
tradition is embodied by corpus linguistics research methodology, one of
Examining lexis 119
its empirical principles being that ‘repeated events are significant’ (Stubbs
2007: 130). In this light, the clustering of lexical items reveals different co-
textual environments that are built upon co-collocation and colligation (Pace-
Sigge 2013) […] Keywords, and, particularly, keyness (Scott & Tribble, 2006),
identify the lexical items that characterize a text or a whole corpus.
(Pérez-Paredes, 2017: 163)
Scott & Tribble (2006) report that 57% of the keywords from 1,000 randomly
selected texts from the BNC are nouns, determiners, prepositions and pronouns.
Specifically, pronouns, proper nouns, possessive ’s, the verb ‘be’ and common
nouns are the most likely sources of keywords in the English language (Culpeper
& Demmen, 2015). In a study of English news language, Fuentes (2015) reports
that over 70% of the keywords are nouns.
Keywords convey aboutness and, more specifically, they represent keyness. Scott
& Tribble (2006) have defined keyness as:
[…] a quality words may have in a given text or set of texts, suggesting that
they are important, they reflect what the text is really about, avoiding trivia
and insignificant detail. What the text ‘boils down to’ is its keyness, once we
have steamed off the verbiage, the adornment, the blah blah blah […]
(Scott & Tribble, 2006: 55–56)
The basic principle is that a word-form which is repeated a lot within the text
in question will be more likely to be key in it. A recipe for a cake may well
have several mentions of eggs, sugar, flour, cake. In our case, it is simple ver-
batim repetition, allied to a statistical estimate of likelihood. The method uses
words, no sentences or propositions, and relies on a simple decision as to what
constitutes a ‘word’, namely the presence of space or punctuation at each end
of a candidate string […]
(Scott & Tribble, 2006: 58)
identify more keywords. However, a reference corpus whose domain and register
are far away from the ones of our target corpus will result in the identification of
keywords that do not necessarily reflect the propositional content of the corpus.
Let us focus on some methodological aspects in the following sections.
and […] how frequent phrases in Jane Austen’s novels contribute to the con-
struction of characters, and places such as the city of Bath.
(Culpeper & Demmen, 2015: 95)
Note that the identification of multiword keywords presents a direct way to under-
stand how lexical units convey aboutness while simultaneously revealing discoursal
features of the use of the language. So, a combination of single word keyword and
multiword keyword analysis can be extremely useful when identifying the con-
struction of discourse in a given text or corpus.
Our research project in this chapter involves the examination of how educa-
tion is conceptualised in peace treaties worldwide (Cremin, 2016). We will now
use keyword analysis and multiword keyword analysis to identify the aboutness
and propositional content of examples of these documents. We have gathered five
peace agreements that contain references to education as per the Peace Accords
Matrix search interface5 provided by the Kroc Institute for International Peace
Studies and the University of Notre Dame. The corpus contains 38,618 tokens
(words) and 4,870 types (different words). This analysis may inform areas in peace-
keeping, peace-making and peace-building in the context of peace education
(Cremin & Bevington, 2017). The five peace agreements selected are listed in
Table 6.1 with their United Nations descriptions.
Table 6.1 Cont.
than the study corpus yields a similar amount of keywords to reference corpora
that are up to 100 times larger than the study corpus’, so the five-times rule seems
a reasonable way forward, although we recommend using a myriad of corpora
and triangulating the results.
To exemplify scenario A in Figure 6.1, we chose the British Law Reports
Corpus,7 an 8.8 million-word corpus of judicial decisions made between 2008 and
2010 by British courts, available on Sketch Engine. After running a keyword ana-
lysis, we find that the keywords listed in Table 6.2 yield the highest keyness scores:
Note that Table 6.2 only shows the top 10 strongest keywords in the corpus.
As expected, the strongest candidates are all proper nouns (Tajikistan, RUF, Sierra,
B
our corpus
Reference corpus: reflects the
language used widely in the
language of the target corpus
Target corpus:
individual
Reference corpus: the enre
target corpus (except the file
C
documents analysed)
Rayson (2003), evaluating various statistical tests for data involving low fre-
quencies and different corpus sizes, favors the log-likelihood test ‘in general’
and, moreover, a 0.01% significance level ‘if a statistically significant result is
required for a particular item’ (2003: 155).
(Culpeper & Demmen, 2015: 98)
126 Examining lexis
Table 6.4 Essential terminology: keyness scores
• The following subjects shall be added in the No. 3 of the function of the
Council-(1) Vocational training; (2) Primary education in mother tongue;
(3) Secondary education.
• […] No. 3 of the function of the Council-(1) Vocational training; (2) Primary
education in mother tongue; (3) Secondary education. b. The words ‘or
protected’ placed in sub-section 6 (b) of the function of the Council in the first
schedule shall be […]
• […] the educational institution. The govt shall provide necessary scholarships
for research works and receiving higher education in abroad. 11.The govt
and elected representative shall make efforts to maintain separate culture and
tradition of […]
newgenrtpdf
128 Examining lexis
Figure 6.2 Keywords in the Chittagong Hill Tracts Peace Accord
Examining lexis 129
So, the use of education in this agreement is subordinate to the legal competencies
that will be retained by the government. This lack of engagement with the notion
of education as a driving force for change would, therefore, seem like a missed
opportunity for education to play a bigger role in social justice and for education
and government policies to help ‘to bridge the growing gap between the rich and
the poor’ (Cremin, 2016: 5). Repeating this process of analysis with the other
documents would give researchers a fuller picture of how education is constructed
in peace agreements.
So far, we have looked at single word keywords. Multiword keywords will give
researchers a closer look at how discourse is built through larger units, typic-
ally but not exclusively, through noun phrases. Those researchers interested in
understanding how people understand reality will find that multiword keyword
tests can give them an excellent opportunity to start their analysis by examining
what is unique or different in an interview or in a corpus. This is so because
multiword keywords exemplify how speakers construe a range of relations through
their choice of words. For example, noun phrases modified by adjectives can offer
a robust picture of the external reference used by a speaker or a writer and their
stance towards their propositions. Pre-and post-modification in noun phrases
are extremely useful linguistic tools to do so. Table 6.5 gives the picture of the
top 40 multiword keywords that are found in the Chittagong Hill Tracts Peace
Accord. In this analysis, the reference corpus is the peace treaties corpus minus
the Chittagong text.
Note that multiword keywords can be made up of two or more words, such
as normal life, general amnesty and determination of electoral constituency, although most
keywords tend to be two-word clusters. Only 16 of the keywords occur more than
twice in the corpus, which is only natural given the sizes of both the target and the
reference corpora. In the Chittagong agreement, there seems to be an emphasis
on amnesty, the distinction between tribal and non-tribal stakeholders, and land. The
references to primary education, secondary education and educational institutions
all make it into the top 100 keywords. The use of educational as in educational insti-
tution is relevant in terms of policy making and is not found in the rest of the
agreements:
Until development equal to other region of the country the govt shall continue
reservation of quota system in govt services and educational institutions for
the tribals. With an aim to this purpose, the govt shall grant more scholarships
for the tribal students in the educational institution.
(Chittagong Hill Tracts Peace Accord, 1997, https://peaceaccords.nd.edu/)
Now look at Table 6.6 to compare the top 50 multiword keywords in the entire
corpus of agreements (five documents), using the British National Corpus as a
reference corpus.
Note how we have shifted our interest from a more focused examination of
the circumstances of the Chittagong Hill Tracts Peace Accord (Table 6.5) to the
130 Examining lexis
Table 6.6 Cont.
• Some keywords are very frequent in the reference corpus. However, given
their different sizes, they still qualify as significant keywords in the corpus.
This group includes national reconciliation, armed conflict and, interestingly, educa-
tional system. The exploration of the patterns of use of these keywords in the
two corpora will give researchers a better insight into how language is being
used in the target corpus and will contribute to drawing conclusions in terms
of the research questions in their projects.
• A second group of keywords is made up of terms that are not found in the ref-
erence corpus. This is extraordinary as the reference corpus used here is the
100-million-word BNC. This group includes terms such as delivery of humani-
tarian assistance, geo-political structure or democratic restructuring. Again, a closer
examination of the cotexts and the contexts of use will be necessary to gain
further understanding of what the implications are.
Examining lexis 133
A B C
ID of the
ID Exploration
Keyword Concordance left and Colligational
significant of the local
analysis lines right analysis
keywords grammars
contexts
The Lomé agreement is unique in stating explicitly the need and pledge to pro-
mote human rights education:
References to educational sectors (primary, secondary, etc.) are only found in the
Chittagong agreement and in the agreement between the Government of the
Republic of the Philippines (GRP) and the Moro National Liberation Front (MNLF),
where there is an explicit reference to how education will look once the agreement
has been signed, and a full description of the new educational system is provided:
It shall develop the total spiritual, intellectual, social, cultural, scientific and
physical aspects of the Bangsamoro people to make them Godfearing, pro-
ductive, patriotic citizens, conscious of their Filipino and Islamic values and
Islamic cultural heritage under the aegis of a just and equitable society. The
Structure of Education System. The elementary level shall follow the basic
national structure and shall primarily be concerned with providing basic
education; the secondary level will correspond to four (4) years of high
school, and the tertiary level shall be one year to three (3) years for non-degree
Examining lexis 137
courses and four (4) to eight (8) years for degree courses, as the case may be
in accordance with existing laws. Curriculum. The Regional Autonomous
Government educational system will adopt the basic core courses for all
Filipino children as well as the minimum required learnings and orientations
provided by the national government, including the subject areas and their
daily time allotment.
(Mindanao Final Agreement, 1996, https://peaceaccords.nd.edu/)
For example, the gay keywords sweat, smelly, beer, football, duty, army, and mili-
tary all contributed toward a discourse of hypermasculinity within the gay
narratives. Some of these keywords have semantic links –for example, army
and military, but it is only by looking at their overall functions in the texts
that stronger links can be made between them (e.g., there is no immediately
obvious link between the words smelly and military). Only through a con-
cordance-based analysis of these words was it made clear that smelly was
consistently used in a way to construct hypermasculine identities in the gay
texts.15
(Baker, 2004: 352)
Coming back to our analysis of the keyword education in the corpus of peace
agreements texts, we find that the low frequency of occurrence of the noun
itself and the use of a limited range of collocates conspire to present few gram-
matical relations with a very limited set of collocates, as reflected in Figure 6.6.
Particularly, the role of education in clauses either as a subject or as an object is
Examining lexis 139
marginal, even more so when one examines the low frequency of such words (see
Figure 6.6) in the corpus.
Clearly, the term education16 is not a priority in the discourse of peace treaties.
If the researcher wishes to abandon the keyword analysis path for a moment, it
may be useful to check whether a collocation analysis will return some different
results. This is not always a successful task to do but it will typically offer a slightly
wider range of results that can be tested. In order to test our claim that education
is largely irrelevant in the corpus, we will need to approach the analysis of colliga-
tion patterns using the multiword Sketch function.
The collocations are only extracted from sentences which contain the col-
location (phrase) in question. In other words, the collocates only come from
contexts where the collocation (phrase) is used. Contexts where the members
of the phrase are used on their own are excluded. This makes it possible to
only display collocates related to a particular word sense or subject.
(www.sketchengine.eu)
Educational system occurs in our target corpus (13 times) and in the BNC (217 times)
when compared with the rest of the multiword keywords. Figure 6.7 shows the
colligational patterns of educational system in the target corpus of peace agreements.
When compared with Figure 6.6, one can see that the number of collocates has
gone down. This can be explained by the fact that the larger a unit of analysis is,
the fewer the number of occurrences and the colligational patterns emerge in the
analysis. In Figure 6.7, we note the following: (1) the words that premodify edu-
cational system tend to describe their scope (national, regional) in a matter of fact,
non-specific way; (2) this educational system plays no substantial role either as sub-
ject or object in clauses. In other words, its impact on the discourse in the corpus
is very limited. This contrasts with the usage observed in the British National
Corpus (Figure 6.8). This is a larger corpus and, quite naturally, will give us an
accurate insight into how other speakers across a wider set of contexts have used
this particular multiword keyword (educational system).
It is unsurprising that, given the wider range of texts included in the BNC, we can
find more lexical variation in this multiword sketch. The range of adjectives that pre-
modify our keyword is extraordinary, from bourgeois to needs-oriented and over-competitive.
As represented in Figure 6.8, the colligational pattern scenario is dominated by pre-
modification, which takes up almost 80% of the entire set of collocates. Equally
140 Examining lexis
interesting is the fact that there is a wide range of collocates where educational system
is either a subject or an object in the clause, as shown in Table 6.9.
This picture of lexical diversity contrasts with what we found in the peace
agreements corpus. Exploring the colligational patterns in a larger, representative
corpus of language use (scenario B in Figure 6.1) can give us a measure of what
is not present in our target corpus. In other words, contrasting usage in different
corpora can provide us with what is frequent and what is not at all frequent in
the data. Baker (2004) suggests that comparing and triangulating data is essential:
Carrying out comparisons among three or more sets of data, grouping infre-
quent keywords according to similar meaning or function, showing awareness
of keyword dispersion across multiple files by using key keywords, carrying
Examining lexis 141
Contrasting keywords and words within the same corpus can be illuminating in
many ways. A Word Sketch Difference search between education and peace can be
revealing of the way discourse is built around these two ideas in the corpus. Thus,
and we already knew this, education shows a very restrictive set of collocations in
the corpus (health, programme, Leone and student) and so on. Peace, on the contrary,
offers a more complex discursive treatment.
Table 6.10 summarises how to such research nouns and noun phrases.
Since 2012, we’ve sent every story we’ve received to Oxford University Press.
These scholarly superstars have now collected 658,477 stories since 2012.
That’s over 328 million words! Our entrants have provided them with the
biggest collection of children’s writing in the world. Why does that matter?
Well, these stories help them to create dictionaries, to understand the lan-
guage children are using and how it’s developing over time. It helps them
work out what kids are interested in: from politics, world events, celebrities
144 Examining lexis
to football. The results from this are taught in seminars and lectures around
the world and help leading figures in Education to improve the way English
is taught in schools.
(www.bbc.co.uk/programmes/articles/1hvt2rmlxVfHyLXhJgDb58B/
your-words-help-us-understand-childrens-language)
One of these applications is the study Oxford University Press makes every year
of the lexis used by these young writers. In 2019, use of Brexit increased by 464%
when compared to 2018. According to this study,23 children showed an interest
in politics and political language was more widely used than in previous years.
European Union was the main cluster, but trade deal or backstop also made their way
into the most used words.
Thompson & Sealey (2007) drew on the British National Corpus to create a
corpus of fiction written for children and compared it with a larger reference
corpus of fiction written for an adult audience. They contrasted these two corpora
not only to examine the similarities and differences in the overall frequencies of
words, parts of speech, etc., but also to examine whether specific lexical items are
used in particular ways when representing the world to child readers. This type
of exploration is really of potential interest to educational researchers looking at
specific texts or corpora in their own projects. Thomson & Sealy’s (2007) work
is an interesting showcase of what can be done with the range of tools we have
discussed in this book (concordance lines, word lists, collocations, keywords, etc.).
Interestingly, they carried out automatic semantic analysis. To do this, they used
Paul Rayson’s Wmatrix24 (Rayson, 2008). They found that the children’s books
corpus was characterised by the following topics (in decreasing order of statis-
tical significance): living creatures, personal names, food, plants, objects, com-
munication, future, size, sight, fear and speed. In contrast, the adult corpus was
categorised semantically by intimate and sexual relationships, drinks, life, law and
order, anatomy, medicines, strong vs. weak characterisation and thoughts and
beliefs.
N-grams are multiword units that occur frequently in a corpus. Essentially, n-
grams are sequences of n words (most typically three, four and five words) that
are parsed, counted and extracted from a corpus. There is substantial research
that shows that the clustering of words is a feature that varies across registers,
characterising them lexically (Gries, Newman, Shaoul & Dilts, 2009). However,
the applications of n-gram analyses go beyond linguistic analysis. Juola (2013) has
used n-grams to analyse cultural complexity in the US, and Stamatatos (2009) to
study plagiarism detection.
The extraction of the most frequent 4-grams from the CLiC corpus of 19th
century children’s fiction returns a list of multiword sequences that can serve as a
starting point to understand the specific narrative techniques and concerns of that
period. This is the list of the 20 most frequent 4-grams and their raw frequency
in the corpus:
Examining lexis 145
Notions of space, time and quantity seem to dominate the choice of words in this
list, which tentatively confirms the presence of a well-referenced shared context
in these stories. However, an n-gram analysis (as set out in Table 6.11) requires a
thorough examination of all relevant n-grams as well as their distribution in the
corpus and an exploration of the concordance lines where they occur. Mahlberg
(2013) argues that n-grams can provide a fine-grained description of the language
used in fiction:
A list of n-grams can give researchers some powerful insights into the frequent
multiword lexical items used in the corpus and into what Mahlberg has called
local textual functions (Mahlberg, 2013). McEnery & Hardie (2012) have stressed
how different areas of research are using CL methods to triangulate their results.
146 Examining lexis
The use of n-grams is particularly relevant when looking at formulaicity and how
language is represented in our brains:
[…] the research by Ellis and Simpson-Vlach (2009) on the status of n-grams
as psychological units triangulates by incorporating corpus data within a
series of experimental investigations [and] responses by instructors of English
for Academic Purposes expressing their opinions on the formulaicity, cohe-
siveness and educational importance of the n-grams under study. The three-
way methodological multiplicity allows Ellis and Simpson-Vlach to conclude
(2009: 73) that ‘formulaic sequences, statistically defined and extracted from
large corpora of usage, have clear educational and psycholinguistic validity’.
(McEnery & Hardie, 2012: 209)
Note that keyword analysis provides a different route into the content of a corpus,
not necessarily rooted in frequency. When looking at n-grams we are actually
evaluating effects of absolute frequency (see chapter 1). If there are too many
concordance lines to be examined, it may be feasible to obtain a random sample
that can be analysed in a realistic timeframe. Sketch Engine offers such function.
Table 6.11, meanwhile, offers a summary of how to use n-grams.
Notes
1 www.theguardian.com/commentisfree/2019/dec/27/brexit-end-english-official-
eu-language-uk-brussels
2 www.theguardian.com/commentisfree/2019/dec/27/brexit-end-english-official-eu-
language-uk-brussels
3 www.lexically.net/wordsmith/
4 Target corpora are also referred as focus corpora or corpora of interest.
5 https://peaceaccords.nd.edu/search
Examining lexis 147
6 This will involve cleaning up our data and save.txt files free from noise and unwanted
coding. See chapter 4 for further guidelines.
7 http://flax.nzdl.org/greenstone3/flax?a=fp&sa=collAbout&c=BlaRC&if=
8 www.sketchengine.eu/documentation/simple-maths/
9 Paul Rayson has set up a resource that offers the possibility to calculate log-likelihood
and effect size online: http://ucrel.lancs.ac.uk/llwisard.html
10 AntConc website offers some useful lists, mainly BNC related: www.laurenceanthony.
net/software/antconc/
11 You can find further details on how to do this: www.sketchengine.eu/guide/create-
a-subcorpus
12 British National Corpus used as reference corpus.
13 Sinclair (1991) showed how for many speakers of English back is a part of our body; yet,
in actual usage, it is just a residual sense.
14 More information on LogDice and other Sketch Engine statistics can be found at: www.
sketchengine.eu/documentation/statistics-used-in-sketch-engine/
15 Baker (2004: 353) also warns us that this process is not quantitative per se: ‘[…] one
problem with combining words into conceptual groups is that it is a subjective process.
Some groups may suggest themselves more clearly to the researcher than others, and
it may be difficult to know how to specify a cut-off point. Carrying out concordance-
based analyses of individual keywords should ensure that the researcher first has an
understanding of what such words are used to achieve in a text, before erroneously
combining words that may appear similar at face value. Like many other forms of
linguistic analysis, researchers are required to develop skills of interpretation, which
suggests that corpus-based research is not a merely quantitative form of analysis.’
16 A thorough analysis would include all keywords (or words if our analysis is preliminary)
that in some way or another are related with this notion.
17 Developed as part of the AHRC-funded project: “CLiC Dickens –Characterisation in
the representation of speech and body language from a corpus linguistic perspective”
(Arts and Humanities Research Council grant reference AH/K005146/1); www.clarin.
ac.uk/clic
18 www.clarin.ac.uk/clic
19 https://github.com/birmingham-ccr/corpora/tree/master/ChiLit
20 www.sketchengine.eu/oxford-childrens-corpus/
21 www.bbc.co.uk/programmes/p00rfvk1?region=uk
22 www.bbc.co.uk/ p rogrammes/ a rticles/ 3 Xk91WDG700VjPYNGMYBrzK/
a-life-sentence
23 http:// f dslive.oup.com/ w ww.oup.com/ o xed/ c hildren/ 5 00- w ords/ B rexit_
Children’sWordoftheYear_Infographic_500Words2019.pdf
24 Wmatrix 4 URL: https://ucrel-wmatrix4.lancaster.ac.uk/
25 https://en.wikipedia.org/wiki/Chittagong_Hill_Tracts_Peace_Accord
References
Baker, P. (2004). Querying keywords: questions of difference, frequency, and sense in
keywords analysis. Journal of English Linguistics, 32(4), 346–359.
Berber-Sardinha, T. (2000). Comparing corpora with WordSmith Tools: How large must
the reference corpus be? In Proceedings of the workshop on Comparing corpora. Association for
Computational Linguistics, 7–13.
148 Examining lexis
Analysing talk
Complex searches
In this chapter we will shift our focus to spoken language and interviews. Readers
will examine spoken data and will integrate complex searches into their inquiry
process. A section of the chapter will be devoted to a review of skills 12–17.
The Backbone corpora are freely available2 for learning and academic purposes3.
Figure 7.1 shows a screenshot of the search interface of Backbone corpus of
English as a Lingua Franca.
The corpus consists of 50 interviews containing 81,607 tokens and 5,817 types.
The adults interviewed come from different sectors including education, research,
engineering and art. They speak English as a Lingua Franca (ELF) and during the
interviews they discuss a variety of topics, from environmental issues to work and
family balance, from culture to politics. In this chapter, we will query a text version
of the corpus that was transcribed and annotated following the TEI guidelines
discussed in chapter 5. The corpus was cleaned up and, among others, references
to entities such as the project and the staff involved in the transcription and anno-
tation were removed.
From a linguistic perspective, spoken language shows distinctive features when
compared with the written form. Staples (2015) has summarised some of the areas
where spoken English language4 behaves in unique ways.
1 N-grams are more common in spoken than in written English. Formulaic lan-
guage is a distinctive feature of spoken communication.
2 Stance features are much more common in spoken than in written language.
Sometimes stance devices (adverbs, adverbials, verb + that clause and, among
others, modal verbs) are used frequently for both epistemic and interactional,
strategic purposes.
3 Discourse markers are much more common in spoken language.
4 Spoken discourse is characterised by vagueness.
(King, Horrocks & Brooks, 2019), it is necessary to understand that the social
world that the researcher is trying to capture is constrained by the characteristics
of spoken communication. Most educational research drawing on qualitative
methodology tries to understand people as constructing their own meanings of
situations. Cohen, Manion & Morrison (2018) frame this constructionist approach
in the following way:
People are deliberate, intentional and creative in their actions, and meaning
arises out of social situations, interactions and negotiations, and is handled
through the interpretive processes of the humans involved. Meanings used by
participants to interpret situations are culture-and context‑bound, and there
are multiple realities, not single truths in interpreting a situation.
(Cohen, Manion & Morrison, 2018: 288)
CL methods offer practical ways to unpack the constructed reality. For example,
Locke (2004) has captured relevant areas of Fairclough’s5 text analysis that can be
used by researchers when examining the linguistic fabric of discourse:
• vocabulary (individual words)
• word meaning: it explores the meaning potential and the changes
involved to accommodate (new) discourses
• wording: the ways in which extralinguistic referents are coded into words
is a manifestation of interdiscursivity
• metaphors: the figures of speech used to structure the way we express our
systems of beliefs and knowledge
• grammar (phrases and clauses)
• modality: evaluates how speakers see propositions, from certainty to
possibility
• transitivity: the ideational dimension of grammar. It explores verb
valency to understand relational meanings (be, have, etc.), actions (and
the arguments involved, such as agents), events and mental verbs
• cohesion (clause and sentence linking)
• connectives and argumentation: different types of argumentation have
cultural and ideological significance
• text structure (organisational properties of the text)
• interactional control: explores turn-taking and topic shift, among others.
These areas are worth exploring in the data and, most certainly, can be used to
comprehend how language is interwoven with how humans code their experiences.
Thompson (2013) notes how language uses these resources:
While this may seem obvious to many of us, it offers a good starting point to
classify experiences in terms of participants, processes and circumstances. The
analysis of verbs in clauses has advanced the study of transitivity (Locke, 2004), that
is, the analysis of the relationship between processes and participants. In short,
processes can be classified in the following ways.
The actual language used in interviews provides evidence of how these processes
are present in our data. In this chapter, we will adopt a case study strategy to
explore three open research questions:
typed between inverted commas. Before the lemma, we then specify that we
want an adjective. We do this is by selecting the appropriate POS tag. As seen in
Figure 7.2, Sketch Engine offers a list of the tags available once we click on the box
TAGS on the right-hand side. Table 7.2 shows some of the most common tags in
a corpus tagged by Sketch Engine.
Note that some tags give us specific information about a word’s class. For
example, in the case of adjectives we can choose between comparative (JJR) or
superlative forms (JJS). As for nouns, we can examine common nouns in the
singular (NN), plural (NN), singular proper nouns (NP) or plural proper nouns
(NPS). Let us go back to our initial search: [tag=“J.*”][lemma=“city”]. This will
return all the instances in the corpus where the lemma city is preceded by an
adjective. Table 7.3 shows a breakdown of the lexical items in the [tag=“J.*”]
[lemma=“city”] search.
Note that [tag=“J.*”] includes “.*” after J. What we are doing here is asking
Sketch Engine to retrieve all tags that start with a “J”, including JJ, JJR and JJS.
This is a flexible option that can be used to either zoom in or zoom out the range
of results we want to examine. For example, [tag=“V.*”] includes the following
verb POS tags: VV, VVD, VVG, VVN, VVP and VVZ (see Table 7.2). However,
this type of search returns such a wide variety of POS tags and lexical items that
the results will be very challenging to process. How do we make sense of all of this
information? We can either examine the range of POS tags and/or the lexical
elements that fill the POS tag slots. Let us see how this can be done.
2,928 2,900
3,000
2,500 2,307
2,000
Frequency
1,500 1,288
1,131
1,000 854
732 719
527
421
500
0
VVP VV VBZ VBP VVG VHP VVD VVN VVZ VBD
Verb tags
We would first like to know how many of the possible V tags are present in the
corpus. To do this we look at the distribution of these POS tags in the corpus.
Figure 7.3 shows the top 10 most frequent verb tags in the corpus and their raw
frequencies.
The most frequent tags are, in decreasing order of frequency, inflected pre-
sent tense forms other than the third person singular (I think, we hope, I want);
base forms of verbs (I must say, difficult to understand); simple preset form of verb
to be; third person singular (is) and first and second person (am, are); and gerund
forms (talking, working). These tags can be explored individually so as to map out
the functions they carry out in the interviews. We can use the tag VVP to either
explore some mainly ideational functions of the language to express opinions on
what it feels like to live in cities or villages. Contrasting VVP and VVD tags seems
like a good way into the data. To do this, we can use [tag=“VVP|VVD”], where
the vertical line | is used to ask Sketch Engine to retrieve either of the two tags.
The results will be easy to sort out by selecting the Show frequency option and
the KWIC (keyword-in-context) POS tag (see Table 7.4). From there the con-
cordance lines for each of the tags can be shown.
The new concordance line results can be further sorted, and a rank of verbs
obtained. Figure 7.4 shows the top 20 most frequent verbs used in the present
tense (VVP) in the corpus.
From this screen we can access new concordance lines, for example we may
want to explore think, and we can ask Sketch Engine to return its collocates. The
results will include adjectives such as good (10.21),6 important (10.07), interesting (9.15),
158 Analysing talk: complex searches
European (9.05) and different (8.67). These are the adjectives that appear to con-
centrate the expression of opinion in the dataset. However, our main focus is
to explore opinions about living in the city. We could approach this search in
different ways. One of them is to use the following CQL:
Note that the search needs to be inserted in brackets. What we are doing here
is asking Sketch Engine to give us all the concordance lines where a verb in the
present tense is found five slots to the left or to the right of either city or village. We
could include an adjective in our search instead and increase our search to eight
slots to the left and the right:
This is a powerful method to identify language uses that can help us answer our
research questions. Note that the concordance lines obtained from the results
of our search can be exported to a spreadsheet where we can filter out results,
examine the evidence in the data and come up with new hypotheses and target
language patterns. In chapter 3, we presented a step-by-step procedure to read
and interpret concordance lines. This procedure (summarized in Table 3.7) is
based on the notion of constant recycling of our findings and hypotheses.
Figure 7.5 captures the dynamic nature of this process and acknowledges the
agency of the researcher while assessing the usage that characterises the lived
experiences of those interviewed. We suggest that researchers start the exploration of
Analysing talk: complex searches 159
the language in the texts with a word list (either a general word list or a word list of
nouns, verbs or adjectives) and then move on to explore specific items in the corpus.
An alternative way in is to start with those lexical items that will be key to interpreting
the research questions in the project. The two circles in Figure 7.5 exemplify how CL
methods (grey arrows) can guide our inquiry (black arrow, inner circle).
Let us explore now our second question.
160 Analysing talk: complex searches
First
Results
hypothesis
Collocations Collocations
<u who==“#Interviewer”>
Analysing talk: complex searches 161
A search like this will return all the concordance lines where the lemma difference
has been used by speakers other than interviewees:
Note that we are searching within structures in a corpus that had previously been
marked up by either transcribers or annotators (see chapter 5). The structures of
a corpus can be found in their XML structure or, if we are using Sketch Engine,
in Corpus Information > Structures and Attributes. This bit from the previous
search:
is asking Sketch Engine to retrieve all the instances of the lemma difference not used
by speakers marked up as interviewers. ‘!’is an operator that means ‘not’, hence
not within the structure <u who=“#Interviewer”/>. This will let us focus our searches
on those speakers that are of interest. Now try to think how useful this will be if,
for instance, you are working with a corpus of classroom talk where we can iden-
tify everyone in the room either in terms of the role (teachers, learners, etc.), their
gender, age, IDs or names, etc. As we saw in previous chapters, the main analytical
tool for corpus users is the comparison between corpora or datasets. Hence the
importance of isolating parts of a corpus based on specific criteria. Once this is
done, a subcorpus can be created. We can then perform specific searches within it
or compare this subcorpus against another dataset.
A search like this one:
[lemma=“similarity|difference|culture|cultural”]!within
<u who=“#Interviewer”/>
will return all the concordance lines where those interviewed discuss the differences
that they found when living or working in a different country or region. Some of
the significant collocations with the lemma difference include countries (12.19), prob-
ably (11.78) and but (11.23). A first hypothesis indicates that the interviewees tend
to hedge their opinions considerably, smoothing out their judgements as regards
other cultures. Using the procedure in Figure 7.5 we can refine and fine-tune our
hypotheses. The next section explores how family life is impacted by work in our
corpus.
far too many concordance lines. We may want to reduce our focus by looking at
uses of work as a noun (and not as verb). To do this we need to specify that we
are only interested in results that have been tagged as nouns. We can do it in the
following way:
[lemma=“family|life|work&[N.*]”]
We have here specified that we are only interested in the lemma work when it is a
noun, using the ampersand symbol to let Sketch Engine know that this is a condi-
tion that has to be met and that this is exactly what we want to search for: &[N.*].
If the corpus has been previously annotated or coded, we may want to use this
annotation to search within structures that are likely to contribute to our explor-
ation of the data. In the case of our TEI corpus, the coding team decided to
include this information at the <div> level (as the corpus was segmented into
thematic sections) and within the decls attribute. The following is an example of a
complex search that integrates various lemmas and a specific part of the corpus
annotated as particularly relevant for the study of the world of work:
Given the complexity of the annotation of this corpus, with different overlapping
topics or tags annotated simultaneously by a team of international educators, the
search above returns exclusively those parts of the corpus that were annotated
with #worldofwork as their only tag. A simple examination of the different decls
attributes in Sketch Engine reveals that the corpus was annotated with 332
different combination of tags. This, no doubt, enriched the search experience
of the users of the corpus and increased the flexibility of the annotators when
making decisions in terms of the coding scheme. Another option is to search for
our lemmas within any annotated part of the corpus. To do this we can use the
following CQL:
This is an alternative way into the data. This search string retrieves all annotated
parts or sections (div) of the corpus. Note that we are asking Sketch Engine to
retrieve any alphabetical characters either in upper or lower case that follow the
hashtag symbol ([A-Za-z]) plus any other characters that may follow them (.*). The
information symbol on the left of each concordance line specifies all the associated
metadata. As we saw in chapter 5,7 using an ad hoc corpus gives researchers
plenty of freedom and flexibility in terms of how they want to annotate their
texts. A customised annotation will dramatically facilitate the understanding of
our research questions. Remember: a corpus is just an instrument to look at the
data that we deem relevant to our research questions.
Analysing talk: complex searches 163
A collocational analysis of the results of this search will provide us with a long
and potentially interesting list of collocates that we may want to explore. Just in
the top 80 of collocates we find, among others, the following: company (10.12),
different (10.07), but (9.88), together (9.58), people (9.58), they (9.55), balance (9.32), better
(9.20), hard (9.11), job (9.03), good (9.01), always (8.86) and enterprises (8.79). An ana-
lysis of just company and balance alone lend evidence to the following:
with a larger corpus of English that is instrumental in identifying the most rele-
vant topics in white supremacist discourse. Then, collocation analysis is used to
refine his understanding of how these words are used. Brindle (2016: 22) explains
that concordance lines ‘are then employed to facilitate the observation of the
actual contexts of these words in order to comprehend their meanings’. Brindle
categorises the Stormfront keywords into three groups –sexuality, race and evalu-
ation –which in many ways resemble the kind of result that might derive from
theme analysis. However, the ways in which this finding is arrived at are widely
different, both ontologically and epistemologically.
An analysis of collocates can provide us with even more fine-grained results.
Brindle has used collocation networks to understand these relationships. Plotting
these collocates on a network graph can be truly illuminating. Figure 7.6 shows
the relationships of the top 20 keywords in the corpus. Note how some of the
words attract more words than others and, similarly, how some words are more
unidimensional, and their association power is limited to one or two other
keywords.
Brindle (2016) interpreted some of these relationships in the light of the links
in Figure 7.6.
Pedophilia
Homosexuality
Pedophiles
Perverts
Homos
Homosexual Queer
Homo
Heterosexual
Perversion Jew
Homosexuals
Gay
Gays
While the visual representation of these relationships is not crucial in the analysis,
we can certainly rely on a diagram to spot these links more easily. #LancsBox10
(Brezina, Timperley & McEnery, 2018) is a multiplatform software that can gen-
erate such visual network representations. Using a diagram can helpfully stimulate
our understanding of how a particular phenomenon is represented in a dataset.
Table 7.5 gives a breakdown of how to conduct the complex searches we have
dealt with in this section.
7.3.2 Chapter 6
In this chapter we explored how keyword analysis (see Table 6.8) can be used
to explore corpora of policies and long documents. Revisit Figure 6.1 and think
about the implications of your choice of target and reference corpora. Why is
your choice so important?
What do keyness scores tell you about the lexical items in a corpus or in a
document?
You have decided to use keyword analysis to investigate your data. Go to
Figure 6.3 and try to outline step-by-step guidelines to look at your data. Can you
envisage the outcome of your analysis?
What are the differences between keyword analysis and n- gram analysis
(Table 6.11)?
7.3.3 Chapter 7
Again, focus on a research project that you would like to do in the future. The
project involves the transcription and the analysis of interviews. Try to identify
some of the themes that you expect to find in the interviews. How would you go
about investigating them? Can you outline a strategy to examine these themes in
your corpus? Where will you start? Which of the linguistic elements in 7.1 may
be relevant?
Ideally, your corpus has been annotated and you can search within attributes
and structures (see chapter 5). Can you envisage the outcome of your searches?
How does a plain transcription differ from an annotated corpus? How can com-
plex searches (Table 7.5) help you retrieve results from your dataset?
Notes
1 The entire XML TEI annotated corpus: http://webapps.ael.uni-tuebingen.de/
backbone-search/corpora/backbone_english_as_lingua_franca.xml
2 http://webapps.ael.uni-tuebingen.de/backbone-search/
Analysing talk: complex searches 167
References
Brezina, V., Timperley, M. & McEnery, T. (2018). #LancsBox v. 4.x [software]. Available
at: http://corpora.lancs.ac.uk/lancsbox.
Brindle, A. (2016). The language of hate: A corpus linguistics analysis of white supremacist language.
London: Routledge.
Cohen, L., Manion, L. & Morrison, K. (2018). Research methods in education. London: Taylor
Francis.
Ellis, N. (2020). Usage-based theories of construction grammar: Triangulating corpus lin-
guistics and psycholinguistics. In Egbert, J. & Baker, P. (Eds.) Using corpus methods to triangu-
late linguistic analysis, 239–267.
King, N., Horrocks, C. & Brooks, J. (2019). Interviews in qualitative research. 2nd edition.
London: Sage Publishing Company.
Kohn, K. (2012). Pedagogic corpora for content and language integrated learning. Insights
from the BACKBONE Project. The Eurocall Review, 20(2), 3–22.
Locke, T. (2004). Critical discourse analysis. London: Bloomsbury.
Staples, S. H. (2015). Spoken discourse. In Biber, D. & Reppen, R. (Eds.) The Cambridge
handbook of English corpus linguistics. Cambridge: Cambridge University Press, 271–291.
Thompson, G. (2013). Introducing functional grammar. London: Routledge.
Chapter 8
Conclusion
This book has provided a practical introduction to the use of corpus linguis-
tics methods in education research. We have presented an entry-level research
guide to the world of corpus linguistics for those who do not necessarily have
a linguistic background or have not used corpora before. Our discussion aimed
to bridge the gap between educational research, admittedly a massively diverse
area of practice, and corpus linguistics, a very specific area with a huge poten-
tial to become a useful set of research methods across a variety of disciplines.
Despite the methods’ potential, however, we find two major challenges to this
collaboration.
The first of these challenges is inherent to the notion of interdisciplinary work.
McEnery, Brezina, Gablasova & Banerjee (2019) have stressed that, despite the
promise (of collaboration), working across disciplinary boundaries has long been
acknowledged to be difficult. This difficulty is expressed in the absence of corpora
in major publications on research methods in education (e.g. Cohen, Manion &
Morrison, 2018; Gray, 2018). The fact that corpus methods may be either ignored
or seen as out-of-bounds in education research reinforces the perceived distance
between those practising educational research in the inner circle and those con-
tributing from the outside.
The second challenge is more elusive and difficult to address. Corpus linguistics
methods are seen as essentially quantitative, at least in linguistics research and,
particularly, in research involving lexicography and the building of representative
corpora. The use of representative corpora in CL research has received much
scholarly attention and, arguably, most linguists make use of corpora as proxies
for usage based on the sophisticated design features of their corpus of choice.
However, CL-inspired research that exploits some of the CL research methods
with more modest corpora are also relevant across educational contexts, and par-
ticularly in second language education.
The use of corpus linguistics as a complementary research method to inform
the qualitative analysis of language has rarely been discussed in education where
its use in research-methods design has been limited. The perception that corpus
linguistics is a quantitative methodology may put off education researchers that
Conclusion 169
use an interpretive paradigm and who may feel that corpus methods, as post-
positivist practice, may be useless in their context:
In this book we have tried to provide an account that takes stock of the main-
stream quantitative tradition in corpus linguistics, while presenting opportun-
ities for collaboration and mixed-methods research to emerge. Using Egbert &
Baker’s (2020) classification of when corpus methods are used in triangulation,
our approach can be described either as sequential or cyclical. Sequential CL
methods were mainly used in this book when corpora were devised as the main
research data collection method. On the other hand, most of the discussions in
this book seem to favour the use of CL methods as cyclical. Our model for the
examination of concordance lines (Figure 7.5) is an example of this approach.
We need more reflection and conversations with educational researchers in order
to understand better how these two approaches can be used in their research
designs and how they can contribute to the use of mixed-methods or CL-only
approaches.
Apart from these challenges, we want to emphasise the many opportunities that
lie ahead for educational researchers who are interested in corpus linguistics. CL
methods can help them with the triangulation and validation of their research: tri-
angulation of methods and also of datasets, as well as validation of results and an
evaluation of researcher bias. We note that many of the data collection methods
used in education such as interviews or focus groups are likely to be explored
either automatically or semi-automatically by means of CL methods such as col-
location or keyword analysis. The use of policy documents and media texts is
absolutely central to existing CL work in other areas of research such as the social
sciences. Besides, this book has provided plenty of examples where CL methods
have already been used in education.
We believe that the best applications of CL methods to educational research are
yet to come. We hope that this book will contribute to extending the popularity of
corpus linguistics outside its area of specialisation.
Table 8.1 offers a summary of our crucial, final skill: remaining critical.
170 Conclusion
References
Cohen, L., Manion, L. & Morrison, K. (2018). Research methods in education. London: Taylor
Francis.
Egbert, J., & Baker, P. (Eds.). (2020). Using corpus methods to triangulate linguistic analysis.
London: Routledge.
Gray, D.E. (2018). Doing research in the real world. 4th Edition. London: Sage Publications
Limited.
McEnery, T & Hardie, A. (2012). Corpus linguistics: method, theory and practice. Cambridge:
Cambridge University Press.
McEnery, A., Brezina, V., Gablasova, D. & Banerjee, J. (2019). Corpus linguistics, learner
corpora, and SLA: Employing technology to analyse language use. Annual Review of
Applied Linguistics, 39, 74–92.
Index
Note: Page numbers in italic denote figures and in bold denote tables.
annotation 90–91, 91, 92–113, 96, 98, British Academic Written English Corpus
101, 108, 109, 110, 114–115; see also (BAWE) 3–4, 5, 16–17
part-of-speech (POS) tagging British Journal of Educational Technology
AntConc software 23, 43, 43, 52–53, (BJET) 23
53, 54, 55, 58; annotation 95–96, 96; British Law Reports Corpus 124–125, 124
comparison 76, 77, 78, 79, 80; keyword British National Corpus (BNC) 10, 15,
analysis 120, 125–126; n-grams 146 16, 17, 51, 97, 119, 125, 125, 129–133,
Anthony, Laurence 43, 78 131–132, 139–140, 141, 141, 144
association measures 57 Brown Corpus 1
Aston, Guy 27–28 Burnard, Lou 96–97
Atkinson, J.M. 25 Bybee, J. 11, 12
Australia, early childhood education (ECE)
policy 65–69, 68, 71 Callies, M. 28
automated content analysis 23 Canada, financial literacy education policy
automated transcription 88, 92–93 69–71, 71, 71
average reduced frequency (ARF) 85 Cheung, L. 4, 5
Child Language Data Exchange System
Backbone Corpus of English as a Lingua (CHILDES) project 90–91, 99–100
Franca 150–151, 151, 154–163, 154, children’s literature, keyword analysis
156, 157, 159 142–146, 143
Bailey, J. 91–92 chi-square test 120, 125, 126
Baker, Paul 12–14, 117, 133, 134, 138, CLAWS tagset 75–76
140–141, 169 CLiC (Corpus Linguistics in Cheshire)
Barroso, J. 20 142–143, 143, 144–145
Bednarek, M. 3 Coates, J. K. 23–24
Berber-Sardinha, T 121–124 coding 90–91, 91, 92–113, 96, 98, 101,
Bergmann, J.R. 24–25 108, 109, 110, 114–115; in content
Bergström, G. 25 analysis 20–22
Biber, Douglas 4, 7, 16, 29, 29, Cohen, L. 8, 9, 9, 20, 21, 89–90,
30–32, 30, 73 152, 169
BNC see British National Corpus (BNC) colligational analysis 1 33, 134, 135–142,
Bond, M. 23 135, 136, 137, 140, 141, 141, 142
Brat software 93 collocation analysis 28, 56–61, 56, 57, 58,
Brenchley, M. 35 59, 60, 70–71, 72, 85, 139, 163, 164
Brezina, Vaclav 52, 55, 85, 117, 126, 134 comparison 73–83, 77, 79, 80, 81,
Brindle, A. 163–165, 164 82, 84, 85
172 Index