Untitled

Corpus Linguistics for Education
Corpus Linguistics for Education provides a practical and comprehensive introduction

to the use of corpus research methods in the field of education. Taking a hands-on
approach to showcase the applications of corpora in the exploration of education-
ally relevant topics, this book:
• covers 18 key skills including corpus building, the role of frequency, different
corpus methods, transcription and annotation;
• demonstrates the use of available corpora and desktop and online corpus ana-
lysis tools to conduct original analyses;
• features case studies and step-by-step guides within each chapter;
• emphasises the use of interview data in research projects.
Corpus Linguistics for Education is an essential guide for students and researchers
studying or conducting their own corpus-based research in education.
Pascual Pérez-Paredes is Professor of Linguistics and Applied Linguistics in

the English Department, Universidad de Murcia, Spain. At the time of writing
this book he was Lecturer of Research in Second Language Education (2015–19)
and Overall Coordinator of the MEd Research Methods Strand (2016–19) in the
Faculty of Education at the University of Cambridge, UK. His main research
interests are learner language variation, the use of corpora in language education
and corpus-assisted discourse analysis. He has published research in journals such
as Computer Assisted Language Learning; Discourse & Society; English for Specific Purposes;
Journal of Pragmatics; Language Learning & Technology; System; ReCALL; and the
International Journal of Corpus Linguistics, and is an Assistant Editor of CUP ReCALL
journal.
Routledge Corpus Linguistics Guides
Series consultant: Michael McCarthy

Michael McCarthy is Emeritus Professor of Applied Linguistics at the University of
Nottingham, UK, Adjunct Professor of Applied Linguistics at the University of Limerick,
Ireland, and Visiting Professor in Applied Linguistics at Newcastle University, UK. He is
co-editor of the Routledge Handbook of Corpus Linguistics, editor of the Routledge Domains of
Discourse series and co-editor of the Routledge Applied Corpus Linguistics series.
Series consultant: Anne O’Keeffe

Anne O’Keeffe is Senior Lecturer in Applied Linguistics and Director of the Inter-
Varietal Applied Corpus Studies (IVACS) Research Centre at Mary Immaculate College,
University of Limerick, Ireland. She is co-editor of the Routledge Handbook of Corpus Linguistics
and co-editor of the Routledge Applied Corpus Linguistics series.
Series co-founder: Ronald Carter

Ronald Carter (1947–2018) was Research Professor of Modern English Language in
the School of English at the University of Nottingham, UK. He was also the co-editor
of the Routledge Applied Corpus Linguistics series, Routledge Introductions to Applied
Linguistics series and Routledge English Language Introductions series.
Routledge Corpus Linguistics Guides provide accessible and practical introductions to using
corpus linguistic methods in key sub-fields within linguistics. Corpus linguistics is one of
the most dynamic and rapidly developing areas in the field of language studies, and use
of corpora is an important part of modern linguistic research. Books in this series provide
the ideal guide for students and researchers using corpus data for research and study in a
variety of subject areas.
Other titles in this series
Corpus Linguistics for World Englishes

Claudia Lange and Sven Leuckert
Corpus Linguistics for Education

Pascual Pérez-Paredes
More information about this series can be found at www.routledge.com/series/RCLG

Corpus Linguistics
for Education
A Guide for Research
Pascual Pérez-P aredes

First published 2021
by Routledge
2 Park Square, Milton Park, Abingdon, Oxon OX14 4RN
and by Routledge
52 Vanderbilt Avenue, New York, NY 10017
Routledge is an imprint of the Taylor & Francis Group, an informa business
© 2021 Pascual Pérez-Paredes
The right of Pascual Pérez-Paredes to be identified as author of this work has been
asserted by him in accordance with sections 77 and 78 of the Copyright, Designs
and Patents Act 1988.
All rights reserved. No part of this book may be reprinted or reproduced or utilized
in any form or by any electronic, mechanical, or other means, now known or
hereafter invented, including photocopying and recording, or in any information
storage or retrieval system, without permission in writing from the publishers.
Trademark notice: Product or corporate names may be trademarks or registered trademarks,
and are used only for identification and explanation without intent to infringe.
British Library Cataloguing-in-Publication Data
A catalogue record for this book is available from the British Library
Library of Congress Cataloging-in-Publication Data
Names: Pérez-Paredes, Pascual, author.
Title: Corpus linguistics for education : a guide for research / Pascual Pérez-Paredes.
Description: Abingdon, Oxon; New York, NY: Routledge, 2020. |
Series: Routledge corpus linguistics guides |
Includes bibliographical references and index.
Identifiers: LCCN 2020007487 (print) | LCCN 2020007488 (ebook) |
ISBN 9780367198435 (paperback) | ISBN 9780367198442 (hardcover) |
ISBN 9780429243615 (ebook)
Subjects: LCSH: Education–Research–Methodology. |
Applied linguistics–Research–Methodolgy. |
Corpora (Linguistics) | Linguistic analysis (Linguistics) |
Language and education. | Education–Terminology.
Classification: LCC LB1028 .P346 2020 (print) |
LCC LB1028 (ebook) | DDC 370.72–dc23
LC record available at https://lccn.loc.gov/2020007487
LC ebook record available at https://lccn.loc.gov/2020007488
ISBN: 978-0-367-19844-2  (hbk)
ISBN: 978-0-367-19843-5  (pbk)
ISBN: 978-0-429-24361-5  (ebk)
Typeset in Baskerville
by Newgen Publishing UK
For Nani, Alicia and Arturo
Contents
List of figures x
List of tables xii
Preface xv
Acknowledgements xviii
1 Introduction: corpus linguistics and education research 1

1.1 What is corpus linguistics? 1
1.2 The role of corpus linguistics research methods in education research 8
1.3 Understanding the role of frequency 11
1.3.1 Frequency in L1 learning and use 11
1.3.2 Frequency in public discourse 12
1.3.3 Frequency in texts or groups of texts 14
1.3.4 Why frequency matters 15
1.3.5 How to interpret frequency 16
References 18
2 Analysing text 20
2.1 Different approaches to text analysis 20
2.2 Text as register 27
2.2.1 Corpus linguistics and the analysis of register 27
References 32
3 Corpus linguistics approaches to understanding language use 35

3.1 Understanding and researching language use: discovering patterns 35
3.1.1 Corpus linguistics outside linguistics? 35
3.1.2 Case study 1. Examining interviews: qualitative versus
CL methods 38
3.1.3 Case study 2. Examining policies: combining content analysis
and corpus methods 39
3.1.4 Using an existing corpus 40
viii Contents
3.2 Reading concordance lines 42

3.2.1 How to read concordance lines 45
3.3 Handling frequencies 51
3.3.1 Corpus size and relative frequencies 51
3.4 Collocations 56
References 61
4 Researching education policies: using your own corpus 64

4.1 Basic corpus design features 64
4.1.1 Designing corpora 64
4.2 Comparison basics and significance testing 73
4.2.1 Comparison basics and part-of-speech (POS) tagging 74
4.3 Reviewing skills 1–11 83
4.3.1 Chapter 1 84
4.3.2 Chapter 2 85
4.3.3 Chapter 3 86
4.3.4 Chapter 4 86
References 87
5 Interview data: transcription and annotation 88

5.1 Transcription: so much more than a monotonous task 88
5.2 Transcription basics 91
5.3 Adding structure and metadata to a corpus 96
5.3.1 Annotating a corpus using our own tags 104
5.3.2 Annotating a corpus using standard XML guidelines 108
References 116
6 Examining lexis: analysing peace treaties and children’s literature 117

6.1 Examining lexis 117
6.2 Researching the lexicon: keywords 117
6.2.1 Introducing keyword analysis 120
6.2.2 Keyword analysis: a step-by-step guide 121
6.3 Researching nouns and noun phrases 133
6.3.1 Exploring individual nouns 135
6.3.2 Exploring multiword units 139
6.4 Analysing children’s literature: the lexicon of fiction 142
References 147
7 Analysing talk: complex searches 150

7.1 Examining talk: a linguistic perspective 150
7.2 Complex searches 153
7.2.1 Living in a city 154
7.2.2 Understanding cultural differences 160
7.2.3 How is their family life impacted by work? 161
Contents ix
7.3 Putting it all together: reviewing skills 12–17 166

7.3.1 Chapter 5 166
7.3.2 Chapter 6 166
7.3.3 Chapter 7 166
References 167
8 Conclusion 168
References 170
Index 171
Figures
1 .1 Corpora as a research method 2

1.2 Using corpora to examine textual data from data elicitation
methods such as interviews or focus groups 6
1.3 Research in education 8
1.4 Using corpora as primary or secondary data 15
2.1 Adding metadata to codes in MAXQDA 22
2.2 Situational characteristics of registers and genres 30
3.1 Language as genre–activity continuum 37
3.2 Concordance lines in AntConc 43
3.3 Concordance lines in Sketch Engine 44
3.4 Word list from The Times Online corpus as displayed in
Sketch Engine 46
3.5 Bullying in the COCA corpus 52
3.6 AntConc word list screenshot 53
3.7 AntConc tool preferences. Activating lemma forms 54
3.8 Lemma list in AntConc 54
3.9 AntConc collocate window 58
4.1 Breaking down a research question (Fenech & Wilkins, 2019)
into arguments 68
4.2 Pinto’s (2013) analysis of narratives 71
4.3 Word list options in Sketch Engine 77
4.4 AntConc tag settings 77
4.5 TagAnt interface 79
4.6 AntConc concordance option (tags visible) 79
4.7 AntConc concordance option (tags hidden) 80
5.1 Planning the transcription of your corpus 91
5.2 Searching for tagged annotations in AntConc 96
5.3 CQL search using tags 108
5.4 Adding some attributes and values to an interview 109
5.5 Adding person roles 110
6.1 Different reference corpora in keyword analysis 124
6.2 Keywords in the Chittagong Hill Tracts Peace Accord 128
List of figures xi
6 .3 Working with keywords: a suggested pathway 134

6.4 Sketch Engine Word Sketch basic interface 135
6.5 Sketch Engine Word Sketch advanced interface 136
6.6 Visualisation of Word Sketch for education in the peace
agreements corpus 137
6.7 Visualisation of the grammatical relations of educational system
in the peace agreements corpus 140
6.8 Visualisation of the grammatical relations of educational system
in the BNC 141
6.9 Multiword keywords in the CLiC corpus of 19th century
children’s literature 143
7 .1 Interviews in the Backbone corpus of English as a Lingua Franca 151
7.2 Advanced query interface in Sketch Engine 154
7.3 Verb POS tags in the corpus analysed 157
7.4 Most frequent verbs used in the present tense (VVP) 159
7.5 A model for the examination of concordance lines 160
7.6 Collocational network of first 20 keywords 164
Tables
1.1 Connecting research questions, methods and data collection

and analysis 3
1.2 Total accountability in different corpus designs 5
1.3 Differences between positivism and phenomenology paradigms 9
1.4 Skill 1: why frequency matters 17
1.5 Skill 2: how to interpret frequency 18
2.1 Skill 3: understanding text types: register basics 29
2.2 Skill 4: understanding textual features and textual data 31
3.1 GIG corpus data 42
3.2 Skill 5: using an existing corpus 42
3.3 Essential terminology: lemma 44
3.4 Patterns emerging from the concordance lines of fact in
Orwell’s 1984 45
3.5 Essential terminology: types/tokens 46
3.6 Extended node contexts 49
3.7 Skill 6: reading concordance lines 50
3.8 Skill 7: handling frequencies 55
3.9 Essential terminology: collocate 56
3.10 Top 10 collocates of inclusion (MI score) 57
3.11 AntConc collocates of inclusion in The Guardian 2018 corpus 59
3.12 Skill 8: understanding collocations 60
4.1 Before designing your own corpus 67
4.2 Corpus building in Fenech & Wilkins (2019) and Pinto (2013) 71
4.3 Skill 9: basic corpus design features 72
4.4 Top 20 most frequent nouns in two HE policy documents 81
4.5 Concordance lines from UK ‘International Education
Strategy: global potential, global growth’ 82
4.6 Relative frequencies of lemmas per 1,000 82
4.7 Skill 10: comparing two corpora 84
4.8 Skill 11: statistical tests 85
4.9 Reflecting on existing corpora 85
5.1 LINDSEI transcription guidelines 94
List of tables xiii
5 .2 Skill 12: transcribing your data 97

5.3 Essential terminology: Text Encoding Initiative (TEI) 98
5.4 Tags in the TEI header 101
5.5 Annotation taxonomy 114
5.6 Skill 13: annotating and querying your data: using your
own annotation taxonomy 114
6.1 Peace agreements included in our corpus 122
6.2 Top 10 keywords using British Law Reports Corpus as
reference corpus 124
6.3 Top 10 keywords using the British National Corpus as
reference corpus 125
6.4 Essential terminology: keyness scores 126
6.5 Multiword keywords in the Chittagong Hill Tracts Peace Accord 130
6.6 Multiword keywords in the peace agreements corpus
(five treaties) 131
6.7 Education-related multiword keywords in a corpus of peace
agreements 132
6.8 Skill 14: understanding keywords 134
6.9 Collocates of educational system acting as subject or object in
clauses in the BNC 141
6.10 Skill 15: researching nouns and noun phrases 142
6.11 Skill 16: looking at the lexicon of a corpus: n-grams 146
7.1 A selection of multiword keywords in the Backbone corpus
of English as a Lingua Franca 154
7.2 List of some POS tags used by Sketch Engine 155
7.3 [tag=“J.*”][lemma=“city”] search in the Backbone corpus
of English as a Lingua Franca 156
7.4 Essential terminology: KWIC (keyword-in-context) 158
7.5 Skill 17: complex searches 165
8.1 Skill 18: remaining critical 170
Preface
This book provides a practical introduction to using corpus linguistics methods

in education research. It offers a detailed guide on how corpora can be used to
explore sample key research questions, data types and a range of educationally
relevant topics. With this book, researchers in education will acquire basic skills
that will eventually allow them to become independent corpus users and apply
corpus linguistics methods in their research projects.
In preparing this book, we decided not to assume any previous knowledge
about corpus linguistics in the readers. The language used is direct and accessible
to everyone with some understanding of research methods in the social sciences
and education. We use a number of tables and figures in every chapter to facili-
tate the readers’ engagement with the methods discussed. In particular, we have
foregrounded the most relevant skills needed to use corpus research methods. In
every chapter the reader will find some of these skills. More often than not, we see
them as useful guidelines that corpus users may think about when considering the
use of corpus research methods in their own projects. The book also offers some
useful tables with essential terminology that, despite our effort to present it in ways
that are easy to understand, does not necessarily hide some of the complexities
involved in their definition and uptake.
The book is divided into seven main chapters and a conclusion. In chapter 1, we
will explore how corpus linguistics can be used within the wide range of education
research. This introductory chapter provides a useful rationale to situate corpus
linguistics in the context of prevailing research practices in education. In this con-
text, we advance two different approaches to the application of corpus methods
in education research, exploring their impact on our research projects: in the first
approach, corpora are used as the main research instrument; in the second, the
researcher collects data from a wealth of different sources such as interviews or
focus groups and submits these texts to analysis with the aid of corpus methods.
This is a much needed approach that acknowledges the fact that corpus-based
research methods can be used successfully for purposes other than the analysis of
language in linguistics-related disciplines.
Chapter 2 discusses how texts can be analysed using different data analysis
methods and how they impact our findings. Far from presenting a detailed analysis
xvi Preface
of different approaches to text analysis, this chapter seeks to develop an awareness

of the ways in which textual features and register specifics can be looked at by
different researchers using methods such as discourse analysis or content ana-
lysis. We present a discussion of how a register perspective can contribute to our
understanding of how language is used across a variety of communicative situ-
ations. Chapter 3 offers a practical discussion of the most widely used methods
in corpus linguistics. The readers will be introduced to the processes involved
in reading concordances, interpreting frequencies and collocational analysis for
research purposes. Likewise, they will be presented with the basic skills needed to
evaluate already existing corpora as well as to design their own corpus for research
purposes.
Chapter 4 examines different education policies through the comparison of
corpora. The analysis of educational policies and related documents is usually,
although not exclusively, approached in education research by means of grounded
theory. In this chapter, corpus methods will be examined as an alternative to
purely qualitative methods. Building on the skills developed so far, this chapter will
allow education researchers to gain the essential skills to explore the comparative
methodology in corpus linguistics. A subsection of the chapter will be devoted to
a review of the skills discussed so far, an important reflection and exercise for the
forthcoming chapters.
In chapter 5, readers will familiarise themselves with the use of corpus methods
in one of the most extensively used research instruments in education: interviews.
Transcription and annotation skills will be discussed in great detail, and the
readers will learn how to approach their own corpus projects by considering the
types of transcription and annotation that they want to implement. This chapter
will cater for the needs of those wishing to add no structure at all to their data, as
well as those that consider making use of XML (Extensible Markup Language) in
their projects.
Chapter 6 presents readers with the opportunity to learn how to examine a
corpus by looking at its lexicon. We will discuss how keywords can be used to know
more about the ‘aboutness’ of texts and corpora. We will discuss how to conduct
keyword analyses and multiword keyword analyses in great detail, and the impli-
cation of the results in the context of a corpus-informed research methodology.
Special attention is paid to the analysis of nouns in the corpus. A final section of
the chapter will examine children’s literature and the ways in which we can tap
into the type of lexis used in texts for children.
In chapter 7, we will shift our focus to spoken language. Readers will examine
interview data and will integrate complex searches into their inquiry process. The
conclusion in chapter 8 will present an evaluation of the aims of the book as well
as our last skill: remaining critical.
Throughout the book, we encourage the use of desktop and online corpus
applications. In our analyses and discussions, we will use both AntConc and
Sketch Engine. The reader will note that we use different versions of AntConc
running on different systems (Mac and Windows). As for Sketch Engine, we use
Preface xvii
the latest, newest interface, which is expected to remain stable for quite some time.
Other excellent software packages are available to those wishing to familiarise
themselves with corpus work and, of course, the reader will be able to use their
software of preference while following our discussion, or at least most of it. We
are using both terms, corpora and datasets, interchangeably as no differences are
implied between them in the context of this publication.
Our intention is to present a stimulating and informative discussion with plenty
of visuals and relevant information that can be accessed readily by the reader. We
have broken down a selection of corpus linguistics methods into 18 skills that can
serve as guidance to educational researchers wishing to explore the usefulness of
corpus linguistics in their projects. These skills have been distributed across all the
chapters in the book. This is the list of skills that we have put together:
Table number Skill

1.4 Skill 1: why frequency matters
1.5 Skill 2: how to interpret frequency
2.1 Skill 3: understanding text types: register basics
2.2 Skill 4: understanding textual features and textual data
3.2 Skill 5: using an existing corpus
3.7 Skill 6: reading concordance lines
3.8 Skill 7: handling frequencies
3.12 Skill 8: understanding collocations
4.3 Skill 9: basic corpus design features
4.7 Skill 10: comparing two corpora
4.8 Skill 11: statistical tests
5.2 Skill 12: transcribing your data
5.6 Skill 13: annotating and querying your data: using your own annotation
taxonomy
6.8 Skill 14: understanding keywords
6.10 Skill 15: researching nouns and noun phrases
6.11 Skill 16: looking at the lexicon of a corpus: n-grams
7.5 Skill 17: complex searches
8.1 Skill 18: remaining critical
In chapters 4 and 7, the readers will find some activities that seek to encourage
self-reflection and further thinking about how to implement corpus linguistics
research methods.
We sincerely hope that the book can stimulate the use of corpus linguistics in
education research. We are confident that future contributions from readers of
this book will enrich our understanding of how language and education can col-
laborate to gain a profound understanding of the discourses, practices and lives of
ever-increasingly complex societies.
newgenprepdf
Acknowledgements
I would like to thank Michael McCarthy and Anne O’Keeffe for their advice
and for encouraging me to write this book. Special thanks to Anne O’Keeffe and
Geraldine Mark: I learn so much from you whenever we discuss corpus analysis,
not to mention how much fun we have. Thanks to my colleagues at the Faculty of
Education at the University of Cambridge for inspiring me in so many ways, and
Adam Woods and Lizzie Cox for their patience and guidance. I would also like to
thank my PhD students at the University of Cambridge: you are such a fabulous
bunch. Also, thanks to David, Diane, Encarna and Fernando. You were always
there when I most needed you.
Thank you, too, to Laurence Anthony and Sketch Engine for their permission
to use screenshots from AntConc and Sketch Engine, respectively; to Mark Davies
for his permission to use screenshots from www.english-corpora.org/; and to Phil
Durrant for his permission to use the GIG corpus.
Chapter 1
Introduction
Corpus linguistics and education research
1.1 What is corpus linguistics?

If you expect to find a definition of corpus linguistics in this opening paragraph,
you will not be disappointed. Actually, you will find two. One is short; the second
is a bit longer. Corpus linguistics (CL) studies language usage empirically. That
was the first definition. It is inspired by McEnery and Wilson (1996: 1): ‘CL is the
study of language based on examples of real-life language use’. And this is the
slightly longer definition: CL studies the usage of language by examining how rep-
resentative texts of a given genre reflect the discursive practices of actual language
users. Do not worry if the second definition is a bit difficult to process now, or if
you just do not seem to find how this may be relevant in education research. The
aim of this book is precisely to show you how you can use CL research methods
in your area of investigation. We will come back to these definitions throughout
this chapter.
Corpus linguists have always been concerned with actual usage. This interest has
run parallel in the past decades with an interest in describing language perform-
ance, that is, what people actually say or write. Linguist Geoffrey Leech noted that
it was Randolph Quirk’s Survey of English Usage in 1959 and Nelson Francis’s
collection of the Brown Corpus in 1962 that contributed to the development of
CL applications before the massive, widespread use of computers. Leech observed
(Viana, Zyngier & Barnbrook, 2011: 155) that ‘both [linguists] hit on the idea
of collecting a large body of texts (and transcriptions) wide-ranging enough to
represent, to a reasonable extent, the contemporary English language’. One of the
early applications of CL was lexicography. Not so long ago, most dictionary entries
contained examples of use made up by lexicographers and a selection of entries
based on their expert insight. Before the use of CL methods, lexicographers had
consistently tried to portray the meanings of words in the most accurate way, but
it has not been until more recent times that they have begun to rely on descriptions
of language use based on attested uses of the language contributed by a community
of speakers and users of the language.
In CL a large body of texts is known as a corpus, hence the name corpus linguistics.
A corpus is used to model usage and we can think of a corpus as a proxy for usage.
2 Introduction
Research questions
Corpora
Figure 1.1 Corpora as a research method
In this view, a corpus is an instrument, a method, that researchers use to answer

research questions. Linguists regularly use corpora (plural form of corpus) to inves-
tigate questions concerning the characterisation of usage (Figure 1.1).
CL research methods offer us a means to understand how language is used by
a group of individuals while engaged in communication. For example, you may
want to know how speakers of English in both hemispheres use language and you
need massive amounts of data that can illustrate your query. The iWeb corpus1
contains 14 billion words compiled from websites in English-speaking countries. It
is an excellent resource to look at how language is currently used across national
varieties (US, UK, NZ, etc.) and different types of  text.
Let’s now turn our attention to three concrete research questions and how cor-
pora can be used to help researchers answer them:
Question A: What language is used in TV shows?

Question B: What characterises Higher Education (HE) student writing in
the UK?
Question C: What characterises dentists’ communication in professional
contexts?
Before we look at how these questions are examined by means of a corpus, can
you think of other research methods that can be used to answer these questions?
How will your data be collected? How will it be analysed?
Arguably, these questions can be answered by drawing on different research
methodologies and methods, as suggested in Table 1.1. However, corpus linguis-
tics will put more emphasis on the notion of usage and the need to use a repre-
sentative body of textual evidence. The three questions above can be answered
by putting together and querying different corpora that can be used as proxies of
the phenomena under investigation. In the case of the first research question (A),
we may want to use the English-Corpora.org TV Corpus. This corpus contains
325 million words from 75,000 episodes of TV shows dating from the 1950s
to 2018 produced and recorded in, among other countries, the US, the UK,
Introduction 3
Table 1.1 Connecting research questions, methods and data collection and analysis
Discussion
Research methods that can be used to answer these questions

A. What language is • Data collection and analysis:
used in TV shows? ……………………………….
• Data collection and analysis:

……………………………….
B. What does Higher • Data collection and analysis:

Education (HE) ……………………………….
student writing
look like in the UK? • Data collection and analysis:
……………………………….
C. What • Data collection and analysis:

characterises ……………………………….
dentists’
communication • Data collection and analysis:
in professional ……………………………….
contexts?
Australia and New Zealand. Depending on your area of interest, you may want
to narrow down your focus and examine only talk shows or, for example, all
soaps or just one specific genre (i.e. comedy). Bednarek (2018) used the Sydney
Corpus of Television Dialogue (SydTV)2 to investigate dialogues in American
TV series and developed a categorisation of their functions, namely, narrative-
related functions (progressing the plot or filling out character) and medium-
related functions (endorsing products or engaging audience emotions).
Question B calls for the compilation of a specific collection of texts represen-
tative of student writing in HE. The British Academic Written English Corpus
(BAWE)3 seems like a good fit when approaching this research question. The
BAWE contains university-level student writing: around 3,000 good-standard stu-
dent assignments totalling 6,506,995 words. The corpus showcases different types
of text (essays, critiques, explanation, literature reviews, etc.) across different dis-
ciplines (agriculture, economics, biological sciences, business, classics, engineering,
etc.). Nesi & Gardner (2018) have discussed how the genres in this corpus can be
linked to different social purposes:
• demonstrating knowledge and understanding

• developing powers of independent reasoning
• building research skills
• preparing for professional practice
• writing for oneself and others.
4 Introduction
For each of these purposes, Nesi & Gardner have analysed an inventory of
subgenres and have developed specific materials that can be used to teach HE
writing across different levels of expertise and disciplines. These findings are
solely based on the evidence provided by the texts included in the British Academic
Written English Corpus.
Question C represents one of the areas where corpus linguistics has been
most productive in the last decades: the analysis of specialised languages (Bhatia,
Sánchez Hernández & Pérez-Paredes, 2011). The use of professional registers has
attracted the attention of applied linguists who have found in corpora an oppor-
tunity to examine evidence of how specialised discourse is used and applications
of evidence-based knowledge in education. Thus, Biber & Conrad (2009) have
stressed the educational potential of the analysis of corpora:
Text varieties and the differences among them constantly affect people’s daily
lives. Proficiency with these varieties affects not only success as a student, but
also as a practitioner of any profession, from engineering to creative writing
to teaching. Receptive mastery of different text varieties increases access to
information, while productive mastery increases the ability to participate in
varying communities. And if you cannot analyze a variety that is new to you,
you cannot help yourself or others learn to master it.
(Biber & Conrad, 2009: 4)
The use of language is, therefore, constrained by the communities of users

where those uses are meaningful, either because they see them as part of a dis-
cursive practice or a non-discursive one. In the case of the language used by den-
tistry professionals, Crosthwaite & Cheung (2019) have identified register features
from three subgenres: published experimental research articles, case reports, and
novice and professional research reports within the Dental Public Health domain.
Crosthwaite & Cheung (2019) decided to use subgenres that undergraduates will
encounter shortly after graduation. Awareness of the differences among these
subgenres at all levels (lexical, syntactic, phraseological, etc.) is key to understanding
communication within this area of practice. Biber & Conrad (2009: 3), among
others, have stressed the educational needs of students ‘to control and interpret the
language of different varieties’ as a vital factor to succeed at school and in their
careers.
We have just seen how researchers use corpora as a method to answer questions
such as A, B and C above. Corpora, therefore, become the central research
instrument for corpus linguists when trying to answer a wide range of different
questions. McEnery & Hardie (2012: 15) have stressed the idea that most corpus
designs follow the principle of total accountability in that the researcher tries to
avoid ‘conscious selection of data’. This principle applies to corpus designs that
represent language usage with no hypothesis in mind, allowing multiple queries
in the corpus and, potentially, the testing of an infinite number of hypotheses
about usage. The principle of total accountability guarantees that the researcher
Introduction 5
Table 1.2 Total accountability in different corpus designs
Research Corpus How is total accountability

question approached?
A Mark Davies’ English-Corpora.org The corpus includes 75,000 TV
TV Corpus s episodes spanning over 6 decades of
television across different English-
speaking countries.
B The British Academic Written The corpus includes 2,858 texts
English Corpus (BAWE) written by 1,039 different students
of 4 different University levels across
4 major areas of knowledge (arts
& humanities, life sciences, physical
sciences and social sciences).
C Dentistry corpus (Crosthwaite & The corpus includes a variety of
Cheung, 2019) subgenres of relevance to dentists.
avoids confirmation bias and, in simple terms, text cherry-picking. Let us revisit
our three questions and exemplify how the principle of total accountability works
(Table 1.2).
Corpora are finite and, inescapably, represent a selection of data. McEnery &
Hardie (2012: 15) have noted that researchers ‘can only seek total accountability
relative to the dataset that [they] are using, not to the entirety of language itself ’.
This is a key point that we need to bear in mind as researchers. When knowledge
claims are made in terms of usage, we need to be aware that our claims will neces-
sarily be constrained by the instrument that was designed and compiled to extract
our results and findings (Figure 1.1).
McEnery & Hardie (2012) have suggested that it is essential that results are
replicated, and their consistency tested across different methods, that is, across
different corpora. While the use of Mark Davies’ TV corpus for question A above
appears quite robust in terms of replicability (other corpora will necessarily include
most of the TV shows already in this corpus), questions B and C will benefit
from further scrutiny with other datasets, that is, with other students writing other
texts, in the case of question B, and with other texts and subgenres in the case of
question C. McEnery & Hardie have noted that corpus designers and researchers
need to engage with the notion of validity:
Total accountability to the data at hand ensures that our claims meet the
standard of falsifiability; total accountability to other data in the process of
checking and rechecking ensures that they meet the standard of replicability;
and the combination of falsifiability and replication can make us increas-
ingly confident in the validity of corpus linguistics as an empirical, scientific
enterprise.
(2012: 16)
6 Introduction
So far, corpus linguistics has been successfully used in different areas of linguistics
and applied linguistics such as contrastive linguistics, discourse analysis, language
learning and teaching, lexicography, pragmatics, semantics and sociolinguistics.
However, we must never forget that corpora are proxies. As McEnery & Hardie
(2012: 26) have observed, ‘corpora allow us to observe language, but they are not
language itself ’. With this caveat in mind, and with a good grasp of the prin-
ciple of accountability, we are better equipped as researchers to make the most of
corpus linguistics methods even in areas outside linguistics.
You may think that you are not interested in describing language use exclusively
from a linguistic perspective. However, CL is not just used by linguists researching
linguistics. Corpus linguistics has been used in anthropology, economics, history,
law and, among other areas, sociology. If, as a researcher, you are interested in
the implications of language usage within a community of users, then there may
be something in CL for you. In education research, you may perhaps use data
collection methods such as interviews or focus groups –in other words, you will
be looking at textual data and, possibly, discourse or discourses. These texts can
also be explored by means of the corpus research methods that we will discuss
throughout this book. Figure 1.2 offers a visual rendering of the role of bigger,
representative corpora in our use of corpus research methods to process and ana-
lyse data from interviews and other qualitative instruments
For some research questions all we need is a representative corpus of the
domains, registers or language users that are the focus of our research. As we have
seen, questions A, B and C can be explored using such instruments. If, for example,
you are analysing language policy, CL research methods may be of interest simi-
larly. Let us take the UK government’s Education Act 2011.4 According to its
Research questions
Text
Corpora
Figure 1.2 Using corpora to examine textual data from data elicitation methods such as
interviews or focus groups
Introduction 7
introduction, this Act makes provision about education, childcare, apprenticeships,

training, schools and the school workforce, institutions within the further education
sector and academies, the Office of Qualifications and Examinations Regulation
(Ofqual) and about student loans and fees. After some text clean-up (which we will
explain in detail in the pages that follow), we can automatically obtain the following
frequency-related information:
1 A word list that gives us the raw frequency of all the words in the text. This
is usually a great starting point to get the gist of the lexical items used in a
text and therefore its overall content. We can always read such a list either in
descending order of frequency or increasing order of frequency. In the latter,
we start with the words that are only used once in the text. In the former, quite
unsurprisingly, the majority of the top ten most frequent words are function
words such as the, of, in, a, to, and, for, by and or. The tenth most frequent word
is section. If we zoom out, we will start to notice that the more frequent items
among, say, the top 30 most frequent words, are those related to legal jargon
such as annotation, force, commencement, paragraph, person, substitute or provision. It is
in the top 100 most frequent ranking where we will start to find lexical items
that we may want to explore further such as staff, corporation or teacher. For each
of these words, we can then explore the concordance lines and contexts where
they occur. We will learn how to do this in the forthcoming chapter.
2 A list can include specific items only. For example, we can generate a list of the
top verbs in the Education Act 2011 and acquire a sense of those actions that
have been specified by the legislator. A list of adjectives will give us insight into
what is considered as serious or interim, for example, within the Education Act.
3 A more sophisticated keyword analysis list using inferential statistics will reveal
that the multiword terms that characterise the Education Act 2011 are transfer
scheme, alternative provision, education corporation and service provider (all in the top 15).
A keyword analysis (discussed in more detail in chapter 6) is one of the methods

that benefits from the approach in Figure 1.2. Although our main focus is the
analysis of the Education Act 2011, we are using other, larger, representative cor-
pora to establish comparisons between language usage in the Act and the usage
in, for example, general communication (all-purpose communication such as the
one represented in the British National Corpus) or legal language across a wide
range of Acts (a representative corpus of legal language). When we do this, we
are querying the larger corpus looking for an understanding of how the usage of
these same lexical items differs in a broader, usually representative, dataset. Scott
& Tribble (2006) and Biber & Conrad (2009) have emphasised the nature of the
comparative textual data, specifically the choice and type of the reference, larger
corpus. We will come back to this method in chapter 6.
Before looking more closely at why frequency is relevant in CL, let us consider
in the following section how education research can be characterised and how we
can understand the potential of CL in this area.
8 Introduction
1.2 The role of corpus linguistics research methods in

education research
Admittedly, defining educational research is daunting. Cohen, Manion &
Morrison (2018: 3) have said that it is a ‘deliberative, complex, subtle, challenging,
thoughtful activity’. And it is one where more than one dominant assumption
about the nature of reality is in place. Pring (2004) has suggested that a feature
of educational research is its variety and the abundance of different approaches
and paradigms. Gray (2004) has noted that some of the terminology applied to
these assumptions about reality and their epistemologies is often inconsistent, so
it is crucial to understand how research methodology is affected by philosophical
assumptions. Let’s examine Figure 1.3, which is based on Pring (2004) and Gray’s
(2004) adaptation of the work of Crotty (1998).
You can see that, on the one hand, epistemology and theoretical perspectives
remain separated. On the other hand, we find methodology as well as methods.
Methods are usually seen in education research as instruments for data collection
that are the result of ‘deliberative process in which the key is the application of
the notion of fitness for purpose’ (Cohen, Manion & Morrison, 2018: 469). Pring
(2004: 56) notes that different research methods will provide different explanations
of reality as they will be dependent on different methodologies, theoretical
perspectives and research paradigms: ‘Understanding human beings, and thus
researching into what they do and how they behave, calls upon many different
methods, each making complex assumptions about human beings’.
The two dominant paradigms in educational research have been the scientific
and the constructivist paradigms.5 The scientific paradigm assumes that there is
a sort of objective reality out there that is independent of the researchers and
which is made up of objects that interact with each other. Different researchers
can therefore observe the same objects and their interactions and contribute to a
body of knowledge that explains reality. In contrast, the constructivist paradigm
(also known as the phenomenological paradigm) holds that all of us live in a world
of ideas (Pring, 2004) that ultimately construct our own reality. Therefore, there is
not one true reality out there waiting to be discovered by researchers: there are ‘as
Epistemology
(Paradigms)
Research Theoretical
methods perspectives
Research
methodology
Figure 1.3 Research in education
Based on Pring (2004) and Gray’s (2004) adaptation of the work of Crotty (1998)
Introduction 9
many realities as there are conceptions of it’ (Pring, 2004: 50). Choosing a meth-
odology and a method without a detailed understanding of the epistemology that
supports it will create unnecessary tensions between how we analyse and interpret
our findings. For example, if we adopt positivism as our theoretical stance and
the scientific paradigm as our epistemology, we will most likely choose a research
methodology that uses scientific observation and controlled empirical inquiry. If
we adopt the constructivist paradigm and interpretivism as our epistemology, we
will most likely use a research methodology that looks at the lived realities of those
people being researched and methods that examine how meanings emerge out of
social interaction, and may want to use, among other options, phenomenological
or ethnographic methodologies. Table 1.3 explores the differences between posi-
tivism and phenomenological education research.
Research methodologies are sensitive to our paradigm inclination as
researchers. If we situate our research within the positivist paradigm, an experi-
mental or survey methodology will seem fit for purpose. If, on the contrary, we
situate ourselves as researchers within the constructivist paradigm, case studies,
ethnography or grounded theory are good candidates as methods for our research
project. Methodologically speaking, a researcher that perceives reality as objective
will, very likely, use mathematical models and quantitative analysis to operation-
alise an abstraction of reality. Those educational researchers that perceive reality
as subjective will rely on methods that look at the ‘representation of reality for
purposes of comparison’ (Cohen, Manion & Morrison, 2018: 7), very frequently
by analysing language and meaning. Despite this interest in analysing language,
CL methodology and methods do not feature in major accounts of educational
research such as Gray (2004) or Cohen, Manion & Morrison (2018).
There are many reasons why CL methods are not widely used in education
research. One reason is the fact that CL is a relatively recent discipline that, even
in linguistics, has met lots of criticism. Timmis (2015) has pointed out that corpora
can only reflect usage, not other areas of language-related phenomena that are
Table 1.3 Differences between positivism and phenomenology paradigms
Positivism Phenomenology
Epistemology Reality is external and Reality is socially constructed
objective
Researchers Explore causality between Construct theories and models
variables from the data
Methods • Mainly quantitative methods • Mainly qualitative methods
• Concepts are operationalised • Phenomena are complex
and measured and multiple methods are
• Use of large samples and required
generalisation • Use of small samples
researched in depth
Based on Cohen, Manion & Morrison, 2018
10 Introduction
private or which the users are not willing to share. While this is a fair criticism, this
concern applies also to other research instruments such as interviews or diaries.
Eventually, these instruments can only record language that is voluntarily shared
by speakers. Large, representative corpora are, admittedly, different. Most of the
language in the 100 million-word British National Corpus (BNC) was not elicited
to be part of the corpus, that is, the texts in the BNC were used as a result of a
design process that deemed particular types of texts appropriate. According to
the BNC reference guide,6 published written texts were selected partly at random
from Whitaker’s Books in Print 1992 and selection features such as domain (sub-
ject field), time (within certain dates) and medium (book, periodical, etc.). Another
criticism is that, even with robust, big corpora, the language in each corpus will
always be a partial representation of usage. As we have seen, corpus triangulation
can provide researchers with optimised results. We will see in the forthcoming
pages how this can be achieved. A third area of criticism revealed by Timmis
(2015: 184) is that most big corpora only give us basic textual information. In the
case of spoken language this may be an important limitation as we are not, as
yet, given access to recorded files and annotated prosodic and paralinguistic oral
features. This may change in the forthcoming years but there is no doubt that the
compilation of written data is favoured as it presents the researchers with fewer
challenges in terms of the ethical and logistical issues involved.
We propose in this book that corpus linguistics be seen and conceptualised both
as a research methodology and as a set of research methods. For example, the use of
keyword analysis can be massively useful to analyse the content of texts. Culpeper
& Demmen (2015:105) have noted that ‘the full potential of [keyword] analyses
across the humanities and social sciences has yet to be realised’. We believe that
this full potential can only be achieved if we further clarify the different roles of
CL when seen either as a methodology or as a set of methods. Mautner (2019: 9)
believes researchers should be cautious when it comes to interpreting results when
using corpus methods to examine either individual texts or corpora:
Whether you use a [corpus-assisted discourse studies] CADS approach or

traditional, qualitative methods without computer support, your claims about
the data must be commensurate with how representative –in other words,
how typical –they are of the wider universe of texts ‘out there’. An analysis
of a single text –such as a politician’s speech about eradicating poverty, or a
newspaper editorial –is just that: an analysis of one text. Obviously, the more
representative your corpus is, the more legitimate it is to generalise, and the
bolder your claims can become. Although most researchers are in principle
aware of these limitations, it is not uncommon for written-up research to con-
tain tell-tale slippage into unwarranted and over-confident generalisations.
(Mautner, 2019: 9)
Let us turn our attention now to the role of frequency in language and in corpus
analysis.
Introduction 11
1.3 Understanding the role of frequency

CL methods give us information about the frequency of different items in a given
corpus or in a text or a collection of texts. The question is, why should we bother
about frequency? Why do corpus linguists pay so much attention to the frequency
of language items in a corpus? The question begs some attention and the answer
is far from simple.
In the following paragraphs, we will look at three areas where the frequency of
linguistic units plays a significant role: in L1 language learning and use; in (public)
discourse and in individual texts or groups of  texts.
1.3.1 Frequency in L1 learning and use

Frequency matters because frequency is the organising principle of language use.
At least, it is one of them. Bybee (2007) has noted that humans are sensitive to
both prototypical linguistic structures as well as to individual patterns. This com-
bination of foci explains how humans are able to both generalise and come up
with ‘original’ language, and remember and reuse prefabricated, off-the-shelf
language chunks. Similarity and frequency in experience determine how we, as
humans, form and develop language-related categories out of experience. Bybee
(2007) has noted the following:
[…] repetition of actions brings about the formation of structures; thus in lan-
guage, too, we see that repetition is a necessary component of grammar forma-
tion […] The reason frequency or repetition plays a role in grammar formation
is that the mind is sensitive to repetition. This is a domain general principle;
that is, it does not apply just to language but to other cognitive domains as well.
(Bybee, 2007: 8)
Bybee (2007) stresses the fact that, according to psycholinguistic and cognitive accounts
of language learning and cognition, repetition strengthens memory representations
for linguistic forms. This fact makes highly frequent forms more accessible from a
cognitive perspective and more likely to be used by more and more members of a
community of speakers. Put simply, those speakers that use language while enacting
certain types of register (conversation, academic language, etc.) will tend to notice
and use (and reuse) highly frequent items (or constructions as noted as such in the
specialised literature). Not only do language constructions become entrenched in our
minds as we learn our first language(s), they also become entrenched for a commu-
nity of users of that language. Interestingly, frequency plays a major role in how we
learn language. Ellis (2002) has summarised some of these effects:
[…] the recognition and production of words is a function of their frequency

of occurrence in the language. For written language, high-frequency words
are named more rapidly than low-frequency ones […] they are more rapidly
12 Introduction
judged to be words in lexical decision tasks […], and they are spelled more
accurately […]. Auditory word recognition is better for high-frequency than
low-frequency words […]
(Ellis, 2002)
Bybee (2010) has proposed that grammar is the cognitive organisation of one’s
experience with language and Ellis (2019) has recently defined language as the quint-
essence of distributed cognition. Frequency, according to usage-based perspectives,
functions both as the main factor impacting the cognitive representation of lan-
guage and also usage: ‘each instance of language use impacts representation’ (Bybee,
2010: 9). Corpora offer evidence of such representations as used by speakers when
engaging in communication. As such, corpora offer researchers a fertile ground to
test how repetition of items triggers chunking, that is, linguistic units of meaning of
varied size that are easily stored and retrieved from our memory.
So far, we have seen that the frequency of occurrence of different language
units (from morphemes to complex constructions) has an impact on our learning
and use of those very same units and beyond. We note that frequency affects how
we acquire language and how we use language, which in turn affects how others
learn and use language.
1.3.2 Frequency in public discourse

Bybee (2007, 2010) and Ellis (2019) have shown evidence that frequency, even a
small number of repetitions in our experience, has a cognitive effect. But there
are other effects of repetition that have been studied by researchers interested in
discourse and its use in society. Baker (2006) has noted that discourse has an incre-
mental effect and that the analysis of language use presents opportunities to uncover
the typicality of hegemonic discourse. He has put it this way:
An association between two words, occurring repetitively in naturally

occurring language, is much better evidence for an underlying hegemonic dis-
course […]For example, consider the sentence taken from the British maga-
zine Outdoor Action: ‘Diana, herself a keen sailor despite being confined to a
wheelchair for the last 45 years, hopes the boat will encourage more disabled
people onto the water’ […] there are a couple of aspects of language use
here which raise questions –the use of the phrase confined to a wheelchair, and
the way that the coordinator despite prompts the reader to infer that disabled
people are not normally expected to be keen sailors. There are certainly
traces of different types of discourses within this sentence, but are they typical
or unusual? Which discourse, if any, represents the more hegemonic variety?
(Baker, 2006: 13)
Baker goes on to suggest that the use of large representative corpora can give us
access to what is considered to be normal or usual in a given community of users.
Introduction 13
After consulting the frequency of use and the collocational behaviour of confine
and wheelchair in the BNC, Baker concluded the following:
[…] one discourse of wheelchair users constructs them as being deficient in

a range of ways, and it is therefore of note when they manage to be cheerful,
prosperous or active in church life! […] every time we read or hear a phrase
like wheelchair bound or despite being in a wheelchair, our perceptions of wheelchair
users are influenced in a certain way. At some stage, we may even repro-
duce such discourses ourselves, thereby contributing to the incremental effect
without realizing it.
(Baker, 2006: 14)
Tognini-Bonelli (2010: 19) has noted that, in corpora, the significant elements are
‘the patterns of repetition and patterns of co-selection’. She holds that it is the fre-
quency of occurrence that ‘takes pride of place’. By using corpora, researchers can
study the repeated patterns of usage and establish whether these patterns show
evidence of hegemonic discourses, majority common-sense ways of viewing the
world or even resistant discourses (Baker, 2006). These analyses start with observing
the frequencies of occurrence of linguistic units. Then, the researchers can move
on to qualitative examinations of the contexts and situations of use. Let’s look at
some examples. Corpus-assisted discourse analysis (CADA) uses corpora to ana-
lyse, among other areas, how minorities are represented. An important study in
this area is Baker, Gabrielatos & McEnery (2013), who looked at the representa-
tion of Muslims and Islam in the British press between 1998 and 2009. In their
research, these authors used a corpus of 143 million words and included everything
published in papers as diverse as The Guardian and The Sun that dealt with Muslims
or Islam during that time. In total, the corpus includes over 200,000 articles. The
authors concluded that while overt negative representation of Muslims was carefully
avoided, a number of strategies were identified that favoured a distorted represen-
tation of Muslims, in particular in right-wing tabloids. Based on the evidence
provided, left-leaning broadsheets were found to present a more balanced reporting
of Muslims and were ‘more likely to give voices to Muslims and reflect on issues to
do with terminology or representation’ (Baker, Gabrielatos & McEnery, 2013: 259).
It is impossible to present a fair summary of this study in just a few lines. For the time
being, let us advance that the words Islam, Islamic, Islamism, Islamist and Islamists all
collocated with the words terror and terrorism. In simple terms this means that, in the
context of written media in the UK, it is very likely that these words occur together.
The authors wonder whether this fact contributes to the emergence and perpetu-
ation of a public discourse that tends to pigeonhole Muslims as a violent community
against other British citizens. This is evidenced by the fact that almost 50% of the
topic indicators in the corpus are concerned with the idea of conflict:
The ‘incremental effect’ (Baker, 2006: 13–14) of the lexical co-occurrence

of Islam and terror reflects a particular discourse –in the sense of ‘a set of
14 Introduction
meanings metaphors, representations images, stories, statements and so on

that in some way together produce a particular version of events’ (Burr
1995: 48).
(Baker, Gabrielatos & McEnery, 2013)
This type of study has been used to understand the representation of other minor-
ities such as migrants and asylum seekers, gay men and members of the LGBT
community as well as people with diseases such as cancer or mental health issues.
1.3.3 Frequency in texts or groups of texts

As we have just seen, Baker, Gabrielatos & McEnery (2013) compiled a massive
corpus of articles published in the UK over a decade. The main advantage of this
approach is that they enacted the principle of total accountability by including
in their corpus every single article published where the word Muslim or Islam
appeared. Other studies, however, use corpora in different ways. Instead of using
a corpus as the main, primary source of data, we can turn our attention to one
particular text, or groups of texts, and use corpora as a secondary source of data
mainly for comparative purposes.
Pérez-Paredes (2017) looked at the UK Higher Education (HE) Green Paper
in order to reveal the discourse of power underlying this text. The author used
part-of-speech (POS) keyword analysis (discussed in more detail in chapter 6) to
uncover the linguistic mechanisms that were in place to manipulate the readers’
perceptions of this green paper. The author found that three groups of POS tags
were used statistically more frequently in the green paper than in a larger con-
trol group: tags including common nouns, tags concerned with the expression
of purpose (infinitives, e.g. to do, to ensure, etc.) and modality (different modal
verbs, e.g. should, would). These groups of POS tags were interpreted by means
of a qualitative analysis and the author concluded that the prevalence of these
linguistic items in the HE green paper contribute to the endorsement of a con-
servative, neoliberal policy that presents a view of HE that puts economics before
other concerns. It is interesting to note that in this type of research, a bigger
corpus is used to compare systematically how some features characterise a text
(Rayson, 2008).
Figure 1.4 represents the two approaches discussed so far. When we use a rep-
resentative corpus as primary data (see also Figure 1.1), the corpus is actually
the main research method and corpus linguistics is the research methodology
advocated. In other words, we are querying a corpus that we, as researchers, con-
sider fit for purpose in terms of answering our research questions. However, it
is the second approach, depicted on the right of Figure 1.4, that will be most
useful for educational researchers (see also Figure 1.2). In this second approach,
the researchers have collected textual data through a variety of methods (e.g.
interviews, texts, open-ended questionnaires, etc.) and have a text or a group of
Introduction 15
Corpus as primary data Corpus as secondary data
C C
T
1 2
Figure 1.4 Using corpora as primary or secondary data
texts (T) that can be explored using CL data analysis methods. In the context of
this approach, a larger, representative corpus will be queried as a source of usage
information that will inform the researchers’ analysis and interpretation of their
main data source (T). Coming back to our long definition of CL provided in the
first paragraph of this chapter (CL studies the usage of language by examining how repre-
sentative texts of a given genre reflect the discursive practices of actual language users), we are
now better equipped to understand that while our interest is the ‘T’ in Figure 1.4,
it is the ‘C’ that will be used as a baseline to judge how our data differs from what
can be understood as ‘expected’, ‘normal’ or ‘usual’ in usage.
Let us turn our attention now to the first two skills that need to be mastered in
the context of this book.
1.3.4 Why frequency matters

Frequency of occurrence provides insight into the status of linguistic units within
a corpus. This way, we can understand better how language is used in the context
of actual usage contributed by a community of speakers. Some of the best-known
corpora were designed precisely to provide such insight. The BNC, for example,
was designed to be representative of the language used in the UK in the late 1980s
and early 1990s. It shows usage in the written as well as in the spoken medium,
although the representation has an imbalance in that 90% of the language in the
BNC comes from the written medium. According to the information provided
by the Oxford Text Archive,7 the written language part of the corpus includes
extracts from both regional and national newspapers, periodicals and journals
for all ages and interests, academic books, popular fiction, published and unpub-
lished letters as well as school and university essays. The original 1990s spoken
part includes unscripted informal conversations of people of different ages, region
16 Introduction
and social class. It also features spoken language from different contexts such as
business and government meetings, and radio shows including phone-ins.
Corpora can offer a detailed account of the frequency of different linguistic
items and identify those units that are most frequently used. Potentially, most
corpus management software can offer frequencies of the following:
• the total number of words in a corpus (tokens)

• the total number of different words in a corpus (types)
• the number of specific word classes in a corpus (nouns, verbs, adjectives,
prepositions, etc.)
• the number of sequences of n members in the corpus (n-grams)
• the frequency of any complex structure that may be of interest to the
researcher (e.g. verbs followed by noun phrases with a postmodifying relative
clause).
As you can see, there is some variety in terms of the units that can be counted
in a corpus. This fact gives researchers plenty of freedom in terms of searching
for a diverse range of items. If our interest is to examine the use of nouns
in the first place, the BNC can give us interesting insights into how frequent
nouns are in British English. Overall, it seems that 21% of English words in
the BNC are nouns. However, there are important differences across registers.
While in spoken English nouns account for 14% of all words, in academic
English nouns are 25% of all words while in fiction the figure decreases to 17%.
This shows evidence of how naturally occurring language is affected by the
medium (spoken vs. written) and the type of register through which commu-
nication happens (conversation, academic language, fiction, news, etc.). While
all texts use language, different registers tend to reflect different distribution
of linguistic elements. One of the tasks of the linguist is to interpret these
differences in terms of linguistic theory, for example Biber’s functional linguis-
tics approach (Biber, 1988; Biber & Conrad, 2009; Biber, Johansson, Leech,
Conrad & Finegan, 1999). It is the task of the educational researcher to inter-
pret different frequency patterns in the light of their specific inquiry. Table 1.4
summarises why frequency matters.
1.3.5 How to interpret frequency

In section 1.3 we looked at some of the reasons why frequency is relevant across
different areas of research and everyday communication. Frequency count is the
most basic statistical measure, ‘a simple tallying of the number of instances of
something that occur in a corpus’ (McEnery & Hardie, 2012: 49). In the British
Academic Written English Corpus (BAWE), the word education occurs 1,648 times.
This is the raw frequency of an item in a corpus or in a text. Raw frequency per
se is of little value. For comparison reasons, we would prefer to normalise the fre-
quency of linguistic units in a corpus. According to McEnery & Hardie (2012: 49),
Introduction 17
Table 1.4 Skill 1: why frequency matters
• Frequency-related information is essential to

understand usage.
Skill 1
• Usage is affected by frequency of occurrence and
occurrence is affected by our perceptions of usage.
• We can look at the frequencies of different linguistic units
(from morphemes to words or strings of words) from
different analytical angles.
• Frequency and distribution of the frequency of discrete
units (words, word classes, etc.) affect how registers are
formed linguistically.
• Repeated patterns of usage frame the linguistic boundaries
of communication.
normalised frequency ‘answers the question “how often might we assume we will
see the word per x words of running text?” ’ Normalised frequencies are calculated
using this easy formula:
Normalised frequency = (number of occurrences of the word in the

corpus: size of corpus) x base of normalisation
In the case of the BAWE we know that the corpus is made up of 6,968,089
words.8 So the normalised frequency of education in the BAWE is 236.5
per million words:
Normalised frequency = (1,648/6,968,089) x 1,000.000
236.5 per million words reflects how often education would be expected to occur
on average in each million words of the BAWE. In the case of the BNC, we find
that education occurs 25,947 times in the total 96,134,547 words. The normalised
frequency of education in the BNC is the following:
Normalised frequency = (25,947/96,134,547) x 1,000.000
269.9 per million words reflects how often education would be expected to occur
on average in each million words of the BNC. With two normalised frequencies
calculated using the same base of normalisation, we can now start to compare the
frequency of use across the BAWE and the BNC. Of course, we could decide to
use a different base of normalisation, for e xample 1,000 words. Our normalised
frequencies would be 0.23 per 1,000 words in the BAWE and 0.26 per 1,000
words in the BNC.
Table 1.5 gives a useful breakdown of our second skill, how to interpret
frequency.
18 Introduction
Table 1.5 Skill 2: how to interpret frequency
• The raw frequency of a word in a corpus or a text is the

number of times that it occurs.
Skill 2
• Raw frequencies are of little value. It is more interesting to
use normalised frequency and establish comparisons across
texts or corpora.
• We can use different bases of normalisation to calculate
normalised frequency, the most common ones being per
1 million words, per 10,000 words and per 1,000 words.
Notes
1 www.english-corpora.org/iweb/
2 www.syd-tv.com/
3 www.coventry.ac.uk/ r esearch/ r esearch- d irectories/ c urrent- p rojects/ 2 015/
british-academic-written-english-corpus-bawe/
4 www.legislation.gov.uk/ukpga/2011/21/contents/enacted
5 Cohen, Manion & Morrison (2018) go as far as to identify six paradigms: empirical-
analytic, pragmatic, interpretive, critical, post-structuralist and transcendental.
6 www.natcorp.ox.ac.uk/docs/URG/BNCdes.html#BNCpurp
7 www.natcorp.ox.ac.uk/corpus/index.xml
8 These totals are calculated using the Corpus Info sections of these corpora on Sketch
Engine: www.sketchengine.eu/
References
Baker, P. (2006). Using corpora in discourse analysis. London: Continuum.
Baker, P., Gabrielatos, C. & McEnery, T. (2013). Discourse analysis and media attitudes: The
representation of Islam in the British press. Cambridge: Cambridge University Press.
Bednarek, M. (2018) Language and television series. A linguistic approach to TV dialogue. Cambridge:
Cambridge University Press.
Bhatia, V., Sánchez Hernández, P. & Pérez-Paredes, P. (Eds.) (2011). Researching specialised
languages. Amsterdam: John Benjamins Publishing.
Biber, D. (1988). Variation across spoken and written English. Cambridge: Cambridge University
Press.
Biber, D. & Conrad, S. (2009). Genre, register and style. Cambridge: Cambridge University  Press.
Biber, D., Johansson, S., Leech, G., Conrad, S. & Finegan, E. (1999). Longman grammar of
written and spoken English. Harlow: Longman.
Bybee, J. (2007). Frequency of use and the organisation of language. Oxford: Oxford University  Press.
Bybee, J. (2010). Language, usage and cognition. Cambridge: Cambridge University Press.
Cohen, L., Manion, L. & Morrison, K. (2018). Research methods in education. London: Taylor
Francis.
Crosthwaite, P. & Cheung, L. (2019). Learning the Language of Dentistry: Disciplinary corpora in
the teaching of English for Specific Academic Purposes. Amsterdam: John Benjamins Publishing.
Crotty, M. (1998). The foundations of social research: Meaning and perspective in the research process.
London: Sage Publications Limited.
Introduction 19
Culpeper, J. & Demmen, J. (2015). Keywords. In Biber, D. & Reppen, R. (Eds.) The Cambridge
handbook of English corpus linguistics, 90–105.
Ellis, N. (2002). Frequency effects in language processing: A review with implications for
theories of implicit and explicit language acquisition. Studies in Second Language Acquisition,
24(2), 143–188.
Ellis, N. C. (2019). Essentials of a theory of language cognition. The Modern Language Journal,
103,  39-60.
Gray, D.E. (2004). Doing research in the real world. London: Sage Publications Limited.
Mautner, G. (2019). A research note on corpora and discourse: Points to ponder in research
design. Journal of Corpora and Discourse Studies, 2, 2–13.
McEnery, A.M. & Wilson, A. (1996). Corpus Linguistics. Edinburgh: Edinburgh University
Press.
McEnery, T. & Hardie, A. (2012). Corpus linguistics: Method, theory and practice. Cambridge:
Nesi, H. & Gardner, S. (2018). The BAWE corpus and genre families classification of
assessed student writing. Assessing Writing, 38,  51–55.
Pérez-Paredes, P. (2017). A keyword analysis of the 2015 UK Higher Education Green
Paper and the Twitter debate. In Orts, M. A., Breeze, R. & Gotti, M. (Eds.) Power, persua-
sion and manipulation in specialised genres: providing keys to the rhetoric of professional communities.
Bern: Peter Lang, 161–191.
Pring, R. (2004). The philosophy of education. London: Bloomsbury.
Scott, M. & Tribble, C. (2006). Textual patterns: Key words and corpus analysis in language education.
Amsterdam: John Benjamins Publishing.
Timmis, I. (2015). Corpus linguistics for ELT: Research and practice. London: Routledge.
Tognini- Bonelli, E. (2010). Theoretical overview of the evolution of corpus linguis-
tics. In O’Keeffe, A. & McCarthy, M. (Eds.) The Routledge handbook of corpus linguistics.
London: Routledge,  42–56.
Viana, V., Zyngier, S. & Barnbrook, G. (Eds.). (2011). Perspectives on corpus linguistics.
Chapter 2
Analysing text
In the following paragraphs, we will discuss different approaches to text analysis in

the context of education research, including content analysis, theme analysis, con-
versational analysis, discourse analysis and corpus linguistics analysis of register.
Our aim is to showcase the wide range of operationalisations of the term text
analysis and to inform the reader about the implications that each may have for
their own research projects.
2.1 Different approaches to text analysis

Educational and social science research uses text as its main source of data
(Boréus & Bergström, 2017). Given the complexity and the variety of epistemo-
logical stances in these fields, there is a wealth of different approaches to choose
from. Irrespective of the approach selected, two stages need to be considered
when looking at texts: analysis and interpretation.
One of the approaches most widely used by educational researchers is content
analysis. Cohen, Manion & Morrison (2018: 668) see it as a procedure to reduce
‘copious amounts of data to manageable and comprehensible proportions’.
Reduction is achieved by means of coding, that is, the construction and appli-
cation of an analytical instrument that helps researchers decide what needs to
‘be noted and counted in the material’ (Boréus & Bergström, 2017: 27). Content
analysis and theme analysis are often used interchangeably in research (Braun &
Clarke, 2006; Vaismoradi, Turunen & Bondas, 2013) as they both can be seen as
qualitative descriptive approaches that use a relatively low level of interpretation,
whereas grounded theory or hermeneutic phenomenology rely more heavily on
the interpretive skills of researchers. Sandelowski & Barroso (2003) have argued
that theme analysis is often referred to as a method for identifying themes and
reporting patterns in the data. In content analysis, however, the coding process
is paramount. While a code is ‘a name or a label that the researcher gives to a
piece of text which contains an idea or a piece of information’ (Cohen, Manion &
Morrison, 2018: 668), coding can be conceptualised as the process of ascribing
such a label to different units of analysis in the data.
Analysing text 21
Kellams (1975) is a classic example of content analysis in educational research.

The author explored research projects focused on Higher Education (HE) in the
US and Canada in 1972 that were included in a published inventory of 1,130
abstracts. Kellams carried out a systematic sampling of 25% of those abstracts
and retained a total of 279 abstracts for analysis. Each of the 279 abstracts was
manually annotated with codes that the author developed so as to categorise the
particular types of research (descriptive, theoretical, policy research, develop-
mental research), researchers (professors, etc.), funding bodies and anticipated
modes of publication (report, book, paper, etc.). Once the coding was finalised,
Kellams carried out correlation analyses involving all the coded units in his study,
which allowed him to gain a better understanding of the research fields across HE
studies. This is an instance of how a coding scheme can be applied to a set of data
in order to facilitate data reduction and the identification of relationships within a
body of texts. According to Cohen, Manion & Morrison, codes
[…] enable the researcher to identify similar information. The researcher can
search, retrieve and assemble the data in terms of those items that bear the
same code. Codes can be regarded as an indexing or categorizing system, like
the index in a book […] and the data can be stored under the same code, with
an indexed entry for that code.
(Cohen, Manion & Morrison, 2018: 669)
Content analysis is, therefore, oriented to a system of categories that relies on pre-
formulated models. In addition, it is theory-dependent as the coding is theoretically
underpinned (Sandelowski & Barroso, 2003). An example of this is the T-SEDA1
project, which seeks to give teachers and researchers tools to reflect on the quality
of educational dialogue across a wealth of subjects and school contexts. The
authors developed a coding framework (Hennessy, Rojas-Drummond, Higham
et al., 2016) based on educational dialogue theory that includes top-level concepts
such as invitation to build on ideas, challenging students, making reasoning explicit,
inviting students to reason, coordination of ideas and agreement, reflection on the
activity, or, among others, expressing ideas. Turns that include teacher’s talk like
the following:
• What do you mean? Tell me more…

• Can anyone add to that? Can you give an example of what you said?
• Is your idea similar to Manuel’s? What do you think about Maria’s idea? Do
you agree with what Chris just said?
• What other information do we need?2
were coded as invitation to build on ideas. Vrikki, Kershner, Calcagni et al. (2019)
have revisited this coding scheme by grouping the most relevant codes into eight
clusters and proposed an interesting approach that compares situated live coding
22 Analysing text
versus traditional coding based on the analysis of classroom dialogues (transcript

+ audio). Whatever the detailed approach (live coding vs. desktop analysis), con-
tent analysis offers a rationale for the coding of phenomena that are susceptible
of systematic observation and reduction. In a further example, Flynn (2007)
audio recorded three lessons by three different teachers with Year 2 students in
the Literacy Hour in a primary school in the UK. The author listened to the
recordings and developed a taxonomy of teacher talk that was used to code the
data. This taxonomy included categories such as recap, reading, explanation,
directions, questions, discussion/ interaction, behaviour management, pupil
activity and housekeeping. For example, the discussion/interaction category is
subdivided into the following codes: talk with individual, with pair, with group,
with class and involved in role play.
Manual, paper-based analysis is frequent even these days, but there is also a
plethora of software that researchers can use to code texts, such as NVivo3 and
MAXQDA.4 The main advantage to using these packages is that researchers can
develop and share their coding schemes with other researchers in an electronic
format that can be easily reused across projects. Texts can be arranged in groups
and codes that can be easily expanded or modified by a research team. Codes
can be customised (by using colour, names, etc.) and searched. Coders can create
memos (Figure 2.1) where they specify meta information concerning the nature
of the codes used in a research project, which increases both the validity and reli-
ability of the methodology. Some software can even automatise the coding process
itself.
Figure 2.1 Adding metadata to codes in MAXQDA

Source: www.maxqda.com/products/maxqda-standard
Analysing text 23
Although not originally developed to carry out content or theme analyses,

AntConc5 can also be used for this purpose. We will learn how to do this later (see
chapter 3).
In the last decade, we have witnessed ways to carry out content analysis that
are automatically performed by software. Computer-assisted content analysis has
been used to examine large amounts of text in particular. An application of this
approach is Bond, Zawacki-Richter & Nichols (2019), a content analysis of 50 years
of research published in the British Journal of Educational Technology (BJET). They
looked at 1,777 research articles from the period 1977–2018 using Leximancer,6
a software that identifies concepts and produces concept maps showing their fre-
quency and how the themes identified are connected. The authors found that the
most frequent themes in the journal were the following:
• the evolution of teaching and learning in distance education

• the emergence of instructional design
• misunderstanding between practitioners and learning designers
• issues of pre-and in-service teacher education and technology uptake by
educators and students
• the technology skills of educators and students
• lack of institutional support to provide space and time for training and inte-
gration to occur (Bond, Zawacki-Richter & Nichols, 2019: 12).
Automatic content analysis enabled Bond, Zawacki-Richter & Nichols (2019)

to mine, process and derive the main content present in 50 years of research in
BJET, a task that would have required massive resources were it to be carried out
manually. It is interesting that, while this process allows researchers to find the
most relevant content in the texts analysed, researchers can similarly spot what is
missing in the set of concepts identified. In this case, Bond, Zawacki-Richter &
Nichols (2019) found that some fundamental educational topics did not make it
into the research published in the BJET.
Theme analysis is popular in educational research, especially when looking at
small datasets. An instance of this approach is Coates & Pimlott-Wilson (2019),
who examined 33 UK primary school children’s experiences in a Forest School
(FS) programme that provides children with opportunities to learn outside the
classroom in a natural environment. According to the literature, the FS approach
facilitates children’s engagement with play-focused learning activities that seek to
develop problem-solving skills, cooperation, confidence, self-motivation and self-
esteem. However, the authors (p. 24) felt that there was little evidence ‘to support
the use of FS as a tool for facilitating learning’, and in particular there was a
lack of understanding of the ideographic experiences of children in FS (Coates
& Pimlott, 2019: 24). The authors used a phenomenological qualitative research
design where open interviews looked at the children’s outdoor engagement
through word-association tasks and, among others, photographs in two different
schools in the East Midlands. The interviews lasted a mean of 45.5 minutes and
24 Analysing text
were audio recorded and transcribed. The thematic analysis approach used by the
researchers follow a five-stage process:
1 the researchers first familiarise themselves with the data by reading the
transcriptions and listening to the children’s actual voices;
2 initial codes are generated by both researchers independently and consistency
is discussed;
3 codes are organised into themes;
4 themes are reviewed and sub-themes are grouped;
5 themes are defined and named, and a thematic map is developed so as to
represent the relationships between themes and sub-themes.
In the case of this research, three main themes were identified: break from rou-
tine, learning through play and collaboration and teamwork. The authors claim
that this analysis has facilitated their understanding of how ‘specific facets of play
create meaning in the learning journeys of children’ (Coates & Pimlott, 2019: 35).
Sometimes grounded theory is used when approaching theme analysis of phenom-
enological data. In short, as opposed to grand theories, grounded theory emerges
from the data. It is a bottom-up theory that looks at the data as defining and
shaping the phenomenon of concern. Hadley (2014: 11) notes that ‘grounded
theorists adopt a stance of ‘theoretical agnosticism’ […] meaning that although
there is an area of specific interest that has motivated them to begin, they reflex-
ively recognize that their own sets of personal constructs could limit what they see
and hear’. Hadley used grounded theory to examine the lived experiences and
practices of English for Academic Purposes (EAP) teachers, HE administrators
and students in neoliberal universities in the UK and the US.
Conversational analysis (CA) is a type of discourse analysis that is primarily
focused on spoken discourse. Mercer (2010) has pointed out that CA deserves to
be described as a methodology rather than a method. According to Bergmann
(2004: 296), CA examines social interaction ‘as a continuing process of produ-
cing and securing meaningful social order’. Originally based on ethnomethod-
ology and interaction analysis (Bergmann, 2004), CA looks at reality as created
in situated contexts where people attribute and interpret meanings by using the
language conventionalised by a community of speakers, hence the interest in
language structure and, in particular, turn-taking and pragmatics. In CA, talk is
considered as an intersubjective phenomenon. Wooffitt (2005: 73) highlights the
usefulness of CA when exploring interaction as ‘social action is accomplished
through the participants’ use of tacit, practical reasoning skills and competen-
cies’. Conversational analysts strive to preserve the observed phenomena as
fully as possible and that is why so much attention is given to the recording
(Wooffitt, 2005), its transcription (Jefferson, 2004) and the sequence of turns
(Schegloff, 2007). Bergman (2004: 297–299) outlines the following analytical
maxims in CA:
Analysing text 25
A. Transcriptions of conversations or interactions should preserve communica-

tion as it happened, including hesitations, disfluency phenomena, overlapping,
intonation, etc.
B. The preservation of the intact spoken interaction intends to offer researchers
the opportunity to observe every single element of the situation as a potential
contributor to meaning. This reinforces the notion on CA that it is best not to
stick to preordained questions or objects that may prevent the investigation of
possible elements of order (a story, an utterance, body movement, etc.) in the
data before data analysis. As Atkinson & Heritage (1984: 4) put it, ‘nothing
that occurs in interaction can be ruled out, a priori, as random, insignificant, or
irrelevant’.
C. Searching for similar manifestations of such objects in the data is essential in
order to understand the social organisation of interaction.
Despite the openness to data and analysis, Wilkinson & Kitzinger (2019: 556–557)
note that CA has in particular provided considerable insight into the practices of
talk interaction in six areas: turn-taking, sequence organisation, action-formation,
repair, word-selection and overall structural organisation of talk. In schools, CA
has been used extensively to research classroom talk (Sinclair & Coulthard, 1975;
Seedhouse & Walsh, 2010). However, Mercer (2010) has noted that using CA to
analyse classroom talk may not be convenient when handling large sets of data.
According to his estimate, transcribing and analysing one hour of classroom may
take between five and 12 hours of research time. According to our own experience,
transcribing a 13-minute dialogue, with some minimal annotation for pauses, turn-
taking and dysfluency phenomena, the process may even take longer. Mercer has
also noted that making convincing generalisations may prove extremely challenging
as only specific illustrative examples can be offered in standard research publications.
Discourse analysis (DA) is a different approach to textual data analysis that has
been used extensively to look at changes in ideas and viewpoints over time. DA is
useful to track down how groups of people or concepts have been construed in
discourse. Rogers (2011: 1) maintains that three areas, at least, justify the use of
critical discourse analysis in educational research. The first is the communicative
nature of texts, talk and other semiotic interactions that are found in learning
and education; second, DA is particularly sensitive to sociocultural theory (SCT)
and some of its tenets, mainly, the fact that discourse constructs and reflects the
social world through a myriad of sign systems; finally, discourse and educational
research are ‘both socially committed paradigms that address problems though a
range of theoretical perspectives’.
Bergström, Ekström & Boréus (2017) note that the delimitation of the discourses
to analyse is paramount and stress that it is crucial that researchers discuss and
justify their choices in an explicit way. A tenet of DA is that social relations in
discourse are revealed through language. Parker (2004) argues that, because
discourse is organised around patterns and structures that fix the meaning of
26 Analysing text
symbolic material, researchers can study the ‘ideological force of language’ (Parker,
2004: 310) in discourse and, accordingly, understand how entities and concepts
are defined. Although there are many approaches to DA (e.g. James Gee, Norman
Fairclough, Gunther Kress), Parker (2004) suggests that, in practical terms, DA can
be done following a set of steps such as itemising the objects in the text by looking
at the nouns, keeping a distance from the text so that the text is seen as one more
object in the context of the wider research project, itemising the subjects in the
text, reconstructing the rights and the responsibilities of the subjects and mapping
the networks of relationships into patterns. For Parker, discourses are ‘located in
relations of ideology, power and institutions’ (Parker, 2004: 311), so it seems that
DA is most suitable when one of these areas is the object of our inquiry. Discourse,
when seen as social practice, process and product, needs to be considered critically
as a sort of battlefield where meanings are invented, negotiated, used and, often,
imposed. In particular, critical discourse analysis (CDA) has examined the study
of power in society, social structures and individuals by looking at different foci
(Wodak & Meyer, 2009): power as the result of specific resources of actors; power
as an attribute of interactions; and power as a systemic element of society. Rogers
(2011) argues that educational practices are suitable to be examined by CDA as
interactions and practices are constructed across time and contexts in education:
Discourse studies provide a particular way of conceptualizing interactions

that is compatible with sociocultural perspectives in educational research […]
discourse reflects and constructs the social world through many different sign
systems. Because systems of meaning are caught up in political, social, racial,
economic, religious, and cultural formations which are linked to socially
defined practices that carry more or less privilege and value in society, they
cannot be considered neutral […] discourse studies and educational research
are both socially committed paradigms that address problems through a
range of theoretical perspectives.
(Rogers, 2011: 1)
In Woodside-Jiron (2011: 158) we find an application of DA in educational

research. The author looks at ‘legislated policy documents, official state educa-
tion agency documents, professional listserves [mailing lists] and private corres-
pondence, newspaper articles, and documents from popular media sources with
high circulation rates such as Time and Newsweek’ in order to understand the state
of California’s reading policies between 1995 and 1997. The author examined
the ‘process of naturalisation in policy development, policy communication, and
policy implementation’ (Woodside-Jiron, 2011: 178) in order to expose the ideo-
logical stance of the procedures and practices as manifested in the texts analysed.
To do this, Woodside-Jiron used a combination of methods that included the ana-
lysis of theme–rheme structures and the good reason principle in argumentation.
The author concluded that the policies analysed exclude some of the main actors
in educational processes:
Analysing text 27
In this naturalization processes or shaping of cultural models, some norms

are brought to the center and others are pushed to the margin. […] select
policy players and policy informants took center stage while parents, teachers,
administrators, taxpayers, and students were pushed to the margin.
(Woodside-Jiron, 2011: 178)
We can see this analysis as a way to give visibility to the otherwise invisible process
that, in discourse and through discourse, affects social structures, social relations,
and agendas.
2.2 Text as register

Corpus linguistics research uses a relatively small set of methods that are useful to
analyse language use by adopting a different perspective towards the analysis of
discourse. We have used in this section the notion of text as register to capture the
essence of the work that is undertaken by corpus linguists that, in general terms,
do not use the term discourse in their discourse analysis.
2.2.1 Corpus linguistics and the analysis of register

Corpus linguistics studies texts from a different angle. John Sinclair’s linguistics-
driven paradigm (Sinclair, 1991) situated corpus linguistics in a positivist research
paradigm (Gray, 2018) that assumes that (1) there is a reality that is available to
the senses; (2) inquiry should be based on scientific observation and empirical
inquiry, and (3) natural and human sciences share methodological principles that
deal with facts and not with values. Such a conceptualisation of corpus linguistics
situates its ontology, quite paradoxically for the beginner corpus user, on purely
epistemological foundations. The game here is all about the how. No wonder that
McEnery & Hardie (2012: 1) have put it that corpus linguistics is essentially ‘a
set of […] methods [that has facilitated the] exploration of new theories of lan-
guage and theories which draw their inspiration from attested language use and
the findings drawn from it’. The following extract quotes linguist John Swales
reflecting on the personal transformation and transitions that corpus linguists
undergo:
When I first started getting involved in Corpus Linguistics around 1997,

I thought it was a science, a new empirical sub-branch of the language
sciences relying heavily on quantitative methods […] Somewhat later […]
I came around to the view that CL was a methodology, by which I mean a
way of looking at large bodies of language data for a wide variety of purposes
(historical, critical, pedagogic, etc.) rather than as a new branch of linguistics
with its concern with a circumscribed area of content […] One strength is its
capacity for making generalizations about language use […]
(Viana, Zyngier & Barnbrook, 2011: 222)
28 Analysing text
Swales’ reflection brings home some of the frictions that CL users experience when
using research methods: corpus-based vs. corpus-driven approaches, and theory-
driven vs. data-driven research designs are just but some of the tensions we face
when reading and engaging with other colleagues’ research. Guy Aston (Viana,
Zyngier & Barnbrook, 2011) notes that CL is both a methodology and a science
and that it is only an emphasis on applications that swings the pendulum towards
CL as a set of methods.
Corpus linguistics uses both qualitative7 and quantitative methods to derive
knowledge from observed, attested uses of language. Among the former we find
concordance lines, while among the latter we find collocation or keyword analyses.
What characterises corpus linguistics is its emphasis on the study of data that
has been produced while the users are engaged in communication and therefore
communicating within the boundaries of a language register (Biber & Conrad,
2009). Indeed, an emphasis on usage and the blurring of the lexical and gram-
matical (formal) distinctions are the blueprint of most corpus linguistics research.
However, most of us in the field of corpus linguistics will agree that corpus
research methods are characterised by the fact that, in most research designs,
a control corpus from a reference-variety is compared against an experimental
corpus (often the researched area or question) by examining normalised frequency
counts, applying statistical tests and procedures or by manually coding more com-
plex patterns of  usage.
Callies (2015) has defined corpus data as hard, quantitative data that can be
identified, quantified, classified and is subject to refined statistical analysis. Callies
notes that corpus linguistics research methods make findings more generalisable
and our research more easily replicable. Unfortunately, replication studies are not
frequent in corpus linguistics, and caution is required before generalising in most
corpus studies and, in particular, in corpus research involving language learners.
Callies (2015: 36) has defined, mainstream, corpus research methodology as
follows: ‘The research methodology that underlies the quantitative analyses […]
is primarily deductive, product-oriented and designed to test a specific hypoth-
esis, which can then be confirmed or rejected, or refined and re-tested’. Certainly,
an emphasis on hypothesis testing and a lack of explicitness about the research
subject in language research has dominated the research in the first waves of CL
research during the last decades of the 20th century and the first of the 21st
century.
The notion of representativeness has been a central topic in the design of cor-
pora, predominantly in the field of descriptive linguistics. It is so entrenched in
corpus linguistics that we hardly stop to think what the implications are for our
research ontology and epistemology. Corpus linguistics is based on two empirical
principles according to Stubbs (2007: 130):
1 the observer must not influence what is observed;

2 repeated events are significant.
Analysing text 29
So, where do we look? What do we look at? Hunston (2002: 14) distinguishes different
uses of corpora. One of them is the use of ‘general corpora […] to establish norms
of frequency and usage against which individual texts can be measured’. This is
an excellent instance of the assumed epistemology which also infuses the study of
usage: objectivist epistemology (Gray, 2018). Objectivism has met huge criticism
in the social sciences (Gray, 2018), and we cannot surely escape the fact that there
is some danger in believing that, because an objectivist paradigm is in place, our
research necessarily presents ‘objective facts and established truths’ (Gray, 2004: 18).
In this book, we have tried to adopt an approach that is aware of the challenges
that affect how contrast, frequency and representativeness are impacted by the
corpora analysed, as well as by the inquiry methods employed –among others, the
tagger, the corpus management tools, or the very finiteness of the corpora used. In
doing so, we set out to strengthen our findings and claims by presenting a critical
perspective on how researchers set out to investigate our research field, and how
we try to distance ourselves from our own observations on usage.
When we speak or write, we unconsciously use language as the vehicle to
express our ideas. When a friend calls, or when we reply to an email, we use the
language that we deem fit for the purpose of maintaining communication. Corpus
linguistics has provided evidence that the language we use when speaking on the
phone, or writing an email, adapts to different situations in the way that each
and every register displays a distinctive frequency and distribution of linguistic
features. Biber & Conrad (2009: 47) have suggested that registers can be studied by
‘describing the situational characteristics of the register; [by] analyzing the [most
common] linguistic characteristics of the register; and [by] identifying the func-
tional forces that help to explain why those linguistic features tend to be associated
with those situational characteristics’. Table 2.1 offers guidance in understanding
the basics of register.
Table 2.1 Skill 3: understanding text types: register basics
• Humans accomplish different communicative functions

across a variety of situations.
Skill 3
• Registers can be understood as textual sites where

communicative functions and situations of context meet.
• A register can be very broad, such as fiction, or very
specific, for example transhuman science fiction in the first
decade of the 21st century.
• Frequency and distribution of the frequency of words and
word classes affect how registers are formed linguistically.
• Biber & Conrad (2009: 23) note that ‘an understanding
of how linguistic features are used in patterned ways
across text varieties is of central importance for both the
description of particular languages and the development of
cross-linguistic theories of language use’.
30 Analysing text
Parcipants
Relaons
Topic among
parcipants
Communicave Channel of
purposes communicaon
Seng: me and Producon

place circumstances
Figure 2.2 Situational characteristics of registers and genres

Based on Biber & Conrad, 2009
A register can be defined, then, as a group of texts that displays distinctive

linguistic features when used in a situation to achieve a set of communicative
functions. Biber & Conrad have put it this way:
The register perspective combines an analysis of linguistic characteristics that

are common in a text variety with analysis of the situation of use of the
variety. The underlying assumption of the register perspective is that core
linguistic features like pronouns and verbs are functional, and, as a result,
particular features are commonly used in association with the communicative
purposes and situational context of  texts.
(Biber & Conrad, 2009: 2)
In other words, all phone conversations, emails, textbooks, laws, texts, service
encounters and so on share a common set of formal, linguistic features that can
be analysed through CL methods. Even within the same broad register category,
differences tend to be significant. For example, the linguistic differences between
a textbook and a research paper can be reduced and quantified in terms of the
features used in those texts. In order to understand how a register is constituted,
we need to explore the situation in which it is used, paying special attention to a
myriad of factors that are based on both systemic functional linguistics and func-
tional linguistics (Halliday, 1978; Biber, 1988).
Register analysis seeks to explain linguistic data in the light of language vari-
ation across different dimensions of use, that is, different sets of co-occurring
linguistic features that display distinct functional underpinnings (see Figure 2.2,
based on Biber & Conrad, 2009). Douglas Biber’s seminal work in this area (Biber,
1988) identified five broad dimensions of use that explain the underlying motiv-
ations of speakers when using their language. These dimensions are:
• Dimension 1: involved vs. information production. Involved texts dis-

play a higher frequency of second person pronouns, verbs of opinion and
Analysing text 31
that-deletion (i.e. omitting ‘that’ from a sentence such as ‘I think [that] you
are right’). Texts that score high on the other end of the continuum typically
display dense informational structures in noun phrases, with multiple nouns
premodifying noun headwords in noun phrases.
• Dimension 2: narrative vs. non- narrative concerns. Texts that display a
narrative orientation, such as different types of fiction, novels, etc. show a
high frequency of past tenses, third person pronouns, perfect aspect tenses
and opinion verbs.
• Dimension 3: explicit versus situation-dependent reference. Explicit texts
tend to be written registers while the latter tend to be spoken. Texts that score
high on explicit features display a high frequency of wh-relative clauses in
object positions, wh-relative clauses in subject positions, phrasal coordination
and nominalisations.
• Dimension 4: overt expression of persuasion. This dimension is associated
with the expression of our point of view and/or with the use of argumen-
tation to persuade the interlocutor. Texts that score high on this dimension
display a high frequency of infinitives, prediction modals, persuasion verbs,
conditional subordination and necessity modals.
• Dimension 5: abstract vs. non-abstract information. Texts scoring high on
this dimension are highly abstract and display a high frequency of conjuncts,
agentless passives and by-passives.
Table 2.2 Skill 4: understanding textual features and textual data
• In register analysis, linguistic features can be identified using

corpus analysis and part-of-speech (POS) tags.
Skill 4
• POS tags capture morphosyntactic properties of

lexical items.
• A set of linguistic features chosen by the researchers
can be analysed in order to evaluate the situational and
functional characteristics of texts.
• In general terms, we can evaluate whether a given linguistic
feature is very common, common or not common at all in
corpus A when compared with corpus B.
• An examination of linguistic features may include an
evaluation of the frequency of, among others, the following:
• nouns and nominalisations
• premodifying and postmodifying slots in noun phrases
• attributives
• pronouns and cohesion
• use of question tags
• tenses
• modal verbs
• adverbials
• linking adverbials
• stance adverbs
• uses of relative clauses
• different types of verbs (opinion, action, etc.)
32 Analysing text
These dimensions explain how the frequency and distribution of formal linguistic
features affect usage. Although we will not approach the study of register using
Biber’s multidimensional analysis methodology in this book, an understanding
of the theoretical foundations of the register-related linguistic theory will let us
explore how situational differences correspond to systematic linguistic differences.
Corpus linguistics often explores the linguistic properties of texts against the back-
drop of the register where they belong, which provides opportunities to close in
usage across users or other comparable corpora. This awareness of the differences
in language use will make us more appreciative of how registers can constrain
usage. Table 2.2 gives a breakdown of textual features and textual data that are
likely to be examined in register analysis.
Notes
1 www.educ.cam.ac.uk/research/projects/tseda/index.html
2 www.educ.cam.ac.uk/research/projects/tseda/Information%20for%20teachers%20
T-SEDA%20180618.pdf
3 www.qsrinternational.com/nvivo/home
4 www.maxqda.com
5 www.laurenceanthony.net/software/antconc/
6 https://info.leximancer.com
7 Mike Scott, widely known for developing WorsdSmith Tools and for his research at
Aston University, has noted that the sheer power of the tools and the corpora have
brought about not a simple quantitative change but a qualitative one, too (Viana,
Zyngier & Barnbrook, 2011).
References
Atkinson, J.M. & Heritage, J. (1984). Introduction. In Atkinson, J.M. & Heritage, J. (Eds.),
Structures of Social Action. Cambridge: Cambridge University Press, 1–15.
Bergmann, J.R. (2004). Conversation analysis. In Flick, U., von Kardoff, E. & Steinke, I.
(Eds.) A companion to qualitative research. London: Sage Publications Limited, 296–302.
Bergström, G. Ekström, L. & Boréus, K. (2017). Discourse analysis. In Boréus, K. &
Bergström, G. (Eds.) Analysing text and discourse: Eight approaches for the social sciences.
London: Sage Publications Limited.
Press.
Biber, D. & Conrad, S. (2009). Genre, register and style. Cambridge: Cambridge University
Press.
Bond, M., Zawacki-Richter, O. & Nichols, M. (2019). Revisiting five decades of educational
technology research: A content and authorship analysis of the British Journal of Educational
Technology. British Journal of Educational Technology, 50(1), 12–63.
Boréus, K. & Bergström, G. (2017). Analysing text and discourse: Eight approaches for the social
sciences. London: Sage Publications Limited.
Braun, V. & Clarke, V. (2006). Using thematic analysis in psychology. Qualitative Research in
Psychology, 3, 77–101.
Analysing text 33
Callies, M. (2015). Learner corpus methodology. In Granger, S., Gilquin, G. & Meunier, F.
(Eds.) The Cambridge handbook of learner corpus research. Cambridge University Press, 35–55.
Coates, J. K. & Pimlott-Wilson, H. (2019). Learning while playing: Children’s forest school
experiences in the UK. British Educational Research Journal, 45(1), 21–40.
Flynn, N. (2007). What do effective teachers of literacy do? Subject knowledge and peda-
gogical choices for literacy. Literacy, 41(3), 137–146.
Gray, D.E. (2018). Doing research in the real world. 4th Edition. London: Sage Publications
Limited.
Hadley, G. (2014). English for academic purposes in neoliberal universities: A critical grounded theory.
New York: Springer.
Halliday, M.A.K. (1978). Language as social semiotic: the social interpretation of language and
meaning. London: Edward Arnold.
Hennessy, S., S. Rojas-Drummond, R. Higham, O. Torreblanca, M.J. Barrera, A.M.
Marquez, R. García Carrión, Maine, F. & Ríos, R.M. (2016). Developing a coding
scheme for analysing classroom dialogue across educational contexts. Learning, Culture and
Social Interaction, 9, 16–44.
Hunston, S. (2002). Corpora in applied linguistics. Cambridge: Cambridge University Press.
Jefferson, G. (2004). Glossary of transcript symbols with an introduction. Pragmatics and
Beyond, 125, 13–34.
Kellams, S. (1975). Research studies on higher education: A content analysis. Research in
Higher Education, 3(2), 139–154.
Mercer, N. (2010). The analysis of classroom talk: Methods and methodologies. British
Journal of Educational Psychology, 80(1), 1–14.
Parker, I. (2004). Discourse analysis. In Flick, U., von Kardoff, E. & Steinke, I. (Eds.) A com-
panion to qualitative research. London: Sage Publications Limited, 308–312.
Rogers, R. (2011). Critical approaches to discourse analysis in educational research. In Rogers,
R. (Ed.) An introduction to critical discourse analysis in education. London: Routledge, 29–48.
Sandelowski, M. & Barroso, J. (2003). Classifying the findings in qualitative studies.
Qualitative Health Research, 13, 905–923.
Schegloff, E. (2007). Sequence organisation in interaction: A primer in conversation analysis.
Cambridge: Cambridge University Press.
Seedhouse, P. & Walsh, S. (2010) Learning a second language through classroom inter-
action. In Seedhouse P., Walsh S. & Jenks C. (Eds.) Conceptualising ‘learning’ in applied lin-
guistics. London: Palgrave Macmillan, 127–146.
Sinclair, J. (1991). Corpus, concordance, collocation. Oxford: Oxford University Press.
Sinclair, J.M. & Coulthard, M. (1975). Towards an analysis of discourse: The English used by
teachers and pupils. Oxford: Oxford University Press.
Stubbs, M. (2007) On Texts, Corpora and Models of Language. In Stubbs, M., Hoey,
M., Teubert, W. & Mahlberg, M. (Eds.) Text, Discourse and Corpora: Theory and Analysis.
London: Continuum, 127–162.
Vaismoradi, M., Turunen, H. & Bondas, T. (2013), Qualitative descriptive study. Nursing and
Health Sciences, 15, 398–405.
Vrikki, M., Kershner, R., Calcagni, E., Hennessy, S., Lee, L., Hernández, F., Estrada,
N. & Ahmed, F. (2019). The teacher scheme for educational dialogue analysis (T-
SEDA): developing a research-based observation tool for supporting teacher inquiry into
34 Analysing text
pupils’ participation in classroom dialogue. International Journal of Research & Method in

Education, 42(2), 185–203.
Wilkinson, S. J. & Kitzinger, C. (2019): Using conversation analysis in feminist and critical
research. Social and Personality Psychology Compass, 2(2), 555–573.
Wodak, R. & Meyer, M. (2009). Critical discourse analysis: history, agenda, theory and
methodology. In Wodak, R. & Meyer, M. (Eds.) Methods of critical discourse analysis.
London: Sage Publishing Company, 1–33.
Woodside-Jiron, H. (2011). Language, power, and participation: using critical discourse
analysis to make sense of public policy. In Rogers, R. (Ed.) An introduction to critical discourse
analysis in education. London: Routledge, 154–182.
Wooffitt, R. (2005). Conversation analysis and discourse analysis. London: Sage Publishing
Company.
Chapter 3
Corpus linguistics approaches

to understanding language use
3.1 Understanding and researching language use:

discovering patterns
Frequency in a corpus or in a text is ‘observable evidence of probability in the
system’, therefore ‘unique events can be described only against the background of
what is normal and expected’ (Stubbs 2007: 130). Gablasova, Brezina & McEnery
(2017) have highlighted that, when looking at second language learners, informa-
tion about the frequency of occurrence or co-occurrence of a unit or sets of units
can help us uncover language patterns that point to underlying factors in second
language learning, that is, the patterns of use in a dataset can reveal aspects of the
researched phenomena. Durrant & Brenchely (2019) have studied children’s use
of vocabulary in English schools and have suggested that, given the high repeti-
tion of high frequency verbs and adjectives in lower forms, lexical sophistication
is conceptually inseparable from lexical diversity. McEnery, Brezina, Gablasova
& Banerjee (2019) have noted that word associations (i.e. collocations) are crucial
to understand discourse. Two examples are metaphorical connections between
words and social evaluation promotion. However, the unwavering role of fre-
quency in CL is not necessarily well understood in other areas of language edu-
cation research, let alone other disciplines not remotely connected with language
or linguistics.
McEnery & Hardie (2012: 1) have noted that ‘corpus linguistics is not a mono-
lithic, consensually agreed set of methods and procedures for the exploration
of language […] corpus linguistics is a heterogeneous field’. This is a relevant
assertion in the context of a young discipline that is subject to a process of crit-
ical inquiry and witnessing a rich debate in terms of methodological foundations.
In the following paragraphs, we will try to explore some of the principles behind
the exploration of language use and what they imply for the deployment of CL
methods in education research.
3.1.1 Corpus linguistics outside linguistics?

Do you need to be a linguist to use corpus linguistics methods? Mike Scott, the
developer of WordSmith Tools,1 one of the best tools available to explore corpora,
36 Approaches to understanding language use
and the co-author of a book about corpus analysis in language education (Scott &
Tribble, 2006) has stated the following on this point:
No discourse analyst needs to know anything doing Corpus Linguistics. What

they need (for either) is an open mind, a willingness to learn, to take risks, to
make mistakes, to ask for help or find it for themselves. There is not just one
way of slicing bread, and the CL ways of slicing it are not necessarily superior
to non-CL ways.
(Viana, Zyngier & Barnbrook, 2011: 218 )
We could not agree more. Corpus linguistics methods can be used by any researcher
interested in exploring how language is used in textual data, whether it is a text
originally intended to be printed or published electronically, or a transcription of
an interview or a conversation. In CL methods, we can see a type of a reduction
effort that, either by examining or comparing large datasets, extracts regularities
and patterns. Scott & Tribble have described this process almost in terms of what
a cook does when reducing a sauce or stock. A researcher will eventually boil down
language and keep a refined extract:
[…] all the effort of a concordancer or a word-listing application goes into

reducing a vast and complex object to a much simpler shape. That is, a set of
100 million words on a confusing wealth of topics in a variety of styles and
produced by innumerable people for a lot of different reasons gets reduced
to a mere list in alphabetical order. A rich chaos of language is reduced, it is
‘boiled down’ to a simpler set. In the vapours that have steamed off are all the
facts about who wrote the texts and what they meant.
(Scott & Tribble, 2006: 5–6)
Scott & Tribble (2006) argue that corpora can serve different purposes and so
researchers can decide to focus on a myriad of research questions, not necessarily
or exclusively linguistic or grammar-related. Scott & Tribble have identified four
aspects of words that can be examined with CL methods:
1 Words in texts. The study of sentences, paragraphs, sections, etc.

2 Words in the language. Applications of corpora to lexicography, terminology,
phraseology or the study and description of, for example, standard varieties or
the language used by teachers in STEM subjects.
3 Words in the brain. Associations, memory, the mental lexicon, etc.
4 Words in culture. Class, sociological, political or educational implications of
language use.
Words provide an entry door to different areas of inquiry, education research

being one of those. Mercer (2002) and Mercer, Wegerif & Dawes (1999) have
provided evidence of how human beings use language (a) to collaborate and solve
Approaches to understanding language use 37
complex problems, (b) to realise education and (c) develop understanding. Wegerif

& Dawes (1999) found that using exploratory talk helps children to work more
effectively together on problem-solving tasks and that children who had been
taught to use more exploratory talk made greater gains than those who had not
been explicitly taught to do so. Mercer (2002) has noted that while the concept
of genre is of interest to linguists, educational researchers may prefer to use ideas
such as intermental activity, communities or thinking:
But from an intermental perspective, we can see that language genres are also
related to conventional, collective ways of thinking in particular communi-
ties and societies. People unfamiliar with a community’s ways with words are
likely to be excluded from its activities. Those familiar with its genres know
how to use language to participate, how to work with others to get things
done. Expert members of communities can use language features to recog-
nize when a particular kind of activity is taking place, and this enables them
easily to draw on past experience relevant to the joint intellectual activity they
become engaged in. Genres are templates for interthinking, which, like all
social conventions, both facilitate and constrain what we do.
(Mercer, 2002: 6)
Language use analysis can be thus conceptualised (Figure 3.1) on a continuum

that ranges from the characterisation of the formal features of a genre to an
understanding of the social context in which, for example, students, teachers or
parents use language when engaged in education activities.
Corpus linguists have explored the limits of registers (or genres) by looking
at the frequency of occurrence of a wide range of linguistic features through
sophisticated statistical procedures (e.g. Biber, 1988), the analysis of collocations
(Sinclair, 1991) or patterning (Partington, 1998). The impact of corpus methods
on the analysis of language in society has been relevant in areas such as corpus-
assisted discourse analysis and the study of classroom discourse. However, we
would like to emphasise that corpus methods can be of use to researchers outside
linguistics. As we saw in chapter 1 (see Figure 1.4), corpora can be used either as a
primary or a secondary data source. These two approaches can be situated on the
Language as genre Language as acvity
Formal Language
aspects of in society
language
Figure 3.1 Language as genre–activity continuum

genre–activity continuum above and can, consequently, serve different research

purposes.
In the following section we will examine two case studies where researchers
have used corpus linguistics as one of their data analysis methods to discover
hidden patterns of language use.
3.1.2 Case study 1. Examining interviews: qualitative versus

CL methods
Fest (2015) collected over six hours of audio material by interviewing 14 German-
speaking students about an online self-assessment tool that offers secondary school
‘graduates a possibility to test themselves with regard to their learning skills and
suitability before picking a field of study at university’ (2015: 51). The interviews
elicited the students’ opinions about the usability of the tool both in a mentoring
context and as a stand-alone tool. The interviews were transcribed and then
coded using 18 emerging themes. These themes were reduced to five major cat-
egories: affordances of the tool, target users, the tool in the context of career
counselling, usability of the tool and miscellaneous comments. This analysis was
complemented with CL research methods.
The interviews were part-of-speech (POS) tagged and then analysed using
a corpus management tool. The researcher retained in her analysis the 44,159
words used by the 14 interviewees and discarded the words of the interviewers –a
common practice in CL. Fest (2015) concluded that the interviews displayed a set
of linguistic features that could be interpreted functionally (Biber, 1988; Biber &
Conrad, 2009):
1 The frequent use of the first-person pronoun I together with believe, feel,
think and, particularly, the use of I know with the adverb not suggests that the
interviewees expressed different degrees of certainty by means of different lin-
guistic devices. The more infrequent use of we suggests the construction of a
group identity, something which was not expected by the researcher.
2 The use of modal verbs and the use of the subjunctive suggests a range of
attitudes towards the tool analysed that cannot be grasped through theme ana-
lysis. The interviews were conducted in German so these remarks refer to the
peculiarities of the German modal verb system.
3 The use of quantifiers is peculiar in the dataset analysed. Fest (2015: 63)
suggests that her interviews contain implicit comments ‘phrased as suggestions’
(2015: 64) and found a pattern of  use:
When looking at the immediate contexts of these words as given in the con-
cordance lines, the first notable result is that the two most frequent ones, ganz
(pretty) and sehr (very), both co-occur most frequently with the same adjective,
namely gut (good). There are 20 instances of ganz gut (pretty good) and 17
of sehr gut (very good) in the corpus, which shows a slight tendency towards
stressing positive aspects. Praise for the tool is often emphasized by the inten-
sifier very, whereas the construction ganz gut rather equals an only slightly
better than neutral evaluation, like okay.
(Fest, 2015: 63)
This is a great example of how we can use some of the corpus methods to go
beyond the meanings of words (as found in dictionaries) to examine the meanings
of words in context as used by speakers. The analysis of accumulated concord-
ance lines can provide researchers with the chance to examine data and identify
patterns that are difficult to spot due to the amount of variation in actual usage
(Pérez-Paredes, 2017). We will come back to this point when we examine skill six,
reading concordances.
3.1.3 Case study 2. Examining policies: combining content

analysis and corpus methods
Non-Anglophone universities across the world have begun to integrate under-
graduate and graduate English programmes in so-called English as a Medium of
Instruction (EMI) degrees. Villares (2019) is interested in examining how the lan-
guage policies of universities in Spain interpret and articulate this new linguistic
and pedagogic landscape. To do this, she put together a corpus of official lan-
guage policies across 29 Spanish universities from 2001 to 2018. After examining
this corpus of policy documents, she found that Spanish universities represent
themselves as responsible for fostering the uses of either (1) English or (2) local
languages other than Spanish. This is linguistically marked by the use of adverbs
and different stance markers. In the case of English, universities appear in the
corpus as sensitive agents that listen to the needs of a global world and local soci-
eties demanding a more international outlook. This is seen in the use of modal
verbs expressing obligation. In her analysis, the researcher found that univer-
sities projected an image of international prestige and reputation linked with the
idea of international or global impact. This is seen in the use adjectives and the
modifiers in noun phrases that tend to highlight institutional rigour and auctoritas.
The use of content analysis and corpus methods can strengthen the range of
findings of educational researchers. In the above example, the identification of
topics is complemented by analyses of how content is delivered linguistically. These
analyses, as we will see in skill eleven, testing statistical differences (Table 4.7), can
reinforce some of the findings in the content analysis study and thus increase the
validity of the findings.
In the two case studies above, we have seen that corpus methods can com-
plement the use of interviews and policy documents (data collection methods)
and theme analysis and content analysis (data analysis methods) so as to produce
stronger, more valid findings as well as a stronger link between individuals’ use of
language and the meaning and impact of that language on research questions and
research findings.
3.1.4 Using an existing corpus

Researchers need to decide whether an existing corpus can be used for their own
purposes or whether a new, ad hoc corpus will need to be put together. The latter
will be more time-consuming and challenging, but it may be absolutely essential.
What is involved in doing this is covered in the next chapter, under skill nine (see
Table 4.3).
If our research target is a (large) well-known population, the chances are that
we may at least be able to find either a corpus of potential interest or primary data
that can be collated to form a corpus. For example, if we’re interested in looking
at teachers’ perception of violence in schools or leadership teams’ understanding
of educational technology, we can try a wide array of data service websites that
may offer such data. The UK Data Service, funded by the ESRC (Economic and
Social Research Council), offers UK researchers the opportunity to examine pre-
vious research and datasets. They have a three-tier access policy whereby data
can be open, safeguarded or controlled. Open and safeguarded data have been
anonymised pursuant to the Data Protection Act and Statistics and Registration
Services Act. However, safeguarded data can only be accessed if there is a residual
risk of disclosure. Nevertheless, obtaining permission to use the data is always
advised (and note that different countries may offer similar services).
Let us examine two different datasets from this website. Say, for example, we are
interested in how school leaders and teachers in secondary schools bridge the gap
between educational policy and educational practices and outcomes. Gu (2015)
examined this area both in the UK and Hong Kong from 2012 to 2014. The
original data can be reached through the UK Data Service website. This is Gu’s
summary of how the data was collected:
The data were collected as part of case studies of four most improved and
effective secondary schools across diversified school populations in different
socio-economic contexts. These focussed upon the ways in which government
reforms were mediated by principals, senior and middle leaders and teachers in
order to assess the extent to which the primary intentions had been translated
into practice and sustained; and if not, why not. Data were collected over
three phases. Firstly, school principals identified policy initiatives which had
the greatest impact on their schools and explained how they interpreted/
mediated policy. Second, interviews with senior and middle leaders (n=6–8)
and classroom teachers (n=6–8) permitted progressive focusing […] on how
policies were understood, communicated and enacted in each school. Third,
a further visit explored emergent themes with key staff members.
(Gu, 2015)
The original interviews were digitally recorded and transcribed. The interview
protocol, the participant consent form as well as the participation information
sheet can also be found online. The author looked at the transcripts, categorised
and refined emergent themes, and identified topic patterns using grounded
theory to define the emerging variables in the investigation. The transcriptions
of these interviews can be downloaded and further analysed using a wide range
of data analysis methods, including corpus methods. Although the original data
was not intended as a corpus per se, the fact that the transcriptions are available
in text format facilitates our use of this resource as a corpus. Note that when
using resources that have not been designed within our own research project, we
will need to make sure that, if needed, we clean up some of the markup in the
text such as interview boundaries (‘end of recording’), identification of speakers
(‘interviewer’, ‘teacher, ‘head’), page numbers and footers and headers. Once we
have made sure that only the transcribed interviews are in .txt format, we are
ready to upload our data to a corpus management tool.
However, we may even discover that there is a proper corpus available ready
to be used. Say, we want to examine children’s writing development in schools.
Durrant (2019) put together and distributed a corpus of school children’s writing
in England collected between 2015 and 2019. This is how the corpus is described
in the UK Data Service website:
The Growth in Grammar Corpus (GIGC) is a collection of texts written by

children at schools in England as part of their regular school work. Texts were
mainly sampled from children in years 2, 6, 9 and 11, covering the disciplines
of English, Science and Humanities. There is also a small collection of texts
from year 4. The corpus was collected as part of a project aiming to develop
the first systematic understanding of the distinctive uses of grammar which
mark out student writing across the full range of ages, attainment levels and
text types in English primary and secondary education to age sixteen.
(https://beta.ukdataservice.ac.uk/datacatalogue/studies/study?id=853809)
This corpus offers a balanced sample of English schools according to the following
categories:
1 School Location I: North vs. South England.

2 School Location II: Urban vs. rural areas.
3 School Demographics I: Greater vs. less than 10% of pupils qualifying for free
school meals.
4 School Demographics II: Below vs. above average ethnic diversity.
This is a very interesting corpus design that will facilitate the analysis of children’s
writing across a variety of demographic factors and with a huge potential to offer
insights into writing development from a variety of angles. Table 3.1 offers a
breakdown2 of the number of texts collected, the number of schools involved,
the number of students as well as the percentages of English as an Additional
Language (EAL) students and students that qualify for the free school meal in each
year sample.
Table 3.1 GIG corpus data
Schools Texts Writers % EAL students % students qualifying

for free school meal
Year 2 6 639 160 28 20
Year 4 2 49 10 0 37
Year 6 7 868 185 18 23
Year 9 12 804 457 2 29
Year 11 9 538 171 6 12
Table 3.2 Skill 5: using an existing corpus
• Before jumping into existing data, we need to ask ourselves

whether an existing corpus may be relevant in the context
Skill 5
of our research. We will need to answer the following:

• Can my research questions be answered by examining
this corpus?
• Does my research involve the examination of language
used by speakers in interviews, focus groups or essays?
• Who are my target population? Is it well represented in
an existing dataset?
• Can I use a corpus of data already available? In which
format is this data? Is it feasible to use this data?
• Am I allowed to use the data for research purposes?
In total, this corpus includes almost 3,000 writing samples from almost 1,000
children studying in 24 schools across England. This is a potentially hugely useful
corpus resource for educational researchers that want to look at writing develop-
ment using either cross-sectional or pseudo-longitudinal designs. This corpus has
particular potential because of its representativeness, covering a wide range of
schools and pupils, and can be used in combination with smaller datasets to offer
a baseline for comparison and further scrutiny. The researchers that put together
this corpus stress, however, that it may be used to better understand ‘how language
can express attitudes towards social relations or to help students better develop the
linguistic resources for expressing such attitudes’. In other words, the corpus was
not collected so as to reveal the attitudes of the students. Table 3.2 shows how to
approach our fifth skill, using an existing corpus.
3.2 Reading concordance lines

What is a concordance line? What do concordance lines look like? McEnery &
Hardie (2012: 35) have stated that ‘the single most important tool available to the
corpus linguist is the concordancer’. Let us use a well-known text. Figure 3.2 shows
Figure 3.2 Concordance lines in AntConc
some of the occurrences of the word fact in George Orwell’s 1984 as displayed by
Laurence Anthony’s AntConc3 following a keyword-in-context (KWIC) concord-
ance format.
As McEnery & Hardie (2012: 35) put it, ‘a concordancer allows us to search
a corpus and retrieve from it a specific sequence of characters of any length –
perhaps a word, part of a word, or a phrase. This is then displayed […] as an
output where the context before and after each example can be clearly seen’.
The word fact appears exactly in the middle of the lines, with both preceding and
following context on the left and right. There are 51 occurrences of fact in Orwell’s
1984. The screenshot shows 35 of those. Figure 3.3 shows the same search using
the online corpus management tool Sketch Engine.
Concordance software such as AntConc or Sketch Engine will let you (up)load
your data, search it and export your search results in different formats, which gives
researchers the opportunity to work with their data in different ways. While some
researchers may prefer to have their results in spreadsheets, others will prefer to
store them in text files or even in pdf format. What is crucial, however, is the
fact that concordance lines allow us to examine all the occurrences of a word (or
words) or a lemma (or lemmas) (see Table 3.3) in either a corpus or a single text in
the context in which they occurred.
Sinclair (1991) has stressed that the examination of natural occurring language
embodies the idea that any aspect of language use depends on its surrounding
context (1991: 5): ‘The details of choice shown in any segment of a text depend –
some of them –on choices made elsewhere in the text’. Sinclair argues that when
we examine uses of the language, we are in fact looking at the ‘constraints that
Figure 3.3 Concordance lines in Sketch Engine
Table 3.3 Essential terminology: lemma
Essential terminology Lemma

A lemma is a unit of meaning that integrates all the forms of
a given word. For example, fact and facts are two different
words. The lemma fact, however, includes both. When
we search for fact as a lemma, we will obtain results for
both the singular and the plural forms. If we search for
the lemma teach, we will obtain teach, teaches, taught and
teaching.
BUT There is no agreement on the procedures used to

approach lemmatisation. Take the word teacher. For the
Wordnet lemmatiser, teacher is both a lemma and a
word. For the Lancaster lemmatiser, teacher is a word
whose lemma is teach.
determine the precise relationship of any fragment of text with the surrounding
text’ (1991: 6). We can therefore think beyond the word unit and become aware of
the connections at the phrasal, sentence and discourse levels, as it is in these units
where constraints at the lexico-grammatical level operate.
Concordance lines can be ordered in different ways. They can be ordered
alphabetically, typically to the right of the central word-form. Sinclair (1991: 33)
argues that this ordering ‘highlights phrases and other patterns that begin with
the central word’. Another convenient ordering is reverse alphabetisation to the
left of the central form, which provides a useful clue to the topic of the passage
Table 3.4 Patterns emerging from the concordance lines of fact in Orwell’s 1984

‘The fact’ followed by of + ing clause
[…] the sort of words that are uttered in the din of battle, not distinguishable
individually but restoring confidence by the fact of being spoken. Then the face of Big
Brother faded away again, and instead the three slogans of the Party stood out in […]
‘The fact’ followed by that + clause
[…] holds to be the truth, is truth. It is impossible to see reality except by looking
through the eyes of the Party. That is the fact that you have got to relearn, Winston. It
needs an act of self-destruction, an effort of the will.You must humble […]
‘The fact’ followed by a verb phrase
And when he told her that aeroplanes had been in existence before he was born and
long before the Revolution, the fact struck her as totally uninteresting. After all,
what did it matter who had invented aeroplanes?
if the form is a verb (Sinclair, 1991: 34) or, as in Figure 3.3, this ordering reveals
the syntactic function and meanings of the noun phrase in which fact is found in
1984. In the case of fact in Figure 3.3, by examining concordance lines it is rela-
tively straightforward to identify uses of this noun as in in fact and, maybe, focus
on phrases where determiners such as the or a are used by Orwell. For example, the
fact occurs 17 times in the novel. Table 3.4 shows some of the key instances that
emerge from an examination of the concordance lines.
Most of these uses of fact seems to suggest that the word was used by Orwell
to facilitate apposition in noun phrases. As you can see, we have grouped these
occurrences using formal linguistic criteria, but this is not the only way to proceed.
Instead, we suggest that we pre-empt any assumptions and follow a step-by-step
procedure that situates our research inquiry at the centre of the process.
3.2.1 How to read concordance lines

It may seem obvious, but, when searching a corpus, we need to start somewhere.
We actually need to decide what to search and this is, certainly, no trivial task.
In our own research, we found that language learners (Pérez-Paredes, Sánchez-
Tornel, Alcaraz Calero & Jiménez, 2011) often struggle with initiating searches in
a corpus. The reasons are complex: corpus users may lack familiarity with corpus
search interfaces, or they may struggle with the very concept of what a corpus
means and how it can help them find information of interest. So where to start?
Unless we have a very clear idea of what we are looking for, it seems useful to
create a list of the words or lemmas in a text or in a corpus.
A word list is easy to create and is usually a first step that can be useful to
explore the lexical repertoire in a corpus or a text. For purposes of illustration, we
will use in the rest of the chapter two small corpora of UK news about education
Table 3.5 Essential terminology: types/tokens
Essential terminology Types/tokens

When counting the number of words in a corpus, it is
useful to distinguish between types and tokens.
Tokens are every single word that can be counted in
the corpus. For example, in The Times Online corpus
mentioned above, we find 174,459 tokens in total.
Types are the number of different words in the same
corpus. For example, in The Times Online corpus the
word children appears 475 times. While children is one
single type, we can count up to 475 occurrences of
the word. These can be described and counted as 475
tokens. There are 25,758 types in the corpus.
BUT There is no agreement on terminology across different

corpus management software. In Sketch Engine, the
term tokens includes punctuation and uses the term
words for types.
Figure 3.4 Word list from The Times Online corpus as displayed in Sketch Engine
and inclusion in (a) The Times during the period 2015–18 and (b) The Guardian
during 2018. The two corpora can, potentially, be useful in helping researchers
understand how these news outlets communicate news that deal with education
and inclusion. The Times corpus is made up of 64 different texts and contains
174,459 words (tokens), of which 25,758 are unique words (types) (see Table 3.5).
Word lists can be simple and offer a ranking of every word in the corpus, from
the most frequent to those that occur only once. They can also offer specific infor-
mation about particular word classes (only nouns, only verbs, etc.) or we can even
form lists of lemmas or even POS tags. In the case of The Times corpus, the most
frequent token is to, somewhat unexpected as the article the tends to be the most
frequent word in most English corpora. In this case, this can be explained by the
inclusion of the New Year’s Honours list4 in The Times across four years where the
expression service to is used very frequently. Figure 3.4 consists of a screenshot of
the word list function in Sketch Engine.
The figure shows the interface that Sketch Engine users will see when they run
the word list function. The results can be exported and be kept in a variety of
formats, e.g. pdf, but other formats may be more interesting to researchers:
• CSV. A comma-separated values file is a text file that uses a comma to sep-
arate values and stores tabular data. It can be used across a variety of software
(spreadsheets, notepads, etc.) and platforms (Windows, Mac, Linux). In general
terms, CSV files preserve the essential information and can be used in almost
any computer you can think of.
• -XLS. A spreadsheet file that can be read by Microsoft Excel and other spread-
sheet software.
• -XML. An Extended Markup File is a file that offers structured informa-
tion that can be reused in other applications. In the case of a word list, the
XML file offers the information about the corpus and subcorpus names as
well as the words and their frequencies enclosed in the structure attributes
(<corpus>(</corpus>, <word list></word list>, etc.).
Once we have a word list of the words (or lemmas, tags, etc.) in a corpus or a text,
we can identify candidates for our initial search. Say we want to explore the use of
inclusion and mental. The former occurs 105 times in the corpus, the latter 55 times.
Following Sinclair (2003), Pérez-Paredes, Sánchez-Tornel, Calero & Jiménez
(2011) and Pérez-Paredes, Sánchez-Tornel & Calero (2012), the analysis of con-
cordance lines can be seen as a procedure that follows well-established steps that
start with the selection of a search term known as the node. Let us consider them.
• Step 1: initiate. The researcher observes the words to the left and to the
right of the node. The goal of this step is to come up with a selection of
sequence candidates where typically one or two sequences will be stronger
than the rest in terms of the evidence provided. Sinclair suggests that if the
same words occur in more than half the instances in the sample, it is sensible
to think that the link between these words is pretty strong. I would avoid, how-
ever, being this specific in a context where datasets may vary greatly in length.
If there is not one single word that stands out either to the left or to the right
of the node, Sinclair suggests that we look at word classes instead (is it a noun?
Is it a verb? Or adverb?).
• Step 2: interpret. This is where researchers form their hypotheses about

the links between the sequences of words and the pattern(s) observed. Do
they present a coherent meaning? Are all of them nouns or part of a noun
phrase?
• Step 3: consolidate. The hypothesis or hypotheses are tested. Researchers
will consider a wider range of words to the left and to the right and will try
to establish the nature of the link between units beyond the phrase boundary
(noun phrase, verb phrase, etc.). Sinclair (2003: xvii) suggests that we ‘use
always the criterion of how close they are to coming under the hypothesis that
you have set up, and be prepared to revise and loosen up the hypothesis a little’.
• Step 4: report. Write your own pre-final outcome of the hypothesis you
consolidated in step 3. This is a great tip and we suggest that whenever we
come to analyse concordance lines, we make notes of whatever we feel is of
significance.
• Step 5: recycle. Continue with other pattern candidates, possibly on the
other side of the node.
• Step 6: results. Write your final outcomes and list of the hypotheses that
were tested.
Let us begin with a search on The Guardian corpus of education and inclusion,
a corpus of 68 texts published by this newspaper in 2018. A search of the term
inclusion returns 89 hits. What we want to know is whether the term is used in
specific ways and whether those ways can be identified through the examination
of concordance lines. We will follow the procedure above, trying to illustrate the
outcomes at every stage. We discuss this model further in chapter 7.
Step 1: initiate
We have 89 occurrences of inclusion in the corpus. We can start by sorting the
words to the left of the node alphabetically. This is what we find:
• Inclusion is mostly preceded by diversity and equity.

• Inclusion appears frequently in coordinated noun phrases such as tolerance and
inclusion and, again, diversity and inclusion.
• Some of the most repeated adjectives premodifying inclusion are social and
greater.
• The main meaning of inclusion in the concordance lines analysed is that of
being included as a person, rather than incorporation or insertion.
Step 2: interpret
It seems that inclusion tends to be used in adjacency or in coordination with diversity
and when premodified by an adjective it is social.
Table 3.6 Extended node contexts

a […] Assistant Provost for equality, diversity and inclusion at Imperial, says a
women’ refuge has carried out an evaluation of all their policies and procedures
on sexual harassment.
b Though the rollback does not change the law, Archer says it sends the message
that this administration does not think diversity and inclusion are
important.
c […] efforts to increase the number of female scientists’ peer reviewing the
work of others, and increasing gender diversity in committees, programmes
and the honours and recognition process, while a diversity and inclusion
task force will provide final recommendations by the end of the year.
Step 3: consolidate
Diversity and inclusion occur 11 times in the entire corpus, that is, 12% of the
occurrences of inclusion are found in this coordinated phrase. Diversity occurs 55
times, so 20% of the times this word occurs with inclusion in coordinated phrases.
When the context is expanded, we look further to the left of the node and we can
develop a better sense of how patterns are used. The following three excerpts
from The Guardian corpus shown in Table 3.6 capture the width of meanings
represented across the 11 concordance lines examined.
(A) represents the use of diversity and inclusion as part of the role of an
authority; (B) stands for cases where diversity and inclusion appear to be
neglected by a group of people or an organisation; finally, (C) represents uses
where diversity and inclusion is used in noun phrases (diversity and inclusion task
force) to signal groups of people and institutions working towards achieving
inclusion in society. This is obviously an interpretation of the contexts where
diversity and inclusion occur. However, it is an interesting one where evidence of
usage is provided. Furthermore, every single instance of use in the corpus has
been examined.
Step 4: report
After steps 1, 2 and 3 we have noted that the contexts to the left of the word
inclusion seem to favour the use of the word in coordinated phrases together with
diversity.
Step 5: recycle
So far, we have looked at the context to the left on inclusion. Let us now examine
what happens to the right of the node. Let us first order the concordance lines
alphabetically and then go over steps 1, 2 and 3 again. It appears that the following
trends emerge:
• There is a good number of cases where inclusion appears immediately before

a full stop, for example:
• ‘[…] a detraction from more serious issues such as discipline and
inclusion.’
• ‘My personal view is that schools should be places of tolerance and
inclusion.’
• ‘[…] cut back on the availability of courses appropriate to need and
removed the support network integral to the success of inclusion.’
• Inclusion is often postmodified by a prepositional phrase, as in the
following case:
• ‘The strategy emphasises the importance of both quality and inclusion in
higher education.’
• However, most prepositional phrases fronted by of or on tell us about the use
of inclusion as insertion or introduction such as in these cases:
• ‘It was undermined by the dilution of religious education through the
inclusion of all worldviews in an already tight teaching timetable.’
• ‘The education industry is starting to recognise this with greater inclusion
of Aboriginal content and perspectives at all levels of education.’
• ‘The inclusion of graduate prospect statistics in university rankings has
necessitated this […]’
• ‘Andrew Adonis Young5’s inclusion on the board immediately attracted
sustained public controversy […]’
Step 6: results
The use of inclusion in The Guardian 2018 corpus suggests that the term tends to
be used in conjunction with words such as diversity and equity when it is used to
Table 3.7 Skill 6: reading concordance lines
• Concordance lines show how a node behaves in a corpus

or in a text.
Skill 6
• We examine concordance lines to discover patterns of

use. They reveal how words and meanings are patterned in
discourse.
• A node is what you search, whether this is a word, a lemma
or a string of words. Nodes are shown right in the middle
of concordance lines.
• Concordance lines reveal the surrounding context of a
node. The help us move our analysis beyond the word unit.
• Examining concordance lines is time-consuming and very
much qualitative.
• An analysis of concordance lines involves the examination
of the context to the left and to the right of the node and
the formation of hypotheses as to what constitutes strong
evidence of use.
discuss the improvement of peoples’ opportunities and dignity. There is a ten-

dency for it to be used on its own and to be postmodified by a prepositional phrase
when it is used to mean the introduction of something or someone in a group.
This preliminary analysis can be complemented with other CL methods, or even
with other research methods altogether. We will explore some of these options in
the forthcoming chapters. Table 3.7 covers skill six, reading concordance lines.
3.3 Handling frequencies

Frequency is everywhere, even when we don’t consciously think about it. In
chapter 1 we discussed how frequency impacts the representation of language as
well as its productivity and learnability (Bybee, 2007; Ellis, 2002). In other words,
the language we use is constrained by previous uses of the language and conditions
future uses of the language as well.
Corpus analyses provide different types of measures of frequencies. In this
section, we will look at the most relevant frequency-related data necessary to
understand when we analyse corpora.
3.3.1 Corpus size and relative frequencies

In principle, there is no small or big corpus. There are different reasons why
researchers put together a corpus and its size can be seen as an effect of those
motivations. For example, most corpora representative of a language variety usu-
ally contain n-hundred million words. Let us examine four different corpora.
The British National Corpus (BNC)6 contains 100 million words; the Corpus
of Contemporary American English (COCA)7 around 500 million words, Mark
Davies’ Corpus del Español8 is above 100 million words, and the 2012 French
Web corpus9 contains almost 10 billion words, exactly 9,889,689,889 words.
These four corpora represent different approaches to data compilation and dis-
tribution. The BNC is a finite size corpus whose designers estimated that this
was just about the right size to offer a balanced representation of the types of
texts written in the UK as well as of spoken language. The fact that only 10%
of the data is spoken, however, says much more about the difficulty to obtain,
transcribe and process that kind of data than about the importance of spoken
communication.
COCA is a monitor corpus, meaning that COCA is an ever-growing corpus.
In 2010, the corpus contained 400 million words (Davies, 2010), so it has grown
approximately 25% larger in a decade. This type of corpus allows researchers to
look at the evolution of language use across different genres. In Figure 3.5 we can
see the evolution in use of the word bullying and we can observe that it was in the
mid 2000s when its use started to pick up.
It would be interesting to see what happened in the 2015–17 period: did new
words take up the space of bullying? Were academics and journalists less interested
in this?
Figure 3.5 Bullying in the COCA corpus
Our third corpus, Corpus del Español includes Spanish texts from 1200 to
1900 and it was compiled as a diachronic corpus where researchers track the
evolution of the Spanish language over eight centuries. Finally, our fourth corpus,
French Web 2012, is a massive crawled corpus. It contains almost 10 billion words
extracted from websites in French. The TenTen Family10 of corpora have been
collected following the same criteria and can be regarded as comparable cor-
pora. In general, large representative corpora can be situated on a continuum
that ranges from those that are small, their design careful and well documented
to those that are massive, highly informative but poorly structured in terms of
the different genres represented (e.g. news, academic publications, fiction, blogs,
forums, etc.).
Let us go back to The Guardian 2018 corpus. When we load the corpus file on
AntConc and run a word list search, we obtain the size of the corpus. Figure 3.6
shows the screen shot of this search.
The corpus has 73,259 words (tokens) and 8,967 types (different words).
Figures 3.5 and 3.6 are extremely relevant in our work with corpora. We know
that 73,259 is the absolute token size of the corpus, and the word education, for
example, occurs 357 times in this corpus. Is this high frequency? Is it not? There
is no way we can know. We need to normalise this frequency count in order to
understand its true significance. According to Brezina (2018: 43), relative fre-
quency is ‘the mean […] of the frequencies of the word in hypothetical samples of
x tokens from the corpus, where x is the basis for normalisation’. To calculate the
relative frequency of education we divide its absolute frequency (357) by the number
of tokens in the corpus (73,259) and multiply it by a basis for normalisation. For
example, if we choose 100,000 as our basis for normalisation, the formula will
be (357/73,259) x 100,000 and the relative frequency will be 487 per 100,000
words. If we chose a different basis, for example 10,000, the relative frequency
of education in The Guardian 2018 corpus would be 40 per 10,000 (or, similarly, 4
per 1,000 words). Using relative frequencies is absolutely essential to compare fre-
quencies across different corpora. The most usual bases are 10,000, 100,000 and
1,000,000. If we are working with small corpora, smaller bases for normalisation
Figure 3.6 AntConc word list screenshot
are more appropriate (Brezina, 2018), but which base to use is ultimately the deci-
sion of the researcher.
We can also run a frequency test that looks at lemmas rather than words. This
is probably a good idea if we are not interested in differences between the sin-
gular or plural forms of a noun, or in the different tense inflections of a verb. In
AntConc, we can activate this option and load a lemma list. Figure 3.7 shows the
three steps required to do this.
We need to load a lemma list that AntConc can use to perform this analysis.
There is a wealth of online resources that will meet most of your needs, so don’t
worry too much if you are new to corpus analysis. We suggest that you visit Mike
Scot’s website and download one of the lemma lists there.11 You will need to load
this lemma list and then click Apply. Once you have done this, you can go to
Word List and run a new analysis. The results will appear in the form shown in
Figure 3.8.
On the left-hand side, we can see the lemma overall counts. For example,
the lemma school occurs 672 times in the corpus. On the right, we can find the
breakdown of the different words (tokens) that are part of this lemma and their
corresponding count (school, schooled, schooling, schools). These forms were
defined in the lemma list that we downloaded, and they may not necessarily
coincide with how we want to parse our lemmas, so it is worth exploring other
alternatives or even compiling our own list of lemmas. Although most lemma
lists will meet our needs, it is necessary that we fully understand the range of
forms that are attributed to each lemma stem (i.e. researcher is not a form in the
Figure 3.7 AntConc tool preferences. Activating lemma forms
Figure 3.8 Lemma list in AntConc

lemma research in the aforementioned list). On top of Figure 3.8 we can see that
the frequency-related information provided is lemma sensitive: while the lemma
tokens are the same as the number of word tokens, the lemma types are lower
than the word types (7,000 vs. 8,967). In Sketch Engine, the information about
corpus size has a dedicated window. The tokens in Sketch Engine include punctu-
ation, so you should expect to find differences between corpus management tools.
We will come back to this idea later.
When discussing the frequency of a word or a lemma, it is necessary that we
consider the concept of the dispersion of that word or lemma across our corpus.
Brezina has defined it in the following way:
[…] dispersion tells us about the distribution of words or phrases throughout

the corpus. For example, the definite article the is not only a highly frequent
word, it also is fairly evenly distributed in text. […] dispersion directly depends
on our understanding of corpora and their structure (parts) because disper-
sion describes distribution of words and phrases throughout the corpus or
across its different parts.
(Brezina, 2018: 47)
AntConc has a tool called ‘concordance plot’ that will let us explore visually where
in the corpus a particular feature is more frequent. If a corpus is made up of
many individual texts, it is a good idea to upload the individual files to AntConc
(see Figure 3.6, left side of the screenshot) so that the exploration of dispersion is
truly relevant. If all the texts are concatenated into one single file, the use of such
a tool is discouraged. Table 3.8 summarises what is involved in skill seven, hand-
ling frequencies.
Table 3.8 Skill 7: handling frequencies
• There are two types of frequencies: raw/absolute

frequencies and relative frequencies.
Skill 7
• Raw frequency tells us how many occurrences of a word

or a lemma are found in a corpus.
• Relative frequencies are more useful than raw frequencies
as we can compare corpora normalised using the
same base.
• Normalised bases include number of words or lemmas per
1,000, 10,000, 100,000 or 1,000,000 words.
• Frequencies of both words and lemmas are of interest.
• Expect that different software shows slightly different
information as regards frequency.
• When considering the frequency of an item in a corpus,
make sure you understand how it is represented in the
entire corpus. Pay attention to dispersion.
3.4 Collocations
The term collocation is used in corpus linguistics to denote the idea that ‘important
aspects of the meaning of a word [or a lemma or other linguistic unit] are not
contained within the word itself [...] but rather subsist in the characteristic asso-
ciations that the word participates in, alongside other words or structures with
which it frequently co-occurs’ (McEnery & Hardie, 2012: 122–123). This idea has
important implications. In theme analysis and other methods, the semantics of
the units analysed is rarely discussed or problematised, as if word meanings were
obvious and their identification straightforward. Evert (2007: 4) defined colloca-
tion as follows: ‘a combination of two words that exhibit a tendency to occur near
each other in natural language, i.e. to cooccur […] The term ‘word pair’ is used
to refer to such combination of two words […]’.
Sinclair (1991: 113) argued that the core meaning of a word is not a de-lexical
one and that ‘frequent words have less of a clear and independent meaning’.
Despite the limitations of an overemphasis on collocation (McEnery & Hardie,
2012), the analysis of individual word meaning in isolation may misrepresent how
language is actually used. Collocates (Table 3.9) are the words that co-occur with
node words in a corpus.
There are at least two major ways to identify collocations. One of those is to ask
native speakers of a language to identify them. However, this methodology may
be flawed as our intuitions about language are affected by so many variables, e.g.
our own memory, retrieval routines, previous experience with different domains
of the language, etc. One of the variables that affects our intuitions as native
speakers is the frequency of occurrence of words in the language. Evert (2007)
describes these collocations as a phraseological, theoretical notion. Siyanova-
Chanturia & Spina (2015) looked at the intuitions about collocation frequency
in L1 and L2 Italian (80 noun-adjective collocations). These researchers found
that ‘both native speakers and (advanced and intermediate) non-natives were
Table 3.9 Essential terminology: collocate
Essential terminology Collocate
A collocate is a unit of analysis, typically a word, that has

been found to co-occur with another unit in a corpus. In
CL analysis, the implication is that a node and a collocate
tend to co-select each other.
BUT Collocates are identified statistically and, as different tests

are used, so we should expect different results on
the same dataset. Also, we need to frame our results
exclusively within the corpus we are using. It would be
misleading not to do so.
sensitive to the frequency of collocations […] their judgments were found to be

affected by corpus frequency as well as the frequency bands (Siyanova-Chanturia
& Spina, 2015: 550–552). They concluded that native speakers did not exhibit
‘good intuitions in the case of […] middle frequency bands’. Food for thought.
The second way to identify collocations, empirical collocations as defined by
Evert (2007), is the use of statistical tests on corpora. Halliday & Matthiessen
(2014: 59) note that ‘the measure of collocation is the degree to which the prob-
ability of a word (lexical item) increases given the presence of a certain other word
(the node) within a specified range (the span)’. Using a corpus seems a much more
robust empirical way to identify what constitutes a collocation in a given popula-
tion of users of the language or language register.
There are different association measures that identify collocations (Oakes,
1998), and their use depends on the nature of the dataset as well as on our
research preferences. These measures use either (a) surface proximity to deter-
mine whether co-occurrence within a certain span, typically plus or minus three
or five words to the left and the right of the node12 (the span), or (b) textual co-
occurrence, where the span size is more arbitrary and can encompass sentences,
paragraphs or even texts. Two of the best-known association measures are Mutual
Information (MI) and T-score.13 MI is an association measure that looks at the
‘observed cooccurrence frequency O by comparison with the expected frequency
E [of words w1 and w2], and calculates an association score as a quantitative
measure for the attraction between two words’ (Evert, 2007: 18). An MI score of 3
or higher shows evidence that two words (or other items) are collocates (Hunston,
2002). The main criticism of MI score is that it is very sensitive to low frequen-
cies; this is precisely why a T-score is usually chosen to provide evidence for
collocations of very frequent items. According to Kolesnikova (2016), the T-score
compares the observed co-occurrence frequency and the expected co-occurrence
frequency as random variates. A T-score of 2 or higher shows evidence that the
co-occurrence of node and collocate is statistically relevant. In practical terms,
we note that both scores will return different collocates for the same node. The
use of Sketch Engine will let us use other collocation measures such as LogDice,
a measure that is not as sensitive to the size of the corpus.14 Tables 3.10 and 3.11
show the ten most significant collocates of inclusion in The Guardian 2018 corpus
in decreasing order following the MI and the T-scores. These scores have been
calculated using AntConc.
Table 3.10 Top 10 collocates of inclusion (MI score)
Rank Word & MI score Rank Word & MI score

1 Warmth (9.6) 6 Pointed (9.6)
2 Tyne (9.6) 7 Olivier (9.6)
3 Soup (9.6) 8 Noise (9.6)
4 Prospectuses (9.6) 9 Mounted (9.6)
5 Professionalism (9.6) 10 Integral
Figure 3.9 AntConc collocate window
As expected, low frequency words tend to top up the ranks as all ten words
occur only once (see Figure 3.9). So what does this analysis tell us? The MI
.
score reveals that these words tend to collocate with inclusion, that is, when they
are used in The Guardian 2018 corpus they tend to appear in the vicinity of the
node word.
Once we have obtained a list of collocation candidates like the one in Figure 3.9,
we need to examine it and consider each of these collocates in isolation (Hunston,
2002). This is very much qualitative analysis and involves three steps: (a) explor-
ation of the context(s) in which the collocate appears, (b) building a hypothesis
about the meaning of the extended unit (node + collocate) and (c) Reporting
of our finding(s). The first step involves a careful examination of the context in
which the collocate appears. In most software tools, this involves clicking on the
word (‘2’ in Figure 3.9) and considering the span size of our search (five words to
the left and five words to the right in this case). After clicking on warmth, we will
obtain the following: OU’s great strengths of warmth and inclusion in an academic com-
munity. Obviously, we need more context to interpret this. We click again, and we
will have access to the precise point where this micro context comes from in the
corpus. In this case, this collocate is found in one of three letters to The Guardian
published together under the headline Our Open University has become a daydream.15
This is the paragraph where warmth is found:
As an OU tutor in Northern Ireland in the early 70s I saw how the tutorial
system, regular meetings at study centres and summer schools were a vital and
intellectually stimulating part of the students’ experience. Many were studying
for the first time, others to increase existing qualifications, and some for the
sheer pleasure of learning. All had something to teach each other. The chilling
new proposals for an OU based on service centres and media platforms oper-
ating in the cloud, are bound to kill off the OU’s great strengths of warmth
and inclusion in an academic community. I remember Peter Horrocks when
he was head of TV news at the BBC and I was a daily newscaster. He was a
man of dry intellectual brilliance, but his admitted shyness revealed a crippling
lack of social skills. He is not the visionary needed to lead and encourage the
Open University community to grow and prosper in the 21st century.
(Letters, The Guardian, 14 January 2018)
Now we have a clear picture of how the use of inclusion is operationalised by one of
the letter writers, a former OU tutor. Service centres and cloud media platforms
are positioned against traditional values of the OU: community, inclusion and
warmth. In term of analysis, now we can potentially move to a different collo-
cate. Normally, one looks at the significant collocates but limits the exploration to
a certain number of them (the 10, 20 and up to 100 most frequent; only nouns,
only verbs, etc.). It is absolutely essential, though, that we understand the extent
to which these collocates co-occur significantly with a node in the context of the
corpus being queried. In the case of the example we are using here, this is a corpus
of news articles and features published in The Guardian during 2018 and which
featured both education and inclusion in their texts. Different research projects will
make use of corpora that are instrumental in understanding how language is used
across a wide range of different contexts.
Now consider Table 3.11, which lists the top 10 collocates in The Guardian 2018
corpus using T-scores.
We can appreciate that most of the top ten collocates are function words such
as articles (the, a, an) and prepositions (of, in). We should not forget that this is
totally expected as T-score tests favour frequency of occurrence, and function
words are more frequent in language use than lexical word classes such as nouns,
verbs, adjectives and adverbs. In spoken English, seven out ten words are function
Table 3.11 AntConc collocates of inclusion in The Guardian 2018 corpus
Rank Word & T-score Rank Word & T-score

1 The (6.5) 6 For (4.1)
2 Of (6.2) 7 Diversity (3.9)
3 And (6.1) 8 A (3.6)
4 To (4.3) 9 An (3.2)
5 In (4.2) 10 On (3.2)
words and five out of ten in academic language (Biber, Johansson, Leech, Conrad
& Finegan, 1999). When using the T-score we therefore need to make sure we
understand that there is an impact on the co-selection of function words. If we
widen our scope of collocates, we will find equality (top 12, z = 2.9), education (top
18, z = 2.2), issues (top 19, z = 2.2) and social (top 21, z = 2.1). The following list
contains the five co-occurrences of issues and inclusion:
1 ‘[…] authoritarianism and a detraction from more serious issues such as dis-
cipline and inclusion. Olivier […]’
2 ‘[…] Paris Agreement, but also the social and inclusion issues that have so
clearly impacted the […]’
3 ‘[…] work is continuing. we’re looking at issues of diversity, inclusion, in all
of our […]’
4 ‘[…] and self-esteem, peer relationships and social inclusion. But on many
issues more parents were […]’
5 ‘[…] universities should demonstrate how they are addressing issues of race,
equality and inclusion. This is […]’
We note that all of these uses of inclusion are vaguely linked to the mainstream idea
found in dictionaries that ‘everyone should be able to use the same facilities, take
part in the same activities, and enjoy the same experiences, including people who
have a disability or other disadvantage’.16 However, the uses in the five contexts
above go beyond this mainstream notion in different ways and, interestingly, these
concordance lines provide the language evidence of how inclusion is used in the
vicinity of issue(s) in our corpus.
Using a combination of MI and T-scores can be effective in understanding how
language is used by a group of people or in a set of documents, whether these are
interviews with school counsellors, policy drafts or statutes. Collocations that are
Table 3.12 Skill 8: understanding collocations
• When we say that word x collocates with word y, we are

using a quantitative methodology that establishes that such
Skill 8
co-occurrence is evidenced in our corpus.

• There are different statistical measures that can be used to
find out the collocational behaviour of a word (or other
items such as lemmas or string of words).
• Collocations reveal aspects of word meanings that go
beyond the individual semantics of a given word.
• AntConc will let us explore collocations by means of MI
score and T-score.
• MI scores of >3 and T-scores of >2 provide strong evidence
of collocability in a corpus.
mined by using an MI score will be particularly useful to identify what is unique

and truly specific in a corpus; collocations found by using a T-score will be helpful
when looking at how longer units of text are put together by speakers. Table 3.12
summarises how to approach our eighth skill, understanding collocations.
Notes
1 https://lexically.net/wordsmith/
2 http://reshare.ukdataservice.ac.uk/853809/33/SummaryCorpusContents.pdf
3 Antconc version 3.5.8. –www.laurenceanthony.net/software/antconc/
4 According to Gov.uk website, the New Year Honours list recognises the achievements
and service of extraordinary people across the United Kingdom. The compete 2019
Honours list is a 119-page pdf document.
5 Labour politician. Minister of State for Education in HM Government from 2005
to 2007.
6 www.english-corpora.org/bnc/
7 www.english-corpora.org/coca/
8 www.corpusdelespanol.org
9 www.sketchengine.eu/frtenten-french-corpus/
10 According to the designers, TenTen corpora are built using technology specialised in
collecting only linguistically valuable web content; www.sketchengine.eu/documenta-
tion/tenten-corpora/
11 https://lexically.net/wordsmith/support/lemma_lists.html
12 This is known as span size.
13 Another measure is Z-score. This test looks at the observed frequency, that is, the
actual frequency of the collocation candidates, with the frequency expected, that is,
occurrence of w1 and w2 by chance. A high z score indicates a greater degree of
collocability of an item with the node.
14 More information at: www.sketchengine.eu/my_keywords/logdice/
15 www.theguardian.com/ e ducation/ 2 018/ j an/ 1 4/ o ur- o pen- u niversity- h as-
become-a-daydream
16 https://dictionary.cambridge.org/dictionary/english/inclusion
References
Press.
Biber, D., Johansson, S., Leech, G., Conrad, S. & Finegan, E. (1999). Longman grammar of
written and spoken English. Harlow: Longman.
Brezina, V. (2018). Statistics in corpus linguistics. Cambridge: Cambridge University Press.
Bybee, J. (2007). Frequency of use and the organisation of language. Oxford: Oxford University  Press.
Davies, M (2010) The corpus of contemporary American English as the first reliable
monitor corpus of English. Literary and Linguistic Computing, 25(4), 447–464.
Durrant, P. (2019). Growth in grammar corpus 2015–2019. [data collection]. UK Data
Service. SN: 853809, http://doi.org/10.5255/UKDA-SN-853809
Durrant, P. & Brenchley, M. (2019). Development of vocabulary sophistication across

genres in English children’s writing. Reading and Writing, 32(8), 1927–1953.
Ellis, N. (2002). Frequency effects in language processing: A review with implications for
theories of implicit and explicit language acquisition. Studies in Second Language Acquisition,
24(2), 143–188.
Evert, S. (2007). Corpora and collocations. PhD thesis. University of Osnabrück. www.stefan-
evert.de/PUB/Evert2007HSK_extended_manuscript.pdf
Fest, J. (2015) Corpora in the social sciences. How corpus-based approaches can support
qualitative interview analyses. Revista de Lenguas para Fines Específicos, 21(2), 48–69.
Gablasova, D., Brezina, V. & McEnery, T. (2017). Exploring learner language through
corpora: Comparing and interpreting corpus frequency information. Language Learning,
67(S1), 130–154.
Gu, Q. (2015). Interviews at four secondary case study schools. [data collection]. UK Data
Halliday, M.A.K. & Matthiessen, M.I.M. (2014). Halliday’s introduction to functional grammar.
4th edition. London: Routledge.
Hunston, S. (2002). Corpora in applied linguistics. Cambridge: Cambridge University Press.
Kolesnikova, O. (2016). Survey of word co-occurrence measures for collocation detection.
Computación y Sistemas, 20(3), 327–344.
McEnery, T., Brezina, V., Gablasova, D. & Banerjee, J. (2019). Corpus linguistics, learner
corpora, and SLA: Employing technology to analyze language use. Annual Review of
Applied Linguistics, 39, 74–92.
Mercer, N. (2002). Words and minds: How we use language to think together. London: Routledge.
Mercer, N., Wegerif, R. & Dawes, L. (1999). Children’s talk and the development of
reasoning in the classroom. British Educational Research Journal, 25, 95–111.
Oakes, M.P. (1998) Statistics for corpus linguistics Edinburgh: Edinburgh University Press.
Partington, A. (1998). Patterns and meanings: Using corpora for English language research and teaching.
Paper and the Twitter debate. In Orts, M.A., Breeze, R. & Gotti, M. (Eds.) Power, persua-
Pérez-Paredes, P., Sánchez-Tornel, M., Calero, J.M.A. & Jiménez, P.A. (2011). Tracking
learners’ actual uses of corpora: guided vs. non-guided corpus consultation. Computer
Assisted Language Learning, 24(3), 233–253.
Pérez-Paredes, P., Sánchez-Tornel, M. & Calero, J.M.A. (2012). Learners’ search patterns
during corpus- based focus- on-form activities: A study on hands- on concordancing.
International Journal of Corpus Linguistics, 17(4), 482–515.
Sinclair, J. (2003). Reading concordances. London: Longman.
Siyanova-Chanturia, A. & Spina, S. (2015). Investigation of native speaker and second
language learner intuition of collocation frequency. Language Learning, 65(3), 533–562.
Stubbs, M. (2007) On texts, corpora and models of language. In Stubbs, M., Hoey,
M., Teubert, W. & Mahlberg, M. (Eds.) Text, discourse and corpora: theory and analysis.
London: Continuum, 127–162.
Viana, V., Zyngier, S. & Barnbrook, G. (Eds.). (2011). Perspectives on corpus linguistics.
Villares, R. (2019). The role of language policy documents in the internationalisation of
multilingual higher education: An exploratory corpus-based study. Languages, 4(3), 56.
Chapter 4
Researching education policies

Using your own corpus
4.1 Basic corpus design features

A substantial part of our research projects will most likely involve the compilation
and use of ad hoc corpora. In practical terms, this means that, instead of using
an existing corpus, we will have to put together our own corpus, something which
may not be as straightforward as it seems. In the following sections we will discuss
some basic principles in corpus design that are relevant to the members of an edu-
cational research team wishing to use CL methods in their project.
4.1.1 Designing corpora

Most of the specialised literature on corpus creation and design has been written
to meet the needs of linguistic research in the area of linguistics and applied
linguistics. This fact explains the emphasis on representativeness in terms of
capturing the properties of sample texts as proxies for language as used by the
speakers of language X across contexts Y and Z. Whatever our research project
looks like, several areas need to be considered when designing our own corpus.
Reppen (2010: 32) has paid attention to corpus size, representativeness and data
collection. Let us explore these in some detail.
4.1.1.1 Corpus size and data collection

The size of a corpus should be seen as the consequence of the type of research
questions we have devised for our research and, ultimately, the corpus that is needed to
explore those questions. Setting unrealistic or speculative corpus size goals is both inef-
fective and, conceptually, works against the very raison d’être of a corpus as a research
method. More importantly, we need to ask ourselves what our corpus represents, that
is, in which ways our corpus can serve us as a proxy for some (or all) the data needed in
our research project. As regards size, Reppen has noted the following:
For explorations that are designed to capture all the senses of a particular
word or set of words, as in building a dictionary, then the corpus needs to
Using your own corpus 65
be large, very large –tens or hundreds of millions of words. However, for

most questions that are pursued by corpus researchers, the question of size
is resolved by two factors: representativeness (have I collected enough texts
(words) to accurately represent the type of language under investigation?) and
practicality (time constraints).
(Reppen, 2010: 32)
So, focusing on (large) corpus size is probably not the best thing to do, or at least the
main thing to do, at the design stage. What needs some important consideration,
however, is how we want to integrate corpus data into our research. In chapter 1,
we proposed two approaches to corpus analysis in educational research. While the
first approach uses a corpus as primary data, the second approach uses a corpus as
secondary data (see Figure 1.4). The choice of approach has important methodo-
logical implications. When we are using a corpus as primary data, we advocate the
use of corpus linguistics as our main research methodology. When we use a corpus
as secondary data, however, we are using corpora and CL methods as part of a
larger research methodology framework, most likely mixed methods and pragma-
tism. In short, the educational researcher has essentially two options when it comes
to integrating CL methods in their own research. For the sake of clarity, we will dis-
cuss and exemplify broadly these two options by exploring two research case studies
that examine education policies. The first research project is set in Australia and
looks at how the National Quality Framework is represented and constructed in the
media. The second project looks at financial literacy education policy in Canada.
The researcher in this project uses CL as a complementary method. These two
studies are just examples of how CL methods can be used to research education
policies but, understandably, offer just a glimpse of the many potential uses.
CASE STUDY 1. FENECH & WILKINS (2019): USING CL AS MAIN RESEARCH METHODOLOGY
Fenech & Wilkins researched printed Australian media and their mediatising of
early childhood education (ECE) policy. In particular, the researchers looked at
the ‘representations of, and claims made about, the National Quality Framework
(NQF) –Australia’s system of regulation and quality assurance of ECE and child-
care services –in newspaper media’. (Fenech & Wilkins, 2019: 749). According to
the Australian government1 the NQF sets out ‘to raise quality and drive continuous
improvement and consistency in children’s education and care services’. The
researchers wanted to examine the role of Australian media as a ‘potential discursive
influence on parents’ childcare choice’. Their research questions were the following:
1 What key propositions and claims about the NQF are proffered in the
Australian print media?
2 Are these claims and purported impacts consistent across media organisations
(Fairfax vs. NewsCorp) and newspaper types (‘broadsheet’ vs. ‘tabloid’)?
3 Whose voices and agendas are being heard?
66 Using your own corpus
As their interest was to analyse Australian printed media, the researchers had to
come up with a corpus design that could be representative of the whole country.
They included three states (New South Wales, Queensland and Victoria), which
arguably are mostly representative of the eastern part of the country, as northern
Australia and the state of Western Australia were not sampled2. The researchers
then moved on to deciding on the newspapers to examine. They chose the two
papers with the largest circulation in each of the three states from 1 August 2013
to 31 May 2015. In total, 801 newspaper articles that had a focus on childcare
were selected and included in their corpus. Note that the researchers did not set
any target size in order to validate the suitability of their corpus. Instead, they
drew up a set of criteria that they thought would be necessary to meet in order
to obtain the right dataset that could help them answer their research questions.
They used the corpus management tool Wordsmith Tools3 to do the following:
1 Identify which of the initial pool of 801 articles made specific references to the
National Quality Framework or aspects of the National Quality Framework.
2 Discover the keywords that were specific to the group of texts that discussed
the National Quality Framework.
3 Study the evolution of the topics discussed between 2013 and 2015.
As a result, Fenech & Wilkins found that 121 articles had some focus on the NQF,
60% from Fairfax papers (broadsheets) and 40% from News Corp (tabloid) papers.
The authors also identified a set of somewhat expected keywords such as quality
standards, report, qualifications and ratios as well as some other less expected keywords
that emerged from the texts. Their analysis of the topics and voices appearing con-
tinuously during the almost two years’ worth of data allowed the researchers to
focus their analyses on ideas and stakeholders of ‘continuing prominence’ (Fenech
& Wilkins, 2019: 756). In terms of the two main media sources, Fairfax and News
Corp papers, the authors found that most of the keywords identified were ‘dis-
tinct to one of the two media corporations’ (Fenech & Wilkins, 2019: 757). These
differences were discovered by using quantitative methods4 and point to the fact
that different media adopt and promote different positionings of the National
Quality Framework. It was the keyword analysis performed by Fenech & Wilkins
that identified in a precise way the nature and extent of these (lexical) differences.
In short, Fairfax articles seemed to focus on quality whereas News Corp laid
emphasis on care. This is where the power of CL methods lies: statistical analysis
can be used to explore language use and inform further research methods that
could be used by researchers to explore data qualitatively. In this research paper,
the authors carried out a content analysis of those articles that explicitly discussed
the NQF, only 13 of the 121 in the corpus. This is an excellent example of how to
use CL methods and other data analysis methods to explore a research question.
Fenech & Wilkins concluded that most aspects of the educational policy
analysed are actually mediated by journalists and media groups. The two media
companies represented the NQF and positioned themselves in radical different
Table 4.1 Before designing your own corpus

A What is my research question? Can it be answered, either partially or
totally, through CL methods?
B What other research methods could be used to answer this question?
What is the main contribution of a corpus in the context of my research
design?
C What texts do I need? How many?
D What are my criteria of inclusion?
E Is it feasible to collect the data within my timeline?
F Is it feasible to collect and clean the data (and annotate, if necessary)
within my timeline?
ways, thus presumably affecting the impact of the implementation of the NQF
agenda as seen by, for example, for profit and not-for-profit education groups.
In terms of the design of the corpus, researchers need to reflect on a variety
of aspects in what Nelson (2010) has called an initial planning document. We
are offering here a template (Table 4.1) that can be used to help you think about
the planning and subsequent building of your corpus. The questions included
will make you reflect on your research question(s) and other alternative research
methods for data collection and analysis (A,B), the source of your data (C) as well
as the feasibility of the data collection process (E, F).
Once we are sure that a corpus of texts5 is necessary as part of our research, we
need to devise a strategy for data collection. We are assuming that the corpus we
seek to build is not already available, and that we will need to put it together. To
do this we must consider the implications of modelling the types and the nature
of texts to be included in our corpus so that it can answer our research question.
We have just seen how Fenech & Wilkins (2019) made a series of decisions that
sought to ensure that their overarching question could be answered: ‘What key
propositions and claims about the NQF are proffered in the Australian print
media?’ The researchers need to operationalise all of the arguments within their
research question so that they can be dealt with methodologically. Let us illustrate
this process by breaking down Fenech & Wilkins’ overarching question into three
arguments A, B and C.
We need to understand how each of the arguments as set out in Figure 4.1 will
impact our corpus design and data collection. Starting with C, the researchers will
need to devise strategies to collect data in a digital format. If the textual data is
printed, then the researchers will have to come up with an OCR (optical character
recognition) or transcription strategy so that their texts are stored in a machine-
readable format (.txt, .docx, .rtf, or even .pdf). Transcription will be dealt with in
chapter 5 so let us assume that we can find a digital version of our data on the
Internet.
If the target data can be accessed online, the researchers will need to come up
with a plan to make sure that their texts have been cleaned up and can be processed
Research What key propositions and claims about the NQF are proffered in the
queson Australian print media?
Key propositions About the NQF Proffered in the

and claims Australian print
media?
A B C
Geographical
Domain: what is locaon of data
Analysis being talked
about Register/genre
Figure 4.1 Breaking down a research question (Fenech & Wilkins, 2019) into arguments
by a corpus management tool. By clean texts we mean files that, clean of HTML
formatting or PDF coding, etc., contain the language, and only the language, that
needs to be examined in our research. This follows the general guidelines that
‘the safest policy is to keep the text as it is, unprocessed and clean of other codes’
(Sinclair 1991: 21). Let us break down this process in two parts: obtaining the data
and preparing the data for corpus analysis.
Depending on the scope of our corpus, we may want to obtain the data either
by visiting the online provider website or by using specialised services that can
speed up the process. In the former scenario, we can use Google or any other
search site to locate the information for us. For example, if we wanted to search
for news features that discuss the Australian National Quality Framework (NQF)
in The Age,6 we could do one of the following:
1 Search for ‘National Quality Framework’ in the search box of the paper. Then
click on every article and copy the page or the text. As we obtained 887 results,
this will be a time-consuming task, and, probably, not very efficient. The old
school way to do this would be to save the page first as html and then extract
the text. This can be done in many different ways. We could open the html file
with a simple text editor. No matter what you use, it will look ugly. You will find
in the file the navigation structure of the site and so much more noise we are
not interested in. However, you can also find some of the metadata that you
may want to keep (Name of the author, date of publication, original URL,
etc.), so it is not a bad idea to screen every single file manually and decide how
you want to save and store your data. Using an excel file to document this
process is usually a good idea. Alternatively, you can select the text, copy it and
paste it on NotePad (Windows) or TextEdit (Mac). Make sure you save this file
as plain text (.txt extension) and save it as UNICODE UTF-87. If you have
some basic knowledge of Python programming, or somebody in your team has
it, web scraping is a great option. All you need is a Python library and to get
familiar with how it pulls out data from html pages. ‘Requests’ and ‘Beautiful
Soup’ are easy to use and will most likely meet your basic needs.
2 Alternatively, we can use a third-party service such as LexisNexis or Factiva
to fetch the texts for us. It is a good idea to double-check whether your
institution or university has access to these services. Factiva8 is an inter-
national news database with over 32,000 sources from 200 countries in 28
languages. The potential to locate text-based sources is, as you can see, huge.
The databases include national, international and regional newspapers,
magazines, journals, newswires (i.e. Reuters), TV or radio podcasts (e.g. BBC,
CNN, ABC, CBS, NBC, Fox, etc.), news and business information websites,
blogs, company reports and the EUR-Lex website. LexisNexis9 is an excel-
lent service that will let you collect all sorts of law-related texts, including
annotation on whether and when an Act was repealed and information on
additional provisions and savings. LexisNexis will also let you search for news
contents and select many or just a particular source. This type of service
will let you specify the requirements of your search, including search words,
search time span and type of source. A search of ‘mental health’ and ‘educa-
tion’ in blogs in North America during the last year (2018–2019) in Factiva
will return 74 results. These results can be emailed or saved either as pdf or
text files and reused for private, academic purposes. LexisNexis will not let
you download all texts in one single file, but you will have the opportunity to
search within a single source, for example, TES, and then examine every hit
as clean text on screen. Another advantage of these services (Factiva or Lexis
Nexia) is that we can discover new sources of potential data.10
3 Attention must be given to copyright issues and the anonymisation of the data.
In the European Union, the General Data Protection Regulation has strict laws
that may potentially affect how metadata is collected and stored electronically.
CASE STUDY 2. PINTO (2013): USING EMBEDDED CL RESEARCH METHODS
Pinto (2013) researched the status of financial literacy education in Canada after
the 2008 crisis. She used narrative policy analysis (NPA) and corpus linguis-
tics methods to examine the narratives around financial literacy education and
uncover the ‘assumptions underlying public policy […] embedded within rhet-
orical devices’. Rather than extending her inquiry to the whole country, Pinto
examined education policies in Ontario as education is under the provincial
jurisdiction in Canada. NPA is interpretive ‘in that it regards a certain form of
storytelling as analytic entrée into policy-relevant experiences with emphasis on
the identification of schemes and tropes, and especially policy metaphors’ (Pinto,
2013: 100). Pinto describes her data collection in the following way:
I collected extensive documentary evidence of financial literacy education

debate in the form of newspaper reports, press releases, position papers,
speech transcripts, transcripts of debates in the Ontario Legislative Assembly
and other reports. Sixty-eight newspaper articles were analysed, collected
through a search of the keyword ‘financial literacy’ in the Proquest Canadian
Newsstand database and narrowed to include all of those that address finan-
cial literacy education between January 2008 (the start of the period of global
financial crisis) and August 2011. The use of newspaper articles reflects
research that affirms the media’s role as an important conduit for competing
advocacy coalition positions to convey positions and construct narratives
(Shanahan et al. 2008). […] I also collected two government reports pertaining
to financial literacy […] I reviewed transcripts from the Legislative Assembly
of Ontario for the time period studied and identified discussion at the pro-
vincial level having to do with financial literacy education. Finally, I included
three speeches given by Canadian Minister of Finance Jim Flaherty during
the timeframe studied, each of which addressed the issue of financial literacy
education. By drawing on these varied data sources, I was able to triangulate
narratives arising from various policy actor groups and public venues.
(Pinto, 2013: 100–101)
Pinto examined inductively each of the individual texts collected and decided
whether the narratives favoured financial literacy education (LE) by promoting
an idea of LE as a solution to looming economic uncertainty (the dominant
narrative) or by promoting suspicion about government interest in LE (the counter
narrative). After this interpretive analysis, Pinto used collocation analysis to iden-
tify ‘unique, recurrent semantic devices’ in each subcorpus of texts:
I further ran all text files through a corpus linguistics research software tool,
AntConc 3.2.4, to triangulate my interpretation as the software would identify
collocations that I may have missed. I also searched for schemes and tropes
operating within the data sources with particular attention to the trope of
metaphor as a rhetorical device shaping each of the narratives […] Metaphor
is especially valuable for identifying underlying themes and revealing power
dynamics within policy […] especially given policies are often understood
in symbolic terms […] Certainly, ‘those who will control the metaphors will
ultimately control the action: and those who change the metaphors will ultim-
ately change the action’ [Monin & Monin, 1997: 57].
(Pinto, 2013: 102)
The use of two subcorpora that can be contrasted is a common research design
in CL. This approach was advocated by Pinto, who analysed pro-literacy educa-
tion narratives from two different camps by isolating the linguistic devices used by
those defending them. The dominant narrative used the metaphor of a morally
superior crusade that used ‘a relatively neutral language an tone from a rhetorical
RQ How do financial literacy education policies differ?
Dominant narrative Counter narrative

A B
Figure 4.2 Pinto’s (2013) analysis of narratives
Table 4.2 Corpus building in Fenech & Wilkins (2019) and Pinto (2013)
Fenech & Wilkins (2019) Pinto (2013)

Data size Not relevant Not relevant
Representativeness Yes No
Data collection criteria Strictly defined Loosely defined
Corpus linguistics Overarching research Research methods
methodology subordinate to NPA
standpoint’; the counter narrative used ‘passionate and emotional language to cast
doubt on the true intentions of the crusaders’ (Pinto, 2013: 110).
The type of research in Pinto (2013) differs from the one in the first case study
in this section in that (1) the corpus size and collection criteria are more loosely
defined and (2) the CL methods are subordinate to the overarching research meth-
odology embraced by the researcher. Figure 4.2 illustrates how NPA was used to
elucidate the main types of narrative surrounding financial literacy education in
Canada after 2008.
Note that CL methods were not used to tell those two types of narratives
apart. What is important to understand is that the overarching research question
in Figure 4.2 is not contingent on the application of CL methods. What the
researcher did instead was to submit the texts classified either as dominant or
counter narratives to collocation analysis, and thus triangulate the results. Table 4.2
sums up how the two research projects in this section approached corpus building.
As in most research projects that do not aim at recording linguistic use, data
size per se is not important or, put it in a different way, should not determine
the overall quality of the corpus. Despite the literature devoted to the time-
consuming nature of putting a corpus together (Clancy, 2010), we believe that,
well into the 21st century, this can no longer be a defining criterion for what
should count as a good corpus. The availability of online data, processing tools
and corpus management software make CL methods available to every researcher
with an Internet connection and some sound understanding of the methodology.

Establishing data collection criteria, however, is absolutely paramount. These
criteria need to be explicit and accountable, otherwise our research will be meth-
odologically flawed as CL methods display what Habermas has described as a
technical interest in scientific testing, hypothesis testing and quantitative methods
(Cohen, Manion & Morrison, 2011). Table 4.3 provides a summary of how to
design a basic corpus.
Table 4.3 Skill 9: basic corpus design features
• Before putting your own corpus together, you need to

make sure that (1) such a corpus does not exist already and
Skill 9
(2) that you understand how your research question(s) can

be answered by using CL methods.
• First, you need to establish whether CL is your primary
research methodology, or if it plays a subordinate role in
your research design.
• If you are using CL as a primary research methodology
(Figure 1.1), we need to make sure that the corpus design
is robust enough to answer your research question(s)
through one of the CL methods available (collocation
analysis, keyword analysis, etc.).
• You will need to figure out and carefully document the
inclusion criteria of the texts in the corpus. Remember the
total accountability principle.You should consider, at least, the
following:
• years when the texts were produced
• number of years or period of time represented in
the corpus
• type(s) of register included: news, legal language, fiction,
academic language, etc.
• Domain of the texts: education, mental health, sports,
public debate, etc.
• If you are using CL methods as a complementary
methodology (Figure 1.2), then you need to devise the ways
in which the different methods in your research design will
provide you with different data and data analysis options.
• It is a good idea to think in advance how the design of the
corpus can affect the range of data to be collected in the
first place, and the type of insights that can be gained in the
second.
• Pilot your design of the corpus by examining how your
target texts can be submitted to analysis either by
collocation, keyword or colligational analysis, among others.
Are you happy with the range of results? What action needs
to be taken? Is it the size of the corpus that needs to be
modified? Is it the nature of the texts?
4.2 Comparison basics and significance testing

Biber & Conrad (2009: 36) have argued that ‘effective register analyses are always
comparative’. Actually, corpus research methodology is more often than not com-
parative, at least ontologically. Most CL methods rely on comparing the frequency
of occurrence of discrete features in corpus A versus corpus B. Even when one
single corpus is analysed, there exists some sort of comparison between expected
versus random occurrence of linguistic items.
When it comes to examining language data in corpora, one of the two following
scenarios are most likely: in the first scenario, two different corpora are compared
so as to try to understand how usage differs across the two datasets; in the second
scenario, our main interest rests with the examination of one corpus of texts,
while the use of a second corpus as a reference corpus is contingent upon the
corpus method used. This is the case of keyword analysis, which will be covered
in chapter 6. In this section, we will offer some basic guidelines to understand how
we can go about comparing features across two corpora. We will assume that the
researcher is equally interested in both datasets, and that a comparison is neces-
sary in order to understand the differences between the type of language use that
is used or represented in either of the two corpora.
In this section, we will examine the policies on international education
published by the governments of the UK and New Zealand. For this analysis, we
will use policy documents publicly available on the websites of both governments.
The UK’s ‘International Education Strategy: global potential, global growth’,
published in March 2019,11 includes a foreword by both the Secretary of State for
Education and the Secretary of State for International Trade and President of the
Board of Trade. The introductory foreword reinforces the future role of the UK
in a post-Brexit scenario:
With around 90% of global economic growth in the next five years expected
to originate outside the European Union, forging a new role for the United
Kingdom on the world stage starts with rising to the exporting challenge –of
which this strategy and the education sector will form a key part. Working
together, we can help UK education reach its full, global exporting potential.
(DfE & DIT, 2019)
The document outlines the objectives for UK higher education, as well as the role
of the government:
Our objective is to drive ambition across the UK education sector. We will

champion the breadth and diversity of the UK’s international education offer,
strengthening our position as the partner and provider of choice for countries
and individuals around the world. Working in tandem with the education
sector, we will provide the practical solutions and tools it needs to harness
its full international potential. We will focus on the role of government in
supporting exports while recognising that government should do only those

things which it alone can do. To make a real difference, the government’s
action must be met by the ambition and activity of the sector.
(DfE & DIT, 2019)
New Zealand’s ‘International Education Strategy 2018–2030’ was published in

August 2018.12 The foreword by the Minister of Education suggests a different,
more all-embracing approach to education policy:
International education includes international students coming here to study

among New Zealanders, our own people travelling the world to experience a
global component in their education, and people anywhere, online and inter-
nationally, learning through great products, services and approaches built in
New Zealand.
(New Zealand Government, 2018)
The objective of the policy mentions both incoming and outgoing students, as
well as the wellbeing of all the students involved:
This International Education Strategy aims to create an environment where

international education can thrive and provide economic, social and cultural
benefits for all New Zealand. It builds on New Zealand’s quality education
system and focuses on delivering both good education outcomes for inter-
national students and global opportunities for domestic students and our edu-
cation institutions. The Strategy is underpinned by the International Student
Wellbeing Strategy, and a commitment to maintaining the integrity of New
Zealand’s immigration system.
(New Zealand Government, 2018)
4.2.1 Comparison basics and part-of-speech (POS) tagging

Both policy documents were cleaned and converted to .txt files and uploaded to
Sketch Engine. A first approach to the two documents reveals that the UK policy
document has 14,708 tokens and 2,429 types while the NZ policy document has
6,665 tokens and 1,438 types (see Tables 3.3 and 3.5 to make sure you understand
the difference between tokens and types). Many researchers use the Type Token
Ratio (TTR) as an index of lexical richness or variation. TTR is calculated by
dividing the types by the tokens. In the UK policy document, we find TTR is 0.16
while in the NZ document it is 0.21, which suggests that the latter offers a more
varied vocabulary choice. TTR can be interesting when used comparatively and
rather meaningless if this is not the case. Software such as ATLAS.ti offers TTR
as a lexical richness index, so the index is widely used outside CL. However, we
must note that TTR is highly sensitive to corpus size, that is, it is sensitive to the
number of tokens in the corpus. Put simply, a large corpus will necessarily yield a
low TTR. The reason is simple: in English the most frequent 2,000 words account
for 87.4% of fiction books and 90.3% of spoken communication, which reinforces
the idea that it takes massive corpus data to cover a wide range of  types.
By default, every corpus uploaded to Sketch Engine is part-of-speech tagged,
which means that every word in the corpus is annotated with morphological infor-
mation. This gives researchers the possibility to run sophisticated searches that
combine POS tags and different types of unit (lemmas and words). There are
different POS tagging services and tagsets that can be used, their main differences
being that the tags used display different levels of depth. Sketch Engine uses by
default the English TreeTagger PoS tagset with Sketch Engine modifications, but
other services, and of course other languages, will use other software and tagsets.
The software that performs the analysis of the language and tags every token as
a part of speech is called a ‘tagger’. There are freely available solutions13 such as
the Stanford Part of Speech Tagger14 or the CLAWS free online service15 at the
University of Lancaster (only 100,000 words).
A sentence from the UK policy document such as ‘Our higher education
institutions are amongst the most renowned and prestigious in the world’ will look
like this once it has been POS-tagged on Sketch Engine:
<s>
Our PPZ our- d
higher JJR high- j
education NN education- n
institutions NNS institution- n
are VBP be- v
amongst IN amongst- i
the DT the- x
most RBS most- a
renowned JJ renowned- j
and CC and- c
prestigious JJ prestigious- j
in IN    in-i
the DT   the- x
world NN world- n
. SENT .- x
</s>
On the left, we can see our sentence in vertical format and next to it the actual tag
that was assigned by the tagger. Our has been tagged as PPZ (possessive pronoun),
higher as JJR (comparative adjective), education as NN (singular noun) and so on.
The tagset used by the Sketch Engine English TreeTagger PoS tagset contains
55 tags. CLAWS tagset 7 contains 137 tags, allowing for further discrimination
between word categories such as adjectives or adverbs, for example. The same
sentence tagged by the CLAWS free online service will return the following:
0000003 010 Our 93 APPGE

0000003 020 higher 03 [JJR/100] RRR@/0
0000003 030 education 93 NN1
0000003 040 institutions 93 NN2
0000003 050 are 93 VBR
0000003 060 amongst 93 II
0000003 070 the 93 AT
0000003 080 most 97 RGT
0000003 090 renowned  97 JJ
0000003 100 and 93 CC
0000003 110 prestigious 93 JJ
0000003 120 in 93 [II/100] RP@/0
0000003 130 the  93 AT
0000003 140 world 93 NN1
0000003 141. 03.
Our has been tagged as APPGE (prenominal possessive pronoun), higher as JJR
(general comparative adjective), education as NN1 (singular common noun) and
so on. As you can see, the tags are not the same but, in principle, the fact that
different tagsets co-exist should not be something that is of primary importance
in the context of our research. We need to be aware that while different taggers
and tagsets16 will use different categories, the fundamental, broad morphological
categories of analysis will be present in most software.
In practical terms, there are two ways to POS-tag a corpus. If we are using
services such as Sketch Engine, the POS annotation will remain invisible to the
researcher although the search interface will allow us to carry out searches that use
POS tags. The following screenshot from Sketch Engine (Figure 4.3) shows how
we can search nouns, adjectives, verbs, etc. in our UK policy document.
This type of search can be performed only because our corpus has been POS-
tagged. If we are using stand-alone software like AntConc, we will need to upload
a corpus that has already been POS-tagged. This is a different process altogether.
Fortunately, AntConc is highly customisable. Among other things, we can decide
(1) whether we want to see tags or not and (2) what can be considered as a tag (start
and end tag symbol). Figure 4.4 shows the Global Settings window where we can
set up our preferences.
There are two major types of tags: non-embedded tags and embedded tags.
Non-embedded tags are independent of the text being annotated. The following
is an example of a poem by William Blake from the Text Encoding Initiative
(TEI) website17 where an introduction to Extended Markup Language (XML) is
offered. This example shows how non-embedded tags can be used to annotate
structure and structure elements. Note that every tag, for example <poem>, is
followed at some point by a closing tag </poem>. So, in the following example
we have six different types of tags, both opening and closing tags: <anthology>,
<poem>,<heading>, <stanza>, and <line>.
Figure 4.3 Word list options in Sketch Engine
Figure 4.4 AntConc tag settings

<anthology>
<poem>
<heading>The SICK ROSE</heading>
<stanza>
<line>O Rose thou art sick.</line>
<line>The invisible worm,</line>
<line>That flies in the night</line>
<line>In the howling storm:</line>
</stanza>
<stanza>
<line>Has found out thy bed</line>
<line>Of crimson joy:</line>
<line>And his dark secret love</line>
<line>Does thy life destroy.</line>
</stanza>
</poem>
Anything between <anthology> and </anthology> is part of this structure and

so on. We will come back to TEI tags in chapter 6.
Embedded tags are attached to the word. The sentence from the UK policy
document ‘Our higher education institutions are amongst the most renowned and
prestigious in the world’ is POS-tagged by TagAnt like this:
Our_PP$ higher_JJR education_NN institutions_NNS are_VBP amongst_

IN the_DT most_RBS renowned_JJ and_CC prestigious_JJ in_IN the_
DT world_NN  ._SENT
Although there are many software options available, Laurence Anthony’s

‘TagAnt’18 is a tool that we would recommend for a variety of reasons: there
are Windows, Mac and Linux versions, it is free to download and use, and new
versions are periodically published. Laurence Anthony’s corpus software is used
worldwide, and there is a whole community of researchers out there that can
always offer their expertise and previous experience with data analysis. Figure 4.5
shows UK ‘International Education Strategy: global potential, global growth’
tagged in the workspace on the right. Note that every word has been POS-tagged
and that an embedded tag is shown immediately after.
Once we have tagged our corpus, we can save it (remember to save it as a .txt
file, UTF 8) and open it in AntConc. If we search for the token universities, we will
get the following, as shown in Figure 4.6.
As we can see, the token universities appears either as universities_NNS or
Universities, the latter being part of an entity. If we do not want to see the
embedded POS tags, we will have to select Hide tags (see Figure 4.4 for how to do
this, and Figure 4.7 for what the AntConc concordance looks like once the tags
are hidden).
Figure 4.5 TagAnt interface
Figure 4.6 AntConc concordance option (tags visible)

Once a corpus has been POS-tagged, we are ready to query our texts by exam-
ining how the lexical items and the grammatical categories are related. The
corpus size will need to be reported and considered when calculating the relative
frequency of occurrence of the linguistic items explored in the corpus. Where
to start? A word list of the most frequent nouns, or verbs, provides us with a first
glimpse of the lexical items used in the two policy documents. Table 4.4 shows the
20 most frequent nouns in both policy documents and their relative frequencies
per 1,000 words.
The information presented in Table 4.4 shows us how some nouns are more
frequent in each of the two policies. For example, while in the UK policy there
seems to be a high use of trade, export and market, in the NZ policy there seem to
be slightly more frequent references to market, quality and region. This is an excel-
lent departure point to start our exploration of both policy documents. We can
either focus on the lexical items that are identical in both documents or, alterna-
tively, look at what is unique, more frequent or not frequent at all, in one of the
documents. Using this list as suggested in previous chapters, we are now ready to
examine the contexts of use of some of these items in every dataset. For example,
the use of trade in the UK policy document is almost exclusively linked to the activ-
ities of the Department for International Trade, and in 33 of the 78 concordance
lines analysed is used with the auxiliary verb will followed by work ten times and
encourage nine times. In the latter, most of these uses appear in an action subsection
of the document. These are the nine concordance lines (Table 4.5) where encourage
follows The Department for International Trade will in the document.
Concordance lines 6–9 are repeated in the document as the will predictions are
revisited in terms of the timeline for their implementation. Although the analysis
of nouns will be discussed in chapter 6, even with the somewhat limited insight we
Figure 4.7 AntConc concordance option (tags hidden)

Table 4.4 Top 20 most frequent nouns in two HE policy documents
Ranking Lemma Relative Ranking Lemma Relative

UK policy frequency per NZ Policy frequency per
document 1,000 words document 1,000 words
1 UK 196 1 Education 248
2 Education 141 2 New 224
3 Sector 91 3 Student 208
4 Student 64 4 Zealand 205
5 International 63 5 Sector 50
6 Department 57 6 International 49
7 Opportunity 56 7 Education 46
8 Government 49 8 Provider 45
9 Education 49 9 Term 41
10 Trade 46 10 Market 37
11 Export 44 11 Medium 33
12 Provider 44 12 Experience 33
13 Market 32 13 Government 33
14 School 29 14 Quality 32
15 Country 28 15 Agency 29
16 World 27 16 System 29
17 Year 26 17 Region 28
18 Action 26 18 Strategy 28
19 Offer 24 19 Opportunity 25
20 Activity 24 20 MOE (Ministry 24
of Education)
have gained from examining some of the top nouns in the UK document, we can
note how trade and related concepts play a substantial role. A further colligational
analysis of trade as a noun will reveal the following:
• Trade is premodified significantly (as calculated by Word Sketch using LogDice)

by free, UK, international and education.
• Trade modifies significantly policy, agreement, mission, and many others.
• When trade is an object, the verbs are champion (as in champions free trade19), bring
and help.
None of these analyses could have been carried out if the datasets had not been
POS-tagged. In terms of comparison between the two policy documents, we
could use either descriptive or inferential statistics. Descriptive statistics will give
us a measure of the quantity of word classes used. In Table 4.6, we can see the
number of raw lemmas (in brackets) and the relative frequency per 1,000 lemmas.
Note that we needed the total of lemmas in the UK and NZ policy documents,
2,012 and 1,227, respectively, to calculate the relative frequencies. What these fre-
quencies tell us is that both policy documents used a very similar range of word cat-
egories, which is expected given their similar nature and functions. However, this is
Table 4.5 Concordance lines from UK ‘International Education Strategy: global potential,

global growth’
1 […] this it needs reliable information on where the best opportunities are for
different types of providers. The Department for International Trade will
encourage the growth of the early years market by sharing more intelligence
with the sector about the scale and scope of […]
2 […] growth, including in the European market where we are seeing growing
demand for UK schools. The Department for International Trade will
encourage independent schools to access international opportunities, using
improved education exports data to […]
3 […] bodies across the UK, a number that is forecast to increase. The
Department for International Trade will encourage a greater proportion
of UK skills organisations to consider taking their offer internationally, where
[…]
4 […] new and existing providers, and to improve the overall evidence base around
best practice and impact. The Department for International Trade will
encourage the sector to grow TNE by engaging in dialogue with countries with
recognised export potential.
5 […] international objectives. It is this physical presence that the UK government
can help facilitate. The Department for International Trade will
encourage the EdTech and educational supplies sector to engage with buyers
both in the UK and overseas.
6 […] basis given the differences in demand from different parts of the world.
Completion Spring 2020 Action 11. The Department for International
Trade will encourage the growth of the early years market by sharing more
intelligence with the sector about the scale and scope of […]
7 […] that want to expand their offer to find the best export opportunities for them.
Trade will encourage independent schools to access international
opportunities, using improved education exports data to […]
8 […] to raise awareness of the ELT offer for the benefit of the UK education sector.
Ongoing: review Spring 2020 Action 17. The Department for International
Trade will encourage a greater proportion of UK skills organisations to
consider taking their offer internationally, where […]
9 […] will focus on countries of particular interest and opportunity for the sector.
Trade will encourage the EdTech and educational supplies sector to engage
with buyers both in the UK and overseas.
Table 4.6 Relative frequencies of lemmas per 1,000
Lemmas UK ‘International Education New Zealand ‘International

Strategy: global potential, global Education Strategy 2018–2030’
growth’
Nouns 514 (1,043) 491 (602)
Verbs 167 (339) 171 (210)
Adjectives 152 (308) 152 (187)
Adverbs 57 (115) 56 (69)
a very broad picture. We need a more precise understanding of the differences. We

can try to appreciate these by using a combination of keyword analysis and statis-
tical coefficients that examine how different classes of words are used. Wmatrix20
(Rayson 2005, 2008) is an excellent tool to do this. We can compare the frequen-
cies and the distribution of the POS tags in the two documents and see where
differences are statistically significant. Note that the comparison corpus unit here
could be defined as ‘the national international policy strategy as implemented in an
official document’. In our context, it would not make much sense to compare two
documents that either thematically or linguistically are too divergent.
After running the POS keyword analysis, we find the following:
• Modal verbs (in particular will and can), are statistically more frequent (11.49)
in the UK policy document. The LogRatio, which measures the effect size, is
0.61. These are five of the 197 concordance lines where will is used in the UK
document:
s strategy and the education sector will form a key part. Working together,
that only government can give. We will seek to grow education exports and i
ses to Grow on the World Stage. We will seek to use the opportunities presen
across the UK education sector. We will champion the breadth and diversity
ondem with the education sector, we will provide the practical solutions and
• The preposition for is statistically more frequent in the UK document (11.40).
The LogRatio is 0.58. Most of these uses involve a prepositional phrase that is
complementing a noun (reputation, opportunities, processes). These are five of the
293 concordance lines where for is used in the UK document.
Setting the foundations for global success. The UK’s global
and embrace our ambitious objectives for the education sector. The
Government
The UK has a global reputation for education, characterised by excellen
first by international students for student experience across several mea
the UK, we are the European leaders for education technology. Our cultural
Other relevant POS tags that were statistically more frequent in the UK docu-
ment were the use of plural nouns expressing time (years), the use of existential there
(there is) or the use of more (see Table 4.8 for information of significance). These
three POS tags display a LogRatio of around 1.85, which suggests that they are
used twice as many times in the UK policy document. Table 4.7 summarises what
is involved in comparing two corpora, and Table 4.8 summarises each of the stat-
istical tests we have mentioned so far.
4.3 Reviewing skills 1–1 1

So far, we have discussed ten skills that are essential to use CL methods in research
projects. In the following section we present a range of follow-up questions
to facilitate discussion and reflection on some of the ideas we have covered in
chapters 1 to 4.
Table 4.7 Skill 10: comparing two corpora

Skill 10 • Make sure that your corpora have been cleaned and
that you have ready-to-use.txt files that can be further
manipulated.
• You will need to keep a record of how your files have been
cleaned up. These decisions include what to do with, among
others, the following:
• page numbers
• image descriptions
• URLs
• copyright.
• Make sure that the conversion keeps all the text in the
original pdf files. A quick look at your converted file may
probably do. A more systematic way is to upload your file
to Sketch Engine or AntConc and run a word list. This will
reveal any issues arising during conversion.
• Your corpora will have to be part-of-speech (POS) tagged.
POS tagging can be done using different taggers and
software. Online services such as Sketch Engine include POS
tagging as part of their workflow. Desktop software such
as AntConc will deal with tagged corpora in different ways.
Make sure you read about how this is done. Use TagAnt to
POS-tag your texts and then upload your tagged corpus.
• If you are using CL methods as a complementary
methodology (Figure 1.2), then you need to devise the ways
in which the different methods in your research design will
provide you with different data and data analysis options.
• Run word lists of the major word categories. These lists will
give you insight into the breadth of words used in every
corpus. Sketch Engine will give you the chance to click on
every word and generate the concordance lines where
they occur.
• Make sure you are using the same normalisation benchmark
(occurrence per 1 million words, per 10,000 words or per
1,000 words). Also make sure you are consistent in terms of
your units of comparison. If you are looking at lemmas, then
it makes no sense to compare lemmas and tokens (words).
4.3.1 Chapter 1
In this chapter, we discussed the use of corpora as tools that can generate insights
into language usage. We examined some corpora that are freely available to
researchers and that have been widely used to examine language use.
Visit Mark Davies’ website: english-corpora.org
How do you think we can use Mark Davies’ TV corpus in our research? And
the Corpus of Historical American English (COHA)? Visit the Google Books
corpus and the Hansard Corpus. Check out Table 4.9 and try to come up with
some of the potential uses of these corpora for educational research.
Table 4.8 Skill 11: statistical tests
• We can use corpus methods to test whether the

Skill 11
differences found between corpora are statistically
significant.
• Most of the tests of significance evaluate whether the
occurrence of an item is statistically significant in Corpus
A when compared with corpus B.
• Normalised frequency is convenient when calculating just
the number of times an item (i.e. a word or a lemma) is
found in a corpus, but it fails to capture the dispersion of
the item in the corpus.
• Average reduced frequency (ARF) is ‘a measure
that combines frequency and dispersion’. ARF is a
usage coefficient that examines both frequency and
dispersion: ‘the more frequent and evenly distributed the
word is, the more prominent it is considered to be’. The
measure ‘does not depend on the corpus being physically
divided into different parts (subcorpora)’ (Brezina,
2018: 54). All we need is the absolute frequency of the
word, the corpus size (total number of tokens) and the
positions of the word in the corpus. A useful tool to
calculate this is http://corpora.lancs.ac.uk/stats/toolbox.php
• Both collocation and keyword analyses use tests to
measure statistical significance. The skill tables devoted to
collocations and keywords provide detailed information
on these tests.
Table 4.9 Reflecting on existing corpora
Corpus Representative of Relevant for educational research?

TV corpus
COHA
Google Books corpus
Hansard corpus
Choose one of the corpora above. See Table 1.4, Skill 1. In which ways is fre-
quency represented there? Why is it of relevance? See Table 1.5, Skill 2. How can
you make sense of the normalised and the raw frequencies in this corpus? What
do they tell you?
4.3.2 Chapter 2
See Table 2.1, Skill 3. Can you define what a register is? Can you think of the range
of registers that you use on an everyday basis? Can you identify at least three?
Now go to Table 2.2, Skill 4. Go through the list of linguistic features enumerated.
Can you think how those three registers you have just identified in your everyday
life differ in terms of the frequency of use of some of these features?
4.3.3 Chapter 3
Go to Table 3.2, Skill 5. Think about your own research interests. Can you think
of a project would have been useful? In which ways? What sort of ideal corpus
would be useful?
See Table 3.7, Skill 6. What do concordance lines reveal? How do we generally
approach the analysis of concordance lines?
Go to Table 3.8, Skill 7. Why is dispersion of interest? How are frequency
counts and dispersion related?
Now go to Table 3.12, Skill 8. How can we interpret collocations? What do they
tell us about a node and its collocates? How is it calculated?
4.3.4 Chapter 4
Go to Table 4.3, Skill 9. Outline three areas at least that need to be considered
before designing your own corpus. Why are they relevant? In what ways can they
impact your data collection and future analysis of the data?
See Table 4.6, Skill 10 and 4.7, Skill 11. Before actually comparing two cor-
pora, what needs to be done? Revisit Figure 1.2. How do the two alternatives there
affect your comparison of the two datasets? What are the implications?
Notes
1 www.acecqa.gov.au/nqf/about
2 The authors acknowledged this bias in their paper and justified it because two of the
partner childcare organisations supporting their research are based in these three states.
3 Current version of Word Smith Tools is version 7; www.lexically.net/wordsmith/
4 We will deal with keyword analysis in chapter 6.
5 The term ‘text’ is used in this book to denote both spoken and written language.
6 A daily newspaper published in Melbourne, Victoria, Australia, owned by Halifax.
7 By using UTF 8 we make sure that all Latin-script alphabets, Greek, Cyrillic, Coptic,
Armenian, Hebrew, Arabic, Chinese, Japanese and Korean characters, mathemat-
ical symbols, and emojis can be read and interpreted by our machine. Source: http://
unicode.org/main.html
8 https://professional.dowjones.com/factiva/
9 www.lexisnexis.com
10 Note that most education databases are excellent sources of textual data, but they are
primarily academic or research oriented. Education Abstracts will let you search in
magazines and periodicals, while ERIC will let you search in reports and PhD theses.
11 https:// a ssets.publishing.service.gov.uk/ g overnment/ u ploads/ s ystem/ u ploads/
attachment_data/file/799349/International_Education_Strategy_Accessible.pdf
12 https://enz.govt.nz/assets/Uploads/International-Education-Strategy-2018–2030.
pdf
13 Martin Weisser maintains a website with detailed information on tagging solutions for
different languages; http://martinweisser.org/corpora_site/taggers.html
14 https://nlp.stanford.edu/software/tagger.shtml
15 http://ucrel-api.lancaster.ac.uk/claws/free.html
16 You can find a description of the Sketch English adaptation of Tree Tagger at www.
sketchengine.eu/english-treetagger-pipeline-2/ and CLAWS tagset 7 at http://ucrel.
lancs.ac.uk/claws7tags.html
17 https://tei-c.org/release/doc/tei-p5-doc/en/html/SG.html
18 www.laurenceanthony.net/software/tagant/
19 The UK’s Department for International Trade (DIT) helps businesses export, drives
inward and outward investment, negotiates market access and trade deals, and
champions free trade.
20 https://ucrel-wmatrix4.lancaster.ac.uk/wmatrix4.html
References
Clancy, B. (2010). Building a corpus to represent a variety of a language. In O’Keeffe, A. &
McCarthy, M. (Eds.) The Routledge handbook of corpus linguistics. London: Routledge,  80–92.
Francis.
Department for Education and Department for International Trade [DfE & DIT] (2019).
International education strategy: global potential, global growth. London: DfE & DIT.
Fenech, M. & Wilkins, D.P. (2019). The representation of the national quality framework
in the Australian print media: silences and slants in the mediatisation of early childhood
education policy. Journal of Education Policy, 34(6), 748–770.
Nelson, M. (2010). Building a written corpus: What are the basics? In O’Keeffe, A. &
New Zealand Government. (2010). International education strategy 2018–2030. Wellington:
New Zealand Government.
Pinto, L.E. (2013). When politics trumps evidence: Financial literacy education narratives
following the global financial crisis. Journal of Education Policy, 28,(1), 95–120.
Rayson, P. (2005). Wmatrix. Lancaster University. www.comp.lancs.ac.uk/ucrel/wmatrix.
Rayson, P. (2008). From key words to key semantic domains. International Journal of Corpus
Linguistics, 13(4), 519–549.
Reppen, R. (2010). Building a corpus: what are the key considerations? In O’Keeffe, A. &
Chapter 5
Interview data
Transcription and annotation
5.1 Transcription: so much more than a

monotonous task
Although fully automated software-aided transcription will be a reality at some
point in the not so distant future, current research projects rely on human transcrip-
tion and annotation of language. Most of this chapter is devoted to transcriptions
and coding by humans, with some remarks on software-aided transcription. Note
that in this chapter we will not cover POS tagging, which in many corpus linguis-
tics publications is treated as annotation.
In some of the research literature, transcribing an interview or a focus group
is presented in a rather unproblematised way. In fact, very little is said about the
process of transcription in some standard research guidelines for educators. Much
more has been written about the cost and the investment of time but, surprisingly,
transcribing as a process remains largely invisible to researchers. Gray (2004), a
popular handbook for educational researchers, merely discusses whether a full
transcription is necessary given its cost and suggests that researchers may wish to
contemplate drawing up summaries of the interviews instead. No further discus-
sion is provided about the very process of transcription. Leavy (2017), in her book
devoted to both quantitative and qualitative research designs in education, uses
the term transcription just four times. Her advice on transcription emphasises the
time-consuming nature of the process:
A benefit of written interviews is that you can avoid the monotonous tran-
scription process […]. It is important to clearly label and mark up your tran-
script so that it is in a form that is easy to analyse—for example, the use of
bold and italic fonts consistently (e.g., to mark when the researcher is talking,
or when the participant emphasised something). By going the extra mile
during data preparation, the process of data analysis and interpretation is
made easier. Transcription is a tedious process and ample time for it should
be built into the research design. […] If you’re applying for funding for your
research, you may consider budgeting for a transcriber.
(Leavy, 2017: 142)
Interview data 89
Reppen (2010) argues that, depending on the level of detail included in the tran-
scription, it may take up to 15 hours to transcribe and annotate one hour of
spoken language. This is certainly a laborious activity. King, Horrocks & Brooks
(2019: 46) in their second edition of Interviews in Qualitative Research discuss tran-
scription as a ‘demanding task […] often contracted out to people with the essen-
tial skills […] realistically time constraints may mean you need to employ others to
do this task’. Note that the implication here may be that transcriptions are pretty
straightforward to do and, most likely, do not need any type of data-sensitive
coding or markup.
Cohen, Manon & Morrison (2018) discuss transcription in more depth than
authors of other educational research handbooks. They acknowledge conflicting
views on how to conduct the interview and the role of the interviewer/researcher.
They argue that ‘the problem with much transcription is that it becomes solely
a record of data rather than a record of a social encounter’. They caution ‘the
researcher against believing that they [can] catch everything that happened in
the interview’ and suggest a list of data that happen in an interview and which,
depending on the research aim, need to be recorded in a transcript:
• what was being said

• the tone of voice of the speaker(s) (e.g. harsh, kindly, encouraging)
• the inflection of the voice (e.g. rising or falling, a question or a statement, a
cadence or a pause, a summarising or exploratory tone, opening or closing a
line of enquiry)
• emphases placed by the speaker
• pauses (short to long) and silences (short to long)
• interruptions
• the mood of the speaker(s) (e.g. excited, angry, resigned, bored, uncomfort-
able, enthusiastic, committed, happy, grudging)
• the speed of the talk (fast to slow, hurried or unhurried, hesitant to confident)
• how many people were speaking simultaneously
• whether a speaker was speaking continuously or in short phrases
• who is speaking to whom
• indecipherable speech
• any other events that were taking place at the same time that the researcher
can recall (Cohen, Manon & Morrison, 2018: 523–524).
Most of this list is an evaluation of suprasegmental features of speech, informa-

tion on turn-taking plus a recording of the circumstances in which the interview
took place (interruptions, etc.). Cohen, Manon & Morrison note how transcrip-
tion is an act of interpretation itself:
The issue here is that it is often inadequate to transcribe only spoken words;
other data are important. Of course, as soon as other data are noted, this
becomes a matter of interpretation (what is a long pause, what is a short
90 Interview data
pause, was the respondent happy or was it just a ‘front’, what gave rise to
such-and-such a question or response, why did the speaker suddenly burst
into tears?). […] interviewees’ statements are not simply collected by the
interviewer, they are, in reality, co-authored.
(Cohen, Manon & Morrison, 2018: 524)
From a linguistic perspective, there are at least two main types of transcription:1
orthographic and prosodic. Orthographic transcription renders spoken data using
standard orthographic conventions, which, despite being the most straightforward
type of transcription, brings up all sorts of challenges, especially if what is being
transcribed is not a monologue and the language shows a high degree of involve-
ment and interaction. Orthographic transcription of spoken language involves an
act of interpretation that needs to be acknowledged and reflected upon. Prosodic
transcription adds prosodic marking to orthographic transcripts (i.e. intonation).
Reppen (2010: 33–35) has put together a set of issues that researchers need to
address before transcribing language:
• How will reduced forms (e.g. wanna, gonna, cuz) be transcribed? Complete
form? Reduced form? Double coding?
• What will be transcribed when it is difficult to understand what was said?
• How will overlapping speech be treated?
• How will conversational facilitators (uh, mmm, hum, etc.) be transcribed?
• How will repetitions (I I I I I I I don’t think) be treated?
• What about pauses? Will pauses be ‘transcribed’? Will they be timed?
• What about laughter? Shall we transcribe this?
These decisions will impact how the linguistic data will be treated when uploaded
either to Sketch Engine or AntConc. When transcribing an interview or a focus
group, we need to keep a record of these decisions. In due time, this record or
set of guidelines will help us understand our own interpretation of the findings
against the backdrop of those decisions. Consider, for example, the number of
times personal pronouns are repeated during a conversation. If we decide not
to reflect these repetitions in the orthographic transcription, we are deliberately
adopting an approach where we are priming an oversimplified rendering of
spoken data over the complexity of spoken communication. If, on the other hand,
we decide to represent spoken disfluency phenomena such as repetition and hesi-
tation, we are adding extra complexity to our analysis.
The Child Language Data Exchange System (CHILDES) project2 has
developed over the decades essential know-how to approach the transcription of
spoken language data in a robust way. Although their focus is the emergence and
use of language in children, the range of tools and the standards developed over
the years is of interest to anyone thinking about transcribing spoken data. Brian
MacWhinney, the original researcher behind CHILDES, distinguishes between
transcription and coding. In his view, the former is the production of a written
Interview data 91
Transcripon
• Interviews
• Focus groups, etc. • Level of • Ad hoc coding
granularity • Markup
• .txt files • .txt files
Recording Coding
Figure 5.1 Planning the transcription of your corpus
record that tries to represent, quite often unsuccessfully, the original spoken inter-
action. Coding, however, is so much more:
Coding, on the other hand, is the process of recognising, analysing, and

taking note of phenomena in transcribed speech. Coding can often be done
by referring only to a written transcript. For example, the coding of parts of
speech can be done directly from a transcript without listening to the audio-
tape. For other types of coding, such as speech act coding, it is imperative that
coding be done while watching the original videotape.
(MacWhinney, 2019: 19)
Researchers need to work out how they want to approach their transcription pro-
ject, what phenomena they want to code or annotate, and how they want to do
this in terms of the context of their research methodology.
5.2 Transcription basics
Before starting our transcription, we need to plan what needs to be transcribed
and coded. The process is summarised in Figure 5.1. Each stage (Recording –
Transcription –Coding) requires careful planning and making decisions that will
impact the final data-format of the interviews that will be used in our analysis. We
suggest that the researchers work with text (.txt) files as they offer maximum flexi-
bility and guarantee compatibility with a wider range of analysis tools.
Transcription plays a central role in research methodologies that use interviews
or other types of spoken data to gain access to the experiences and opinions of
informants. Bailey (2008) has pointed that, unfortunately, the role of the tran-
scriber in qualitative research is often neglected.3
Transcribing is often delegated to a junior researcher […] but this can be a

mistake if the transcriber is inadequately trained or briefed. Transcription
92 Interview data
involves close observation of data through repeated careful listening (and/or

watching), and this is an important first step in data analysis. This familiarity
with data and attention to what is actually there rather than what is expected
can facilitate realisations or ideas which emerge during analysis. Transcribing
takes a long time […] and this should be allowed for in project time plans,
budgeting for researchers’ time if they will be doing the transcribing.
(Bailey, 2008: 129)
Transcribing is an essential part of the research flow and those in this role need
a proper understanding of the implications of achieving robust transcriptions
that are consistent throughout the entire dataset. This is particularly important
if more than one transcriber is involved and, as is often the case, transcription
is subcontracted. If our entire dataset has not been transcribed using the same
guidelines in a consistent way, the validity of our findings will be compromised.
Similarly, if we have not coded our interviews, it will be impossible to perform
sophisticated searches on our data.
There are different desktop solutions that can help us with our transcription pro-
ject. Desktop solutions such as Inscribe4 or the EXMARaLDA Partitur Editor5 are
cross-platform tools for transcribing and annotating of digital audio and video files.
Inscribe is commercial software that can be used with a USB foot pedal to control
media playback. Subtitled video files can be exported, which is a great feature across
different educational projects. With the EXMARaLDA Partitur Editor, digital
audio or video recordings can be transcribed and aligned (transcription and multi-
media), a great feature for most research projects. Partitur is not only freeware, it
can also process transcriptions according to different transcription conventions and
styles (HIAT, CHAT), output transcription data in different layouts (score format
or line-for-line) and in different document formats (HTML, MS Word) and with
multimedia links to audio or video. Transcription data can be exchanged with other
systems such as Praat6 or ELAN7. Partitur can be used with the EXMARaLDA
Corpus-Manager (Coma), a tool that allows to link EXMARaLDA transcriptions
with metadata in order to compile them into corpora. There are many other tools
that can be of use. The UAM Corpus Tool8 is a great freeware solution if we
already have a transcription and wish to annotate our data and create our own
annotation layers or taxonomies. Folia is an XML-based annotation format for
the representation of linguistically annotated language resources. Their developers
note that their aim is to ‘introduce a single rich format that can accommodate a
wide variety of linguistic annotation types through a single generalised paradigm.
We do not commit to any label set, language or linguistic theory’.9 Folia developers
have created resources10 that can be useful to understand different annotation types
and XML-related tags and structures.
There are plenty of web services that can help us with our transcription, ran-
ging from the pretty basic but essential functionalities of Otranscribe11 to the
more complex and automated functionalities of Transcribe,12 which can automat-
ically transcribe audio files that, eventually, will have to be edited and corrected
by a human transcriber. This subscription web service gives you the chance to
Interview data 93
work on a web browser environment, slow down your audio, use a foot pedal and
define your own acronyms for frequently used words and phrases, which will be
expanded to their full form as you type along. One of the advantages of these
services is that you can use a wider range of devices (e.g. tablets) to carry out the
transcription. Brat13 is a great option if you have some basic server infrastructure
and basic natural language processing (NLP) expertise or support. Annotating is
easy and very much a drag and drop, intuitive experience.
As for orthographic transcription, there are numerous transcription guidelines
available that we can use as a starting point. The Louvain International Database
of Spoken English Interlanguage (LINDSEI) (Gilquin, De Cock & Granger, 2010)
contains oral data from advanced learners of English from several mother tongue
backgrounds. Each L1 group (French, German, Dutch, Spanish, etc.) includes
50 interviews made up of three tasks: a set topic, a free discussion and a picture
description. LINDSEI guidelines14 are an excellent starting point to start a conver-
sation about what needs to be considered even before we transcribe the first word
of an interview. These guidelines contain elements from both the transcription as
well as the markup stages in Figure 5.1. Table 5.1 shows a summary of some of
the areas that need attention and how LINDSEI researchers actually proceeded.
Qualitative research experts such as Silverman (1993) and King, Horrocks &
Brooks (2019) have devised transcription systems that can offer researchers a con-
sistent way to transcribe interviews. King, Horrocks & Brooks (2019: 194) point
out that there is ‘less standardisation among the simplest forms of transcription’,
which is only natural as transcription systems tend to be project-driven and,
accordingly, unique in different ways. For these authors, it is crucial the researchers
use a transcription system that captures every aspect of speech ‘that might indi-
cate something about the way verbal interaction operates and what it achieves’.
They suggest the following:
• line-number your transcript

• represent emphasis through capital letters
• use brackets to annotate what is relevant in the context of your research pro-
ject, for example:
• (p) for a short pause and (pause) for longer pauses
• represent interruptions by means of a dash at the end of an interruption
• indicate overlapping speech in brackets: (overlap) and (end overlap)
• indicate laughing, coughing, etc. in brackets: (both laugh), (interviewee
laughs)
• tone of voice: (ironic tone), (angry tone)
• non-verbal language: (mimics), (stretches arms to indicate size), etc.
• combination of some of the above: (bitter laugh).
This is an effective way to systematise a transcription: ad hoc guidelines are easy

to implement, free and all you need is a text editor (Notepad or Notepad++ for
Windows or TextEdit or BBEdit for Mac machines). However, based on our
experience, we discourage the use of brackets and general text to annotate a
94 Interview data
Table 5.1 LINDSEI transcription guidelines
How to…. Why? How to do this (LINDSEI guidelines)

Identify an Each interview needs an Develop a code like this one:
interview individual code as part L1 Dutch = DU001
of its metadata. L1 German = GE001 and so on.
Use the tags <h> </h> to include this
information.
Use ‘nr’ to identify the L1 group:
<h> nr = “DU005” </h>
‘nr’ is one of the many attributes that
tags can have. Attribute values in
XML must always be quoted.
We can create as many attributes as
we deem necessary (country, school
type, region, etc.).
Transcribe A decision needs to be LINDSEI corpora are not punctuated.
punctuation made as to whether we However, this is not often the case in
would like to include qualitative research.
punctuation.
Transcribe pauses A decision needs to be A 3-tier system is used:
made as to whether one dot for a short pause (< 1 second)
pauses will be two dots for a medium pause (1–3
transcribed. seconds)
three dots for long pauses (> 3
seconds).
Anonymise The confidential and Transcribers can use tags like <first
anonymous treatment name of interviewee>, <first name
of participants’ data is and full name of interviewer> or
considered the norm <name of professor> to replace
for the conduct of names.
research.27 A decision
needs to be made as to
how we would like to
anonymise our data.
Deal with What spelling should be British English spelling was adopted.
spelling and followed? Languages
capitalisation like Chinese, Spanish or
English display dialectal
and regional differences.
Deal with Shall we transcribe them as If acronyms are pronounced as
acronyms letters or as ‘words’? sequences of letters, they are
transcribed as a series of upper-case
letters separated by spaces.
If acronyms are pronounced as words,
they are transcribed as a series of
upper-case letters not separated by
spaces.
Interview data 95
Table 5.1 Cont.
How to…. Why? How to do this (LINDSEI guidelines)

Deal with dates We know that dates and Figures are always written out in words.
and numbers numbers can be said in
different ways across
languages. How to solve
this in our transcription?
Deal with foreign Do we want to Foreign words are indicated by
words and acknowledge the use of <foreign> (before the word) and
pronunciation more than one language </foreign> (after the word).
during the interview?
Speaker turns How do we reflect who is Speaker turns are displayed in vertical
speaking? format. The letter ‘A’ enclosed
between angle brackets always
signifies the interviewer’s turn, the
letter ‘B’ between angle brackets
indicates the interviewee’s turn.
The end of each turn is indicated by
either </A> or .
e.g. <A> okay so which topic have you
chosen  </A>
 the film or play that I thought was
particularly good or bad really 
Deal with During an interview or a The tag <overlap /> (with a space
overlapping focus group, overlapping between ‘overlap’ and the slash) is
speech speech is frequent. used to indicate the beginning of
overlapping speech. It should be
indicated in both turns. The end of
overlapping speech is not indicated.
e.g. yeah I went on a bus to
London once and I’ll never <overlap
/> do it again 
<A> <overlap /> that’s even worse </A>
Contracted forms Are contracted forms Standard contracted forms (i.e. hasn’t)
retained? are retained.
corpus as this approach will very likely distort the contents of the text in terms
of their analysis. A corpus annotated with regular brackets and text will contain
words that are not part of the corpus strictly speaking, so the resulting text will
not only be larger than it should be, but it may also mess up the consistency of
the data. On top of this, it will not be straightforward to find a way around how
to account for these annotations in the overall analysis of the corpus. Certainly, it
will be very difficult to do in Sketch Engine, and in AntConc it will require setting
brackets as default tag symbols. Using regular tags <tag> will do the job across
the board in a nicer way.
The tagged annotations discussed above can be tracked down in AntConc
effortlessly. For example, if we want to search for the foreign words in the Spanish
96 Interview data
Figure 5.2 Searching for tagged annotations in AntConc
LINDSEI interviews, all we need to do is to run AntConc, go to Settings > Global

Settings and click Show Tags. Then we are ready to search for <foreign> on our
corpus. In this case, we will see something similar to Figure 5.2.
The contents in Figure 5.2 can be exported to a text or Excel file and then fur-
ther analysed. Clicking on any of these nodes will take us to the larger interview
context where that annotation was inserted.
The guidelines discussed in this section can be used as a first port to reflect on
the implications of how interviews can be transcribed in a robust, systematic way.
Adding metadata and coding can enhance our corpus search by observing some
basic principles. In the following sections, we will assume that we have access to
Sketch Engine and we can therefore upload our corpus of interviews to Sketch
Engine. First, though, Table 5.2 summarises skill twelve, transcribing your data.
5.3 Adding structure and metadata to a corpus

Although adding metadata is time-consuming, the benefits of working with well-
annotated transcriptions in CL are many. Lou Burnard, one of the founding
Interview data 97
Table 5.2 Skill 12: transcribing your data
• Transcribing spoken language is not a straightforward task.

Skill 12
Lots of thought and consideration are needed before you,
or somebody else, actually sits and starts with the first
transcription.
• It is a good idea to think in advance how your research
questions will impact the design of the corpus as well as
the range of data to be collected. How will the language
collected contribute to your findings? In which ways?
• You need to devise a normalisation strategy that ensures
that your transcription is consistent no matter how many
interviews or the size of the corpus.
• Pilot your transcription and discuss with colleagues the
challenges that you faced and the strategies that you
adopted. Run a word list and check if anything unexpected
comes up.
• Even at the transcription stage, you need to think how your
corpus will be annotated. Work out a plan that maximises
your time. What language will be most revealing in terms of
your research questions? How can my annotation help me
and my colleagues discover the sort of findings I need in my
research project?
• Try different transcription tools (desktop or online) and
make sure you choose the one that is best for you and
your team. Plain text editors are usually great.
• Using.txt files will make your life easier as they are totally
compatible with a wide range of tools.
editors of the Text Encoding Initiative (TEI) (see Table 5.3) and responsible for
the digital infrastructure of the British National Corpus, highlights the role of
metadata in corpus linguistics research:
[…] it is no exaggeration to say that without metadata, corpus linguistics

would be virtually impossible. […] A typical corpus analysis will therefore
gather together many examples of linguistic usage, each taken out of the
context in which it originally occurred, like a laboratory specimen. Metadata
can restore that context by supplying information about it, thus enabling us
to relate the specimen to its original habitat. Furthermore, since language
corpora are constructed from pre-existing pieces of language, questions of
accuracy and authenticity are all but inevitable when using them: without
metadata, the investigator has no way of answering such questions. Without
metadata, the investigator has nothing but disconnected words of unknow-
able provenance or authenticity.
(http://users.ox.ac.uk/~lou/wip/metadata.html)15
Burnard has in mind here a big corpus of language such as the BNC (100 million
words), which contains thousands of sources, writers, contexts of use and so
98 Interview data
Table 5.3 Essential terminology: Text Encoding Initiative (TEI)
Essential terminology TEI

The Text Encoding Initiative (TEI) guidelines offer an
encoding scheme that relies on Extended Markup
Language (XML) standards.
The TEI language defines a tagset of XML elements that
are used to encode texts, along with attributes used to
modify the elements. The TEI Guidelines contain around
500 elements, but in practical terms only a small amount
of tags will be used in your project.
Elements in the TEI tagset can be used (a) to represent
the metadata about the corpus or text (author,
bibliographical information, interview description, revision
history, etc.), and (b) to encode the structural features
of the document itself, such as sections, sentences,
paragraphs, turn-taking, etc.
BUT Although the use of TEI guidelines is recommended, there

are other ways to mark up your corpus.
on. While most educational researchers will not necessarily need a large corpus
and a sophisticated corpus design, it is a fact that a well-annotated corpus will
increase the efficiency of the searches and consequently the validity of their
findings.
If your research project involves different interviews, it is recommended that you
keep them in separate files. Having all our interviews in individual files, instead
of one single large file, will provide us with the opportunity to perform complex
searches and know more about the distribution of certain features across the
corpus (see chapter 4). Corpus management tools such as Sketch Engine16 can
recognise structures and parts in a corpus:
[…] a corpus has to be equipped with marks or labels indicating the beginnings
and ends of such parts. These marks or labels are called structure tags and the
parts of a corpus they mark are called structures. The most typical parts are
files, paragraphs and sentences.
(www.sketchengine.eu)
So, if we want to take advantage of this functionality, we need to think about

which structures we want to have in our corpus and which tags we need to use.
Some of this planning will be performed automatically,17 though:
Corpus management software generally does not prescribe (and neither does
Sketch Engine) what structures should be included in the corpus and what
they should look like. It is, however, advisable to include at least the basic set
Interview data 99
marking the beginning and end of a document, paragraph and sentence. By

default, Sketch Engine will try to identify these three structures when uploading
content and will supply the corresponding structure tags automatically.
Sketch Engine will add a basic structure to our files. A file will become a <doc>,
paragraphs will be enclosed in elements, while sentences will be
annotated as <s></s>. However, we can modify or improve this structure. The
way we can do this is by using angle brackets where the closing tag must have a
slash. This is the point where we, as researchers, need to translate our research
questions to effective coding that can increase the efficiency of our searches in
the corpus. Typically, these will be related to our dependent and independent
variables. The good news is that Sketch Engine converts all metadata to ‘Text
Types’. Let us see how this works. For example, CHILDES corpora are made
up from transcripts of child language of spontaneous conversational interactions.
The speakers involved are young children speaking with their parents or family.
The corpus is annotated to reflect the following features:
• sex
• L1
• languages spoken during the interaction
• age group
• participant role
• date.
This means we could search for all occurrences of, for example, the use of second
pronoun you in conversations with kids between four and six years of age involving
mothers, fathers, carers, etc. Each utterance is inserted in a <s></s> structure.
The following is taken from one of those conversations where the verb talk is used
(there are over 12,000 uses of talk within the frame of our search):
<s> do you want to sit down here? want to sit down? </s>

<s> want talk. </s>
<s> you want to talk? </s>
<s>.  </s>
<s> what do you want to talk about? </s>
<s> xxx. </s>
<s> you don’t know? </s>
<s>.  </s>
<s> do you want to talk about your new bed? </s>
<s>.  </s>
<s> what are we gonna put on your new bed? what did we buy for your new
bed? did we buy things for the new bed? </s>
100 Interview data
Note that in the above transcription dots stand for pauses and xxx stands for unin-
telligible speech,18 but this annotation is totally open to being redefined, and you
can define your own transcription guidelines as you deem appropriate.
Adding metadata to our transcriptions is no longer an option. Before the spread
of electronic textual data, metadata was typically kept separate in reference
manuals and was not generally included as part of the transcription. Including
metadata in electronic format is not only good practice, but an essential practice to
exchange and distribute our data and maximise the opportunities to interact with
our dataset in digital research contexts. Depending on the scope of our research
project and the resources available, we may want to adopt different strategies
towards annotation. For big projects with plenty of resources and support, it is
necessary to develop an annotation strategy that makes use of encoding standards.
However, finding the right strategy and coding system will take time and some
kind of trial and error approach. Scott & Tribble (2006) have noted that there is
no such a thing as a perfect markup system that caters for all researchers:
It is worth noting that no system of markup is ever likely to be satisfactory for

all users of a corpus […] because it is never going to be possible to identify all
features of the speaker or writer, their location at the time, their reasons for
writing or speaking, and exactly what and how they wrote or said, etc.
(Scott & Tribble, 2006: 22)
We need to start somewhere, though. The Text Encoding Initiative (TEI) guidelines
are one of those standards that need to be considered by researchers that need to
work with relatively large and complex datasets. We will explore TEI in the rest of
the chapter. If this is of interest to your project, do not hesitate to read more about
it on https://tei-c.org The TEI guidelines19 suggest that an electronic text should
include the following metadata:20
• A description of the file itself, its authors and publication-related information.

We can use the tags <fileDesc> </fileDesc>.
• A description of the encoding, specifying the tags that have been used.
The tags <encodingDesc></encodingDesc> can be used to include such
description.
• A description of the file or corpus, supplying additional descriptive material
about the file not covered elsewhere, such as its situational parameters, topic
keywords, descriptions of participants in a spoken text etc. We use the tags
<profileDesc> </profileDesc> to include such description.
• A revision log, listing the modifications made to the transcription and the
annotation. We use the tags <revisionDesc></revisionDesc> to include the
changes.
This information is included in the TEI header before the transcription itself. The
following is a minimal header structure recommended by the TEI consortium
taken from their official website:21
Interview data 101
<teiHeader>
<fileDesc>
<titleStmt>
<title>A Title is given here</title>
<respStmt>
<name>A name is given here</name>
<sourceDesc> A description of the source of the data</sourceDesc>
</ respStmt>
</ titleStmt>
<publicationStmt>
<distributor>This can be your institution or you</distributor>
</ publicationStmt>
</fileDesc>
</teiHeader>
And Table 5.4 contains some of the main tags from the example above.
Note that (a) it is not compulsory to use all the tags that can potentially be used
in the header, only those that are necessary for the description of the file,22 and
those that are of interest for (b) the distribution of the resource and (c) coding
some of the features that will be used when querying the data. For example, in
the title statement (<titleStmt>) we may specify, among others, the following:
• <title> the title for any kind of  work

• <author> the name(s) of an author, personal or corporate, of  a work
• <sponsor> name of a sponsoring organisation or institution
• <funder> (funding body) name of an individual, institution, or organisation
responsible for the funding of a project or text
• <principal> (principal researcher) name of the principal researcher respon-
sible for the creation of an electronic text.
More information on these and more elements and how to use them can be found
at https://tei-c.org/guidelines/p5/23
Some TEI elements are unique to spoken texts.24 The most important is , a
spoken element analogous to a paragraph in written texts. U stands for utterance
Table 5.4 Tags in the TEI header
Tags in the TEI header Description

<fileDesc> </fileDesc> We include here the description of the file
<titleStmt></titleStmt> We include here the title of the file (Interview
with Sarah)
<respStmt></respStmt> Name of person or persons responsible
<publicationStmt></publicationStmt> Details about the publication of the file
(if appropriate)
102 Interview data
and contain speech usually preceded and followed by silence or by
a change of speaker. Look at this transcription from the tei-c.org website25 and
annotated using TEI guidelines:
you never <pause/> take this cat for show and tell
<pause/> meow meow
yeah well I dont want to
<incident>
<desc>toy cat has bell in tail which continues to make a tinkling sound</
desc>
</incident>
<vocal who=“#mar”>
<desc>meows</desc>
</vocal>
because it is so old
how <choice>

<orig>bout</ orig>
<reg>about</ reg>
</choice>
<emph>your</emph> cat <pause/>yours is <emph>new</emph>
<kinesic>
<desc>shows Father the cat</desc>
</kinesic>


<listPerson>
<person xml:id=“mar”>

</person>
<person xml:id=“ros”>

</person>
<person xml:id=“fat”>

</person>
</listPerson>
Let us go deeper into some of coding used for this transcription. Every 
 element contains an utterance that was said by a speaker that is also identi-
fied in the transcription, for example:
We know that it was Ros that spoke this thanks to the who element immediately
after <u. Note that her words follow the who element and they are transcribed
before the closing element . We can use <incident> elements to describe
what is going on during the interview:
<incident>
<desc>toy cat has bell in tail which continues to make a tinkling
sound</desc>
</incident>
The <desc> </desc> tags enclose this description. While the example may
seem irrelevant to your research, a taxonomy of such incidents developed in
the context of your research project may help you locate important infor-
mation about the circumstances in which the interviews or the focus groups
took place. The elements <kinesic></kinesic> include gestures, frowning,
nodding, etc. The elements <vocal></vocal> includes non-lexical voice such
as whistles. These are interesting if we want to examine language and non-
verbal communication.
The <choice> </choice> elements are useful to portray the exact words as
heard on the audio or video file and the regular way to represent the word(s) in
conventional spelling. This is an example:
<choice>
<orig>bout</orig>
<reg>about</reg>
</choice>
The <orig></orig> and <reg></reg> elements will allow us to represent what

was actually said and the standard spelling form. Emphasis is annotated by means
of the <emph></emph> elements as in this example:
<emph>your</emph> cat <pause/>yours is <emph>new</emph>
This is a more efficient way to represent emphasis than capital letters as we can
build complex searches that include both lexical items and annotation. Another
interesting part of the code provided above is the declaration of the speakers
involved.


<listPerson>
<person xml:id=“mar”>

</person>
104 Interview data
<person xml:id=“ros”>

</person>
<person xml:id=“fat”>

</person>
</listPerson>
As exemplified above, when transcribing our data we can use the element
who=“#ros” to identify the speaker:
While the purpose of this book is not to provide a detailed account of how to
transcribe spoken language using TEI guidelines, an appreciation of its useful-
ness and possibilities may open up new ways to look at how transcriptions can
support our insight into the language used during interviews. As for our own
annotation of the corpus, we have different options at our disposal: (1) we can
use annotation tags straightaway in our transcription or (2) we can develop an
annotation taxonomy that will be formally declared using TEI elements. The
former is quick and useful in terms of our searches; the latter is more time-
consuming, but it will offer us more sophisticated search options. Let us have a
look at them.
5.3.1 Annotating a corpus using our own tags

We will use part of the interview with Colebrook ICT Middle Leader from
the research project Reshaping Educational Practice for Improvement in Hong Kong and
England: How Schools Mediate Government Reforms (Gu, 2015). We will illustrate how
this interview could be annotated so that we can maximise our searches in a corpus
management tool, in our case Sketch Engine. The interview transcripts are open
access and can be downloaded from ukdataservice.ac.uk. This research project sets
out to understand the complex interface between educational policy, educational
practices and outcomes through a comparative analysis of the ways in which the
intended outcomes of such reforms are mediated by school leaders and teachers
in secondary schools in England and Hong Kong. The research design consisted
of eight case studies involving secondary schools (four in England, four in Hong
Kong) in different socio-economic contexts. Day, Gu & Sammons (2016: 222), one
of the publications from this research, concluded the following:
The research provides new empirical evidence of how successful principals

directly and indirectly achieve and sustain improvement over time through
combining both transformational and instructional leadership strategies.
The findings show that schools’ abilities to improve and sustain effectiveness
over the long term are not primarily the result of the principals’ leadership
style but of their understanding and diagnosis of the school’s needs and their
application of clearly articulated, organisationally shared educational values
through multiple combinations and accumulations of time and context sen-
sitive strategies that are ‘layered’ and progressively embedded in the school’s
work, culture, and achievements.
(Day, Gu & Sammons, 2016: 222)
The authors stress that mixed-methods research designs ‘provide finer grained,
more nuanced evidence based understandings of the leadership roles and
behaviours of principals who achieve and sustain educational outcomes in
schools than single lens quantitative analyses, meta-analyses, or purely qualitative
approaches’ (Day, Gu & Sammons, 2016: 222). In this context, the use of corpus
linguistics methods can inform our understanding of the stakeholders’ positioning
towards leadership, improvement and, among others, success in schools. We have
selected one of the interviews from the dataset in order to illustrate how we can
use interview data in corpus analysis. In total, the whole project is made up of
68 interviews from four schools. First, we will describe the role of each of the
participants in the interview. We will use the <pers> element and the attribute role
in the following way:
<pers role=‘Interviewer’>What’s the latest with appraisal and Ofsted?

</pers>
Note that the words used by the speaker appear immediately before the closing tag
</pers>. Now, we will do the same with the interviewee. The following is just an
extract from the reply to the opening question:
<pers role=‘Interviewee’> Well in terms of appraisal that’s a massive change

at the minute. I know it used to be called performance management but
it was never really that and the funny thing is they’ve changed the name
to appraisal and now it’s more like performance management so, obvi-
ously, that is a big change. </pers>
As above, the interviewee’s reply appears before the closing </pers> tag. Now we
are ready to add some annotation to the text. Depending on their research methods
(theme analysis, content analysis, etc.), the researchers will come up with a different
tagset or taxonomy. In the example below we just show how we could assign a
simple, non-hierarchical annotation scheme to this ICT leader’s words. Just for
illustration purposes, we have identified the following themes in the interview:
• appraisal
• action
• impact
• personal evaluation
• risks involved.
106 Interview data
And have transformed them into tags:
• <Appraisal> </Appraisal>
• <Action> </Action>
• <Impact> </Impact>
• <PersonalEvaluation> </PersonalEvaluation>
• <RisksInvolved> </RisksInvolved>
This is how the interview extract will look like once annotated:
<pers role=‘Interviewee’><Appraisal>Well in terms of appraisal that’s a

massive change at the minute.</Appraisal> <Appraisal> I know it used
to be called performance management but it was never really that and the
funny thing is they’ve changed the name to appraisal and now it’s more like
performance management so, obviously, that is a big change. </Appraisal>
<Action><Appraisal> I mean we implemented a big change this September
in terms of we changed the name to ‘appraisal’ and the way we did it
was slightly different in how we set targets and people were aware that
it became more clearly linked to CPD and they had to become more
responsible for their own CPD and appraisal. </Appraisal> </Action>
That was the change that happened this academic year and <Impact>the
next academic year, starting from September 2013, is really where I see
the massive changes happening because it’s going to become explicitly
linked to pay and that is a big change for teachers.</Impact>
<PersonalEvaluation>I think it’s actually a positive thing and, before
I became a teacher, I worked in industry for ten years and appraisal and
your performance and your pay were all linked so I’m quite used to that
kind of system. I think it’s a massive shock for most teachers because
the status quo has always been that you move up the pay scale but now
it’s a case of you only move up the pay scale if you meet these targets.
</ PersonalEvaluation> <Impact>Most people shouldn’t be worried
about it but it is a big change and people might think that they won’t be
paid as much. </Impact> <PersonalEvaluation> So, generally, I think
it is a positive thing because everybody who does their job properly will
have no reason to worry. The people that I hear worrying are the people
who are the slackers and don’t pull their weight. </PersonalEvaluation>
<RisksInvolved>I think there is a potential –though not here –for it to
become difficult in that if a head teacher doesn’t like someone = but
that is true anyway of any company and any appraisal system because
there is always the potential for it to be abused by the person in power.
</RisksInvolved> <RisksInvolved>I think there are significant issues
and with things, for example, like school budget and I was talking to one
of the senior leaders here and we were talking about going to UPS3,
which is the top of the scale, and if we all keep moving towards that and
no staff move on we’ll get to a point where the school cannot afford to
have that many people on UPS3 </RisksInvolved> </pers>
Note that in the annotation above we have adopted the sentence as our unit of
annotation, although sometimes we have included more than one sentence in
some of the annotations. Some sentences have no annotation at all, though. Also
note that some sentences contain more than one annotation:
<Action><Appraisal> I mean we implemented a big change this September

in terms of we changed the name to ‘appraisal’ and the way we did
it was slightly different in how we set targets and people were aware
that it became more clearly linked to CPD and they had to become
more responsible for their own CPD and appraisal. </Appraisal>
</Action>
Make sure the closing tags are inserted following the general rule that the first tag
is the last one to be closed:
<Action><Appraisal> </Appraisal> </Action>
Once we have uploaded our text to Sketch Engine, we will see that the only
structures than can be searched from the Text Types dialogue box are those
annotated in the <doc> element, so there is no trace of our annotation here. We
will look at this in the next section. What we can do, however, is to search within
the tags. How to do this? Follow the following steps:
1 Go to the Concordance function and select Advanced search.

2 Select CQL (corpus query language) search.
3 Build a search using some of the operators provided.
Figure 5.3 is an example of such a search. CQL allows more sophisticated

searches. This type of search will be covered in chapter 7. In this example []
stands for any token, including punctuation, and the operator within can be
used to specify that we are only interested in language found within a particular
element, in this case <Appraisal>. In very simple terms, this search will render
a concordance line for every single token used in the sentences annotated within
<Appraisal></Appraisal>. We can do the same with the rest of the tags. In
practical terms, we could do the same using [tag!=“Y.*”] within <Appraisal/>,
where [tag!=“Y.*”] is used to ask Sketch Engine to count every single token in a
corpus or in a structure. Examine these search examples:
• [lemma=“we”] within <Appraisal/> It will return all the instances where the
speaker uses we. We could search for [lemma=“I”] within <Appraisal/> and
[lemma=“they”] within <Appraisal/> to see how different personal pronouns
are used to express agency in the sentences annotated as conveying appraisal.
• Compare the results obtained when we search for [lemma=“performance”],
in the entire text or corpus, and in [lemma=“performance”] within
<PersonalEvaluation/>.
108 Interview data
Figure 5.3 CQL search using tags
This search power will become more and more evident as we gather more
interviews in our corpus and we combine these searches with the values of
the attributes identified in the Text Types. For example, we could combine
[lemma=“performance”] within <PersonalEvaluation/> across the four schools
featured in Gu (2015) and the role of the person being interviewed. In the next
section we will show how to use our annotation in a TEI scheme.
5.3.2 Annotating a corpus using standard XML guidelines

In Sketch Engine, structures can have additional labels giving more specific infor-
mation about the structure:
These [labels] are called metadata or structure attributes. For example, the
structure might carry information about the year of publication, the genre,
the dialect, the style, author, source, simply anything that the author wants
to include. If the corpus author is unsure as to what to include, it is best
to include all available metadata. A corpus without metadata can still be
used for many tasks but metadata (information about the text) cannot be
used in searches or analysis. Metadata (or structure values) are automatically
processed by Sketch Engine into text types, making it easy to set search cri-
teria or build subcorpora from texts belonging to the same category.
(www.sketchengine.eu/corpus-annotation-and-structures/)
Figure 5.4 Adding some attributes and values to an interview
In practical terms, this means that we can add structure information and meta-
data to our transcription so as to improve the depth of granularity of our searches.
We can do this at the document level (in other words, at the individual interview
level) by inserting just a few attributes and values. Examine the following.
In Figure 5.4 we have declared (i.e. we have inserted) some attributes to the
interview (<doc>) in question. We have specified the year when it was conducted
(Year), the language in which it was conducted (Lang), the name of the school
where the interview took place (School), the country where it was conducted
(Country) and the role of the person interviewed (StakeholderRole). This may
not seem terribly interesting if you just happen to be working with one or two
interviews, but it is absolutely essential when you have 68 interviews to analyse.
We can also specify whether the text was uttered by the interviewer or by the inter-
viewee by defining different roles, as shown in Figure 5.5.
By annotating attributes and values like those in Figures 5.4 and 5.5, we can
enable more sophisticated searches on Sketch Engine, filtering our searches
through the attributes and values specified during the transcription and anno-
tation of the interviews. For example, we can now obtain the parts of the inter-
view that were contributed by the different roles involved, across different schools,
years, countries and languages used. Note that when some of these categories are
filtered out, we will obtain more focused and fine-grained results, allowing us the
opportunity to explore independent variables such as school types, countries or
management roles. And this is the really interesting bit: we can create subcorpora
that can be compared and analysed.
We suggest developing a basic TEI transcription template for spoken data based
on Pérez-Paredes & Alcaraz-Calero (2009) that can be used to gather the meta-
data for each and every interview of our corpus. Pérez-Paredes & Alcaraz-Calero
(2009: 68) have noted that TEI offers ‘extensibility, interoperability and standard-
ization, three characteristics […] of the utmost importance for the re-usability of
our annotated corpora’. The proposed template is divided in two parts: Metadata
and Body. In the former we include relevant metadata, while in the body section
we provide the transcription. This is what the template looks like:
110 Interview data
Figure 5.5 Adding person roles
1 [METADATA] # Indicates that the METADATA section begins

2 Title: #Title or short description of the Interview
3 Date Recording: #Date of the Interview Recording yyyy-mm-dd
4 Date Transcription: #Date of the Interview Transcription yyyy-mm-dd
5 Locale: #Where the interview took place
7 Principal Investigator: #Principal investigator
8 Researcher: #Researcher’s name or ID
9 Transcriber: #Transcriber’s name or ID
10 Editor: #Editor’s name or ID
11 Authority: #Authority entity: i.e. Institution
12 ID: #UNIQUE ID for the interview
13 Language: #Language of the Interview (TWO LETTER ID) EN
14 MediaFileName: #UNIQUE NAME of the multimedia file (if relevant)
15 Participants: #Participant (For each participant block (16–21) will be
repeated)
16 person: #ID of the participant
17 name: #FULL NAME of the participant (ethical guidelines apply)
18 role: #Role of the participant in the interview (‘interviewer’ or ‘interviewee’)
19 sex: #Sex of the participant (M for Male and F for Female) (ethical
guidelines apply)
20 age: #Age of the participant (with digits) (ethical guidelines apply)
21 Description: #A small description of the participant (if relevant)
22 [/METADATA] #Indicates that the metadata section ends
23 [BODY] #The transcription section starts

24 Person 1: #An example of the transcription of ‘Person 1’
25 [/BODY] #The transcription section ends
This template can be used to draft the all the necessary information before it is
actually coded, which will make both our transcription and coding more system-
atic. This is what the actual TEI coding may look like:26
<teiCorpus>
<teiHeader>
<fileDesc>
<titleStmt>
<title>Colebrook-ICT Middle Leader-Interview 1 </title>
<author xml:id=“Initials”>Name</author>
<respStmt>
<name xml:id=“Initials”>Name</name>
<resp>transcription</resp>
<resp>annotation</resp>
</respStmt>
<sponsor>Name of sponsor</sponsor>
<funder>
<address>
<addrLine>Address line 1</addrLine>
<addrLine>Address line 2</addrLine>
</address>
<email>email here</email>
</funder>
</titleStmt>
<publicationStmt>
<publisher>Name of the publisher, usually the institution</publisher>
<distributor>Name of the distributor, either the institution or a repository
name</distributor>
<availability status=“free”>
Published under a <ref target=“http://creativecommons.org/
licenses/by-sa/3.0/”>Creative Commons Attribution ShareAlike 3.0
License</ref>.
</availability>
<date when=“2014-01-01”>1 January 2014</date>
</publicationStmt>
</fileDesc>
<encodingDesc>
<editorialDecl>
<normalisation method=“markup” source=“http://www.oed.com/”>
Spelling has been modernised using the <gi>orig</ gi> /<gi>reg
</gi> elements, wrapped in a <gi>choice</gi> element.
</normalisation>
<interpretation>
112 Interview data
Thematic analysis added, studying the main motifs.

Names and dates are marked.
</interpretation>
</editorialDecl>
<projectDesc>
Narrative here
</projectDesc>
<classDecl>
<taxonomy xml:id=“SchoolInterviews”>
<category xml:id=“ SchoolInterviews. Appraisal”>
<catDesc>Annotating appraisal</catDesc>
<category xml:id=“ SchoolInterviews.Appraisal.one”>
<catDesc>Appraisal feature 1</catDesc>
</category>
<category xml:id=“ SchoolInterviews.Appraisal.two”>
<category xml:id=“ SchoolInterviews.Appraisal.two.a”>
<catDesc> Appraisal feature 2 type A</catDesc>
</category>
<category xml:id=“ SchoolInterviews.Appraisal.two.b”>
<catDesc> Appraisal feature 2 type B</catDesc>
</category>
<category xml:id=“ SchoolInterviews.Action”>
<catDesc>Annotating Action</catDesc>
</category>
<category xml:id=“ SchoolInterviews.Impact”>
<catDesc>Annotating Impact</catDesc>
</category>
</taxonomy>
</classDecl>
</encodingDesc>
<revisionDesc>
<change when=“2106-01-01” who=“#Initials”>What was done</change>
<change when=“2015-01-01” who=“#Initials”>What was done</change>
</revisionDesc>
</teiHeader>
<text>
<! Transcription and annotation here>
</text>
</teiCorpus>
Alternatively, we can choose the attributes that are relevant and include them
in the <doc attribute=“value” string. The classification schemes used must
be defined in the <classDecl> subsection of the encoding description in the
header. Each classification scheme should be identified by means of the xml:id
attribute of a <taxonomy> element. Such taxonomy declarations can define
their own classification categories inside specific <category> elements. The
category descriptions describe the category in a <catDesc> element. We can
use the <taxonomy> element and develop classification categories that can be
defined in separate <category> elements, each with their own xml:id code.
As is the norm in XML, the category is described in a <catDesc> element.
A great advantage is that classification categories can be nested, which means
that we can develop a hierarchical classification system in no time. Let us con-
sider the following coding. We have defined an annotation taxonomy called
SchoolInterviews that includes, just for illustration purposes, three subcat-
egories: Appraisal, Action and Impact.
<taxonomy xml:id=“SchoolInterviews”>
<category xml:id=“ SchoolInterviews. Appraisal”>
<catDesc>Annotating appraisal</catDesc>
<category xml:id=“ SchoolInterviews.Appraisal.one”>
</category>
<category xml:id=“ SchoolInterviews.Appraisal.two”>
<category xml:id=“ SchoolInterviews.Appraisal.two.a”>
<catDesc>Appraisal feature 2 type A</catDesc>
</category>
<category xml:id=“ SchoolInterviews.Appraisal.two.b”>
<catDesc>Appraisal feature 2 type B</catDesc>
</category>
<category xml:id=“ SchoolInterviews.Action”>
<catDesc>Annotating Action</catDesc>
</category>
<category xml:id=“ SchoolInterviews.Impact”>
<catDesc>Annotating Impact</catDesc>
</category>
</taxonomy>
Note that the hierarchy in the example above is taxonomy > Category >
Subcategory > Sub-subcategory. So the taxonomy has one major category that
includes different sub-categories or labels that can be used to describe, annotate,
code or classify relevant parts of the interview. These annotatated sections can be
searched, retrieved and analysed in an efficient way. Table 5.5 shows a breakdown
of the annotation taxonomy and Table 5.6 summarises how to use your own tax-
onomy to annotate and query your data.
114 Interview data
Table 5.5 Annotation taxonomy
Level Tag Description
1 <taxonomy xml:id=‘SchoolInterviews’> Annotation system

2 <category xml:id=‘ SchoolInterviews. Appraisal’> Annotation of appraisal
3 <category xml:id=‘ SchoolInterviews.Appraisal. Annotation of appraisal,
one’> feature 1
two’> feature 2
two.a’> feature 2, subfeature 1
two.b’> feature 2, subfeature 2
Table 5.6 Skill 13: annotating and querying your data: using your own annotation

taxonomy
• Before transcribing and annotating our interviews, we

Skill 13
need to devise a strategy to include metadata and our own

annotation in our transcription.
• Typical metadata includes:
• when the interviews were held
• roles involved (interviewers, interviewee, etc.)
• where the interviews were held (if appropriate)
• description of the research project and funding
(if appropriate)
• demographic data about the participants (gender, age,
L1, etc.)
• any other metadata that is relevant to our research
project.
• Annotating our data will give more flexibility to our
searches, allowing us to combine the language used by
our interviewees, POS tags, structures and our qualitative
analysis of the interviews.
• Transcription and annotation can be done in at least, two
ways: (1) we can insert our annotation tags <tag></tag> in
the text straightaway, or (2) we can define an annotation
taxonomy in the TEI header of our file.
• Inserting our annotation tags will be preceded by a period
of data analysis where we will decide on the annotation
to use and, most crucially, the research methods that
will be used to identify our annotation system (theme
analysis, etc.).
• Using our own annotation tags in combination with the
interviews’ metadata (school, city, country, demographic
information, gender, age, role of participants in the
interviews, languages used, etc.) will provide our
understanding of the corpus with greater insight.
Table 5.6 Cont.
• If we are thinking about exploiting CL methods to the max,

we will need to annotate our corpus using some kind of
guidelines. TEI guidelines have been devised to transcribe
and annotate written and spoken language.
• To do this, we will need to familiarise with the metadata
that will appear in the TEI header of the file and declare
our taxonomy there. If we do this, our searches will be
incredibly sophisticated, allowing for greater insight.
Notes
1 There are many more types of annotation: syntactic annotation (parsing), semantic
annotation, pragmatic annotation, discourse annotation, stylistic annotation, lexical
annotation, etc. Parsing and semantic annotation can be carried out automatically.
2 https://childes.talkbank.org/
3 She also argues that the limited space in research journals excludes a more in-depth dis-
cussion of the transcription decisions adopted. We wonder whether this is also having
a negative impact on younger researchers who very rarely see that transcription itself is
part of the data gathering and analysis process.
4 www.inqscribe.com
5 https://exmaralda.org/en/release-version/
6 www.fon.hum.uva.nl/praat/
7 https://tla.mpi.nl/tools/tla-tools/elan/
8 www.corpustool.com
9 https://proycon.github.io/folia/
10 https://folia.readthedocs.io/en/latest/introduction.html
11 https://otranscribe.com/
12 https://transcribe.wreally.com/
13 http://brat.nlplab.org/
14 https://uclouvain.be/en/research-institutes/ilc/cecl/transcription-guidelines.html
15 http://users.ox.ac.uk/~lou/wip/metadata.html
16 www.sketchengine.eu/corpus-annotation-and-structures/
17 www.sketchengine.eu/corpus-annotation-and-structures/
18 https://talkbank.org/manuals/CHAT.pdf
19 Recent TEI guidelines suggest the use of a container element (xenoData) if metadata
from non-TEI schemes is used in the document.
20 http://users.ox.ac.uk/~lou/wip/metadata.html
21 www.tei-c.org/release/doc/tei-p5-doc/en/html/examples-teiHeader.html
22 The <fileDesc> element is compulsory and so are <titleStmt>, <publicationStmt>,
and <sourceDesc>.
23 At the time of writing, the lastest TEI guidelines version was 3.6.0, updated 16
June 2019.
24 https://tei-c.org/Vault/P5/3.6.0/doc/tei-p5-doc/en/html/TS.html
25 https://tei-c.org/Vault/P5/3.6.0/doc/tei-p5-doc/en/html/TS.html
26 This is based on the guidelines at: https://teibyexample.org/
27 See the BERA Ethical Guidelines for Educational Research: www.bera.ac.uk/publica-
tion/ethical-guidelines-for-educational-research-2018
116 Interview data
References
Bailey, J. (2008). First steps in qualitative data analysis: transcribing. Family Practice, 25(2),
127–131.
Francis.
Day, C., Gu, Q. & Sammons, P. (2016). The impact of leadership on student outcomes: How
successful school leaders use transformational and instructional strategies to make a
difference. Educational Administration Quarterly, 52(2), 221–258.
Gilquin, G. De Cock, S. & Granger, S. (2010) Louvain international database of spoken English
interlanguage (CD-ROM + handbook). Louvain-la-Neuve, BE: Presses Universitaires de
Louvain.
Gu, Q. (2015). Interviews at four secondary case study schools. [data collection]. UK Data
King, N., Horrocks, C. & Brooks, J. (2019). Interviews in qualitative research. 2nd edition.
London: Sage Publishing Company.
Leavy, P. (2017). Research design: Quantitative, qualitative, mixed methods, arts-based, and community-
based participatory research approaches. New York: Guilford Publications.
MacWhinney, M. (2019). Tools for Analyzing Talk. Part 1: The CHAT Transcription Format.
https://childes.talkbank.org/
Pérez-Paredes, P. & Alcaraz-Calero, J. (2009). Developing annotation solutions for online
data driven learning. ReCALL, (21)1.
Reppen, R. (2010). Building a corpus: what are the key considerations? In O’Keeffe, A. &
Silverman, D. (1993) Interpreting qualitative data: methods for analysing talk, text and interaction.
Chapter 6
Examining lexis
Analysing peace treaties and
children’s literature
6.1 Examining lexis
In chapter 5, we looked at the role of transcription critically, trying to go beyond
some standard practices that see transcription as a mere recording of the words
used in interviews. This chapter will examine how CL methods can provide us
with insights into how vocabulary is used in a corpus. Building on the discussions
in previous chapters, we will offer useful insights into the vocabulary found in a
corpus and discuss ways in which the lexical component of a corpus can con-
tribute to our understanding of how speakers have used vocabulary in distinct
ways. We will use a corpus of peace treaties to examine the relevance of education
in these treaties and try to illuminate how education and possible related concepts
are conceptualised in these documents. We will also look at children’s fiction and
try to showcase some CL methods that can be used to research textual data.
In this chapter we will first look at keyword analysis as a powerful method that
can reveal aspects of language use that might go unnoticed without the use of stat-
istical inference. As Baker (2004) put it, the examination of keywords can reveal,
among others, aspects of ideology and embedded discourse:
Keywords […] will direct the researcher to important concepts in a text (in
relation to other texts) that may help to highlight the existence of types of
(embedded) discourse or ideology. Examining how such keywords occur in
context and which grammatical categories they appear in, and looking at
their common patterns of co-occurrence should therefore be revealing.
(Baker, 2004: 347)
6.2 Researching the lexicon: keywords

Keywords can help us identify ‘concepts in discourses, typical vocabulary in a
genre/language variety, lexical development over time, etc.’ (Brezina, 2018: 79–80).
Keyword analysis can be used to discover the lexical items that characterise a given
corpus when compared to a reference, usually bigger, corpus. Note that computer
scientists refer to the ‘aboutness’ of online resources or websites as the keywords
118 Examining lexis
that have been set up to describe a resource. For example, when we read a piece
of news in an online paper such as The Guardian, the readers can actually see and
read the URL of the resource and the title1 of the piece on their screens. This
type of information is defined in the <title> of the resource in the following way:
<title>Will Brexit spell the end of English as an official EU language? | Jane

Setter | Opinion | The Guardian</title>
In this case, the author and the name of the publication are also displayed and
can be read by the users. However, an online article is also defined by a set of
keywords2 that have been identified either by the author or the editorial team of
the newspaper in the following way:
<meta name=‘keywords’ content=‘European Union,Language,Europe,World

news,Brexit’/>
These keywords are not immediately visible to readers as their main ‘readers’ are
search engines such as Google or Bing, enabling them to locate this resource and
‘know’ what is about. Shari Thurow (2010) puts it this way:
Many search engine optimization (SEO) professionals feel that a web page’s
aboutness is communicated simply by keyword repetition. If you use keywords
many times on a web page, then clearly the page is focused on those keyword
phrases, right? I wish it were that simple. First of all, search engines haven’t
measured keyword density as a ranking factor for a very long time. However,
that doesn’t mean that web pages (and graphic images and multimedia files)
shouldn’t contain keywords. Keywords are essential for communicating
aboutness. But keywords should be placed judiciously so that the aboutness of
the page is clear to both search engines and web searchers.
(Thurow, 2010)
So, in the context of search engine optimisation, keywords are meta tags that
describe what the resource is about. The choice of these meta tags is intentional
and carried out by individuals in order to provide a description of a text or a
resource. In this chapter, we are referring to keywords as items that are identified
exclusively through quantitative methods (Scott & Tribble, 2006), and not to the
notion of keywords as used by SEO experts or as used in sociocultural studies as
glossed by Pérez-Paredes (2017):
O’Halloran (2010) and Taylor (2017) have suggested that two keyword con-
ceptualization traditions have co-existed in the past. One is influenced by
cultural studies traditions and sees these words as the body of meanings of
the practices that are central to our societies and institutions. The second
tradition is embodied by corpus linguistics research methodology, one of
Examining lexis 119
its empirical principles being that ‘repeated events are significant’ (Stubbs
2007: 130). In this light, the clustering of lexical items reveals different co-
textual environments that are built upon co-collocation and colligation (Pace-
Sigge 2013) […] Keywords, and, particularly, keyness (Scott & Tribble, 2006),
identify the lexical items that characterize a text or a whole corpus.
(Pérez-Paredes, 2017: 163)
Scott & Tribble (2006) report that 57% of the keywords from 1,000 randomly
selected texts from the BNC are nouns, determiners, prepositions and pronouns.
Specifically, pronouns, proper nouns, possessive ’s, the verb ‘be’ and common
nouns are the most likely sources of keywords in the English language (Culpeper
& Demmen, 2015). In a study of English news language, Fuentes (2015) reports
that over 70% of the keywords are nouns.
Keywords convey aboutness and, more specifically, they represent keyness. Scott
& Tribble (2006) have defined keyness as:
[…] a quality words may have in a given text or set of texts, suggesting that
they are important, they reflect what the text is really about, avoiding trivia
and insignificant detail. What the text ‘boils down to’ is its keyness, once we
have steamed off the verbiage, the adornment, the blah blah blah […]
(Scott & Tribble, 2006: 55–56)
So keyness is central to the identification of propositional content in a text or in a

corpus. Part of the method to identify keywords is based on identifying words that
are often repeated in a text:
The basic principle is that a word-form which is repeated a lot within the text
in question will be more likely to be key in it. A recipe for a cake may well
have several mentions of eggs, sugar, flour, cake. In our case, it is simple ver-
batim repetition, allied to a statistical estimate of likelihood. The method uses
words, no sentences or propositions, and relies on a simple decision as to what
constitutes a ‘word’, namely the presence of space or punctuation at each end
of a candidate string […]
(Scott & Tribble, 2006: 58)
However, simple repetition is not enough. It is necessary to observe the behaviour

of words in a reference corpus: ‘Finding keywords requires a “reference corpus
word-list” which can indicate how often any given word can be expected to occur
in the language or genre in question. This will be used as a filter.’ (Scott & Tribble,
2006: 58). This comparison between keywords in our target corpus and those in
a reference corpus will determine our final keyword candidate list. As noted by
Culpeper & Demmen (2015), the choice of the reference corpus will definitely
affect the potential for obtaining keyword results that are relevant to the text or the
corpus that we are researching. Usually, a large reference corpus will allow us to
120 Examining lexis
identify more keywords. However, a reference corpus whose domain and register
are far away from the ones of our target corpus will result in the identification of
keywords that do not necessarily reflect the propositional content of the corpus.
Let us focus on some methodological aspects in the following sections.
6.2.1 Introducing keyword analysis

Keyword analyses can be generated through different CL software such as
WordSmith Tools (Windows only) or WMatrix (web service), among many others.
For the sake of simplicity and coherence with the rest of sections in this book, we
will focus here on using AntConc and Sketch Engine. However, it is only fair to
recognise the extraordinary contribution of Wordsmith Tools3 and its developer
Mike Scott to the development of this area of analysis and study (Scott, 2008).
A keyword analysis offers statistical comparisons between the words in a target
corpus4 and a reference corpus. Two word lists are generated, one of the lexical
items in the target corpus and the other of those in the reference corpus, and then
a statistical significance test such as the log-likelihood test or the chi-square test is
run. This test will give us a list that shows the keyness value of each word. Culpeper
& Demmen (2015: 96) have stressed that while the procedure is straightforward
and takes no time, ‘the downside of the comparative ease of keyword analysis is
the potential for performing studies in a relatively mechanical way without a suffi-
ciently critical awareness of what is being revealed or how it is being revealed’. We
will try to provide some reflection on how more critical awareness can be achieved
and how researchers can understand the different choices at their disposal and
their implications.
Keyword analysis can identify either single word keywords or multiword
keywords. In our own research (Pérez-Paredes, 2017), the latter have been useful
in order to identify recurrent topics and topoi, while the former is broadly instru-
mental in identifying nouns, both proper and common, that characterise a text.
Culpeper & Demmen (2015) have illustrated the range of uses of multiword
keywords in, for example, stylistics and in literary studies:
[Multiword keywords] are best considered as recurring word sequences

which have collocational relationships. Investigating key multiword units has
been particularly fruitful in the area of stylistics, and popularised by Stubbs’
(2005) analysis of Joseph Conrad’s novel Heart of Darkness. Stubbs argues that
looking at single (key)words offers limited prospects for analysis, pointing out
that words occur in recurrent ‘lexicogrammatical patterns’ (2005: 15–18)
and that ‘collocations create connotations’ (2005: 14). Stubbs illustrates that
recurrent multiword units provide evidence of recurrent metaphors and/or
intertextuality (2005: 13–14) which may not be discernible from single-word
results, and that they can reveal pragmatic and discoursal aspects of dialogue
and narrative. [research shows] how some of the memorable characters in
Charles Dickens’ novels are created through recurrent localized word clusters,
Examining lexis 121
and […] how frequent phrases in Jane Austen’s novels contribute to the con-
struction of characters, and places such as the city of  Bath.
(Culpeper & Demmen, 2015: 95)
Note that the identification of multiword keywords presents a direct way to under-
stand how lexical units convey aboutness while simultaneously revealing discoursal
features of the use of the language. So, a combination of single word keyword and
multiword keyword analysis can be extremely useful when identifying the con-
struction of discourse in a given text or corpus.
Our research project in this chapter involves the examination of how educa-
tion is conceptualised in peace treaties worldwide (Cremin, 2016). We will now
use keyword analysis and multiword keyword analysis to identify the aboutness
and propositional content of examples of these documents. We have gathered five
peace agreements that contain references to education as per the Peace Accords
Matrix search interface5 provided by the Kroc Institute for International Peace
Studies and the University of Notre Dame. The corpus contains 38,618 tokens
(words) and 4,870 types (different words). This analysis may inform areas in peace-
keeping, peace-making and peace-building in the context of peace education
(Cremin & Bevington, 2017). The five peace agreements selected are listed in
Table 6.1 with their United Nations descriptions.
6.2.2 Keyword analysis: a step-by-s tep guide

After putting together our corpus,6 we are ready to upload it to Sketch Engine. We
have kept five separate files (see Table 6.1), which will provide us with opportun-
ities to do a fine-grained analysis of the differences across these documents. The
first consideration to bear in mind is to choose the reference corpora that will be
used in our analysis. We recommend the use of more than one single reference
corpus, each stressing different textual qualities (see Figure 6.1). After running the
tests, we will triangulate our results to make sure that we retain those keywords
that appear consistently across the different tests.
As the data we are examining here is a corpus of peace agreements, we can
expect to find a wealth of legal jargon that will be identified as key vocabulary
through our analysis. A good strategy seems to be to use a reference corpus that
includes legal language and/or administrative language. Also, we may want to
compare our corpus with more general, all-purpose written English. Finally, we
can use our own corpus as reference corpus and run an analysis of each of the five
documents in order to examine the individual keywords in each agreement. The
strategy we propose here is conceptualised in Figure 6.1. As you can see in A and
B, our target corpus is the corpus of peace agreements; in C, however, the corpus
of peace agreements becomes our reference corpus. This way we can contrast
individual agreements with the corpus analysed. As for the size of the of reference
corpus, we can expect that its size will impact the results of the analysis. Berber-
Sardinha (2000: 12) has shown that a reference corpus that is ‘five times larger
122 Examining lexis
Table 6.1 Peace agreements included in our corpus
Peace treaty United Nations Peacemaker description Signed

or Wikipedia reference*
Peace Agreement This comprehensive agreement Date:
between the reaffirms the cessation of hostilities 7 July 1999
Government of Sierra of 18 May 1997 and provides Country/entity:
Leone and the RUF for power-sharing arrangements Sierra Leone
(Revolutionary United between the elected government
Front) (Lomé Peace and the RUF. It also provides
Agreement) provisions on DDR (disarmament,
demobilisation and reintegration)
and the transformation of the RUF
into a political party. The United
Nations signed as a witness to
the agreement with an explicit
reservation stating that the United
Nations does not accept immunity
for war crimes and crimes against
humanity.
General Agreement on The signing of this agreement and Date:
the Establishment of the subsequent convening of 27 June 1997
Peace and National the Commission on National Country/ entity:
Accord in Tajikistan Reconciliation launched the Tajikistan
period of transition and the
implementation of the following
agreements:
Protocol on the Fundamental
Principles for Establishing Peace
and National Accord (17 August
1995) and, among others, the
Protocol on the Guarantees of
Implementation of the General
Agreement on the Establishment
of Peace and National Accord in
Tajikistan (28 May 1997)
The final agreement on The agreement establishes Date:
the implementation implementing structures and 2 September 1996
of the 1976 Tripoli arrangements during a transitional Country/ entity:
Agreement between period before a new Regional Philippines
the Government of Autonomous Government is
the Republic of the established.
Philippines (GRP) and
the Moro National
Liberation Front
(MNLF).
Examining lexis 123
Table 6.1 Cont.
Peace treaty United Nations Peacemaker description Signed

or Wikipedia reference*
Chittagong Hill Tracts The Chittagong Hill Tracts Peace Date: 2
Peace Accord (CHT) Accord also known as Chittagong December 1997
Hill Tracts Treaty, 1997, is a political Country: Bangladesh
agreement and peace treaty
signed between the Bangladeshi
Government and the Parbatya
Chattagram Jana Sanghati Samiti
(United People’s Party of the
Chittagong Hill Tracts), the political
organisation that controlled the
Shanti Bahini militia on 2 December
1997. The accord allowed for
the recognition of the rights of
the peoples and tribes of the
Chittagong Hill Tracts region and
ended the decades-long insurgency
between the Shanti Bahini and
government forces.*
Comprehensive Peace This comprehensive peace agreement Date:
Agreement between seeks to end the conflict between 22 November 2006
the Government the Government of Nepal and Country/ entity:
of Nepal and the the Communist Party of Nepal Nepal
Communist Party of (Maoist). It comprises a ceasefire
Nepal (Maoist) including the management of arms
and armies of both the national
army and the Maoist group by the
United Nations. It calls for political,
economic and social change in
the country and adherence to
humanitarian law and human rights
principles, including through the
establishment of a National Human
Rights Commission, a Truth and
Reconciliation Commission and a
National Peace and Rehabilitation
Commission. The agreement calls
for the election of a constituent
assembly to end the transition
period and calls on the UN to
observe and assist the electoral
process. The agreement also
calls for the nationalisation of all
property belonging to the King and
Queen and to decide by a simple
majority in the first constitutional
assembly meeting whether to retain
the monarchy as an institution.
*Source: Wikipedia25
124 Examining lexis
than the study corpus yields a similar amount of keywords to reference corpora
that are up to 100 times larger than the study corpus’, so the five-times rule seems
a reasonable way forward, although we recommend using a myriad of corpora
and triangulating the results.
To exemplify scenario A in Figure 6.1, we chose the British Law Reports
Corpus,7 an 8.8 million-word corpus of judicial decisions made between 2008 and
2010 by British courts, available on Sketch Engine. After running a keyword ana-
lysis, we find that the keywords listed in Table 6.2 yield the highest keyness scores:
Note that Table 6.2 only shows the top 10 strongest keywords in the corpus.
As expected, the strongest candidates are all proper nouns (Tajikistan, RUF, Sierra,
Reference corpus: reflects the

domain and the register features
of the target corpus
A
Target corpus:
B
our corpus
Reference corpus: reflects the
language used widely in the
language of the target corpus
Target corpus:
individual
Reference corpus: the enre
target corpus (except the file
C
documents analysed)
Figure 6.1 Different reference corpora in keyword analysis
Table 6.2 Top 10 keywords using British Law Reports Corpus as reference corpus
Keywords Score Frequency in the Frequency in the

target corpus reference corpus
1 Tajikistan 3756.78 168 0
2 Tajik 2661.35 119 0
3 Opposition 2370.72 106 0
4 RUF 1275.28 57 0
5 Reconciliation 1185.86 53 0
6 Autonomous 1029.37 46 0
7 Peace 878.54 51 3
8 Sierra 851.9 114 20
9 Nuri 850.52 38 0
10 Leone 769.73 103 20
Examining lexis 125
Table 6.3 Top 10 keywords using the British National Corpus as reference corpus
Keywords Score Frequency in the Frequency in the

target corpus reference corpus
1 Tajikistan 2035.53 168 95
2 Tajik 1949.78 119 41
3 RUF 1252.98 57 2
4 Opposition 823.7 106 211
5 Leone 793.04 103 214
6 Autonomous 754.15 46 41
7 Reconciliation 718.8 53 73
8 MNLF 691.76 32 4
9 Signed 671.45 34 15
10 Nuri 661.97 38 32
Leone) or common nouns (opposition, reconciliation, peace). If we choose the entire

British National Corpus (Scenario B in Figure 6.1), we obtain the top 10 relevant
keywords that are listed in Table 6.3.
Eight of the ten most relevant keywords are identical in the two lists: Tajikistan,
Tajik, RUF, Opposition, Leone, autonomous, reconciliation and Nuri. MNLF, top 9 in
Table 6.3, and signed, top 10 in Table 6.3, appear as top 13 and top 11, respect-
ively, in the keywords analysis that uses the British National Corpus as a refer-
ence corpus. The keyness score used by Sketch Engine is called simple maths
(Kilgarriff, 2009), a variation on ‘word W is so-and-so times more frequent in
corpus X than corpus Y’8. The simple maths score is considered a better estimate
of keyness than chi-square (See Table 6.4). In our examples, Tajikistan is 3756.78
times per 1 million words more frequent in our target corpus than in the British
Law Reports Corpus, and is 2035.53 times per 1 million words more frequent
in our target or focus corpus than in the British National Corpus. If we used
AntConc, we would obtain slightly different results. AntConc is more flexible in
terms of the keyness scores that can be used, allowing for a variety of options at
the disposal of researchers. The software, nevertheless, recommends the default
setting of Log Likelihood.9 Culpeper & Demmen (2015: 97–98) have suggested
that using either log-likelihood or chi-square tests has little effect on the results as
only some minor differences in the ranking of keywords are found, ‘which had no
effect on the overall picture revealed by the keywords’.
Culpeper & Demmen (2015: 98) suggest that log-likelihood suits most analyses:
Rayson (2003), evaluating various statistical tests for data involving low fre-
quencies and different corpus sizes, favors the log-likelihood test ‘in general’
and, moreover, a 0.01% significance level ‘if a statistically significant result is
required for a particular item’ (2003: 155).
(Culpeper & Demmen, 2015: 98)
126 Examining lexis
Table 6.4 Essential terminology: keyness scores
Essential terminology Keyness scores

Keyness can be measured in different ways. Corpus
linguistics researchers use various statistical tests that
can do the job.
Different corpus management tools offer alternative
options to measure keyness. AntConc is a good example
of this.
The Log Likelihood (LL) statistic is preferred by many
researchers. An LL of => 3.8 is significant at the level of
p<0.05, while => 6.6 is significant at p<0.01.
The chi-squared test measures whether an observed
frequency and an expected frequency are significantly
different. Brezina (2018: 117) suggests that ‘for the chi-
squared test the following should be reported: (i) degrees
of freedom, (ii) test value, (iii) p-value, (iv) effect size […]
and (v) 95% confidence interval for the effect size’.
BUT The tests described above have met criticism in the

past. Culpeper & Demmen (2015: 98) argue the
following: ‘[…] the significance test in keyword
analyses is often erroneous, notably because of the
role of the null hypothesis. The p-value is a conditional
probability: it is based on a range of statistics
conditional upon the null hypothesis being true […]
the solution is to use Bayesian statistics which focus
on the weight of evidence against the null hypothesis
(i.e. against there being no true difference between
the populations from which the samples were drawn).
One of the benefits of this technique is that it more
accurately pinpoints the items that are highly key in the
list.’ Gabrielatos & Marchi (2011) suggest that LL does
not measure frequency difference appropriately and
propose the use of %DIFF, which measures effect size.
In AntConc, researchers can choose a significance value (p value) and an effect

size measure to rank the keywords. The main disadvantage of using AntConc is
that, compared with the range of corpora available in Sketch Engine, we need to
either upload our reference corpus or locate a keyword list that can be uploaded
and used for comparison.10 We recommend using a couple of different tools such
as Sketch Engine or AntConc and evaluating the affordances that they offer in the
context of our research project.
Note that the identification of keywords through keyword analysis is the starting
point for further, more qualitative inquiry that will examine the individual contexts
and cotexts where keywords occur. As suggested by Pérez-Paredes (2017, 2019),
the clustering of lexical items reveals different co-textual environments built upon
Examining lexis 127
co-collocation and colligation. These lexico-grammatical environments built on

repeated linguistic patterns create the conditions for the identification of lex-
ical items characterising a text or a whole corpus (Baker, 2004; Pérez-Paredes,
2019: 156).
Scenario C in Figure 6.1 is useful to examine each of the agreements and con-
trast their keywords against those in the corpus of peace agreements. To do this,
we need to create subcorpora than can be treated individually.11 On the Advanced
tab of any Sketch Engine tool (Concordance, Keywords, etc.), click the plus sign
next to the subcorpus option, and then click ‘Create a new subcorpus’. Figure 6.2
shows the first 13 of the top 50 keywords in the Chittagong Hill Tracts Peace
Accord (CHT). The reference corpus used here is the corpus of peace treaties
containing all five peace agreements except for the Chittagong agreement. This is
a corpus of 34,285 tokens and 4,522 types.
Note how keyness scores across the keywords are lower than those in
Tables 6.2 and 6.3 (except for the first one, Hill). This is an effect of the lexical
items in the reference corpus and the fact that, register-wise, these are similar
documents, which was not the case when we previously used the British Law
Reports Corpus or the British National Corpus. As in the previous analyses, the
top keywords in Figure 6.2 are proper nouns that define the referential scope of
this agreement: Hill, Chittagong, CHT, Jana, Samhati, Samiti and so on. Other
lexical items are difficult to spot if we are not using CL methods. These include
petition (as in file a petition), vacant, frame, absence and so on. All of these and many
more keywords are used in administrative and legal language, which is to some
extent surprising as they are not found at all in the other four agreements. An
interesting finding is that the first 224 keywords in our results are not found in the
rest of peace agreements. This initial list would require further examination and
analysis. As for education, we find this lemma at position 579 of the keywords list,
which suggests that in the Chittagong agreement the term education is not particu-
larly relevant in comparison with the rest of peace agreements. Actually, education
is not a keyword in the entire five-document corpus12 despite occurring 33 times
in total. In the Chittagong Hill Tracts Peace Accord, education occurs only three
times. These are the concordance lines:
• The following subjects shall be added in the No. 3 of the function of the
Council-(1) Vocational training; (2) Primary education in mother tongue;
(3) Secondary education.
• […] No. 3 of the function of the Council-(1) Vocational training; (2) Primary
education in mother tongue; (3) Secondary education. b. The words ‘or
protected’ placed in sub-section 6 (b) of the function of the Council in the first
schedule shall be […]
• […] the educational institution. The govt shall provide necessary scholarships
for research works and receiving higher education in abroad. 11.The govt
and elected representative shall make efforts to maintain separate culture and
tradition of  […]
newgenrtpdf
128 Examining lexis
Figure 6.2 Keywords in the Chittagong Hill Tracts Peace Accord
Examining lexis 129
So, the use of education in this agreement is subordinate to the legal competencies
that will be retained by the government. This lack of engagement with the notion
of education as a driving force for change would, therefore, seem like a missed
opportunity for education to play a bigger role in social justice and for education
and government policies to help ‘to bridge the growing gap between the rich and
the poor’ (Cremin, 2016: 5). Repeating this process of analysis with the other
documents would give researchers a fuller picture of how education is constructed
in peace agreements.
So far, we have looked at single word keywords. Multiword keywords will give
researchers a closer look at how discourse is built through larger units, typic-
ally but not exclusively, through noun phrases. Those researchers interested in
understanding how people understand reality will find that multiword keyword
tests can give them an excellent opportunity to start their analysis by examining
what is unique or different in an interview or in a corpus. This is so because
multiword keywords exemplify how speakers construe a range of relations through
their choice of words. For example, noun phrases modified by adjectives can offer
a robust picture of the external reference used by a speaker or a writer and their
stance towards their propositions. Pre-and post-modification in noun phrases
are extremely useful linguistic tools to do so. Table 6.5 gives the picture of the
top 40 multiword keywords that are found in the Chittagong Hill Tracts Peace
Accord. In this analysis, the reference corpus is the peace treaties corpus minus
the Chittagong text.
Note that multiword keywords can be made up of two or more words, such
as normal life, general amnesty and determination of electoral constituency, although most
keywords tend to be two-word clusters. Only 16 of the keywords occur more than
twice in the corpus, which is only natural given the sizes of both the target and the
reference corpora. In the Chittagong agreement, there seems to be an emphasis
on amnesty, the distinction between tribal and non-tribal stakeholders, and land. The
references to primary education, secondary education and educational institutions
all make it into the top 100 keywords. The use of educational as in educational insti-
tution is relevant in terms of policy making and is not found in the rest of the
agreements:
Until development equal to other region of the country the govt shall continue
reservation of quota system in govt services and educational institutions for
the tribals. With an aim to this purpose, the govt shall grant more scholarships
for the tribal students in the educational institution.
(Chittagong Hill Tracts Peace Accord, 1997, https://peaceaccords.nd.edu/)
Now look at Table 6.6 to compare the top 50 multiword keywords in the entire
corpus of agreements (five documents), using the British National Corpus as a
reference corpus.
Note how we have shifted our interest from a more focused examination of
the circumstances of the Chittagong Hill Tracts Peace Accord (Table 6.5) to the
130 Examining lexis
Table 6.5 Multiword keywords in the Chittagong Hill Tracts Peace Accord
Rank Term Keyness score Frequency Reference

frequency
1 normal life 957.570 5 0
2 second line 957.570 5 0
3 priority basis 766.260 4 0
4 prior approval 383.630 2 0
5 other authority 383.630 2 0
6 tribal candidate 383.630 2 0
7 fringe land 383.630 2 0
8 third line 383.630 2 0
9 tribal member 383.630 2 0
10 rubber plantation 383.630 2 0
11 other punishment 383.630 2 0
12 permanent resident 383.630 2 0
13 original rule 383.630 2 0
14 first schedule 383.630 2 0
15 following sub-section 383.630 2 0
16 holding tax 383.630 2 0
17 disciplinary action 192.310 1 0
18 certain address 192.310 1 0
19 civil administration 192.310 1 0
20 general administration 192.310 1 0
21 signing agreement 192.310 1 0
22 general amnesty 192.310 1 0
23 life general amnesty 192.310 1 0
24 normal life general amnesty 192.310 1 0
25 such appointment 192.310 1 0
26 getting assistance 192.310 1 0
27 giving assistance 192.310 1 0
28 proper authority 192.310 1 0
29 natural calamity 192.310 1 0
30 competent tribal candidate 192.310 1 0
31 qualified candidate 192.310 1 0
32 case of fringe land 192.310 1 0
33 police circular 192.310 1 0
34 possible cooperation 192.310 1 0
35 including coordination 192.310 1 0
36 land commission 192.310 1 0
37 proper conduct 192.310 1 0
38 firm confidence 192.310 1 0
39 electoral constituency 192.310 1 0
40 competent court 192.310 1 0
Examining lexis 131
Table 6.6 Multiword keywords in the peace agreements corpus (five treaties)

frequency
(BNC)
1 present agreement 1.104.940 52 6
2 autonomous government 750.440 35 5
3 autonomous region 464.080 34 72
4 national accord 458.550 21 3
5 regional autonomous 336.340 15 0
government
6 national reconciliation 263.790 19 69
7 transition period 159.620 13 93
8 Nepali army 157.490 7 0
9 establishing peace 156.240 7 1
10 armed conflict 150.560 11 72
11 governmental power 131.790 7 22
12 humanitarian assistance 115.600 6 19
13 regional law 111.880 5 1
14 unhindered access 109.920 5 3
15 legislative assembly 109.670 7 49
16 educational system 99.500 13 217
17 Maoist army 90.420 4 0
18 security detail 90.420 4 0
19 priority basis 88.910 4 2
20 broad-based government 87.370 4 4
21 achieving peace 86.610 4 5
22 nationwide referendum 85.870 4 6
23 lasting peace 85.440 5 36
24 political confrontation 85.140 4 7
25 general agreement 84.360 9 157
26 regular session 82.430 4 11
27 peace agreement 82.380 8 133
28 public office 71.470 5 65
29 special fund 70.920 4 31
30 reciprocal-pardon act 68.070 3 0
31 post-war rehabilitation 68.070 3 0
32 full exchange 67.530 3 1
33 regional principle 67.530 3 1
34 final peace agreement 66.930 3 2
35 mutual forgiveness 66.930 3 2
36 full implementation 66.680 4 40
37 sincere gratitude 66.340 3 3
38 strict implementation 66.340 3 3
39 process of national 66.340 3 3
reconciliation
40 voluntary return 66.340 3 3
41 new mandate 65.770 3 4
42 interim agreement 64.640 3 6
43 final peace 63.550 3 8
44 mutual confidence 61.540 3 12
45 civil conflict 61.540 3 12
46 priority order 61.540 3 12
(continued )
132 Examining lexis
Table 6.6 Cont.

frequency
(BNC)
47 general supervision 61.050 3 13
48 direct supervision 60.080 3 15
49 immediate implementation 59.140 3 17
50 free pardon 58.680 3 18
Table 6.7 Education-related multiword keywords in a corpus of peace agreements

frequency
(BNC)
16 educational system 999.500 13 217
114 educational component 44.550 2 3
126 national educational system 43.780 2 5
129 non-formal education 43.040 2 7
176 public education 30.510 2 56
317 free compulsory education 23.360 1 0
364 timely educational 23.360 1 0
intervention
exploration of the aboutness of peace agreements as a semiotic resource when

implementing peace processes worldwide (Table 6.6). While some of the terms
in Table 6.6 deal with intra-referential textuality such as present agreement or general
agreement, there are at least two other broader groups of multiword keywords of
interest:
• Some keywords are very frequent in the reference corpus. However, given
their different sizes, they still qualify as significant keywords in the corpus.
This group includes national reconciliation, armed conflict and, interestingly, educa-
tional system. The exploration of the patterns of use of these keywords in the
two corpora will give researchers a better insight into how language is being
used in the target corpus and will contribute to drawing conclusions in terms
of the research questions in their projects.
• A second group of keywords is made up of terms that are not found in the ref-
erence corpus. This is extraordinary as the reference corpus used here is the
100-million-word BNC. This group includes terms such as delivery of humani-
tarian assistance, geo-political structure or democratic restructuring. Again, a closer
examination of the cotexts and the contexts of use will be necessary to gain
further understanding of what the implications are.
Examining lexis 133
However, the finest-grained analyses of keywords will be driven by our interest in

specific areas of scrutiny. If we look at the role of education in peace agreements,
we will find that different keywords cover this area. Table 6.7 shows the occurrences
of education and educational in the top 1,000 keywords in the corpus.
In section 6.3 we will explore ways in which we can explore nouns in a corpus.
Before that, Table 6.8 offers the basics of keyword analysis.
6.3 Researching nouns and noun phrases

We can use the keywords in Table 6.7 to explore how education is construed
in peace agreements. As suggested by Baker (2004) (see Section 6.1), studying
the grammatical context in which keywords occur will be useful to understand
patterns of co-occurrence in the texts we are examining. We propose that the
researchers follow the following pathway where A stands for the process of identi-
fication of the keywords (see section 6.2), B stands for the analysis of concordance
lines (see section 3.2) and C stands for the examination of the colligational behav-
iour of our target words.
In what follows, we will focus on the analysis of step C, in Figure 6.3: the gram-
matical and collocational behaviour of our keywords. We will exemplify this step
by means of the Word Sketch function in Sketch Engine. We will illustrate two
approaches here. The first will look at the word sketches of the noun headword
in the keywords, for example education in public education or in free compulsory educa-
tion. The second will look at a more recent feature in Sketch Engine: multiword
Sketch.
Before Sketch Engine, colligational analyses were performed manually in a
qualitative fashion by means of a detailed observation of ‘recurrent syntagmatic
proximity, rather than a statistical tendency’ (Mauranen, 2003: 33). For corpus
linguists, this is a fascinating task that involves the scrutiny of how very frequent
words, typically function words such as articles or prepositions, combine with less
frequent words, mostly lexical words such as nouns and verbs, in order to produce
meaning. Sinclair (1991) put it this way:
There is a frequent tendency for frequent words, or frequent senses of

words, to have less of a clear and independent meaning than less fre-
quent words or senses […] The tendency can be seen as a progressive
delexicalization, or reduction of the distinctive contribution made by that
word to the meaning.
(Sinclair, 1991: 113)
The exploration of colligation is essential in order to discover units of meaning

that go beyond words and their limited contribution to how individual word
meanings are represented either in dictionaries or even in the way in which
speakers describe their declarative knowledge13 about words.
134 Examining lexis
Table 6.8 Skill 14: understanding keywords

Skill 14 • Keywords are words that are considerably more
frequent in corpus A than in corpus B.
• ‘It is important to remember that “keywords” is a
relative term depending on the differences in lexical
frequencies in the two corpora in question’ (Brezina,
2018: 79).
• ‘A word is key if […] its frequency in the text when
compared with its frequency in a reference corpus is
such that the statistical probability as computed by an
appropriate procedure is smaller than or equal to a p
value specified by the researcher’ (Scott, 2011).
• Sketch Engine and AntConc can be used to generate
keywords. However, they make use of different tests
to measure keyness and work in different ways with a
reference corpus.
• While the identification of keywords is based on
quantitative analysis of text, it is necessary that we
understand the cotexts where such words occur.
• Most keywords analyses make use of a limited number
of words (top 10, 20, 50, 100, and so on). Make sure you
understand the implication of your cut-off point (e.g. top
20 keywords) and be cautious of how you interpret your
findings within the limits of your analysis.
• ‘Keywords are an extremely rapid and useful way of
directing researchers to elements in texts that are
unusually frequent (or infrequent), helping to remove
researcher bias and paving the way for more complex
analyses of linguistic phenomena’ (Baker, 2004: 348).
• The interpretation of the linguistic patterns will most
surely provide insights into the research questions of
our project. This interpretation will take keywords as
their starting point
A B C
ID of the
ID Exploration
Keyword Concordance left and Colligational
significant of the local
analysis lines right analysis
keywords grammars
contexts
Figure 6.3 Working with keywords: a suggested pathway

Examining lexis 135
Figure 6.4 Sketch Engine Word Sketch basic interface
6.3.1 Exploring individual nouns

Let us consider education as represented in Table 6.7. An examination of how this
noun is used in the corpus will reveal the ways in which it combines with other
words to form grammatical relations. The Word Sketch function can be accessed
from the Sketch Engine dashboard. Figure 6.4 shows the basic search interface.
Figure 6.5 shows the advanced search interface. We can input which part of
speech we want to search (i.e. study as a noun or as a verb), the minimum frequency
on the corpus and the minimum significance LogDice score.14
Figure 6.6 is a summary of the results of the analysis. All the collocates reported
go beyond statistical significance at p=0.01
The word education is modified, in decreasing order of statistical significance,
by: non-formal, compulsory, secondary, primary, free, public, basic, high, Leone, Rights and
Human. Free education and public education are only mentioned in the Lomé
agreement, where commitments were made to guarantee a set of rights that
include other welfare services:
The Government commits itself to propose and support an amendment to

the Constitution to make the exploitation of gold and diamonds the legit-
imate domain of the people of Sierra Leone, and to determine that the
proceeds be used for the development of Sierra Leone, particularly public
education, public health, infrastructure development, and compensation of
incapacitated war victims as well as post-war reconstruction and development.
(Lomé Agreement, 1997, https://peaceaccords.nd.edu/)
The Lomé agreement is unique in stating explicitly the need and pledge to pro-
mote human rights education:
The Parties further pledge to promote Human Rights education

throughout the various sectors of Sierra Leonean society, including the
schools, the media, the police, the military and the religious community.
(Lomé Agreement, 1997, https://peaceaccords.nd.edu/)
136 Examining lexis
Figure 6.5 Sketch Engine Word Sketch advanced interface
References to educational sectors (primary, secondary, etc.) are only found in the
Chittagong agreement and in the agreement between the Government of the
Republic of the Philippines (GRP) and the Moro National Liberation Front (MNLF),
where there is an explicit reference to how education will look once the agreement
has been signed, and a full description of the new educational system is provided:
It shall develop the total spiritual, intellectual, social, cultural, scientific and
physical aspects of the Bangsamoro people to make them Godfearing, pro-
ductive, patriotic citizens, conscious of their Filipino and Islamic values and
Islamic cultural heritage under the aegis of a just and equitable society. The
Structure of Education System. The elementary level shall follow the basic
national structure and shall primarily be concerned with providing basic
education; the secondary level will correspond to four (4) years of high
school, and the tertiary level shall be one year to three (3) years for non-degree
Examining lexis 137
Figure 6.6 Visualisation of Word Sketch for education in the peace agreements corpus
courses and four (4) to eight (8) years for degree courses, as the case may be
in accordance with existing laws. Curriculum. The Regional Autonomous
Government educational system will adopt the basic core courses for all
Filipino children as well as the minimum required learnings and orientations
provided by the national government, including the subject areas and their
daily time allotment.
(Mindanao Final Agreement, 1996, https://peaceaccords.nd.edu/)
However, education is very rarely premodified or simply precedes other nouns.

The exceptions are health in the Lomé agreement, and students and system in the
Republic of the Philippines (GRP) and the Moro National Liberation Front
(MNLF) document. The only verb of which education is the subject is constitute,
in the GRP and MNLF agreement: ‘Funds for education constituting the
share of the Regional Autonomous Government as contained in the General
138 Examining lexis
Appropriations Act should be given directly to the Autonomous Government.’

Education does not appear as the subject of a verb anywhere else in the entire
corpus. We find some instances where education is used as an object; statistically
significant examples include institutionalise, promote, receive and provide. These are
some of the concordance lines:
The Regional Autonomous Government educational system shall develop the

full potentials of its human resources, respond positively to changing needs
and conditions and needs of the environment, and institutionalise non-
formal education. (GRP and MNLF agreement)
The Parties further pledge to promote Human Rights education throughout
the various sectors of Sierra Leonean society, including the schools, the media,
the police, the military and the religious community. (Lomé agreement)
The govt shall provide necessary scholarships for research works and receiving
higher education in abroad. (Chittagong agreement)
Finally, education is found in coordinating or adjacent structures where the following

collocates are involved: health, student, programme and Leone.
The picture that emerges from this analysis shows that the noun education
displays little competition behaviour with a wider range of collocates, as if the
word itself affords a limited use in the context of the peace agreements analysed.
As Baker (2004: 352) put it ‘a close analysis of concordances and collocations of
individual keywords’ and their grouping together ‘according to the purposes that
they serve in contributing to particular discourses’ is essential. This how Baker
(2004) illustrates the analysis of keywords –mainly nouns –in his study of gay
narratives and masculinity:
For example, the gay keywords sweat, smelly, beer, football, duty, army, and mili-
tary all contributed toward a discourse of hypermasculinity within the gay
narratives. Some of these keywords have semantic links –for example, army
and military, but it is only by looking at their overall functions in the texts
that stronger links can be made between them (e.g., there is no immediately
obvious link between the words smelly and military). Only through a con-
cordance-based analysis of these words was it made clear that smelly was
consistently used in a way to construct hypermasculine identities in the gay
texts.15
(Baker, 2004: 352)
Coming back to our analysis of the keyword education in the corpus of peace
agreements texts, we find that the low frequency of occurrence of the noun
itself and the use of a limited range of collocates conspire to present few gram-
matical relations with a very limited set of collocates, as reflected in Figure 6.6.
Particularly, the role of education in clauses either as a subject or as an object is
Examining lexis 139
marginal, even more so when one examines the low frequency of such words (see
Figure 6.6) in the corpus.
Clearly, the term education16 is not a priority in the discourse of peace treaties.
If the researcher wishes to abandon the keyword analysis path for a moment, it
may be useful to check whether a collocation analysis will return some different
results. This is not always a successful task to do but it will typically offer a slightly
wider range of results that can be tested. In order to test our claim that education
is largely irrelevant in the corpus, we will need to approach the analysis of colliga-
tion patterns using the multiword Sketch function.
6.3.2 Exploring multiword units

Sketch Engine looks at both the left and right contexts of a phrase; in our case
these are the multiword keywords discussed in the previous section (see Table 6.7)
and identifies the collocates of each word in the phrase. This is what the process
looks like according to the Sketch Engine documentation:
The collocations are only extracted from sentences which contain the col-
location (phrase) in question. In other words, the collocates only come from
contexts where the collocation (phrase) is used. Contexts where the members
of the phrase are used on their own are excluded. This makes it possible to
only display collocates related to a particular word sense or subject.
Educational system occurs in our target corpus (13 times) and in the BNC (217 times)
when compared with the rest of the multiword keywords. Figure 6.7 shows the
colligational patterns of educational system in the target corpus of peace agreements.
When compared with Figure 6.6, one can see that the number of collocates has
gone down. This can be explained by the fact that the larger a unit of analysis is,
the fewer the number of occurrences and the colligational patterns emerge in the
analysis. In Figure 6.7, we note the following: (1) the words that premodify edu-
cational system tend to describe their scope (national, regional) in a matter of fact,
non-specific way; (2) this educational system plays no substantial role either as sub-
ject or object in clauses. In other words, its impact on the discourse in the corpus
is very limited. This contrasts with the usage observed in the British National
Corpus (Figure 6.8). This is a larger corpus and, quite naturally, will give us an
accurate insight into how other speakers across a wider set of contexts have used
this particular multiword keyword (educational system).
It is unsurprising that, given the wider range of texts included in the BNC, we can
find more lexical variation in this multiword sketch. The range of adjectives that pre-
modify our keyword is extraordinary, from bourgeois to needs-oriented and over-competitive.
As represented in Figure 6.8, the colligational pattern scenario is dominated by pre-
modification, which takes up almost 80% of the entire set of collocates. Equally
140 Examining lexis
Figure 6.7 Visualisation of the grammatical relations of educational system in the peace

agreements corpus
interesting is the fact that there is a wide range of collocates where educational system
is either a subject or an object in the clause, as shown in Table 6.9.
This picture of lexical diversity contrasts with what we found in the peace
agreements corpus. Exploring the colligational patterns in a larger, representative
corpus of language use (scenario B in Figure 6.1) can give us a measure of what
is not present in our target corpus. In other words, contrasting usage in different
corpora can provide us with what is frequent and what is not at all frequent in
the data. Baker (2004) suggests that comparing and triangulating data is essential:
Carrying out comparisons among three or more sets of data, grouping infre-
quent keywords according to similar meaning or function, showing awareness
of keyword dispersion across multiple files by using key keywords, carrying
Examining lexis 141
out analyses on key clusters and on grammatically or semantically annotated

data, and conducting supplementary concordance and collocational analyses
will enable researchers to obtain a more accurate picture of how keywords
function in texts.
(Baker, 2004: 357)
Contrasting keywords and words within the same corpus can be illuminating in
many ways. A Word Sketch Difference search between education and peace can be
revealing of the way discourse is built around these two ideas in the corpus. Thus,
Figure 6.8 Visualisation of the grammatical relations of educational system in the BNC
Table 6.9 Collocates of educational system acting as subject or object in clauses in

the BNC
educational system as subject of educational system as object of

emit, supplement, object, display, help, flaw, rid, reshape, pervade, re-establish,
prepare, contribute neglect, condition, centralise, weaken
142 Examining lexis
Table 6.10 Skill 15: researching nouns and noun phrases

Skill 15 • Multiword keywords are units that are considerably more
frequent in corpus A than in corpus B.
• Most multiword keywords contain nouns.
• ‘A multiword unit is key if […] its frequency in the text
when compared with its frequency in a reference corpus
is such that the statistical probability as computed by an
appropriate procedure is smaller than or equal to a p value
specified by the researcher’ (Scott, 2011). So, multiword
keywords are powerful predictors of content and
aboutness.
• While the identification of keywords is based on
quantitative analysis of text, it is necessary that we
understand the cotexts where such words occur.
• We can use nouns in the multiword keywords lists as
the starting point for the interpretation of how linguistic
patterning impacts the creation of meaning.
• Multiword word sketches offer researchers the opportunity
to examine how a phrase behaves grammatically and
collocationally.
• Nouns will typically perform different roles at the phrase or
the clause levels. At the phrase level, noun can either be the
headword of the noun phrase (primary education) or it can
modify another noun (education system). In clauses, nouns
tend to perform syntactic roles as subjects, objects or even
adverbials.
and we already knew this, education shows a very restrictive set of collocations in
the corpus (health, programme, Leone and student) and so on. Peace, on the contrary,
offers a more complex discursive treatment.
Table 6.10 summarises how to such research nouns and noun phrases.
6.4 Analysing children’s literature: the lexicon

of fiction
There are different ongoing projects that have put together corpora of children’s
books. One of them is CLiC (Corpus Linguistics in Cheshire).17 The project
(Mahlberg, Stockwell, de Joode, Smith & O’Donnell, 2016) showcases how cor-
pora can be used as a tool to examine literary texts and, among other outcomes,
‘lead to new insights into how readers perceive fictional characters’.18 Initially
devoted to the fiction of Dickens, the project also allows researchers to examine
direct speech and characterisation in fiction. The CLiC website offers a corpus of
19th century children’s fiction composed of 71 books. They also offer a 29-book
corpus of general 19th literature that can be used as a reference corpus. Their
website offers researchers various tools such as a keywords analysis function
where we can specify the number of n-grams to search (single keywords vs.
Examining lexis 143
Figure 6.9 Multiword keywords in the CLiC corpus of 19th century children’s literature
multiword keywords). Figure 6.9 is a screenshot that shows 2-gram keywords in

the children’s literature corpus using the general 19th century books corpus as
reference corpus.
The list gives us an accurate picture of the topics and the characters that are
representative of the books written in the 19th century, some of them masterpieces
of children’s literature worldwide such as Alice’s Adventures in Wonderland or The
Happy Prince. All the texts can be reached through a dedicated GitHub repository.19
The Oxford Children’s Corpus (OCC) includes language that children are
likely to read (both fiction and non-fiction) as well as texts written by children
themselves. This resource can be searched through Sketch Engine but it requires
permission from Oxford University Press.20 The Oxford Corpus has teamed up
with BBC Radio 2’s 500 Words,21 a UK competition for young writers. Only the
best stories can be found online on the BBC website22 but all of them have been
incorporated in the Oxford Corpus throughout the years:
Since 2012, we’ve sent every story we’ve received to Oxford University Press.
These scholarly superstars have now collected 658,477 stories since 2012.
That’s over 328 million words! Our entrants have provided them with the
biggest collection of children’s writing in the world. Why does that matter?
Well, these stories help them to create dictionaries, to understand the lan-
guage children are using and how it’s developing over time. It helps them
work out what kids are interested in: from politics, world events, celebrities
144 Examining lexis
to football. The results from this are taught in seminars and lectures around
the world and help leading figures in Education to improve the way English
is taught in schools.
(www.bbc.co.uk/programmes/articles/1hvt2rmlxVfHyLXhJgDb58B/
your-words-help-us-understand-childrens-language)
One of these applications is the study Oxford University Press makes every year
of the lexis used by these young writers. In 2019, use of Brexit increased by 464%
when compared to 2018. According to this study,23 children showed an interest
in politics and political language was more widely used than in previous years.
European Union was the main cluster, but trade deal or backstop also made their way
into the most used words.
Thompson & Sealey (2007) drew on the British National Corpus to create a
corpus of fiction written for children and compared it with a larger reference
corpus of fiction written for an adult audience. They contrasted these two corpora
not only to examine the similarities and differences in the overall frequencies of
words, parts of speech, etc., but also to examine whether speciﬁc lexical items are
used in particular ways when representing the world to child readers. This type
of exploration is really of potential interest to educational researchers looking at
specific texts or corpora in their own projects. Thomson & Sealy’s (2007) work
is an interesting showcase of what can be done with the range of tools we have
discussed in this book (concordance lines, word lists, collocations, keywords, etc.).
Interestingly, they carried out automatic semantic analysis. To do this, they used
Paul Rayson’s Wmatrix24 (Rayson, 2008). They found that the children’s books
corpus was characterised by the following topics (in decreasing order of statis-
tical significance): living creatures, personal names, food, plants, objects, com-
munication, future, size, sight, fear and speed. In contrast, the adult corpus was
categorised semantically by intimate and sexual relationships, drinks, life, law and
order, anatomy, medicines, strong vs. weak characterisation and thoughts and
beliefs.
N-grams are multiword units that occur frequently in a corpus. Essentially, n-
grams are sequences of n words (most typically three, four and five words) that
are parsed, counted and extracted from a corpus. There is substantial research
that shows that the clustering of words is a feature that varies across registers,
characterising them lexically (Gries, Newman, Shaoul & Dilts, 2009). However,
the applications of n-gram analyses go beyond linguistic analysis. Juola (2013) has
used n-grams to analyse cultural complexity in the US, and Stamatatos (2009) to
study plagiarism detection.
The extraction of the most frequent 4-grams from the CLiC corpus of 19th
century children’s fiction returns a list of multiword sequences that can serve as a
starting point to understand the specific narrative techniques and concerns of that
period. This is the list of the 20 most frequent 4-grams and their raw frequency
in the corpus:
Examining lexis 145
1 the top of the, 329

2 the end of the, 296
3 the rest of the, 286
4 at the end of, 285
5 at the same time, 251
6 in the middle of, 246
7 I should like to, 241
8 on the other side, 230
9 the middle of the, 222
10 the other side of, 208
11 in the midst of, 205
12 a great deal of, 202
13 the bottom of the, 196
14 it would have been, 191
15 for the first time, 180
16 the edge of the, 180
17 as soon as he, 177
18 the side of the, 172
19 other side of the, 171
20 at the bottom of, 166
Notions of space, time and quantity seem to dominate the choice of words in this
list, which tentatively confirms the presence of a well-referenced shared context
in these stories. However, an n-gram analysis (as set out in Table 6.11) requires a
thorough examination of all relevant n-grams as well as their distribution in the
corpus and an exploration of the concordance lines where they occur. Mahlberg
(2013) argues that n-grams can provide a fine-grained description of the language
used in fiction:
The methodology to identify such functions is the retrieval and analysis of

‘clusters’, i.e., repeated sequences of words such as in the middle of the, as if he
had been, or his hands in his pockets […] clusters can be classified into five groups
corresponding to five functional areas: labelling of characters and themes
in the texts; interaction between characters through speech; body language;
narrator comments; and time and place references […] clusters are textual
building blocks and their local textual functions contribute to a description of
areas of meanings in textual worlds.
(Mahlberg, 2013: 3)
A list of n-grams can give researchers some powerful insights into the frequent
multiword lexical items used in the corpus and into what Mahlberg has called
local textual functions (Mahlberg, 2013). McEnery & Hardie (2012) have stressed
how different areas of research are using CL methods to triangulate their results.
146 Examining lexis
Table 6.11 Skill 16: looking at the lexicon of a corpus: n-grams

Skill 16 • Both AntConc and Sketch Engine can extract n-grams
from a corpus. From the output, we can easily obtain the
concordance lines where n-grams occur.
• N-grams can be used to characterise the language used
in the corpus or as a complementary technique to other
research methods, either qualitative or quantitative.
• An application of n-gram analysis is Google N-gram Viewer, a
web application that examines particular instances on
n-grams in different corpora of books over the selected
years. This analysis provides a diachronic perspective of the
use of a sequence of words over a period of time.You can
also search for POS tags. See https://books.google.com/
ngrams/info for further information
The use of n-grams is particularly relevant when looking at formulaicity and how
language is represented in our brains:
[…] the research by Ellis and Simpson-Vlach (2009) on the status of n-grams
as psychological units triangulates by incorporating corpus data within a
series of experimental investigations [and] responses by instructors of English
for Academic Purposes expressing their opinions on the formulaicity, cohe-
siveness and educational importance of the n-grams under study. The three-
way methodological multiplicity allows Ellis and Simpson-Vlach to conclude
(2009: 73) that ‘formulaic sequences, statistically defined and extracted from
large corpora of usage, have clear educational and psycholinguistic validity’.
(McEnery & Hardie, 2012: 209)
Note that keyword analysis provides a different route into the content of a corpus,
not necessarily rooted in frequency. When looking at n-grams we are actually
evaluating effects of absolute frequency (see chapter 1). If there are too many
concordance lines to be examined, it may be feasible to obtain a random sample
that can be analysed in a realistic timeframe. Sketch Engine offers such function.
Table 6.11, meanwhile, offers a summary of how to use n-grams.
Notes
1 www.theguardian.com/commentisfree/2019/dec/27/brexit-end-english-official-
eu-language-uk-brussels
2 www.theguardian.com/commentisfree/2019/dec/27/brexit-end-english-official-eu-
language-uk-brussels
3 www.lexically.net/wordsmith/
4 Target corpora are also referred as focus corpora or corpora of interest.
5 https://peaceaccords.nd.edu/search
Examining lexis 147
6 This will involve cleaning up our data and save.txt files free from noise and unwanted
coding. See chapter 4 for further guidelines.
7 http://flax.nzdl.org/greenstone3/flax?a=fp&sa=collAbout&c=BlaRC&if=
8 www.sketchengine.eu/documentation/simple-maths/
9 Paul Rayson has set up a resource that offers the possibility to calculate log-likelihood
and effect size online: http://ucrel.lancs.ac.uk/llwisard.html
10 AntConc website offers some useful lists, mainly BNC related: www.laurenceanthony.
net/software/antconc/
11 You can find further details on how to do this: www.sketchengine.eu/guide/create-
a-subcorpus
12 British National Corpus used as reference corpus.
13 Sinclair (1991) showed how for many speakers of English back is a part of our body; yet,
in actual usage, it is just a residual sense.
14 More information on LogDice and other Sketch Engine statistics can be found at: www.
sketchengine.eu/documentation/statistics-used-in-sketch-engine/
15 Baker (2004: 353) also warns us that this process is not quantitative per se: ‘[…] one
problem with combining words into conceptual groups is that it is a subjective process.
Some groups may suggest themselves more clearly to the researcher than others, and
it may be difficult to know how to specify a cut-off point. Carrying out concordance-
based analyses of individual keywords should ensure that the researcher first has an
understanding of what such words are used to achieve in a text, before erroneously
combining words that may appear similar at face value. Like many other forms of
linguistic analysis, researchers are required to develop skills of interpretation, which
suggests that corpus-based research is not a merely quantitative form of analysis.’
16 A thorough analysis would include all keywords (or words if our analysis is preliminary)
that in some way or another are related with this notion.
17 Developed as part of the AHRC-funded project: “CLiC Dickens –Characterisation in
the representation of speech and body language from a corpus linguistic perspective”
(Arts and Humanities Research Council grant reference AH/K005146/1); www.clarin.
ac.uk/clic
18 www.clarin.ac.uk/clic
19 https://github.com/birmingham-ccr/corpora/tree/master/ChiLit
20 www.sketchengine.eu/oxford-childrens-corpus/
21 www.bbc.co.uk/programmes/p00rfvk1?region=uk
22 www.bbc.co.uk/ p rogrammes/ a rticles/ 3 Xk91WDG700VjPYNGMYBrzK/
a-life-sentence
23 http:// f dslive.oup.com/ w ww.oup.com/ o xed/ c hildren/ 5 00- w ords/ B rexit_
Children’sWordoftheYear_Infographic_500Words2019.pdf
24 Wmatrix 4 URL: https://ucrel-wmatrix4.lancaster.ac.uk/
25 https://en.wikipedia.org/wiki/Chittagong_Hill_Tracts_Peace_Accord
References
Baker, P. (2004). Querying keywords: questions of difference, frequency, and sense in
keywords analysis. Journal of English Linguistics, 32(4), 346–359.
Berber-Sardinha, T. (2000). Comparing corpora with WordSmith Tools: How large must
the reference corpus be? In Proceedings of the workshop on Comparing corpora. Association for
Computational Linguistics, 7–13.
148 Examining lexis

Cremin, H. (2016). Peace education research in the twenty-first century: Three concepts
facing crisis or opportunity? Journal of Peace Education, 13(1), 1–17.
Cremin, H. & Bevington, T. (2017). Positive Peace in Schools: Tackling conflict and creating a culture
of peace in the classroom. London: Routledge.
Culpeper, J. & Demmen, J. (2015). Keywords. In Biber, D. & Reppen, R. (Eds.) The
Cambridge handbook of English corpus linguistics, 90–105.
Fuentes, A.C. (2015). Exploiting keywords in a DDL approach to the comprehension
of news texts by lower-level students. In Leńko-Szymańska, A. & Boulton, A. (Eds.).
Multiple Affordances of Language Corpora for Data-driven Learning. Amsterdam: John Benjamins
Publishing, 177–198.
Gabrielatos, C. & Marchi, A. (2011). Keyness: Matching metrics to definitions. Presentation
at Corpus Linguistics in the South: Theoretical-methodological challenges in corpus
approaches to discourse studies –and some ways of addressing them. University of
Portsmouth.
Gries, S. Th., Newman, J., Shaoul, C. & Dilts, P. (2009). N-grams and the clustering of
genres. Presented at workshop on Corpus, Colligation, Register Variation at the 31st
Annual Meeting of the Deutsche Gesellschaft für Sprachwissenschaft.
Juola, P. (2013). Using the Google N-Gram corpus to measure cultural complexity. Literary
and Linguistic Computing, 28(4), 668–675.
Kilgarriff, A. (2009). Simple maths for keywords. In Mahlberg, M., González-Díaz, V. &
Smith, C. (Eds.). Proceedings of Corpus Linguistics Conference CL2009, University of
Liverpool.
Mahlberg, M. (2013). Corpus stylistics and Dickens’ fiction. London: Routledge.
Mahlberg, M., Stockwell, P., de Joode, J., Smith, C. & O’Donnell, M.B. (2016). CLiC
Dickens: Novel uses of concordances for the integration of corpus stylistics and cogni-
tive poetics. Corpora, 11(3), 433–463.
Mauranen, A. (2003). But here’s a flawed argument: Socialisation into and through
metadiscourse. In Leistyna, P. & Meyer, C.F. (Eds.). Corpus analysis: Language structure and
language use. Amsterdam: Rodopi, 19–34
O’Halloran, K. (2010). How to use corpus linguistics in the study of media discourse.
In O’Keeffe, A. & McCarthy, M. (Eds.) The Routledge handbook of corpus linguistics.
London: Routledge, 563–577.
Paper and the Twitter debate. In Orts, M.A., Breeze, R. & Gotti, M. (Eds.) Power, persua-
Pérez-Paredes, P. (2019). Little old UK voting Brexit and her Austrian friends: A corpus-
driven analysis of the 2016 UK right-wing tabloid discourse. In Hidalgo, E., Benítez,
M.A. & de Cesare, F. (Eds.) Populist discourse: critical approaches to contemporary politics.
London: Routledge, 152–171.
Rayson, P. (2008). From key words to key semantic domains. International Journal of Corpus
Linguistics, 13(4), 519–549.
Scott, M. (2008). Developing wordsmith. In Pérez-Paredes, P., Scott, M. & Sánchez-
Hernández, P. (Eds.) Software-aided analysis of language, International Journal of English Studies,
8(1), 95–106.
Examining lexis 149
Scott, M. (2011). WordSmith tools manual, version 6. Liverpool: Lexical Analysis Software Ltd.

Stamatatos, E. (2009). Intrinsic plagiarism detection using character n-gram profiles. In
Stein, Rosso, Stamatatos, Koppel, Agirre (Eds.): PAN’09, 38–46. URL: https://www.
uni-weimar.de/medien/webis/events/pan-09/pan09-papers-final/pan09-plagiarism-
detection/stamatatos09-notebook.pdf
Taylor, C. (2017). Togetherness or othering? Community and comunità in the UK and
Italian press. In: Chovanec, J. & Molek-Kozakowska, K. (Eds.) Representing the other in
European media discourses. John Benjamins Publishing, (55–80).
Thompson, P., & Sealey, A. (2007). Through children’s eyes?: Corpus evidence of the
features of children’s literature. International Journal of Corpus Linguistics, 12(1), 1–23.
Thurow, S. (2010). Keywords, Aboutness & SEO Do your web pages, graphic images, and
videos effectively communicate aboutness to both web searchers and search engines?
How do search engines and searchers determine aboutness? Search Engine Land, https://
searchengineland.com/keywords-aboutness-seo-49210
Chapter 7
Analysing talk
Complex searches
In this chapter we will shift our focus to spoken language and interviews. Readers
will examine spoken data and will integrate complex searches into their inquiry
process. A section of the chapter will be devoted to a review of skills 12–17.
7.1 Examining talk: a linguistic perspective

This chapter will offer insights into how usage can be relevant when analysing
talk. Nick Ellis (2020: 239) has recently pointed the importance of corpus
methods in researching how humans use language and how language meaning
is crystalised in usage: ‘Corpus linguistics demonstrates that language usage is
pervaded by collocations and phraseological patterns, that every word has its
own local grammar, and that particular language forms communicate particular
functions: lexis, syntax, and semantics are inseparable.’ Corpus linguistics methods
can give researchers unique perspectives into language usage that can serve a
myriad of purposes across the complex area of educational research. Some of
these purposes have been discussed in chapters 3 to 6. However, we note that the
best applications of corpus methods to educational research are yet to come, as
educational researchers become more familiar with the range of CL methods at
their disposal.
This chapter examines ways in which we can look at talk by operating more
complex corpus searches. Ideally, you will have already transcriptions of class-
room talk, seminars or interviews of your own that you want to explore, and we
encourage you to do so by uploading your texts to Sketch Engine or AntConc.
In order to illustrate how CL methods can be used together with more complex
searches, we will use a corpus of interviews with professionals across Europe: the
Backbone Corpus of English as a Lingua Franca.1 Backbone (Kohn, 2012) is
an EU-funded project set up to meet the pedagogic content needs of language
teachers in secondary, higher and vocational education with regard to a pedagogic
integration of content and language integrated learning (CLIL) and e-learning.
The thematic areas in the interviews incorporate regional perspectives and
include culture, world of work, urban and rural life, social issues, health and social
security, education, environment, government and politics (Kohn, 2012: 5–13).
Analysing talk: complex searches 151
Figure 7.1 Interviews in the Backbone corpus of English as a Lingua Franca
The Backbone corpora are freely available2 for learning and academic purposes3.
Figure 7.1 shows a screenshot of the search interface of Backbone corpus of
English as a Lingua Franca.
The corpus consists of 50 interviews containing 81,607 tokens and 5,817 types.
The adults interviewed come from different sectors including education, research,
engineering and art. They speak English as a Lingua Franca (ELF) and during the
interviews they discuss a variety of topics, from environmental issues to work and
family balance, from culture to politics. In this chapter, we will query a text version
of the corpus that was transcribed and annotated following the TEI guidelines
discussed in chapter 5. The corpus was cleaned up and, among others, references
to entities such as the project and the staff involved in the transcription and anno-
tation were removed.
From a linguistic perspective, spoken language shows distinctive features when
compared with the written form. Staples (2015) has summarised some of the areas
where spoken English language4 behaves in unique ways.
1 N-grams are more common in spoken than in written English. Formulaic lan-
guage is a distinctive feature of spoken communication.
2 Stance features are much more common in spoken than in written language.
Sometimes stance devices (adverbs, adverbials, verb + that clause and, among
others, modal verbs) are used frequently for both epistemic and interactional,
strategic purposes.
3 Discourse markers are much more common in spoken language.
4 Spoken discourse is characterised by vagueness.
Although educational research will most likely examine personal experience

and idiographic elements in interviews and other spoken data collection formats
152 Analysing talk: complex searches
(King, Horrocks & Brooks, 2019), it is necessary to understand that the social
world that the researcher is trying to capture is constrained by the characteristics
of spoken communication. Most educational research drawing on qualitative
methodology tries to understand people as constructing their own meanings of
situations. Cohen, Manion & Morrison (2018) frame this constructionist approach
in the following way:
People are deliberate, intentional and creative in their actions, and meaning
arises out of social situations, interactions and negotiations, and is handled
through the interpretive processes of the humans involved. Meanings used by
participants to interpret situations are culture-and context‑bound, and there
are multiple realities, not single truths in interpreting a situation.
CL methods offer practical ways to unpack the constructed reality. For example,
Locke (2004) has captured relevant areas of Fairclough’s5 text analysis that can be
used by researchers when examining the linguistic fabric of discourse:
• vocabulary (individual words)
• word meaning: it explores the meaning potential and the changes
involved to accommodate (new) discourses
• wording: the ways in which extralinguistic referents are coded into words
is a manifestation of interdiscursivity
• metaphors: the figures of speech used to structure the way we express our
systems of beliefs and knowledge
• grammar (phrases and clauses)
• modality: evaluates how speakers see propositions, from certainty to
possibility
• transitivity: the ideational dimension of grammar. It explores verb
valency to understand relational meanings (be, have, etc.), actions (and
the arguments involved, such as agents), events and mental verbs
• cohesion (clause and sentence linking)
• connectives and argumentation: different types of argumentation have
cultural and ideological significance
• text structure (organisational properties of the text)
• interactional control: explores turn-taking and topic shift, among others.
These areas are worth exploring in the data and, most certainly, can be used to
comprehend how language is interwoven with how humans code their experiences.
Thompson (2013) notes how language uses these resources:
From the experiential perspective, language comprises a set of resources for

referring to entities in the world and the ways in which those entities act on
or relate to each other. At the simplest level, language reflects our view of the
world as consisting of ‘goings-on’ (verbs) involving things (nouns) that may
have attributes (adjectives) and which go on against background details of

place, time, manner, etc. (adverbials).
(Thompson, 2013: 92)
While this may seem obvious to many of us, it offers a good starting point to
classify experiences in terms of participants, processes and circumstances. The
analysis of verbs in clauses has advanced the study of transitivity (Locke, 2004), that
is, the analysis of the relationship between processes and participants. In short,
processes can be classified in the following ways.
1 Material processes: these involve physical actions as well as different types of

doers, goals and circumstances. Thompson (2013) offers a detailed descrip-
tion of the different types of doers, goals (creative vs. transformative) and
processes (intentional vs. involuntary).
2 Mental processes: these involve activities that occur in our minds. Thompson
(2013) notes that these activities involve at least one human being, who is
described as the senser, and a phenomenon that is sensed. Mental processes can
project, that is, they can be followed by that clauses which introduce a new
process. Most mental processes are expressed in the present simple tense (not
in the continuous aspect and not in the past tense). Mental processes can be
categorised as perceptive (hear), emotive (feel), cognitive (think) and desidera-
tive (want).
3 Relational processes: these signal a relationship between two concepts.
Attributional processes involve a carrier and an attribute (‘the music is so good’).
Identifying processes identify one entity in terms of another (this framework
is an interpretation of Foucault’s theories). In these cases, we talk about value
and token, the former being the most abstract of the two and the latter the
more concrete.
The actual language used in interviews provides evidence of how these processes
are present in our data. In this chapter, we will adopt a case study strategy to
explore three open research questions:
• How does living in a city affect professional adults?

• How do they understand cultural differences?
• How is their family life impacted by work?
We will use a combination of different methods and search criteria so as to paint a

broad picture of the range of possibilities available to researchers when examining
data in interviews.
7.2 Complex searches

This section explores the use of complex queries when consulting corpora.
Table 7.5 offers a breakdown of how to approach this skill.
7.2.1 Living in a city

We will start our examination of the corpus by running a multiword keyword ana-
lysis which has identified the top 100 keywords, listed in Table 7.1.
This keyword analysis can be used to filter out automatically those concepts that
are more prominent in the corpus. The items in Table 7.1 can then be used as a
first-candidate list of nouns and adjectives used by the interviewees, including not
only city but also village, town and rural.
We will start our exploration by querying the corpus to obtain a list of
expressions where the lemma city is preceded by an adjective. To do this we need
to build a query that combines words and POS tags. In Sketch Engine, this type
of query is built by selecting Advanced search and then CQL (corpus query lan-
guage) as shown in Figure 7.2.
After selecting the CQL query type, we type our query paying attention to how
we combine its different slots. We first obtain the concordance lines where the
lemma city is preceded by an adjective. To do this we specify in square brackets
that we need to look at city as a lemma ([lemma=‘city’]). The lemma has to be
Table 7.1 A selection of multiword keywords in the Backbone corpus of English as a

Lingua Franca
Keyword Keyness score Raw frequency in the corpus

Big city 104.17 27
Small village 81.34 16
City centre 76.52 22
Rural life 40.47 5
Rural area 37.37 6
Village life 35.63 4
Small city 33.22 5
Big town 31.67 3
Figure 7.2 Advanced query interface in Sketch Engine

typed between inverted commas. Before the lemma, we then specify that we
want an adjective. We do this is by selecting the appropriate POS tag. As seen in
Figure 7.2, Sketch Engine offers a list of the tags available once we click on the box
TAGS on the right-hand side. Table 7.2 shows some of the most common tags in
a corpus tagged by Sketch Engine.
Table 7.2 List of some POS tags used by Sketch Engine
POS tag Description Example

CC coordinating conjunction and
DT determiner the
IN preposition, subordinating conjunction in, of
IN/that that as subordinator that
JJ adjective big
JJR adjective, comparative bigger
JJS adjective, superlative biggest
MD modal can, will
NN noun, singular or mass city
NNS noun plural cities
NP proper noun, singular Worker
NPS proper noun, plural Workers
NPSZ possessive proper noun, plural Boys’, Workers’
NPZ possessive noun, singular Britain’s, God’s
PP personal pronoun I, he, we
PPZ possessive pronoun my, her
RB adverb usually, officially, here,
RBR adverb, comparative
RBS adverb, superlative
RP particle give up
SENT Sentence-break punctuation .!?
SYM Symbol /[= *
TO infinitive ‘to’ to
VB verb ‘be’, base form be
VBD verb ‘be’, past tense was, were
VBG verb ‘be’, gerund/present participle being
VBN verb ‘be’, past participle been
VBP verb ‘be’, singular present, non-third person am, are
VBZ verb ‘be’, third person singular present is
VH verb ‘have’, base form have
VHD verb ‘have’, past tense had
VHG verb ‘have’, gerund/present participle having
VHN verb ‘have’, past participle had
VHP verb ‘have’, singular, present, non-third person have
VHZ verb ‘have’, third person singular, present has
VV verb, base form think
VVD verb, past tense thought
VVG verb, gerund/present participle thinking
VVN verb, past participle thought
VVP verb, present, non-third person think
VVZ verb, third person singular, present thinks
Note that some tags give us specific information about a word’s class. For
example, in the case of adjectives we can choose between comparative (JJR) or
superlative forms (JJS). As for nouns, we can examine common nouns in the
singular (NN), plural (NN), singular proper nouns (NP) or plural proper nouns
(NPS). Let us go back to our initial search: [tag=“J.*”][lemma=“city”]. This will
return all the instances in the corpus where the lemma city is preceded by an
adjective. Table 7.3 shows a breakdown of the lexical items in the [tag=“J.*”]
[lemma=“city”] search.
Note that [tag=“J.*”] includes “.*” after J. What we are doing here is asking
Sketch Engine to retrieve all tags that start with a “J”, including JJ, JJR and JJS.
This is a flexible option that can be used to either zoom in or zoom out the range
of results we want to examine. For example, [tag=“V.*”] includes the following
verb POS tags: VV, VVD, VVG, VVN, VVP and VVZ (see Table 7.2). However,
this type of search returns such a wide variety of POS tags and lexical items that
the results will be very challenging to process. How do we make sense of all of this
information? We can either examine the range of POS tags and/or the lexical
elements that fill the POS tag slots. Let us see how this can be done.
Table 7.3 [tag=“J.*”][lemma=“city”] search in the Backbone corpus of English as a

Lingua Franca
Word Frequency Frequency per million

words
1 big city 27 290.27
2 big cities 9 96.7
3 small city 5 53.7
4 small cities 2 21.5
5 many cities 2 21.5
6 different cities 2 21.5
7 young city 1 10.7
8 wonderful city 1 10.7
9 whole city 1 10.7
10 smaller cities 1 10.7
11 particular cities 1 10.7
12 monumental cities 1 10.7
13 mega city 1 10.7
14 main cities 1 10.7
15 industrial city 1 10.7
16 huge city 1 10.7
17 future city 1 10.7
18 current city 1 10.7
19 biggest cities 1 10.7
20 Spanish cities 1 10.7
21 European cities 1 10.7
22 Chinese city 1 10.7
Most frequent verb POS tags

3,500
2,928 2,900
3,000
2,500 2,307
2,000
Frequency
1,500 1,288
1,131
1,000 854
732 719
527
421
500
0
VVP VV VBZ VBP VVG VHP VVD VVN VVZ VBD
Verb tags
Figure 7.3 Verb POS tags in the corpus analysed
We would first like to know how many of the possible V tags are present in the
corpus. To do this we look at the distribution of these POS tags in the corpus.
Figure 7.3 shows the top 10 most frequent verb tags in the corpus and their raw
frequencies.
The most frequent tags are, in decreasing order of frequency, inflected pre-
sent tense forms other than the third person singular (I think, we hope, I want);
base forms of verbs (I must say, difficult to understand); simple preset form of verb
to be; third person singular (is) and first and second person (am, are); and gerund
forms (talking, working). These tags can be explored individually so as to map out
the functions they carry out in the interviews. We can use the tag VVP to either
explore some mainly ideational functions of the language to express opinions on
what it feels like to live in cities or villages. Contrasting VVP and VVD tags seems
like a good way into the data. To do this, we can use [tag=“VVP|VVD”], where
the vertical line | is used to ask Sketch Engine to retrieve either of the two tags.
The results will be easy to sort out by selecting the Show frequency option and
the KWIC (keyword-in-context) POS tag (see Table 7.4). From there the con-
cordance lines for each of the tags can be shown.
The new concordance line results can be further sorted, and a rank of verbs
obtained. Figure 7.4 shows the top 20 most frequent verbs used in the present
tense (VVP) in the corpus.
From this screen we can access new concordance lines, for example we may
want to explore think, and we can ask Sketch Engine to return its collocates. The
results will include adjectives such as good (10.21),6 important (10.07), interesting (9.15),
Table 7.4 Essential terminology: KWIC (keyword-in-context)
Essential terminology KWIC
KWIC stands for keyword-in-context. The term is used to

denote how concordance lines are shown to users.
The search result is found in a central column, and on
either side we find some of the context preceding and
following each result.
BUT KWIC is therefore the search term, no matter if this is a

word, a string of words, a POS tag or a lemma.
European (9.05) and different (8.67). These are the adjectives that appear to con-
centrate the expression of opinion in the dataset. However, our main focus is
to explore opinions about living in the city. We could approach this search in
different ways. One of them is to use the following CQL:
(meet [lemma=city|village”] [tag=“VVP”] -5 5)
Note that the search needs to be inserted in brackets. What we are doing here
is asking Sketch Engine to give us all the concordance lines where a verb in the
present tense is found five slots to the left or to the right of either city or village. We
could include an adjective in our search instead and increase our search to eight
slots to the left and the right:
(meet [lemma=“city|village”] [tag=“J.*”] -8 8)
Or we could combine both searches:
(meet [lemma=“city|village”] [tag=“J.* |VVP”] -8 8)
This is a powerful method to identify language uses that can help us answer our
research questions. Note that the concordance lines obtained from the results
of our search can be exported to a spreadsheet where we can filter out results,
examine the evidence in the data and come up with new hypotheses and target
language patterns. In chapter 3, we presented a step-by-step procedure to read
and interpret concordance lines. This procedure (summarized in Table 3.7) is
based on the notion of constant recycling of our findings and hypotheses.
Figure 7.5 captures the dynamic nature of this process and acknowledges the
agency of the researcher while assessing the usage that characterises the lived
experiences of those interviewed. We suggest that researchers start the exploration of
Figure 7.4 Most frequent verbs used in the present tense (VVP)
the language in the texts with a word list (either a general word list or a word list of
nouns, verbs or adjectives) and then move on to explore specific items in the corpus.
An alternative way in is to start with those lexical items that will be key to interpreting
the research questions in the project. The two circles in Figure 7.5 exemplify how CL
methods (grey arrows) can guide our inquiry (black arrow, inner circle).
Let us explore now our second question.
Initiate a Word lists

search
Lexical items
First
Results
hypothesis
Collocations Collocations
More More complex

complex searches
searches Hypothesis
Recycle testing +
Report
More complex searches
Figure 7.5 A model for the examination of concordance lines
7.2.2 Understanding cultural differences

A simple search of the lemma difference will return all the concordance lines where
it appears in the corpus. So, one of the options available to the researchers will
be to use this lemma as KWIC and explore the context to the left and the right.
This will lead to the formulation of our first hypotheses in terms of how the
lemma was used by the interviewees (see Figure 7.5). Of course, if we focus on
the interviewees and not on the language used by the interviewer, we will need to
run a search that does not include the questions or the exchanges produced by the
interviewers. This search will only be possible if the corpus has been annotated to
mark up who is speaking. In this dataset, the TEI structure where this happened
was , utterance, and the attribute is ‘who’:

A search like this will return all the concordance lines where the lemma difference
has been used by speakers other than interviewees:
[lemma=“difference”]!within 
Note that we are searching within structures in a corpus that had previously been
marked up by either transcribers or annotators (see chapter 5). The structures of
a corpus can be found in their XML structure or, if we are using Sketch Engine,
in Corpus Information > Structures and Attributes. This bit from the previous
search:
!within 
is asking Sketch Engine to retrieve all the instances of the lemma difference not used
by speakers marked up as interviewers. ‘!’is an operator that means ‘not’, hence
not within the structure . This will let us focus our searches
on those speakers that are of interest. Now try to think how useful this will be if,
for instance, you are working with a corpus of classroom talk where we can iden-
tify everyone in the room either in terms of the role (teachers, learners, etc.), their
gender, age, IDs or names, etc. As we saw in previous chapters, the main analytical
tool for corpus users is the comparison between corpora or datasets. Hence the
importance of isolating parts of a corpus based on specific criteria. Once this is
done, a subcorpus can be created. We can then perform specific searches within it
or compare this subcorpus against another dataset.
A search like this one:
[lemma=“similarity|difference|culture|cultural”]!within

will return all the concordance lines where those interviewed discuss the differences
that they found when living or working in a different country or region. Some of
the significant collocations with the lemma difference include countries (12.19), prob-
ably (11.78) and but (11.23). A first hypothesis indicates that the interviewees tend
to hedge their opinions considerably, smoothing out their judgements as regards
other cultures. Using the procedure in Figure 7.5 we can refine and fine-tune our
hypotheses. The next section explores how family life is impacted by work in our
corpus.
7.2.3 How is their family life impacted by work?

We now turn our attention to opinions where interviewees discuss issues concerned
with family life and work. We could start our initial search with any of these lex-
ical items or search for [lemma=“family|life|work”], which will return probably
far too many concordance lines. We may want to reduce our focus by looking at
uses of work as a noun (and not as verb). To do this we need to specify that we
are only interested in results that have been tagged as nouns. We can do it in the
following way:
[lemma=“family|life|work&[N.*]”]
We have here specified that we are only interested in the lemma work when it is a
noun, using the ampersand symbol to let Sketch Engine know that this is a condi-
tion that has to be met and that this is exactly what we want to search for: &[N.*].
If the corpus has been previously annotated or coded, we may want to use this
annotation to search within structures that are likely to contribute to our explor-
ation of the data. In the case of our TEI corpus, the coding team decided to
include this information at the <div> level (as the corpus was segmented into
thematic sections) and within the decls attribute. The following is an example of a
complex search that integrates various lemmas and a specific part of the corpus
annotated as particularly relevant for the study of the world of  work:
[lemma=“family|life|work”] within <div decls=“#worldofwork”/>
Given the complexity of the annotation of this corpus, with different overlapping
topics or tags annotated simultaneously by a team of international educators, the
search above returns exclusively those parts of the corpus that were annotated
with #worldofwork as their only tag. A simple examination of the different decls
attributes in Sketch Engine reveals that the corpus was annotated with 332
different combination of tags. This, no doubt, enriched the search experience
of the users of the corpus and increased the flexibility of the annotators when
making decisions in terms of the coding scheme. Another option is to search for
our lemmas within any annotated part of the corpus. To do this we can use the
following CQL:
[lemma=“family|life|work”] within <div decls=“#[A-Za-z].*”/>
This is an alternative way into the data. This search string retrieves all annotated
parts or sections (div) of the corpus. Note that we are asking Sketch Engine to
retrieve any alphabetical characters either in upper or lower case that follow the
hashtag symbol ([A-Za-z]) plus any other characters that may follow them (.*). The
information symbol on the left of each concordance line specifies all the associated
metadata. As we saw in chapter 5,7 using an ad hoc corpus gives researchers
plenty of freedom and flexibility in terms of how they want to annotate their
texts. A customised annotation will dramatically facilitate the understanding of
our research questions. Remember: a corpus is just an instrument to look at the
data that we deem relevant to our research questions.
A collocational analysis of the results of this search will provide us with a long
and potentially interesting list of collocates that we may want to explore. Just in
the top 80 of collocates we find, among others, the following: company (10.12),
different (10.07), but (9.88), together (9.58), people (9.58), they (9.55), balance (9.32), better
(9.20), hard (9.11), job (9.03), good (9.01), always (8.86) and enterprises (8.79). An ana-
lysis of just company and balance alone lend evidence to the following:
• Working in a team is considered as having a positive impact on workers’ lives.

• Companies should pursue better work–life balances. It is in their interest.
• International work experience is highly regarded by organisations.
• Face-to-face environments increase life quality in work environments.
• It is not possible for those in top management posts to achieve a good work–
life balance.
• Having a family and being a woman worsen your work–life balance.
• Working away from home has a negative impact on the quality of your family
and personal life.
These findings are grounded in some of the principles discussed in chapter 1.

To start with, any corpus needs to display total accountability relative to the
dataset analysed (and accordingly not to other uses or informants, etc.). In
other words, we need to be absolutely clear about how our corpus was collected
and analysed. Our findings can only be applied within the boundaries of
our dataset, and acknowledging the defining design features of a corpus as
an instrument can only strengthen the validity of the research. Second, the
methods discussed in this book involve at some point or other the examination
of concordance lines. We would like to reflect on the precise moment when
the concordance line is examined and whether different examination times
may have implications for our research. In 7.2.1 we started our analysis with
a data-drive exploratory analysis of the aboutness of the corpus by running a
multiword keyword analysis. In other words, we started our examination of
relevant topics in the interviews by looking exclusively at statistically significant
keywords. Note that this is fundamentally different from either content analysis
or theme analysis as seen in chapter 2. In keyword analyses, our engagement
with a keyword is mediated by a process of information reduction following
a statistical test; that is, while much interpretive and qualitative research uses
interview data to understand a problem, keyword analysis uses the same data
to explore sites of analysis.
Brindle (2016), for example, used keyword analysis to study the language of
hate in the white supremacist community. He obtained a list of lexical keywords
from the Stormfront8 corpus and used the 1.2 billion enTenTen129 as a reference
corpus (see section 6.3 for a discussion of keyword analysis as a research method).
This researcher uses the same design as shown in Figure 1.2: while his main interest
lies with the Stormfront corpus, he is deriving results from a cross-examination
with a larger corpus of English that is instrumental in identifying the most rele-
vant topics in white supremacist discourse. Then, collocation analysis is used to
refine his understanding of how these words are used. Brindle (2016: 22) explains
that concordance lines ‘are then employed to facilitate the observation of the
actual contexts of these words in order to comprehend their meanings’. Brindle
categorises the Stormfront keywords into three groups –sexuality, race and evalu-
ation –which in many ways resemble the kind of result that might derive from
theme analysis. However, the ways in which this finding is arrived at are widely
different, both ontologically and epistemologically.
An analysis of collocates can provide us with even more fine-grained results.
Brindle has used collocation networks to understand these relationships. Plotting
these collocates on a network graph can be truly illuminating. Figure 7.6 shows
the relationships of the top 20 keywords in the corpus. Note how some of the
words attract more words than others and, similarly, how some words are more
unidimensional, and their association power is limited to one or two other
keywords.
Brindle (2016) interpreted some of these relationships in the light of the links
in Figure 7.6.
Pedophilia
Homosexuality
Pedophiles
Perverts
Homos
Homosexual Queer
Homo
Heterosexual
Perversion Jew
Homosexuals
Gay
Gays
Sodomites Fag Queers
Figure 7.6 Collocational network of first 20 keywords

Source: Brindle, 2016: 66
The diagram indicates that pedophiles could be considered a nuclear node as it

has a significant association with several other keywords, thereby suggesting
that it is a central theme in the writers’ texts on homosexuality, and used in
a way which links other key concepts together. Jew also has a considerable
number of links with the other keywords, thus demonstrating the significance
of the Jewish people to white supremacists.
(Brindle, 2016: 67)
While the visual representation of these relationships is not crucial in the analysis,
we can certainly rely on a diagram to spot these links more easily. #LancsBox10
(Brezina, Timperley & McEnery, 2018) is a multiplatform software that can gen-
erate such visual network representations. Using a diagram can helpfully stimulate
our understanding of how a particular phenomenon is represented in a dataset.
Table 7.5 gives a breakdown of how to conduct the complex searches we have
dealt with in this section.
Table 7.5 Skill 17: complex searches
• There is a wide range of possibilities when it comes to

Skill 17
searching for information in a corpus. Simple searches will

be operationalised as searches of words, strings of words,
lemmas or POS tags.
• More complex searches will require you to combine
several of these elements. For example, searching for a
word as a verb or searching for a lemma when preceded by
any adverb that starts with the letter ‘r’.
• These searches make use of largely simplified versions
of Regular Expressions language. Regular expressions are
patterns used in complex searches to match character
combinations in texts.You will need to become familiar with
these versions. Here is a reference list for Sketch Engine,
AntCon and Mark Davies’sets of corpora online:
Sketch Engine: https://www.sketchengine.eu/guide/
regular-expressions/#toggle-id-2
AntConc (via http://www.linguisticsweb.org): https://bit.
ly/2S78FTv
CQP tutorial: http://cwb.sourceforge.net/temp/
CQPTutorial.pdf
• The most complex searches will combine some of the
features above with searches that exploit the markup of the
corpus. These searches will let you find text contributed
by specific speakers across the structures annotated
in the corpus (sentences, utterances, turn-takings,
paragraphs, etc.).
• When using Sketch Engine, an understanding of CQL
language is essential: https://www.sketchengine.eu/
documentation/cql-basics/
7.3 Putting it all together: reviewing skills 12–1 7

7.3.1 Chapter 5
Consider a research project that you would like to develop in the future. The pro-
ject involves the transcription and the analysis of interviews. Now, think about the
transcription process (see Table 5.2). How has this chapter changed your views on
the implications of adopting a transcription standard?
Go to Figure 5.1. How would you plan both the transcription and the markup
of the data? How can you design your data transcription and markup so that you
maximise the opportunities to exploit the structure(s) of the transcribed file(s)?
(See Table 5.6.)
7.3.2 Chapter 6
In this chapter we explored how keyword analysis (see Table 6.8) can be used
to explore corpora of policies and long documents. Revisit Figure 6.1 and think
about the implications of your choice of target and reference corpora. Why is
your choice so important?
What do keyness scores tell you about the lexical items in a corpus or in a
document?
You have decided to use keyword analysis to investigate your data. Go to
Figure 6.3 and try to outline step-by-step guidelines to look at your data. Can you
envisage the outcome of your analysis?
What are the differences between keyword analysis and n- gram analysis
(Table 6.11)?
7.3.3 Chapter 7
Again, focus on a research project that you would like to do in the future. The
project involves the transcription and the analysis of interviews. Try to identify
some of the themes that you expect to find in the interviews. How would you go
about investigating them? Can you outline a strategy to examine these themes in
your corpus? Where will you start? Which of the linguistic elements in 7.1 may
be relevant?
Ideally, your corpus has been annotated and you can search within attributes
and structures (see chapter 5). Can you envisage the outcome of your searches?
How does a plain transcription differ from an annotated corpus? How can com-
plex searches (Table 7.5) help you retrieve results from your dataset?
Notes
1 The entire XML TEI annotated corpus: http://webapps.ael.uni-tuebingen.de/
backbone-search/corpora/backbone_english_as_lingua_franca.xml
2 http://webapps.ael.uni-tuebingen.de/backbone-search/
3 Backbone –Corpora for Content & Language Integrated Learning. EU Lifelong

Learning Programme (grant agreement 143502-LLP-1-2008-1-DE-KA2-KA2MP)
2008–10. The overall project objective was to offer do-it-yourself e-learning solutions
based on a pedagogic corpus approach involving spoken interviews on a wide range of
topics. Special attention is given to lesser taught languages, to regional, sociocultural
and subject-related varieties of more frequently taught languages, as well as to English
as a Lingua Franca.
4 Other languages may not necessarily behave in exactly the same way.
5 Norman Fairclough is the father of textually oriented discourse analysis and modern
critical discourse analysis. He is the author of, among others, Language and Power, 1989,
and Critical Discourse Analysis, 1995.
6 LogDice statistic.
7 Table 5.3 offers some basic principles and tips to annotate and query your data using
your own annotation taxonomy.
8 According to Wikipedia.org ‘Stormfront is a white nationalist, white supremacist, anti-
semitic, Holocaust denialist, and neo-Nazi Internet forum, and the Web’s first major
racial hate site. In addition to its promotion of Holocaust denial, Stormfront has
increasingly become active in the propagation of Islamophobia’. https://en.wikipedia.
org/wiki/Stormfront
9 This is a crawled corpus obtained for the Internet. www.sketchengine.eu/ententen-
english-corpus/
10 Download available from: http://corpora.lancs.ac.uk/lancsbox/download.php
References
Brezina, V., Timperley, M. & McEnery, T. (2018). #LancsBox v. 4.x [software]. Available
at: http://corpora.lancs.ac.uk/lancsbox.
Brindle, A. (2016). The language of hate: A corpus linguistics analysis of white supremacist language.
London: Routledge.
Francis.
Ellis, N. (2020). Usage-based theories of construction grammar: Triangulating corpus lin-
guistics and psycholinguistics. In Egbert, J. & Baker, P. (Eds.) Using corpus methods to triangu-
late linguistic analysis, 239–267.
King, N., Horrocks, C. & Brooks, J. (2019). Interviews in qualitative research. 2nd edition.
Kohn, K. (2012). Pedagogic corpora for content and language integrated learning. Insights
from the BACKBONE Project. The Eurocall Review, 20(2), 3–22.
Locke, T. (2004). Critical discourse analysis. London: Bloomsbury.
Staples, S. H. (2015). Spoken discourse. In Biber, D. & Reppen, R. (Eds.) The Cambridge
handbook of English corpus linguistics. Cambridge: Cambridge University Press, 271–291.
Thompson, G. (2013). Introducing functional grammar. London: Routledge.
Chapter 8
Conclusion
This book has provided a practical introduction to the use of corpus linguis-
tics methods in education research. We have presented an entry-level research
guide to the world of corpus linguistics for those who do not necessarily have
a linguistic background or have not used corpora before. Our discussion aimed
to bridge the gap between educational research, admittedly a massively diverse
area of practice, and corpus linguistics, a very specific area with a huge poten-
tial to become a useful set of research methods across a variety of disciplines.
Despite the methods’ potential, however, we find two major challenges to this
collaboration.
The first of these challenges is inherent to the notion of interdisciplinary work.
McEnery, Brezina, Gablasova & Banerjee (2019) have stressed that, despite the
promise (of collaboration), working across disciplinary boundaries has long been
acknowledged to be difficult. This difficulty is expressed in the absence of corpora
in major publications on research methods in education (e.g. Cohen, Manion &
Morrison, 2018; Gray, 2018). The fact that corpus methods may be either ignored
or seen as out-of-bounds in education research reinforces the perceived distance
between those practising educational research in the inner circle and those con-
tributing from the outside.
The second challenge is more elusive and difficult to address. Corpus linguistics
methods are seen as essentially quantitative, at least in linguistics research and,
particularly, in research involving lexicography and the building of representative
corpora. The use of representative corpora in CL research has received much
scholarly attention and, arguably, most linguists make use of corpora as proxies
for usage based on the sophisticated design features of their corpus of choice.
However, CL-inspired research that exploits some of the CL research methods
with more modest corpora are also relevant across educational contexts, and par-
ticularly in second language education.
The use of corpus linguistics as a complementary research method to inform
the qualitative analysis of language has rarely been discussed in education where
its use in research-methods design has been limited. The perception that corpus
linguistics is a quantitative methodology may put off education researchers that
Conclusion 169
use an interpretive paradigm and who may feel that corpus methods, as post-
positivist practice, may be useless in their context:
Positivism strives for objectivity, measurability, predictability, controllability,

patterning, the construction of laws and rules of behaviour, and the ascrip-
tion of causality; interpretive paradigms strive to understand and interpret
the world in terms of its actors. In the former, observed phenomena are
important; in the latter, meanings and interpretations are paramount. Giddens
(1976) describes this latter as a ‘double hermeneutic’, where people strive to
interpret and operate in an already interpreted world; researchers have their
own values, views and interpretations, and these affect their research, and,
indeed, that which they are researching is a world in which other people act
on their own interpretations and views.
In this book we have tried to provide an account that takes stock of the main-
stream quantitative tradition in corpus linguistics, while presenting opportun-
ities for collaboration and mixed-methods research to emerge. Using Egbert &
Baker’s (2020) classification of when corpus methods are used in triangulation,
our approach can be described either as sequential or cyclical. Sequential CL
methods were mainly used in this book when corpora were devised as the main
research data collection method. On the other hand, most of the discussions in
this book seem to favour the use of CL methods as cyclical. Our model for the
examination of concordance lines (Figure 7.5) is an example of this approach.
We need more reflection and conversations with educational researchers in order
to understand better how these two approaches can be used in their research
designs and how they can contribute to the use of mixed-methods or CL-only
approaches.
Apart from these challenges, we want to emphasise the many opportunities that
lie ahead for educational researchers who are interested in corpus linguistics. CL
methods can help them with the triangulation and validation of their research: tri-
angulation of methods and also of datasets, as well as validation of results and an
evaluation of researcher bias. We note that many of the data collection methods
used in education such as interviews or focus groups are likely to be explored
either automatically or semi-automatically by means of CL methods such as col-
location or keyword analysis. The use of policy documents and media texts is
absolutely central to existing CL work in other areas of research such as the social
sciences. Besides, this book has provided plenty of examples where CL methods
have already been used in education.
We believe that the best applications of CL methods to educational research are
yet to come. We hope that this book will contribute to extending the popularity of
corpus linguistics outside its area of specialisation.
Table 8.1 offers a summary of our crucial, final skill: remaining critical.
170 Conclusion
Table 8.1 Skill 18: remaining critical

Skill 18 • Corpus linguistics is a very recent research area. McEnery
& Hardie (2012: 1) note that ‘the procedures themselves
are still developing, and remain an unclearly delineated
set –though some of them, such as concordancing, are well
established and are viewed as central to the approach’.
• CL methods will very likely become widely used across
the social sciences in the forthcoming years. The Big Data
revolution will knock on our door for sure, but corpus
linguistics has already started to speak this language. Now
we use corpora of billions of words and can retrieve data
from these datasets in no time.
• It seems that increased computational power and data
will be more and more readily available to researchers.
However, we need to remain highly critical of the
implications of both existing and new research methods, in
particular fully automatic analysis of language.
• Keyword analysis is an excellent example that represents
how it is absolutely essential that researchers understand
how the results need to be interpreted and the impact of
the reference corpus on these results.
• An understanding of how we want to use CL methods
in education research needs to consider some of issues
underlying the methods discussed in Figures 1.4, 3.1 and 7.5
References
Francis.
Egbert, J., & Baker, P. (Eds.). (2020). Using corpus methods to triangulate linguistic analysis.
London: Routledge.
Gray, D.E. (2018). Doing research in the real world. 4th Edition. London: Sage Publications
Limited.
McEnery, T & Hardie, A. (2012). Corpus linguistics: method, theory and practice. Cambridge:
McEnery, A., Brezina, V., Gablasova, D. & Banerjee, J. (2019). Corpus linguistics, learner
corpora, and SLA: Employing technology to analyse language use. Annual Review of
Applied Linguistics, 39, 74–92.
Index
Note: Page numbers in italic denote figures and in bold denote tables.
annotation 90–91, 91, 92–113, 96, 98, British Academic Written English Corpus
101, 108, 109, 110, 114–115; see also (BAWE) 3–4, 5, 16–17
part-of-speech (POS) tagging British Journal of Educational Technology
AntConc software 23, 43, 43, 52–53, (BJET) 23
53, 54, 55, 58; annotation 95–96, 96; British Law Reports Corpus 124–125, 124
comparison 76, 77, 78, 79, 80; keyword British National Corpus (BNC) 10, 15,
analysis 120, 125–126; n-grams  146 16, 17, 51, 97, 119, 125, 125, 129–133,
Anthony, Laurence 43, 78 131–132, 139–140, 141, 141, 144
association measures 57 Brown Corpus 1
Aston, Guy 27–28 Burnard, Lou 96–97
Atkinson, J.M. 25 Bybee, J. 11, 12
Australia, early childhood education (ECE)
policy 65–69, 68, 71 Callies, M. 28
automated content analysis 23 Canada, financial literacy education policy
automated transcription 88, 92–93 69–71, 71, 71
average reduced frequency (ARF) 85 Cheung, L. 4, 5
Child Language Data Exchange System
Backbone Corpus of English as a Lingua (CHILDES) project 90–91, 99–100
Franca 150–151, 151, 154–163, 154, children’s literature, keyword analysis
156, 157, 159 142–146, 143
Bailey, J. 91–92 chi-square test 120, 125, 126
Baker, Paul 12–14, 117, 133, 134, 138, CLAWS tagset 75–76
140–141, 169 CLiC (Corpus Linguistics in Cheshire)
Barroso, J. 20 142–143, 143, 144–145
Bednarek, M. 3 Coates, J. K. 23–24
Berber-Sardinha, T 121–124 coding 90–91, 91, 92–113, 96, 98, 101,
Bergmann, J.R. 24–25 108, 109, 110, 114–115; in content
Bergström, G. 25 analysis 20–22
Biber, Douglas 4, 7, 16, 29, 29, Cohen, L. 8, 9, 9, 20, 21, 89–90,
30–32, 30, 73 152, 169
BNC see British National Corpus (BNC) colligational analysis 1 33, 134, 135–142,
Bond, M. 23 135, 136, 137, 140, 141, 141, 142
Brat software 93 collocation analysis 28, 56–61, 56, 57, 58,
Brenchley, M. 35 59, 60, 70–71, 72, 85, 139, 163, 164
Brezina, Vaclav 52, 55, 85, 117, 126, 134 comparison 73–83, 77, 79, 80, 81,
Brindle, A. 163–165, 164 82, 84, 85
172 Index
complex searches 153–165, 154, 154, 155, Day, C. 100–101

156, 157, 158, 159, 160, 165 Demmen, J. 10, 119, 120–121, 125, 126
computer-assisted content analysis 23 discourse analysis (DA) 25–27; corpus-
computer-assisted transcription 88, 92–93 assisted discourse analysis (CADA)
concordance lines 28, 38–39, 42–50, 43, 13–14; critical discourse analysis
44, 44, 45, 46, 46, 49, 50, 134 (CDA) 25, 26
Conrad, S. 4, 7, 29, 29, 30, 30, 73 dispersion 55
constructivist paradigm 8–9, 9, 152 Durrant, P. 35, 41
content analysis 20–23, 39
conversational analysis (CA) 24–25 early childhood education (ECE) policy,
corpora: creation and design 64–72, 67, Australia 65–69, 68, 71
68, 71, 71, 72; existing corpus use Economic and Social Research Council
40–42, 42; as primary data 14, 15, (ESRC) 40
37–38, 65–69; as secondary data 14–15, Education Act 2011, UK 6–7
15, 37–39, 65, 69–72; size 51–52, 64–65 education policy research 64–83; early
Corpus del Español 51, 52 childhood education (ECE) policy,
corpus linguistics, defined 1 Australia 65–69, 68, 71; financial
corpus linguistics research methods 35–61; literacy education policy, Canada 69–71,
collocation analysis 28, 56–61, 56, 57, 71, 71; international education policies,
58, 59, 60, 70–71, 72, 85, 139, 163, UK and New Zealand 73–83, 77, 79,
164; comparison 73–83, 77, 79, 80, 81, 80, 81, 82; university language policies,
82, 84, 85; as complementary research Spain 39
methods 14–15, 15, 37–39, 65, education research, definitions and
69–72; concordance lines 28, overview 8–10, 8
38–39, 42–50, 43, 44, 44, 45, 46, 46, Egbert, J. 169
49, 51, 134; definitions and overview Ellis, N. 11–12, 150
1–7, 27–29; existing corpus use 40–42, English as a Medium of Instruction
42; genre–activity continuum 37–38, (EMI) 39
37; interpreting frequencies 16–17, 18, English TreeTagger PoS tagset 75
50–55, 52, 53, 54, 55; as main research English-Corpora.org TV Corpus 2–3,
methods 14, 15, 37–38, 65–69; register 5, 5
analysis 29–32, 29, 30, 31; role in Evert, S. 56, 57
education research 9–10; use outside EXMARaLDA Partitur Editor 92
linguistics 35–38; see also keyword
analysis Fenech, M. 65–69, 68, 71
Corpus of Contemporary American Fest, J. 38–39
English (COCA) 51, 52 file formats 47
corpus-assisted discourse analysis financial literacy education policy, Canada
(CADA) 13–14 69–71, 71, 71
CQL (corpus query language) searches Flynn, N. 22
107–108, 108, 154–159, 154, 155, 156, Folia software 92
157, 159, 162 Forest School approach 23–24
critical, remaining 170 Francis, Nelson 1
critical discourse analysis (CDA) 25, 26 French Web corpus 51, 52
Crosthwaite, P. 4, 5 frequency 11–17; dispersion and 55;
Crotty, M. 8 importance of 15–16, 17; in individual
CSV files 47 texts or groups of texts 14–15;
Culpeper, J. 10, 119, 120–121, 125, 126 interpretation of 16–17, 18, 50–55, 52,
53, 54, 55; in language learning and use
data collection 65, 67–72, 68 11–12, 35; normalised 16–17, 18, 52–53,
Data Protection Act, UK 40 55, 85; in public discourse 12–14; raw 7,
Davies, Mark 5, 5, 51 16, 18, 55; relative 52–53, 55
Index 173
Fuentes, A.C. 119 154, 154; n-grams 16, 142–143, 143,

functional linguistics approach 16 144–146, 146; nouns and noun phrases
133, 135–142, 135, 136, 137, 140, 141,
Gablasova, D. 35 141, 142; peace agreements corpus
Gabrielatos, C. 126 121–133, 122–123, 128, 130,
Gardner, S. 3–4 131–132, 135–142, 137, 140;
genre–activity continuum 37–38, 37 Stormfront corpus 163–165, 164
Gray, D.E. 8, 8, 9, 29, 88 King, N. 89, 93
Growth in Grammar Corpus (GIGC) Kitzinger, C. 25
41–42, 42 Kolesnikova, O. 57
Gu, Q. 40–41 KWIC (keyword-in-context) 43, 43, 157,
The Guardian corpus of education and 158, 160
inclusion 46, 48–50, 49, 52–55, 53, 54,
57–60, 57, 58, 59 language learning 11–12, 35
Leavy, P. 88
Habermas, Jürgen 72 Leech, Geoffrey 1
Hadley, G. 24 lemmas, defined 44
Halliday, M.A.K. 57 Leximancer software 23
Hardie, Andrew 4–5, 6, 16–17, 27, 35, 42, lexis see keyword analysis
43, 56, 145–146, 170 Locke, T. 152
Hennessy, S. 21 LogDice measure 57, 135
Heritage, J. 25 log-likelihood test 120, 125, 126
Hunston, S. 29 Louvain International Database of Spoken
English Interlanguage (LINDSEI)
Inscribe software 92 93, 94–95
international education policies, UK and
New Zealand 73–83, 77, 79, 80, McEnery, Tony 1, 4–5, 6, 16–17, 27, 35,
81, 82 42, 43, 56, 145–146, 168, 170
interview data: annotation 90–91, 91, MacWhinney, Brian 90–91
92–113, 96, 98, 101, 108, 109, 110, Mahlberg, M. 145
114–115; complex searches 153–165, Marchi, A. 126
154, 154, 155, 156, 157, 158, 159, material processes 153
160, 165; transcription 88–96, 91, Matthiessen, M.I.M. 57
94–95, 96, 97 Mauranen, A. 133
iWeb corpus 2 Mautner, G. 10
MAXQDA software 22, 22
Juola, P. 144 mental processes 153
Mercer, N. 24, 25, 36–37
Kellams, S. 21 metadata 90–91, 91, 92–113, 96, 98, 101,
keyword analysis 7, 10, 14, 28, 73, 85, 108, 109, 110, 114–115
117–133, 124, 126, 134, 134; British multiword keywords 120–121, 129–133,
Law Reports Corpus 124–125, 124; 130, 131–132, 139–142, 140, 141, 141,
British National Corpus (BNC) 125, 142, 154, 154
125, 129–133, 131–132, 139–140, Mutual Information (MI) scores 57–58, 57,
141, 141, 144; children’s literature 60–61, 60
142–146, 143; colligational analysis
133, 134, 135–142, 135, 136, 137, 140, narrative policy analysis (NPA)
141, 141, 142; keyness defined 119; 69–71, 71, 71
keyness scores 120, 124–127, 124, 125, National Quality Framework (NQF),
126, 128, 130, 131–132; multiword Australia 65–69, 68, 71
keywords 120–121, 129–133, 130, Nelson, M. 67
131–132, 139–142, 140, 141, 141, 142, Nesi, H. 3–4
174 Index
New Zealand, international education secondary data, using corpora as 14–15,

policies 73–83, 77, 79, 80, 81, 82 15, 37–39, 65, 69–72
n-grams 16, 142–143, 143, 144–146, Silverman, D. 93
146, 151 Sinclair, J. 27, 43–44, 56, 68, 133
normalised frequency 16–17, 18, Siyanova-Chanturia, A.  56–57
52–53, 55, 85 Sketch Engine software 43, 44, 47, 46,
nouns and noun phrases 1 33, 55, 57; annotation 98–100, 107–109;
135–142, 135, 136, 137, 140, 141, complex searches 154–163, 154, 155,
141, 142 156, 159, 165; CQL (corpus query
NVivo software 22 language) searches 107–108, 108,
154–159, 154, 155, 156, 157, 159, 162;
objectivism 29 keyword analysis 120, 121, 124–125,
orthographic transcription 90, 93 126, 127, 133, 135, 135, 136, 137,
Orwell, George 43, 43, 44, 45, 45 139–142; n-grams  146; part-of-speech
Otranscribe software 92 (POS) tagging 74, 75, 76, 77, 84; Word
Oxford Children’s Corpus (OCC) 143–144 Sketch function 135, 135, 136, 137,
Oxford Text Archive 15 141–142
sociocultural theory (SCT) 25
Parker, I. 25–26 software: Brat 93; EXMARaLDA Partitur
part-of-speech (POS) tagging 14, 75–83, Editor 92; Folia 92; Inscribe 92;
77, 79, 80, 84, 154–157, 155, 157 Leximancer 23; MAXQDA 22, 22;
peace agreements corpus, keyword analysis NVivo 22; Otranscribe 92; TagAnt
121–133, 122–123, 128, 130, 78, 79, 84; Transcribe 92–93; UAM
131–132, 135–142, 137, 140 Corpus Tool 92; WMatrix 83, 120,
Pérez-Paredes,  P.  14 144; WordSmith Tools 35–36, 66, 120;
phenomenological paradigm 8–9, 9 see also AntConc software; Sketch Engine
Pimlott-Wilson, H.  23–24 software
Pinto, L.E. 69–71, 71, 71 software-aided content analysis 23
POS see part-of-speech (POS) tagging software-aided transcription 88,
positivism 8, 9, 9, 27, 169 92–93
primary data, using corpora as 14, 15, Spain, university language policies 39
37–38, 65–69 Spina, S. 56–57
Pring, R. 8–9, 8 spoken language 150–165; complex
prosodic transcription 90 searches 153–165, 154, 154, 155, 156,
157, 158, 159, 160, 165; linguistic
Quirk, Randolph 1 perspective 150–153
Stamatatos, E. 144
raw frequency 7, 16, 18, 55 Stanford Part of Speech Tagger 75
Rayson, P. 144 Staples, S.H. 151
register analysis 29–32, 29, 30, 31 statistical tests 39, 57, 66, 81–83, 85
relational processes 153 Statistics and Registration Services
relative frequency 52–53, 55 Act, UK 40
Reppen, R. 64–65, 89, 90 Stormfront corpus 163–165, 164
representativeness 28–29, 64 Stubbs, M. 28–29, 35
Rogers, R. 25, 26 Swales, John 27–28
Sydney Corpus of Television Dialogue
Sandelowski, M. 20 (SydTV) 3
scientific paradigm 8, 9, 9
Scott, Mike 7, 35–36, 100, 119, 120, 134 TagAnt software 78, 79, 84
Sealey, A. 144 tagging: part-of-speech (POS) 14, 75–83,
search engine optimization (SEO) 118 77, 79, 80, 84, 154–157, 155, 157;
second language learning 35 see also annotation
Index 175
talk see spoken language T-SEDA project  21

TenTen corpora 52 TV Corpus 2–3, 5, 5
text analysis approaches 20–27, Type Token Ratio (TTR) 74–75
152; content analysis 20–23, 39; types, defined 46
conversational analysis (CA) 24–25;
discourse analysis (DA) 25–27; register UAM Corpus Tool 92
analysis 29–32, 29, 30, 31; theme UK, international education policies
analysis 20, 23–24, 38, 56 73–83, 77, 79, 80, 81, 82
Text Encoding Initiative (TEI) 76–78, 97, UK Data Service 40–41
98, 100–104, 101, 108–113 university language policies, Spain 39
theme analysis 20, 23–24, 38, 56
Thompson, G. 152–153 Viana, V. 1, 27–28, 36
Thompson, P. 144 Villares, R. 39
Thurow, Shari 118 Vrikki, M. 21–22
The Times corpus of education and
inclusion 46, 46, 46 web services, transcription 92–93
Timmis, I. 9–10 Wilkins, D.P. 65–69, 68, 71
Tognini-Bonelli,  E.  13 Wilkinson, S. J. 25
tokens, defined 46 Wilson, Andrew 1
total accountability principle 4–5, 5 WMatrix software 83, 120, 144
Transcribe software 92–93 Woodside-Jiron, H.  26–27
transcription 88–96, 91, 94–95, 96, 97; Wooffitt, R. 24
see also annotation WordSmith Tools 35–36, 66, 120
transitivity 152, 153
Tribble, C. 7, 36, 100, 119 XLS files 47
T-scores 57, 59–61, 59, 60 XML files 47

Untitled

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Untitled

Uploaded by

Copyright:

Available Formats

Corpus Linguistics for Education

Corpus Linguistics for Education provides a practical and comprehensive introduction

Pascual Pérez-​Paredes is Professor of Linguistics and Applied Linguistics in

Series consultant: Michael McCarthy

Series consultant: Anne O’Keeffe

Series co-founder: Ronald Carter

Other titles in this series

Corpus Linguistics for World Englishes

Corpus Linguistics for Education

More information about this series can be found at www.routledge.com/​series/​RCLG

A Guide for Research

Pascual Pérez-​P aredes

1 Introduction: corpus linguistics and education research 1

3 Corpus linguistics approaches to understanding language use 35

3.2 Reading concordance lines 42

4 Researching education policies: using your own corpus 64

5 Interview data: transcription and annotation 88

6 Examining lexis: analysing peace treaties and children’s literature 117

7 Analysing talk: complex searches 150

7.3 Putting it all together: reviewing skills 12–​17 166

1 .1 Corpora as a research method 2

6 .3 Working with keywords: a suggested pathway 134

1.1 Connecting research questions, methods and data collection

5 .2 Skill 12: transcribing your data 97

This book provides a practical introduction to using corpus linguistics methods

of different approaches to text analysis, this chapter seeks to develop an awareness

Table number Skill

1.1 What is corpus linguistics?

In this view, a corpus is an instrument, a method, that researchers use to answer

Question A: What language is used in TV shows?

Table 1.1 Connecting research questions, methods and data collection and analysis

Research methods that can be used to answer these questions

• Data collection and analysis:

B. What does Higher • Data collection and analysis:

C. What • Data collection and analysis:

• demonstrating knowledge and understanding

The use of language is, therefore, constrained by the communities of users

Table 1.2 Total accountability in different corpus designs

Research Corpus How is total accountability

introduction, this Act makes provision about education, childcare, apprenticeships,

A keyword analysis (discussed in more detail in chapter 6) is one of the methods

1.2 The role of corpus linguistics research methods in

Table 1.3 Differences between positivism and phenomenology paradigms

Whether you use a [corpus-​assisted discourse studies] CADS approach or

1.3 Understanding the role of frequency

1.3.1 Frequency in L1 learning and use

[…] the recognition and production of words is a function of their frequency

1.3.2 Frequency in public discourse

An association between two words, occurring repetitively in naturally

[…] one discourse of wheelchair users constructs them as being deficient in

The ‘incremental effect’ (Baker, 2006: 13–​14) of the lexical co-​occurrence

meanings metaphors, representations images, stories, statements and so on

1.3.3 Frequency in texts or groups of texts

Corpus as primary data Corpus as secondary data

1.3.4 Why frequency matters

• the total number of words in a corpus (tokens)

1.3.5 How to interpret frequency

Table 1.4 Skill 1: why frequency matters

• Frequency-​related information is essential to

Normalised frequency = (number of occurrences of the word in the

Normalised frequency = (1,648/​6,968,089) x 1,000.000

Normalised frequency = (25,947/​96,134,547) x 1,000.000

Table 1.5 Skill 2: how to interpret frequency

•​ The raw frequency of a word in a corpus or a text is the

Pascual Pérez-Paredes is Professor of Linguistics and Applied Linguistics in

More information about this series can be found at www.routledge.com/series/RCLG

Pascual Pérez-P aredes

7.3 Putting it all together: reviewing skills 12–17 166

Question A: What language is used in TV shows?

Table 1.1 Connecting research questions, methods and data collection and analysis

B. What does Higher • Data collection and analysis:

C. What • Data collection and analysis:

Table 1.2 Total accountability in different corpus designs

Table 1.3 Differences between positivism and phenomenology paradigms

Whether you use a [corpus-assisted discourse studies] CADS approach or

1.3.1 Frequency in L1 learning and use

The ‘incremental effect’ (Baker, 2006: 13–14) of the lexical co-occurrence

1.3.3 Frequency in texts or groups of texts

Table 1.4 Skill 1: why frequency matters

• Frequency-related information is essential to

Normalised frequency = (1,648/6,968,089) x 1,000.000

Normalised frequency = (25,947/96,134,547) x 1,000.000

Table 1.5 Skill 2: how to interpret frequency

• The raw frequency of a word in a corpus or a text is the

• What do you mean? Tell me more…

Figure 2.1 Adding metadata to codes in MAXQDA

Automatic content analysis enabled Bond, Zawacki-Richter & Nichols (2019)

In Woodside-Jiron (2011: 158) we find an application of DA in educational

Table 2.1 Skill 3: understanding text types: register basics

• Humans accomplish different communicative functions

• Registers can be understood as textual sites where

Seng: me and Producon

Figure 2.2 Situational characteristics of registers and genres

Table 2.2 Skill 4: understanding textual features and textual data

• In register analysis, linguistic features can be identified using

• POS tags capture morphosyntactic properties of

[…] all the effort of a concordancer or a word-listing application goes into

1 Words in texts. The study of sentences, paragraphs, sections, etc.

Language as genre Language as acvity

Figure 3.1 Language as genre–activity continuum

genre–activity continuum above and can, consequently, serve different research

3.1.4 Using an existing corpus

Table 3.2 Skill 5: using an existing corpus

3.2 Reading concordance lines

Figure 3.2 Concordance lines in AntConc

Figure 3.3 Concordance lines in Sketch Engine

Table 3.4 Patterns emerging from the concordance lines of fact in Orwell’s 1984

3.2.1 How to read concordance lines

Essential terminology Types/tokens

Figure 3.4 Word list from The Times Online corpus as displayed in Sketch Engine

Table 3.6 Extended node contexts

Table 3.7 Skill 6: reading concordance lines

• Concordance lines show how a node behaves in a corpus

• We examine concordance lines to discover patterns of