Statisticsin Corpus Linguistics

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/334748540
Statistics in Corpus Linguistics: A Practical Guide: by Valclav Brezina,

Cambridge, Cambridge University Press, 2018, xix+296 pp., £21.99
(paperback), ISBN: 978-1107565241
Article in Journal of Quantitative Linguistics · July 2019

DOI: 10.1080/09296174.2019.1646069
CITATIONS READS
2 4,263
1 author:
Cunxin Han
Jimei University
4 PUBLICATIONS 28 CITATIONS
SEE PROFILE
All content following this page was uploaded by Cunxin Han on 13 January 2020.
The user has requested enhancement of the downloaded file.

Journal of Quantitative Linguistics
ISSN: 0929-6174 (Print) 1744-5035 (Online) Journal homepage: https://www.tandfonline.com/loi/njql20
Statistics in Corpus Linguistics: A Practical Guide
Cunxin Han
To cite this article: Cunxin Han (2019): Statistics in Corpus Linguistics: A Practical Guide, Journal
of Quantitative Linguistics, DOI: 10.1080/09296174.2019.1646069
To link to this article: https://doi.org/10.1080/09296174.2019.1646069
Published online: 29 Jul 2019.
Submit your article to this journal
View Crossmark data
Full Terms & Conditions of access and use can be found at

https://www.tandfonline.com/action/journalInformation?journalCode=njql20
JOURNAL OF QUANTITATIVE LINGUISTICS
BOOK REVIEW
Statistics in Corpus Linguistics: A Practical Guide, by Valclav Brezina,

Cambridge, Cambridge University Press, 2018, xix+296 pp., £21.99
(paperback), ISBN: 978-1107565241
Corpus linguistics is a powerful quantitative methodology, which heavily relies

on frequency data and statistical procedures. It is difficult to talk about corpus
linguistics without mentioning statistic measures based on frequency and dis-
tribution. Many learners of linguistics find corpus linguistics to be a worthwhile
method of study due to its grounding in statistics which offers replicable and
highly systematic forms of analysis. However, ‘not all corpus-based studies
utilize the available statistical methods to their fullest extent’ (Gries, 2010,
p. 269). In this sense, Brezina’s book is a timely contribution because hands-on
books are rarely found in the literature. The first book dedicated to this field
Statistics for Corpus linguistics is already 20 years old while a more recent book
Corpus linguistics and statistics with R: introduction to quantitative methods in
linguistics, caters more to the needs of advanced researchers (Desagulier, 2017;
Oakes, 1998). The book under review is both up to date and accessible because
it aims at all levels of researchers and is written in a non-specialist style.
This book is primarily written to help readers to understand the key
principles of statistical thinking so as to apply those principles to their own
research. It can be divided into three parts. The first (Chapter 1) offers an
introduction of corpus linguistics, statistics and their interrelationship.
The second (Chapters 2–8) employs corpus statistics to address various
research issues ranging from vocabulary to register variation. In each chapter,
there is an introduction, application, accompanying exercises, a brief summary
of key points, and a list of references for advanced reading. The author
concludes the book with final remarks, references and index of statistical
techniques.
Chapter 1 begins by introducing readers to the relationship between statis-
tics and corpus linguistics. It initiates a discussion of the role of statistics in
scientific research and corpus linguistics in particular. The author argues that
‘statistics is crucial for corpus linguistics because it helps us work effectively
with quantitative information’. After that, basic terms in corpus statistics are
introduced and explained. Next, the creation of corpora and research design, as
well as data exploration and visualization are discussed. A real-life research
example is provided to illustrate the application of statistics in corpus research.
Finally, information about the companion website is given.
Chapter 2 deals with the most fundamental topic in Corpus Linguistics,
vocabulary. It starts with a fundamental question for the measure of frequen-
cies, i.e. What is a word? Additionally, other terms relating to words in corpus
2 BOOK REVIEW
linguistics are discussed: token, type and lemma. Some simple statistic techni-
ques used to measure frequencies as well as distribution of words and phrases
are introduced afterwards. The chapter ends with the measure of lexical
diversity.
Chapter 3 shifts to the discussion of semantics and discourse, looking at
words in context. This chapter is written with a simple premise, that is, ‘word
meanings can best be investigated through the repeated linguistic patterns in
corpora’. Methods including keywords and collocation used to measure
repeated occurrences and co-occurrences are presented. The chapter first dis-
cusses the concept of collocation, i.e. ‘combinations of words that habitually co-
occur in texts and corpora’, offering an overview of 14 association measures of
collocation. The author holds that our choice of a specific association measure
to employ depends on the type of collocations we are interested in, which is
based on our research design. The ways of visualization of collocation, i.e.
collocation graphs and networks are discussed with particular emphasis on the
following part. Then, the chapter moves on to a detailed explanation of key-
words and lockwords, concluding with inter-rater agreement measures, which
are relevant for issues around reliability in manual coding.
Chapter 4 looks at lexico-grammatical features in language, for example,
modality. It starts with a discussion of two approaches to lexico-grammar in
corpora, the ‘Whole corpus’ research design and the ‘Linguistic feature’ design.
Following that, this chapter shows how cross-tabulation can be used to sum-
marize lexico-grammatical variation. Some measures that are calculated based
on cross-tabulation summary tables are also explained, for example, simple
percentage, the chi-square test and logistic regression. Since the latter is
a comparatively complex model, it is explained at length including the method
of calculation and the interpretation of outputs.
Chapter 5 focuses on register variation. To begin, a group of statistical
methods is introduced to the simultaneous analysis of more than one linguistic
variables which are found to be characteristic of certain texts and registers.
Next, the chapter explores the interrelationship between two linguistic variables
via correlation. Two basic kinds of correlation, Pearson’s correlation and
Spearman’s correlation, are clarified. In the author’s view, Pearson’s correlation
is designed to research scale variables, while Spearman’s correlation process
ordinal variables. Following that, the technique of hierarchical agglomerative
clustering is introduced for the classification of words, texts, registers, etc.
Additionally, the interpretation of the outputs of the cluster analysis is
explained. Finally, Multidimensional analysis (MD), a sophisticated methodol-
ogy devised by Douglas Biber, is introduced, which is characteristic of employ-
ing factors reduced from a range of variables to characterize particular registers.
In the following part, step-by-step guidance is attributed to MD from ‘the
initial process of variable selection through to the interpretation of factor
loadings and dimension plots’.
Chapter 6 shifts to two branches of linguistics other than corpus linguistics,
i.e. sociolinguistics and stylistics, in both of which statistical methods are
heavily relied on. In this chapter, statistics like the t-test, ANOVA, the
JOURNAL OF QUANTITATIVE LINGUISTICS 3
Mann–Whitney U test, the Kruskal–Wallis test, Correspondence analysis and

mixed-effects models are discussed with their respective values for different
types of research. The t-test, ANOVA, the Mann–Whitney U test, and the
Kruskal–Wallis test are applicable to group comparisons, while correspondence
analysis can be used to explore individual linguistic style. For the traditional
(Labovian) string of sociolinguistic research with a focus on variation, mixed-
effects models are recommended.
Chapter 7 discusses the statistical procedures that can trace the development
of a linguistic variable over time in corpora. It starts with a discussion of the
specific features that apply to diachronic studies. The techniques that enable
visualization of diachronic change are introduced subsequently, for instance,
line graphs, Boxplots and error bars, sparklines and candlestick plots. Next,
a specific type of cluster technique, neighbouring cluster analysis, is introduced
for diachronic application. Finally, the chapter presents a method called Usage
Fluctuation Analysis (UFA), which can be used to identify statistical peaks and
troughs in a diachronic data set. This method is found to be particularly
effective in the identification of extreme points in time, which may suggest
a dramatic change in discourse.
Chapter 8 is the concluding chapter, which is devoted to the summary of the
discussion in this book. First, it reviews all the statistical knowledge discussed
in this book, from which it draws 10 key principles of statistical thinking
applied to corpora. Second, Meta-analysis, a new way of integrating the results
of multiple studies systematically is introduced. The author points out that this
method can integrate results from multiple studies into a single mathematical
synthesis so as to help us gain a better understanding of research results in our
field, but ‘its application in corpus linguistics is still problematic due to the
general lack of reporting of effect size measures’. Therefore, the author calls for
a standardized reporting of effect sizes in corpus research. Finally, the chapter
offers a review of common effect size measures and a guidance for their
interpretation.
Overall, the book offers an overview of the state-of-the-art corpus linguistics
methodologies of language analysis and updates our understanding of statistics
in corpus linguistics. To achieve this aim, a wide range of statistical techniques
are presented, including some techniques that are relatively new in corpus
linguistics, for instance, mixed effects models. Additionally, the book sheds
light on the underpinning theories behind each statistical method, which are
often not clearly clarified elsewhere as statistical procedures are often
embedded in the tools for corpus analysis and thus under-explained.
One strength of this book lies in its organizational strategies. Every
chapter begins with a brief introduction and several research questions.
The readers may have a better understanding of the statistical method
under discussion bearing these questions in mind during their reading.
Next, the ‘Think out’ part preceding each subsection helps to prepare readers
to connect to the statistics under discussion so as to achieve a better learning
outcome. Furthermore, the statistical techniques presented in this book are
organized according to their complexity, from the simplest frequency count
4 BOOK REVIEW
to the most complex analysis models. Statistical techniques are also illu-
strated with multiple examples in each chapter and an example of reporting
statistics is followed after each statistical method and formula, which enables
readers to apply it directly to their own research. Finally, readers may find it
very convenient to grasp the key points for each chapter in the ‘Things to
remember’ section.
Another notable strength of this book is that it strikes a good balance
between theories and practice. First, it presents many statistical methods as
well as their applications in various types of research. Second, a selection of
exercises after each chapter are designed to enhance the readers’ understanding
of the theories. Finally, all the research exemplified in this book can be easily
replicated using the Lancaster Stats Tools online, which enables additional
practice to the readers.
The reviewer identified the following weaknesses in this book. First,
although the author claims that ‘No prior knowledge of statistics is assumed;
instead all necessary concepts and methods are explained in non-technique
language’, the reviewer though having some statistical knowledge still found
some difficulty in understanding some of the technical terms and equations.
For instance, in the application part of Chapter 1, a statistic method,
independent samples t-test is employed, which is not addressed until
Chapter 6. Readers at beginner level may find difficulty in understanding
the interpretation of the results without background knowledge of it. Second,
some research examples are derived from the author’s real life which are
intended to enliven the dull subject, but they cannot be taken as serious
research for some readers. For example, the application example for Chapter
6 is ‘Who is this person from the white house?’ Finally, the research
discussed in Chapter 6 extends to Sociolinguistics and Stylistics, which
does not echo the theme that is set for this book, that is, statistics in corpus
linguistics. However, some readers may regard this as a strength instead of
a weakness.
Despite these few issues, Statistics in Corpus Linguistics has achieved its
aim of giving readers a practical guide in statistics used in corpus linguistics.
It should be a worthy and rewarding reader for not only novice but also
experienced researchers in linguistics, sociology and other social research
areas.
Acknowledgments
I would like to thank Professor Paul Baker for his comments on an earlier version
of this review.
Funding
This review was supported by the National Social Science Foundation of China
(Grant No.17BYY066) and a grant from the Ph.D. Start-up Fund of Jimei
University (Grant No: Q201504)
JOURNAL OF QUANTITATIVE LINGUISTICS 5
References
Desagulier, G. (2017). Corpus linguistics and statistics with R: Introduction to quantitative
methods in linguistics. Switzerland: Springer International Publishing.
Gries, S. T. (2010). Useful statistics for corpus linguistics. In A. Sánchez & M. Almela (Eds.),
A mosaic of corpus linguistics: Selected approaches (p. 269). Frankfurt: Peter Lang.
Oakes, M. P. (1998). Statistics for corpus linguistics. Edinburgh: Edinburgh University Press.
Cunxin Han
School of Foreign Languages, Jimei University, Xiamen, China
hancunxin@126.com; hancx@jmu.edu.cn http://orcid.org/0000-
0003-1973-9334
© 2019 Cunxin Han
https://doi.org/10.1080/09296174.2019.1646069
View publication stats

Statisticsin Corpus Linguistics

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Statisticsin Corpus Linguistics

Uploaded by

Copyright:

Available Formats

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Statistics in Corpus Linguistics: A Practical Guide: by Valclav Brezina,

Article in Journal of Quantitative Linguistics · July 2019

The user has requested enhancement of the downloaded file.

ISSN: 0929-6174 (Print) 1744-5035 (Online) Journal homepage: https://www.tandfonline.com/loi/njql20

Statistics in Corpus Linguistics: A Practical Guide

To link to this article: https://doi.org/10.1080/09296174.2019.1646069

Published online: 29 Jul 2019.

Submit your article to this journal

View Crossmark data

Full Terms & Conditions of access and use can be found at

Statistics in Corpus Linguistics: A Practical Guide, by Valclav Brezina,

Corpus linguistics is a powerful quantitative methodology, which heavily relies

Mann–Whitney U test, the Kruskal–Wallis test, Correspondence analysis and

View publication stats

You might also like