Professional Documents
Culture Documents
net/publication/334748540
CITATIONS READS
2 4,263
1 author:
Cunxin Han
Jimei University
4 PUBLICATIONS 28 CITATIONS
SEE PROFILE
All content following this page was uploaded by Cunxin Han on 13 January 2020.
Cunxin Han
To cite this article: Cunxin Han (2019): Statistics in Corpus Linguistics: A Practical Guide, Journal
of Quantitative Linguistics, DOI: 10.1080/09296174.2019.1646069
BOOK REVIEW
linguistics are discussed: token, type and lemma. Some simple statistic techni-
ques used to measure frequencies as well as distribution of words and phrases
are introduced afterwards. The chapter ends with the measure of lexical
diversity.
Chapter 3 shifts to the discussion of semantics and discourse, looking at
words in context. This chapter is written with a simple premise, that is, ‘word
meanings can best be investigated through the repeated linguistic patterns in
corpora’. Methods including keywords and collocation used to measure
repeated occurrences and co-occurrences are presented. The chapter first dis-
cusses the concept of collocation, i.e. ‘combinations of words that habitually co-
occur in texts and corpora’, offering an overview of 14 association measures of
collocation. The author holds that our choice of a specific association measure
to employ depends on the type of collocations we are interested in, which is
based on our research design. The ways of visualization of collocation, i.e.
collocation graphs and networks are discussed with particular emphasis on the
following part. Then, the chapter moves on to a detailed explanation of key-
words and lockwords, concluding with inter-rater agreement measures, which
are relevant for issues around reliability in manual coding.
Chapter 4 looks at lexico-grammatical features in language, for example,
modality. It starts with a discussion of two approaches to lexico-grammar in
corpora, the ‘Whole corpus’ research design and the ‘Linguistic feature’ design.
Following that, this chapter shows how cross-tabulation can be used to sum-
marize lexico-grammatical variation. Some measures that are calculated based
on cross-tabulation summary tables are also explained, for example, simple
percentage, the chi-square test and logistic regression. Since the latter is
a comparatively complex model, it is explained at length including the method
of calculation and the interpretation of outputs.
Chapter 5 focuses on register variation. To begin, a group of statistical
methods is introduced to the simultaneous analysis of more than one linguistic
variables which are found to be characteristic of certain texts and registers.
Next, the chapter explores the interrelationship between two linguistic variables
via correlation. Two basic kinds of correlation, Pearson’s correlation and
Spearman’s correlation, are clarified. In the author’s view, Pearson’s correlation
is designed to research scale variables, while Spearman’s correlation process
ordinal variables. Following that, the technique of hierarchical agglomerative
clustering is introduced for the classification of words, texts, registers, etc.
Additionally, the interpretation of the outputs of the cluster analysis is
explained. Finally, Multidimensional analysis (MD), a sophisticated methodol-
ogy devised by Douglas Biber, is introduced, which is characteristic of employ-
ing factors reduced from a range of variables to characterize particular registers.
In the following part, step-by-step guidance is attributed to MD from ‘the
initial process of variable selection through to the interpretation of factor
loadings and dimension plots’.
Chapter 6 shifts to two branches of linguistics other than corpus linguistics,
i.e. sociolinguistics and stylistics, in both of which statistical methods are
heavily relied on. In this chapter, statistics like the t-test, ANOVA, the
JOURNAL OF QUANTITATIVE LINGUISTICS 3
to the most complex analysis models. Statistical techniques are also illu-
strated with multiple examples in each chapter and an example of reporting
statistics is followed after each statistical method and formula, which enables
readers to apply it directly to their own research. Finally, readers may find it
very convenient to grasp the key points for each chapter in the ‘Things to
remember’ section.
Another notable strength of this book is that it strikes a good balance
between theories and practice. First, it presents many statistical methods as
well as their applications in various types of research. Second, a selection of
exercises after each chapter are designed to enhance the readers’ understanding
of the theories. Finally, all the research exemplified in this book can be easily
replicated using the Lancaster Stats Tools online, which enables additional
practice to the readers.
The reviewer identified the following weaknesses in this book. First,
although the author claims that ‘No prior knowledge of statistics is assumed;
instead all necessary concepts and methods are explained in non-technique
language’, the reviewer though having some statistical knowledge still found
some difficulty in understanding some of the technical terms and equations.
For instance, in the application part of Chapter 1, a statistic method,
independent samples t-test is employed, which is not addressed until
Chapter 6. Readers at beginner level may find difficulty in understanding
the interpretation of the results without background knowledge of it. Second,
some research examples are derived from the author’s real life which are
intended to enliven the dull subject, but they cannot be taken as serious
research for some readers. For example, the application example for Chapter
6 is ‘Who is this person from the white house?’ Finally, the research
discussed in Chapter 6 extends to Sociolinguistics and Stylistics, which
does not echo the theme that is set for this book, that is, statistics in corpus
linguistics. However, some readers may regard this as a strength instead of
a weakness.
Despite these few issues, Statistics in Corpus Linguistics has achieved its
aim of giving readers a practical guide in statistics used in corpus linguistics.
It should be a worthy and rewarding reader for not only novice but also
experienced researchers in linguistics, sociology and other social research
areas.
Acknowledgments
I would like to thank Professor Paul Baker for his comments on an earlier version
of this review.
Funding
This review was supported by the National Social Science Foundation of China
(Grant No.17BYY066) and a grant from the Ph.D. Start-up Fund of Jimei
University (Grant No: Q201504)
JOURNAL OF QUANTITATIVE LINGUISTICS 5
References
Desagulier, G. (2017). Corpus linguistics and statistics with R: Introduction to quantitative
methods in linguistics. Switzerland: Springer International Publishing.
Gries, S. T. (2010). Useful statistics for corpus linguistics. In A. Sánchez & M. Almela (Eds.),
A mosaic of corpus linguistics: Selected approaches (p. 269). Frankfurt: Peter Lang.
Oakes, M. P. (1998). Statistics for corpus linguistics. Edinburgh: Edinburgh University Press.
Cunxin Han
School of Foreign Languages, Jimei University, Xiamen, China
hancunxin@126.com; hancx@jmu.edu.cn http://orcid.org/0000-
0003-1973-9334
© 2019 Cunxin Han
https://doi.org/10.1080/09296174.2019.1646069