A Taste For Corpora

A Taste for Corpora
Studies in Corpus Linguistics (SCL)

SCL focuses on the use of corpora throughout language study, the development
of a quantitative approach to linguistics, the design and use of new tools for
processing language texts, and the theoretical implications of a data-rich
discipline.
General Editor Consulting Editor

Elena Tognini-Bonelli Wolfgang Teubert
The Tuscan Word Centre/ University of Birmingham
The University of Siena
Advisory Board
Michael Barlow Graeme Kennedy
University of Auckland Victoria University of Wellington
Douglas Biber Geoffrey N. Leech
Northern Arizona University University of Lancaster
Marina Bondi Michaela Mahlberg
University of Modena and Reggio Emilia University of Nottingham
Christopher S. Butler Anna Mauranen
University of Wales, Swansea University of Helsinki
Sylviane Granger Ute Römer
University of Louvain University of Michigan
M.A.K. Halliday Jan Svartvik
University of Sydney University of Lund
Yang Huizhong John M. Swales
Jiao Tong University, Shanghai University of Michigan
Susan Hunston Martin Warren
University of Birmingham The Hong Kong Polytechnic University
Volume 45
A Taste for Corpora. In honour of Sylviane Granger
Edited by Fanny Meunier, Sylvie De Cock, Gaëtanelle Gilquin and Magali Paquot
A Taste for Corpora
In honour of Sylviane Granger
Edited by
Fanny Meunier
Sylvie De Cock
Gaëtanelle Gilquin
Magali Paquot
Université catholique de Louvain
John Benjamins Publishing Company

Amsterdamâ•›/â•›Philadelphia
TM
The paper used in this publication meets the minimum requirements of
8
American National Standard for Information Sciences – Permanence of

Paper for Printed Library Materials, ansi z39.48-1984.
Cover design: Françoise Berserik

Cover illustration from original painting Random Order
by Lorenzo Pezzatini, Florence, 1996.
Library of Congress Cataloging-in-Publication Data
A Taste for Corpora : In honour of Sylviane Granger / Edited by Fanny Meunier, Sylvie De
Cock, Gaëtanelle Gilquin and Magali Paquot.
p. cm. (Studies in Corpus Linguistics, issn 1388-0373 ; v. 45)
Includes bibliographical references and index.
1. Corpora (Linguistics) 2. Language and languages--Computer-assisted instruction. 3.
Second language acquisition--Computer-assisted instruction. I. Meunier, Fanny.
II. Granger, Sylviane, 1951-
P128.C68.T37 2011
410.1’88--dc22 2011008291
isbn 978 90 272 0350 2 (Hb ; alk. paper)
isbn 978 90 272 8708 3 (Eb)
© 2011 – John Benjamins B.V.

No part of this book may be reproduced in any form, by print, photoprint, microfilm, or any
other means, without written permission from the publisher.
John Benjamins Publishing Co. · P.O. Box 36224 · 1020 me Amsterdam · The Netherlands
John Benjamins North America · P.O. Box 27519 · Philadelphia pa 19118-0519 · usa
To Sylviane Granger,
once our professor,

always our mentor,
now our colleague
and dear friend
Table of contents
Acknowledgements ix
List of contributors xi
Preface xiii
Bengt Altenberg
Putting corpora to good uses: A guided tour 1

Sylvie De Cock, Gaëtanelle Gilquin, Fanny Meunier and Magali Paquot
Frequency, corpora and language learning 7

Geoffrey Leech
Learner corpora and contrastive interlanguage analysis 33

Hilde Hasselgård and Stig Johansson†
The use of small corpora for tracing the development of academic literacies 63
JoAnne Neff van Aertselaer and Caroline Bunce
Revisiting apprentice texts: Using lexical bundles to investigate

expert and apprentice performances in academic writing 85
Christopher Tribble
Automatic error tagging of spelling mistakes in learner corpora 109

Paul Rayson and Alistair Baron
Data mining with learner corpora: Choosing classifiers for L1 detection 127
Scott Jarvis
Learners and users – Who do we want corpus data from? 155

Anna Mauranen
Learner knowledge of phrasal verbs: A corpus-informed study 173

Norbert Schmitt and Stephen Redwood
 A Taste for Corpora
Corpora and the new Englishes: Using the ‘Corpus

of Cyber-Jamaican’ to explore research perspectives for the future 209
Christian Mair
Towards a new generation of Corpus-derived lexical resources

for language learning 237
David Wible and Nai-Lung Tsao
Automating the creation of dictionaries: Where will it all end? 257

Michael Rundell and Adam Kilgarriff
ddendum
A
Select list of publications by Sylviane Granger 283
Subject index 289

Name index 293
Acknowledgements
We would first of all like to thank all the contributors to this volume for their enthusi-
asm, their diligence in keeping to deadlines, and their patience in complying with our
editorial demands, one of which being secrecy for quite a while!
We would also like to express our most sincere gratitude to Elena Tognini-Bonelli,
editor of the Studies in Corpus Linguistics series, and to Kees Vaes and his team at
Benjamins for their much appreciated trust and support.
Last but not least, we would like to thank Sylviane for not finding out about our
secret project and meetings before all was (officially) revealed to her!
List of contributors
Bengt Altenberg Lund University, Sweden

Alistair Baron Lancaster University, United Kingdom
Caroline Bunce Universidad Complutense de Madrid, Spain
Sylvie De Cock University of Louvain, Belgium
Gaëtanelle Gilquin University of Louvain, Belgium
Hilde Hasselgård University of Oslo, Norway
Scott Jarvis Ohio University, United States of America
Stig Johansson† University of Oslo, Norway
Adam Kilgarriff Lexical Computing Ltd., Brighton, United Kingdom
Geoffrey Leech Lancaster University, United Kingdom
Christian Mair University of Freiburg, Germany
Anna Mauranen University of Helsinki, Finland
Fanny Meunier University of Louvain, Belgium
JoAnne Neff van Aertselaer Universidad Complutense de Madrid, Spain
Magali Paquot University of Louvain, Belgium
Paul Rayson Lancaster University, United Kingdom
Stephen Redwood University of Nottingham, United Kingdom
Michael Rundell exicography MasterClass and Macmillan Dictionaries,
L
United Kingdom
Norbert Schmitt University of Nottingham, United Kingdom
Christopher Tribble London University, United Kingdom
Nai-Lung Tsao National Central University, Taiwan
David Wible National Central University, Taiwan
Preface
Bengt Altenberg
The digital revolution has had a profound effect on contemporary life. It has changed
our way of communicating with each other and our ways of gathering and processing
information. In linguistics the change has also been dramatic. It has made it possible
to develop models for simulating language behaviour and practical applications in
human-machine interaction and to create tools for storing, processing and analysing
large amounts of text.
The development of computer corpus linguistics is now familiar to most scholars
interested in the study of language. The fact that we can analyse large corpora of vari-
ous kinds has provided a solid empirical basis for the description of language and
language use. Although corpus linguistics is strictly speaking a methodology rather
than a theory of language, it has opened up new approaches to the study of language
and new and fruitful ways of matching theory and data.
Today we tend to take this development for granted. But it is profitable to remem-
ber that carefully compiled computer corpora and tools for exploring them did not
arise ‘out of the blue’. They were – and are – the laborious achievement of inspired
linguists who understood the potential of the new technology and knew how to use it
for linguistic purposes. Computer corpus linguistics has had several pioneers of this
kind since its beginning in the 1960s. This book is a tribute to one of these pioneers:
Sylviane Granger, professor of English at the University of Louvain, Belgium.
Sylviane Granger began her career the hard way, in what has humorously been
called the era BC (‘Before Computers’) when corpus data were stored on cards in shoe-
boxes or filing cabinets. Her Ph.D. thesis on the use of the passive in spoken English
(published in 1983) was the result of a painstaking manual inventory and analysis of
be + past participle forms in the files of the Survey of English Usage (then not yet avail-
able in computerized form) at University College, London. That experience undoubt-
edly trained her in handling and analysing a large amount of corpus data but it also,
one can imagine, made her appreciate the advantages offered by computerized corpora
which were being developed at the time.
But Sylviane Granger also had another fervent interest. Being tri-lingual in French,
English and Dutch, she was deeply concerned with second language learning and
teaching, notably the learning and teaching of English as a foreign language (EFL) and
 A Taste for Corpora
– almost as a logical consequence – in contrastive analysis. These interests can be seen

as the main driving forces behind the research conducted at the Centre for English
Corpus Linguistics (CECL) which she founded at Louvain-la-Neuve in 1990. Since
then, her wide-ranging interests in English corpus linguistics, her ambition to use the
results for pedagogical purposes, and her enthusiasm as a teacher and project leader
has made the CECL a veritable hothouse of corpus research and development which
has inspired a large number of scholars around the world and fostered a new genera-
tion of enthusiastic co-workers at Louvain-la-Neuve and abroad.
The research activities at the CECL have undergone a remarkable expansion since
its beginning 20 years ago. Broadly speaking, the development has focused on four
related areas:
– The creation and analysis of computer corpora of various kinds: learner corpora,
multilingual corpora, corpora of English for Specific Purposes, etc.
– Linguistic research on these corpora ranging from lexis and phraseology to gram-
mar and discourse with special emphasis on the development of corpus-related
methodologies and on matching empirical data and linguistic theories
– Pedagogical applications, for example in learner-oriented lexicography, textbooks,
web-based dictionaries, proficiency testing, etc.
– The development and use of computer-aided tools in research and pedagogical
applications
The work in these areas has expanded organically in a series of related steps, each serv-
ing to supplement or refine the results of the previous one. For instance, the first cor-
pus initiated by Sylviane Granger was the widely successful International Corpus of
Learner English (ICLE), a computerized corpus of written argumentative essays pro-
duced by advanced learners of English with a number of different mother tongues.
This written corpus was soon supplemented with a spoken counterpart (LINDSEI)
consisting of interviews of intermediate to advanced EFL learners. However, both
these corpora offered a cross-sectional view of the learners’ interlanguage. To redress
this limitation a new longitudinal learner corpus project (LONGDALE) has recently
been launched, again involving advanced learners with different mother tongues but
followed over a period of three years.
Another example of the expansion of the work at CECL is the development of
cross-linguistic research. The learner corpora give evidence of errors as well as quantita-
tive deviations – overuse or underuse – from a (selected) native English norm or stan-
dard of comparison. These errors and deviations tend to differ in type and frequency
depending on the L1 of the learners. One natural question that arises is to what extent
L1 interference (transfer) plays a role in the learners’ production. This question encour-
ages a contrastive perspective and the development of multilingual (comparable or
translation) corpora which can provide empirical evidence for testing claims in second
language acquisition theory which have previously mainly been based on intuition.
Preface 
However, interlanguage phenomena like underuse or overuse of a target language

feature may also be the result of overgeneralization of a target language structure or,
alternatively, reflect special characteristics of the selected native English norm. The
choice of target norm is therefore problematic. Which variety (or varieties) of English
should be the target in second language research and teaching? Should all learners
have the same target? Questions like these inevitably lead to a concern with language
variation and the characteristics of different varieties of English. The result has been a
development of learner corpora and multilingual corpora representing English for
Specific Purposes (such as newspaper editorials, business English, academic English,
law, etc.). Another recent interest at the CECL is to compare learner English with indi-
genized varieties of English (‘World Englishes’).
All these perspectives require special methodologies and the use of various com-
puter-aided tools for marking, analysing and presenting the data and for the creation
of pedagogical applications of various kinds (e.g. learner-oriented dictionaries, text-
books, multilingual term banks). For example, in order to compare learner data with
native data (L2 vs. L1) or different kinds of learner data (L2 vs. L2) Sylviane Granger
developed the Contrastive Interlanguage Analysis (CIA) methodology which has been
a fruitful approach in many ICLE studies. In addition, to integrate the CIA method
with contrastive observations from multilingual corpora, she developed the Integrated
Contrastive Model which helps the researcher to predict or explain various deviant
interlanguage phenomena.
Examples of computer-aided tools developed by the team at the CECL are the er-
ror-tagging system designed for the ICLE corpus and various learner-oriented projects
in electronic lexicography, such as the creation of a web-based phraseological diction-
ary of English for Academic Purposes intended for non-native writers and of a tri-
lingual terminological database of university-related terms (English-French-Dutch).
This short survey of the work carried out at the CECL can only give an indication
of the varied and rapidly expanding activities initiated by Sylviane Granger (for details,
see the CECL homepage at www.uclouvain.be/en-cecl.html). Apart from her central
influence as an enthusiastic organizer and creative researcher, she has inspired a large
number of scholars around the world and created fruitful international cooperation
around her projects. The collection of articles presented here on the occasion of her
60th birthday give a good indication of her wide research interests. They illustrate the
variety of topics and approaches that characterize the field as well as new lines of
development. In presenting this collection the editors and contributors wish to join
her colleagues and friends around the world in celebrating her pioneering achieve-
ment, hoping that her enthusiasm and creativity will continue to inspire us in the years
to come.
Putting corpora to good uses
A guided tour
Sylvie De Cock, Gaëtanelle Gilquin, Fanny Meunier

and Magali Paquot
This volume is a tribute to Professor Sylviane Granger, and a special gift for her 60th
birthday. The eleven chapters it is made up of tackle corpora from a wide range of
perspectives, thus reflecting Sylviane’s insatiable taste for corpora and her many inter-
ests. They were written by distinguished scholars whose work is appreciated by
Sylviane, but who are also (long-standing) friends of hers. The different contributions
aim to shed light on the numerous linguistic and pedagogical uses to which corpora
can be put. They present cutting-edge research in the authors’ respective domain of
expertise and suggest directions for the future. Given the many potential uses of cor-
pora, the volume is inevitably incomplete and limited in size and focus, but we never-
theless believe that it will offer readers an informed account of the important role that
corpora play in applied linguistics today.
In this chapter, we will first guide readers through the main paths that Sylviane has
explored in her career so far, and then provide an overview of the articles that are
brought together in this volume.
Sylviane Granger is a corpus linguist, a specialist in contrastive linguistics, a lexi-
cographer, and also an English as a Foreign Language teacher. She is a polymathic ap-
plied linguist and her impressive list of publications (see Addendum, this volume)
reflects her numerous research interests including corpus linguistics (native, learner
and bilingual corpora), phraseology, lexicography, English as a Foreign Language,
English for Academic Purposes, Second Language Acquisition, contrastive linguistics
and technology-enhanced language learning.
Twenty years ago, Sylviane founded the Centre for English Corpus Linguistics (CECL)
at the Université catholique de Louvain (UCL), Belgium. From a modest start in 1990,
with one table, one chair, one computer, one bookcase and one researcher, the centre has
gradually grown to include many more tables and computers, but above all many more
researchers. To date some twenty researchers have been directly involved in the work
done at the CECL, a worldwide renowned research centre. This exponential growth is the
result of Sylviane’s enthusiasm, work, vision and leadership. Sylviane has always been an
 Sylvie De Cock, Gaëtanelle Gilquin, Fanny Meunier and Magali Paquot
enthusiastic project and team leader. In addition she has always put a lot of energy and
efforts into promoting learner corpus research through her publications, the many talks
she has given all over the world, but also via the (co-)organization of Summer/Easter
schools and international conferences in Louvain-la-Neuve. Recent conferences include
Phraseology 2005: The Many Faces of Phraseology, and eLexicography in the 21st Century:
New Challenges, New Applications (2009). Today, the CECL is busy organizing the Learner
Corpus Research 2011 conference to mark the 20th anniversary of its creation.
Sylviane has been one of the main driving forces behind learner corpus research and
she initiated two pioneering projects in the field: the International Corpus of Learner
English (ICLE, Granger et al. 2009) and the Louvain International Database of Spoken
English Interlanguage (LINDSEI, Gilquin et al. 2010). The ICLE project started in 1990
and the second version of ICLE, released in 2009, contains data from 16 mother tongue
backgrounds, for a total of 3.3 million words. As for LINDSEI, whose first version has
recently been released, it was started in 1995 and to date contains about 800,000 words
produced by learners from 11 different mother tongue backgrounds. Methodological
issues have also been a major concern for Sylviane and, in 1996, she proposed the Con-
trastive Interlanguage Analysis (CIA) (Granger 1996) approach to analyze learner cor-
pora. The advent of learner corpus research can be said to have taken place with the
publication of Learner English on Computer (Granger 1998), a collection of pioneering
papers on learner language, largely based on ICLE, which has inspired many publica-
tions in learner corpus research. The volume Sylviane co-edited in 2002, entitled Com-
puter Learner Corpora, Second Language Acquisition and Foreign Language Teaching,
(Granger et al. 2002) provides a follow-up with further developments in the field.
At the time of writing, Sylviane’s research appetite and enthusiasm remain undi-
minished and her head is full of new ideas and exciting projects! She often says that
none of this would have been possible without her team at the CECL, and this is prob-
ably true. But what would a team be without an inspirational team leader who always
looks on the bright corpus side of life? With this book, we explicitly want to thank her
for her catching enthusiasm, her intellectual perceptiveness, her unfailing expert
guidance, her sparkling personality, but also for her friendship and for the time she
spends with us, be it to discuss academic or more personal everyday life matters, or
even to party and have a good laugh.
As highlighted at the beginning of this introduction, the different contributions
included in the book reflect the numerous linguistic and pedagogical uses to which
corpora can be put. The first two chapters address two central issues in corpus research:
the notion of frequency and the role of contrastive analysis. In Chapter 1, Leech exam-
ines the role of frequency, as established on the basis of corpus evidence, in language
learning. He shows that after early word-frequency lists such as West’s General Service
List, followed by a generative period characterised by rejection of frequency, the advent
of electronic corpora has led to renewed interest in frequency (frequency of words, but
also of collocations, constructions, etc). This movement is supported by recent trends
in linguistics such as the development of usage-based theories or the recognition of
Putting corpora to good uses 
frequency effects in grammaticalisation. Leech claims that frequency, though not the
only relevant factor, is important for language teaching, because of the principle of
‘more frequent = more important to learn’, according to which the most frequent words
are the more useful ones to the learner (for comprehension as well as production pur-
poses). The chapter finishes with some words of caution (what is most frequent does
not necessarily correspond to what is most salient, and corpora from which frequencies
are extracted do not always match learners’ needs) and some words of comfort (ordinal
frequencies, i.e. how words are ordered along a frequency list, are normally sufficient,
and these are usually quite similar across corpora). In the second chapter, Hasselgård
and Johansson† start their paper with a select review of pre-corpus interlanguage stud-
ies, focusing on three Scandinavian research projects, before moving on to the devel-
opment of computerized learner corpora. They focus on the ICLE project and on CIA,
and present a number of valuable insights into advanced learner English that were
gained from using comparable corpora and a common model of analysis. They then
introduce another framework developed by S. Granger, viz. the Integrated Contrastive
Model (ICM), which makes it possible to explain and/or predict mother-tongue (L1)-
specific learner problems on the basis of systematic comparisons of the first language
and the target language. The two research models are illustrated by means of three case
studies. The first two studies adopt CIA to investigate the use of quite and I would say
in four ICLE sub-corpora and the third one uses the ICM to analyse seem in the inter-
language of Norwegian learners. After identifying a number of challenges that learner
corpus research needs to meet, Hasselgård and Johansson conclude by praising the
dynamism and enthusiasm that characterise this relatively new field.
Chapters 3 and 4 discuss the development of academic literacies. Neff van Aertselaer
and Bunce do so by examining the use of reporting verbs and evaluative lexical re-
sources in two small corpora of texts written within the framework of an academic
writing (AW) course by EFL Spanish university students at B1 and B2 levels of the Com-
mon European Framework of Reference. The Academic Writing (AW) course was or-
ganised around a series of can do descriptors to make explicit the required structural
and rhetorical features to be learned. The authors compare their results with the ICLE
Spanish sub-corpus and show that, by providing explicit descriptors for argumentative
writing, the syllabus for the two AW courses did actually support students’ literacy
growth. This is also confirmed by a comparison of the AW texts written at the beginning
and end of the academic writing course. The study also illustrates how learner corpus
data can be used to evaluate the syllabus and modify classroom teaching practices. In
Chapter 4, Tribble investigates expert and apprentice performances in academic writ-
ing, drawing on Biber’s (2006) account of lexical bundles. He compares lexical bundles
in a corpus of apprentice written production (KCL Apprentice Writing Corpus) and a
close analogue corpus of British Academic Written English (BAWE), an exemplar cor-
pus (Applied Linguistics Corpus) and two progressively more distant analogue corpora
(BNC Baby, Academic and Acta Tropica). The chapter provides concrete illustrations of
how the written production of postgraduate students in a single disciplinary area can be
 Sylvie De Cock, Gaëtanelle Gilquin, Fanny Meunier and Magali Paquot
used to trace contrasts between apprentice and expert writing, and how the account of
such contrasts can be exploited in materials development for English for Academic Pur-
poses (EAP) writing courses. Tribble’s study demonstrates how corpus analysis can help
meet the learners’ linguistic needs; it also shows that a focus on lexical bundles fosters a
better understanding of apprentice writers’ strategies.
Issues pertaining to the automatic analysis of learner corpora are addressed in
Chapters 5 and 6. In Chapter 5, Rayson and Baron present the novel application of a
hybrid approach to the detection of spelling errors in learner data. They use a modified
version of the Variant Detector (VARD) software, initially developed to match histori-
cal spelling variants to modern equivalents, to detect spelling errors in ICLE sub-cor-
pora consisting of 50,000 words from three different mother tongue backgrounds
(French, German and Spanish). They show the potential of natural language process-
ing methods to contribute to the automatic error analysis of learner corpora as VARD
can both assist a manual editing process of a sample corpus and be trained and run
automatically to generate larger amounts of data for analysis. The authors explain,
however, that despite the very high precision rate obtained by VARD, further research
is still needed to improve the recall rate of detection of learner errors, especially those
that can only be found using contextual patterns. In the next chapter, Jarvis uses data-
mining techniques to automatically detect the L1 of learners. The influence of the first
language on a second has been one of the most researched topics in learner corpus
studies. Most of these studies have used CIA to identify features of non-nativeness in
learner productions and assess whether these features are peculiar to one language
group, and thus possibly due to the influence of the learners’ mother tongues. In a
number of recent publications, however, Jarvis has put forward the detection-based
approach to cross-linguistic influence, a complementary and largely automatic
approach to detect cross-linguistic influence. The author compares 20 learning algo-
rithms used for supervised classification, i.e. classifiers, and assesses their ability to
learn to detect L1-related patterns of use of n-grams in 12 ICLE sub-corpora. He also
explains that the applications of the detection-based approach to cross-linguistic
influence are tremendous and largely transcend the field of language learning and
teaching, as they could for instance be used for intelligence purposes.
Chapters 7 to 9 deal with the sometimes blurred frontiers between second/foreign
language acquisition, second language use and new varieties of English. Mauranen
compares learner corpora, which contain data produced by second/foreign language
learners, and corpora of English as a Lingua Franca (ELF), which contain data pro-
duced by non-native speakers who use English as a contact language. She first high-
lights the differences between the two, making a distinction between second language
acquisition and second language use, and showing how this distinction, and the social,
cognitive and interactive differences it implies, may impact corpus compilation and
interpretation. While the division according to mother tongue background makes sense
in learner corpora, for example, it is much less relevant (and feasible) in ELF corpora,
which usually incorporate unpredictable combinations of mother tongues. On the other
hand, learners and ELF users are also shown to share certain features that can be seen
Putting corpora to good uses 
to reflect the cognitive processes underlying the production of (non-native) language.

The processes of overgeneralisation and simplification, for instance, are important in
both second language acquisition and second language use, and can result in similar
lexicogrammatical or phraseological features, as exemplified by Mauranen. On the ba-
sis of these similarities and differences, the author argues that learner corpora and ELF
corpora should be kept separate, but are of great mutual interest. In Chapter 8, Schmitt
and Redwood analyze 68 second language learners’ productive and receptive knowl-
edge of some of the most common phrasal verbs in English with the help of productive
and receptive tests. In addition to frequency effects, the authors also address the poten-
tial link between mode (spoken vs. written) and phrasal verb knowledge, as well as the
interactions of other factors that can lead to individual differences in the acquisition of
phrasal verbs (second/foreign language proficiency, gender, age, and amount and type
of exposure to the target language inside and outside the classroom). The authors dem-
onstrate that frequency can predict phrasal knowledge to a considerable degree in terms
of productive mastery, but not in terms of receptive mastery. Whilst their results show
no effect for formal-instruction-based variables, they show that more out-of-class ex-
posure facilitates the learning of phrasal verbs. In the next chapter, Mair shows how
corpus linguistics has contributed to the study of the so-called ‘New Englishes’. His own
research focus is on Jamaican English and Jamaican Creole, which he explores on the
basis of a large corpus of diasporic Jamaican web-posts, called the Corpus of Cyber-Ja-
maican. Mair highlights interesting features of Jamaican English and Creole as it is used
in computer-mediated communication, for example the higher frequency of basilectal
variants in cyber-Jamaican than in face-to-face interaction, which he explains by the
phenomenon of ‘anti-formality’, i.e. “conscious closing of social distance”. He also deals
with lexical borrowings from African languages in Jamaican English and Creole, with
words such as mzungu (‘white person’ in Kiswahili) or wahala (‘trouble/problem’ in
Nigerian Pidgin) being found in the Corpus of Cyber-Jamaican. More generally, the
paper underlines the benefits of relying on data derived from the World Wide Web,
which includes more non-standard forms than corpora of face-to-face interaction, in
order to investigate variation in the New Englishes. It also argues that web-forums can
provide an arena for language contact that would be unlikely to occur in the real world,
resulting in the rapid globalisation of certain vernacular features.
The last two chapters of the book are devoted to the role that corpora can play in the
development of lexical and lexicographical resources for language learning. Wible and
Tsao report on a new corpus-derived lexical resource designed to help bridge the gap
between language learners’ needs and what corpora can offer when it comes to vocabu-
lary learning. After arguing that vocabulary knowledge is best seen as a rich network of
interconnections among words and that corpora as collections of texts and tokens fail to
give language learners direct access to this web of interconnections, the chapter describes
the lexical knowledgebase StringNet, which has been specifically created to reflect what
learners need to master. The authors explain how corpus-derived ‘hybrid n-grams’, in
which part-of-speech categories can occur alongside lexemes or word forms, have been
instrumental in automatically discovering not only patterns of word behaviour but also
 Sylvie De Cock, Gaëtanelle Gilquin, Fanny Meunier and Magali Paquot
the relations among these patterns and words. In addition, they show how hybrid n-
grams make it possible to uncover the larger patterns in which collocations often tend to
be embedded. Finally, Wible and Tsao suggest that language learners could be given ac-
cess to the lexical knowledgebase StringNet via a browser-based tool which could help
them discover patterns they had not thought of looking for. As for Rundell and Kilgarriff,
they examine and evaluate the role of computers and automation in modern dictionary
making and more specifically in the period from the late 1990s onwards. The focus is on
a number of lexicographic tasks that have been or are in the process of being automated
to a significant degree. These include the compilation of lexicographic corpora (with the
advent of the ‘web corpus’), the development of headword lists (e.g. selecting headwords,
identifying multiwords or new words), the identification of the key linguistic features of
the lexical units included in the dictionary (e.g. their collocational/colligational prefer-
ences, the grammatical or register labels they should be assigned), and the selection of
examples to be included (e.g. using the GDEX [‘good dictionary examples’] algorithm).
The contribution and development of word sketches and the Sketch Engine are also high-
lighted and amply illustrated. Throughout the chapter the authors show how automation
has made it possible not only to relieve lexicographers of more tedious work involved in
dictionary making but also to increase consistency and reliability when describing lan-
guage and compiling dictionary entries. Their paper is rounded off by a discussion of
possible further developments of the process of automation in lexicography.
We hope that the guided tour of some of the key approaches, methods and do-
mains of applications of (learner) corpus research provided in this volume will help
readers refine and/or develop their own taste for corpora, and that it will prompt them
to discover and freely explore new paths.
References
Biber, D. 2006. University Language: A Corpus-based Study of Spoken and Written Registers
[Studies in Corpus Linguistics 23]. Amsterdam: John Benjamins.
Gilquin, G., De Cock, S. & Granger, S. (eds). 2010. The Louvain International Database of Spoken
English Interlanguage. Handbook and CD-ROM. Louvain-la-Neuve: Presses universitaires
de Louvain.
Granger, S. 1996. From CA to CIA and back: An integrated approach to computerized bilingual
and learner corpora. In Languages in Contrast. Text-based Cross-linguistic Studies [Lund
Studies in English 88], K. Aijmer, B. Altenberg & M. Johansson (eds), 37–51. Lund: Lund
University Press.
Granger, S. (ed.). 1998. Learner English on Computer. London: Addison Wesley Longman.
Granger, S., Dagneaux, E., Meunier, F. & Paquot, M. (eds). 2009. The International Corpus of
Learner English. Handbook and CD-ROM. Version 2. Louvain-la-Neuve: Presses universi-
taires de Louvain.
Granger, S., Petch-Tyson, S. & Hung J. (eds). 2002. Computer learner corpora, second language
acquisition and foreign language teaching. Amsterdam & Philadelphia: Benjamins.
Frequency, corpora and language learning
Geoffrey Leech
I begin this chapter with a brief survey of how frequency – in particular,

frequency of words – had a role in language learning in the days before
electronic corpora existed. Then I consider how the ‘corpus revolution’ made
frequency information available in a totally unprecedented way from the 1960s
onward. But how far is this useful to the language learner and teacher? Is the
right kind of frequency knowledge being captured? In the second half of this
chapter, I will consider the equation ‘more frequent = more important to learn’,
what questions of frequency we really need to ask, and how far they can be
answered in the present state of corpus linguistics.1
1. Introduction
If asked what is the one benefit that corpora can provide and that cannot be provided
by other means, I would reply ‘information about frequency’. Frequency is also a theme
which has recurred in language learning – although it has also suffered from neglect
(as will be briefly explained below). Hence there is need for a re-appraisal of the links
between frequency, corpora and language learning. Following this introduction, the
chapter is divided into four main sections: Section 2: ‘A brief glance at history’; Section
3: ‘Recent progress in frequency studies relevant to language learning’; Section 4: ‘New
directions in applied linguistics favourable to frequency’; Section 5: ‘Challenges and
possible solutions’.The chapter ends with some concluding remarks (Section 6).
To begin with, it is as well to make clear that there are three usages of frequency
that might be confused.
a. ‘Raw frequency’ is simply a count of how many instances of some linguistic phe-
nomenon X occur in some corpus, text or collection of texts.
b. ‘Normalized frequency’ (sometimes called ‘relative frequency’) expresses frequen-
cy relative to a standard yardstick (e.g. ‘tokens per million words’).
1. I am very grateful to the editors, Sylvie De Cock, Gaëtanelle Gilquin, Fanny Meunier and
Magali Paquot, for their valuable suggestions and support in helping me to improve this chapter.
 Geoffrey Leech
c. In what I will call ‘ordinal frequency’, the frequency of X is compared with the
frequencies of Y, of Z, etc. Thus a rank frequency list, in which words are listed in
order of frequency, is the classic example of ordinal frequency. Although (a) is the
raw measure from which (b) and (c) are derived, it is of little or no use in itself.
Normalized frequency (b) is of course essential if we are to make comparisons
between corpora, texts, etc., of different sizes. But my view is that ordinal fre-
quency (c) is the most useful measure to use when we are considering language
learning. It is of no use for the language teacher to be told that shall occurs
175 times per million words in a corpus. But to be told that will is much (15 times)
more frequent than shall may well be pedagogically useful.
2. A brief glance at history
The historical sketch I am about to give roughly divides into three epochs: (a) early
frequency studies; (b) the rejection of frequency; (c) the computer age and the revival
of frequency studies.
2.1 Early frequency studies
The early chapters of introductions to corpus linguistics by Kennedy (1998) and by

McEnery & Wilson (2001) give something of the background to this. But for my pres-
ent purpose, it is enough to refer to one or two landmarks in the provision of word-
frequency information on English. Thorndike (1921, 1932), Thorndike & Lorge (1944),
and West (1953)2 are noted examples of word-frequency lists produced by counting
and calculating word frequencies by hand in the first half of the twentieth century –
before, that is, the development of computers. By present-day standards, the corpora
used were pitifully small, and the selection of texts they contained included some
choices hardly ideal for learners of the current language. For example, Thorndike
(1921, 1932) made use of a corpus containing such classics from the 17th, 18th and
19th centuries as Dryden’s Dramatic Essays, the American Declaration of Indepen-
dence, and Jane Austen’s Pride and Prejudice. However, the important point here is that
word frequency was taken seriously as a guide for language teaching in those days, and
in spite of the enormous amount of unrewarding ‘slave labour’ involved, building fre-
quency lists was felt to be a worthwhile exercise. The simple postulate justifying this
effort was: ‘more frequent = more important to learn’.
Of greater interest from the theoretical point of view was the mathematical work
of Zipf. Zipf ’s Law (1935, 1949) held that the frequency of any word is inversely
2. West’s book was called A General Service List of English Words, and recorded frequencies of
senses, not just words. Although not published until 1953, West’s book was based on counts la-
boriously undertaken in the decades before 1950.
Frequency, corpora and language learning 
stylistic
consolation
100% concession aspiration erode
Top 3000 consumption
words – 90%
86% of the mingle
language 80% carefully overwhelm viable therapeutic
unique
% total 70% fresh
words in
the LCN
60% deep
Top 100
words – 50% very
50% of the
language
40% good
Top 10 30% up
words –
25% of the year
language* 20%
10%
the, be, of,
and, a, to, in,
have, it, I
1 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Figure 1.â•‡ Frequency graph of the 10,000 most frequent words in the Longman Corpus
Network (Reproduced by permission of Pearson Education Limited from: Stephen Bullon
and Geoffrey Leech, ‘Longman Communication 3000 and the Longman Defining
Vocabulary’. In Longman Communication 3000. 1. Harlow, Essex: Pearson/Longman.)
proportional to its rank in the frequency list, such that the nth word has a frequency of
approximately 1/n X the frequency of the word of highest rank. Zipf ’s Law gave a more
heavily weighted importance to the most frequent words than would be expected ac-
cording to normal distribution. Language is such that the most frequent 50 words
(i.e. word-types) account for 40% of word-tokens in a corpus of texts; the most fre-
quent 3,000 words account for 85% of word-tokens; and the most frequent 10,000
words account for 92% of word-tokens (see Figure 1). Carroll’s (1971) mathematically-
induced estimate of the number of word-types in the English language was 609,606
words, of which a majority have extremely small probabilities. For practical purposes
we can say that the wordstock of English is both very large and open-ended.
2.2 The rejection of frequency
In linguistics, the second half of the twentieth century, at least up to the 1990s, was
dominated by the generative school of Noam Chomsky, who rejected the value of
frequency in the study and understanding of language. Chomsky famously used the
illustration of I live in Dayton, Ohio and I live in New York to show that the greater
frequency of the latter sentence as compared with the former was of no linguistic rel-
evance or interest. Of course, this had more to do with the differences of population
between Dayton, Ohio and New York – from Chomsky’s point of view, a matter of
performance (and hence of no value to linguistics) rather than competence. He
 Geoffrey Leech
concluded that “probabilistic considerations have nothing to do with grammar”

(Chomsky 1964 [1962]: 215) – using grammar in a broad Chomskyan sense to include
the whole language system. From that time until (roughly) the end of the century,
since Second Language Acquisition (SLA) research was heavily influenced by the gen-
erative paradigm, it was difficult to find any serious reference to frequency in publica-
tions about the learning of languages, and where frequency was discussed, it was dealt
with perfunctorily and sometimes negatively. The well-known authoritative handbook
by Rod Ellis, The Study of Second Language Acquisition (1994), has little to say about
frequency, and offers very little extra in its second edition of over a thousand pages,
published as recently as 2008. The only substantial reference to frequency is in the sec-
tion headed ‘The frequency hypothesis’, in which the emphasis is wholly on the learn-
er’s input frequency (see Ellis 1994: 269–273, 2008: 241–246). For corpus linguistics, a
more relevant question is: how can both the learner’s input and output be adjusted to
the future likely needs of the learner as revealed in corpora?
2.3 The computer age and the revival of frequency studies
It can be said that the corpus revolution in linguistics began with the completion and
distribution of the Brown Corpus in 1964.3 Shortly after, Kučera & Francis (1967) used
this to create the first word frequency lists for English based on corpus data. Later, in
Francis & Kučera (1982), they published lemmatized frequency lists, based on the
part-of-speech (POS) tagged version of the corpus. Further word frequency lists were
derived from the Lancaster-Oslo/Bergen (LOB) Corpus of British English (Hofland &
Johansson 1982; Johansson & Hofland 1989), and for the first time grammatically in-
formed word frequency lists derived automatically from matching computer corpora
became available to the language researcher and the language teacher permitting com-
parison of American and British English.
Of course, this was only the first step: in the last forty years, there has been an im-
mense increase in the number of corpus-based frequency studies both for written and
spoken English, as more diversified corpora as well as much larger corpora have be-
come available. Apart from word frequency lists and studies (e.g. those derived from
the British National Corpus [BNC] – Leech et al. 2001), corpus-based frequency stud-
ies have dealt with collocations (e.g. Sinclair et al. 1970, republished in Krishnamurthy
2005), and with frequency of grammatical categories, structures, etc. Here hundreds of
grammatical studies could be mentioned, starting from Ehrman (1966), and culminat-
ing in a corpus-based frequency grammar of English (Biber et al. 1999) as well as with
frequency studies of the language of learners (Granger 1997, 1998). It goes almost
without saying that the availability of electronic corpora has revolutionized the
3. The Brown Corpus was originally issued by W. Nelson Francis and Henry Kučera of Brown
University, under the title A Standard Sample of Present-Day Edited American English, for use
with Digital Computers.
Frequency, corpora and language learning 
application of frequency information whether derived from general corpora, special-

ized corpora, written texts or spoken transcriptions.
It is also clear that frequency data from authentic texts have been one of the major
driving forces of natural language processing (NLP), leading to the development of so-
phisticated statistical methods and probabilistic systems. One of the first steps was taken
in the probabilistic POS tagging of the LOB Corpus, employing a modified Hidden
Markov Process model (Marshall 1983, 1987). The history of statistical modelling in
NLP, however, cannot be pursued further here. See Jelinek (1998) for further coverage.
2.4 Co-frequency, collocation
Another great step forward was taken through the pursuit of co-frequency – i.e. the
frequency of X and Y occurring together in a corpus, as measured against the probabil-
ity of their occurring together by chance. A serious beginning was made in Sinclair’s
research discussed in his (and colleagues’) OSTI report (1970), using a small corpus of
spoken English of 135,000 words. Obviously, as Sinclair pointed out, a much bigger
corpus (of 20 million words or more) was needed to produce significant results for col-
locational analysis. This was achieved and surpassed in the 1980s and 1990s with
Sinclair’s development of the Birmingham Collection of English Texts, later known as
the Bank of English, as well as by other corpora such as the BNC. To give an impression
of how vastly the size of corpora on which frequency studies are based has mushroomed
in the last forty years: in comparison with Sinclair’s spoken corpus of 135,000 words in
1970, a recently published frequency dictionary of American English (Davies &
Gardner 2010) is based on a corpus of 385,000,000 words, including 79,000,000 words
of speech. This dictionary is also an innovation in providing, alongside individual word
frequencies, a classified list of common collocations for each word.
Word frequency lists such as those of Francis and Kučera were of limited interest
to corpus linguists like John Sinclair, who urged the inadequacy of the open choice
principle of treating every word-token in a string as if independently selected, as con-
trasted with the idiom principle whereby texts are observed to be constructed in terms
of “a large number of semi-preconstructed phrases that constitute single choices, even
though they might appear to be analysable into segments” (Sinclair 1991: 110).
Sinclair’s idiom principle has since been followed up by many corpus linguists and
lexicographers for whom multi-word units – collocations, lexical bundles, and the like
– are essential to the fabric of language, as well as to the learning of language. Indeed,
corpus research itself has shown observationally the importance of word combina-
tions, whose significance is capable of being measured by statistical formulae such as
mutual information, t-test, and log likelihood. Sinclair, in championing the idiom
principle, was following to some extent in the footsteps of his former Edinburgh col-
league M.A.K. Halliday, and Halliday’s teacher J.R. Firth (1957), who first gave promi-
nence to the co-frequential concept of collocation (Halliday 1966; Sinclair 1966).
Halliday (1961: 273–277) had stated that the level of lexis (including collocation) had
 Geoffrey Leech
to be a distinct level of linguistic description,4 and at the same time had proposed that
the levels of grammar and of lexis were interrelated along a cline or continuum of
delicacy (ibid. 276–277). For him, the levels of grammar and lexis constituted a single
lexico-grammatical level accounting for the formal structuring of language. Many cor-
pus linguists have espoused something like this model, evidenced as it is by a multi-
tude of studies,5 with the result that the interpenetration of grammar and lexis
(and hence the spread of lexical frequency-based concepts into grammar) has become
widely accepted. In this respect, it can be said that the corpus revolution has intro-
duced a new theoretical perspective on linguistic structuring: one in bold contrast to
the mainstream paradigm of Chomsky (e.g. Chomsky 1965: 84–88) whereby grammar
and lexicon are two clearly distinct components. It also challenges a tradition long es-
tablished in language study, whereby grammars and dictionaries provide distinct kinds
of information about a language, and are published in separate covers.
3. Recent progress in frequency studies relevant to language learning
In this part of the chapter I will revisit four topics already briefly touched on, showing
how studies of frequency have been increasingly applied to various linguistic units or
components:
a. word frequency (by register, region, etc.)
b. co-frequency between words – lexis and collocation
c. grammatical frequency
d. lexico-grammatical frequency – co-frequency between lexis and grammatical
structures
I will consider how these topics have been advanced by recent research. Other linguis-
tic levels at which frequency has been somewhat less investigated, e.g. semantics, will
remain in the background.
3.1 How frequency is important for English Language Teaching (ELT)
First, let us revisit word frequency lists. The case for ‘more frequent = more important
to learn’ is simply put: “The reasoning behind this position is that learners should be
taught what is most frequent in language, since it is what is of most use to them”
(Gilquin 2006a: 58). In other words, the more frequent a word is in language use, the
more likely it is to be useful to the learner. This is (a) because it will be more frequently
4. “So there must be a theory of lexis, to account for that part of linguistic form which gram-
mar cannot handle” (Halliday 1961: 273).
5. For example, Moon (1998), Nesselhauf (2005), Adolphs (2008), Römer & Schulze (2009),
and a rich range of studies contributed to Granger & Meunier (2008).
Frequency, corpora and language learning 
encountered in the language use of other people, and (b) because it will be more fre-
quently needed for the learner’s own language use.
3.2 Word frequency associated with language varieties
However, frequency counts are least useful when they are based on a general corpus
covering the range of the language; they are more useful if they are differentiated for
region and register. This is one advantage that the corpus revolution has brought, and
which was lacking in earlier manually-based studies. Earlier I mentioned briefly
Johansson & Hofland’s (1989) frequency lists of comparable corpora for American
English (AmE) and British English (BrE): Brown and LOB. These and other corpora in
the Brown family show differences in regional varieties of written English and show,
for example, that the auxiliaries must, may, should and shall have been declining sharp-
ly in frequency, and that in this decline BrE is following in the wake of a sharper de-
cline in AmE (Leech et al. 2009: 71–83). It is also possible to compare AmE and BrE in
terms of spoken English: two comparable corpora of conversation (the demographi-
cally sampled part of the BNC and the Longman Corpus of Spoken American English
[LCSAE]) show that in AmE, much more than in BrE, ‘core’ modals like must, may,
should and shall are less frequent than in the written language, whereas some ‘semi-
modals’ resulting from grammaticalization – constructions like be going to and have to
– have reached a greater frequency than most core modals (see Leech et al. 2009: 100).
In Leech et al. (2001), based on the BNC, we presented word frequency lists for both
written and spoken English, and also lists of words which were most ‘key’ in these two
varieties (i.e. those most strongly associated with written texts or with spoken texts).
Dictionaries such as the Longman Dictionary of Contemporary English (LDOCE) also
give variety-differentiated frequency information. Since its third edition (1995),
LDOCE has flagged words in the first thousand, second thousand, and third thousand
in terms of frequency in speech and in writing, differentiating between their occur-
rence in the two media, where not surprisingly word frequencies differ greatly. For
example, in Table 1 the items in List A are in the top 1,000 words for speech, but below
the top 3,000 in writing: in other words, these words are much more at home in the
Table 1.â•‡ Words strongly associated with spoken (List A) and written (List B) English
List A: Words strongly associated with the List B: Words strongly associated with
spoken medium the written medium
awful, basically, bet (verb), daddy, dear (interj), authority, institution, security, program
exam, go (noun), hello (interj), hi (interj), hopefully, (noun), reveal, sector, king, thus
like (adv), like (conj), mine (pron), mom, mummy,
OK (adv), OK (interj), ours (pron.), penny, phone
(verb), rid (adj), yeah (interj), yep (interj)
 Geoffrey Leech
spoken medium. On the other hand, those in List B are in the top 1,000 words for writ-
ing, but below the top 3,000 in speech: they strongly prefer the written medium.
In addition to frequency information about speech and writing, there are also
corpus-based frequency lists relating to different registers or domains – such as the
Academic Word List of Coxhead (2000).
Such differentiated frequency information is potentially very useful for learners of
a language, or more directly, for those preparing teaching materials, selecting reading
materials, or devising tests. Up to recently, corpora have been restricted largely to the
written medium, and frequency lists were presented as undifferentiated as to variety:
words like daddy and institutional would appear side by side one another in the same
list without much distinction (in fact, their overall frequencies in the entire BNC are
close – 22 and 20 occurrences per million words respectively). So this is a decided step
forward: for the learner, vocabulary resources for speech are very different from vo-
cabulary resources for writing, and corpora have enabled us to see this clearly and in
considerable detail.6
A further innovation in recent lexicography for advanced learners has been the
recognition also of semantic frequency. Using again the example of LDOCE (1995 and
later editions), the various senses of a word are listed under each headword in fre-
quency order, and likewise homographs are presented in frequency order. In such ways,
dictionaries using corpus resources for the advanced learner have been striving to sup-
ply the information the learner needs most frequently in readily accessible form.
3.3 A more considered view
The principle “more frequent = more important to learn” can scarcely be gainsaid as a
general principle. However, one of the discoveries from the study of learner corpora is
that non-native students of English tend to overuse the words towards the top of the
frequency lists:
A number of studies reveal that learners from a wide variety of (unrelated) mother
tongue backgrounds display a common tendency to overuse common, non-spe-
cific words such as important (...) or big or nice (De Cock & Granger 2004: 78)
Part of this effect may well be due to failure to adapt to the written medium: it is true
that words such as big and nice are very frequent – but this is only in the spoken me-
dium, whereas nice, in particular, is rather infrequent in writing. A more general rea-
son for overuse of common words is that they are the words learners have encountered
and used most in the past. They are inevitably words with which the learners feel most
familiar, most confident and most comfortable – “lexical teddy bears”, as Hasselgren
(1994) calls them. Hence it is important to make a distinction between frequency in
6. This is all the more important since learners tend to confuse spoken and written registers
(see e.g. Altenberg & Tapper 1998 or Gilquin & Paquot 2008).
Frequency, corpora and language learning 
past experience for the learner, and frequency in projected future experience. The
reason for prioritizing commoner items over less common items in teaching is that
they are predictably the items the learner will encounter, and need to use, more fre-
quently in the future. But the reason why learners overuse common words must be
that they are the words they have encountered and used most frequently in the past.
The conclusion is that, if we are to follow the ‘more frequent = more important to learn’
principle, attainment in vocabulary acquisition must be linked progressively and sys-
tematically to extending the range of use to less common vocabulary, including less
common uses of frequent words (cf. Lennon 1996).7 The focus of learning should be
step by step on less frequent items which the learner needs to entrench for further use.
From a testing perspective too, as Alderson (2007) points out, less common items are
more discriminatory in the evaluation of levels of performance – for example, in vo-
cabulary size placement tests. All this indicates that, applied to learning processes,
frequency should be a relative, not an absolute quantity. What is important is that more
common words should be most usefully learned before less common words, whether
those more common words are in the top bands of frequency or not.
So far, then, the postulate ‘more frequent = more important to learn’ has not been
overthrown. The overuse effect implies simply that the students, in their learning pro-
cess, have not progressed down the frequency list as far as is desirable. They are relying
too much on well-worn and well-loved paths of expression.
3.4 Frequency of word combinations: Is it more important

than frequency of individual words?8
Teubert (2004: 188) goes so far as to claim: “Not simple words but collocations consti-
tute the true vocabulary of a language”. This may be going too far, as Teubert here
embraces the idiom principle one hundred percent. Nevertheless, the formulaicity of
English has been calculated as around 21% in written texts and even higher (30%) in
spoken language (Biber et al. 1999: 993–994).9 Learning vocabulary is not just a matter
7. Also we should include here the collocational patterns of frequent words, usually disre-
garded at more advanced levels because the words are considered as easy or known, whilst stud-
ies have shown that these collocational patterns were not mastered even at an advanced level
(see for instance Nesselhauf 2003).
8. As background to this section, see Durrant (2009) and Ellis & Simpson-Vlach (2009).
9. There are many ways of defining formulaicity (see Moon 1998) and the percentage figures
here are derived from a very specific definition: lexical bundles (3-grams and 4-grams) recur-
ring at least 10 times per million words. These percentages are estimated from Biber et al.’s
(1999: 993–994) Figures 13.2 and 13.3. A somewhat earlier study by Eeg-Olofsson & Altenberg
(1996) reported that as many as 86% of words in two 5,000-word samples (one monologue, the
other dialogue) “were part of a recurrent word combination in one way or another”. See also
Altenberg (1998).
 Geoffrey Leech
of acquiring individual words, but of acquiring phraseology. Hence frequency of word

combinations, as well as of words, should be an important input to the learning pro-
cess. The strange thing is that, according to De Cock (1998), the percentage of formu-
laicity in learners’ productions is even higher than in those of native speakers (although
some formulae are erroneous). Again, this may be the result of a ‘teddy bear’ effect,
whereby learners hang on to the use of well-worn and familiar phrases, rather than
risking new ones.
3.5 Grammatical frequency
The focus has been mainly on lexical frequency so far – the easiest kind of frequency
data to extract from corpora. The collection of data on frequency of grammatical cate-
gories, grammatical constructions and the like can be achieved automatically only if the
corpus has been annotated with the grammatical information supplied by POS tagging
and (ideally) parsing.10 This annotation process is far from easy. Alternatively, unless the
grammatical items happen to be unambiguously identifiable from their orthographic
form, grammatical information has to be gathered laboriously by manual intervention.
Nevertheless, much has been learned from corpora about grammatical frequency since
the first POS tagging (of the Brown Corpus) was achieved in 1970 (Greene &
Rubin 1971). Many results come from individual case studies of particular areas of Eng-
lish grammar. A more concerted corpus-based account of grammatical frequency is
provided by the Longman Grammar of Spoken and Written English (Biber et al. 1999).
At the more theoretical level, one rather unexpected finding (Sampson 2007) is
that frequency of grammatical structures, defined as tree fragments or mother-daugh-
ter sequences, follows a Zipfian curve similar to that of word frequency, with an enor-
mous tail of structures occurring only once in a corpus, just as the tail of vocabulary
frequency (around 50%) consists of words that only occur once (hapax legomena).
This is surprising for those brought up in the Chomskyan framework (Chomsky 1957:
13) where there is assumed to be a clear dividing line between items which are gram-
matical and those which are not. The common assumption up to recently has been that
the grammar is a closed system of rules whereas the lexicon is open-ended.
On grammatical frequency, perhaps even more than lexical frequency, corpus
findings can be surprising alike to native speakers and to experienced teachers of the
language. For example, it has been reported that teachers of English, when asked
whether the present progressive or the present simple is more common, typically opt
10. However, tagging and even parsing do not necessarily imply that the retrieval of gram-
matical phenomena is fully automatic. Sometimes, considerable manual post-editing is neces-
sary (cf. Gilquin 2002). On the other hand, advanced corpus software such as BNCweb (for use
with the BNC) can undertake queries leading to the retrieval of syntactic patterns by use of a
powerful query syntax known as CQP employing regular expressions – see Hoffmann et al.
(2008: 215–243).
Frequency, corpora and language learning 
for the progressive.11 This choice is reinforced by the fact that in syllabuses, the present
progressive has sometimes been taught before the present simple. Teachers, it can be
supposed, are hugely surprised to be told that (according to corpus evidence) the pro-
gressive aspect is about 20 times less common than the simple non-progressive aspect
(see Figure 2).
Another illustration of how teaching practices in grammar have been notoriously
at odds with corpus evidence is that of conditional sentences. For a generation at least,
Thomson & Martinet’s (1980: 186–192) best-selling grammar textbook helped to per-
petuate the time-honoured assumption that there are just three categories of condi-
tional which learners of English have to master:12
First conditional: Protasis: present simple Apodosis: will + infinitive
e.g. If you don’t get it he’ll repeat it.
Second conditional: Protasis: past simple Apodosis: would + infinitive
e.g. If I had an acre to plant, I would spend all day working on it.
Third conditional: Protasis: past perfect Apodosis: would have + infinitive
e.g. If I’d owned it, I would have thrown it away.
[Examples from the LCSAE]
simp
perf
prog
perf+prog
Figure 2.â•‡ Chart showing the frequency of the simple aspect (non-perfect,
non-progressive) compared with those of the perfect and progressive aspects (based on
Biber et al. 1999: 461–462; the portions represent percentages of all verb phrases)
11. Douglas Biber, personal communication: “I have used this example in literally dozens of
talks, and I consistently get the same result. The most dramatic case was probably a plenary that
I gave at AAAL several years ago. The estimated attendance was c. 800, but only c. 20 raised their
hand to vote for simple aspect as more frequent”.
12. To be fair, Thomson and Martinet allow for variants on these three patterns, for example
where other modals than will and would occur. More recent pedagogical accounts of grammar tend
to include the zero type. For further corpus evidence and discussion, see Gabrielatos (2003, 2007).
 Geoffrey Leech
30 60
20 40
Manner Place &
10 & Place 20 Time
0 0
M-->P P-->M P-->T T-->P
30
20
Manner Key: --> means
10 & Time “before”
0
M-->T T--> M
Figure 3.â•‡ Likelihood of Manner preceding Place, Place preceding Time, and Manner
preceding Time (based of Biber et al. 1999: 811; frequencies per 10,000 words)
However, corpora show that more frequent than each of these three is the unmodal-
ized conditional, often called the ‘zero type’, typically with the present simple tense in
both clauses:
If you do it in twenty days, you’re wonderful. [Example from the LCSAE]
For the millions of learners who have sweated over the second and third conditionals,
it might be a comfort (or alternatively, a vexation) to know that these are fairly rare in
comparison with the type just illustrated.
Yet a further example is the ‘MPT rule’, repeated in many books and materials,
decreeing that the order of adverbials at the end of a clause is ‘Manner followed by
Place and Place followed by Time’. In practice, this turns out to be a probabilistic rule,
and not a very good one at that. The charts in Figure 3 show the likelihood that these
three classes will occur in the order stated.13
As the anecdote of the progressive just mentioned suggests, ‘authoritative’ figures
in language teaching, whether teachers, materials writers or just native speakers, are
very poor at guessing relative frequencies of grammatical classes and structures. If the
time wasted teaching rather uncommon structures and weak rules is to be avoided, the
‘more frequent = more important to learn’ principle should be applied to grammar.
This is where corpus evidence again becomes crucial.
3.6 Phraseology and the interaction of lexis and grammar
In the interaction of lexis and grammar, frequency helps to unlock predictable pat-
terns of meaning. This is definitely an area of corpus-based investigation whose hour
has come. Recently, various frameworks have been put forward extending the
13. The data for Figure 3 comes from Biber et al. (1999: 811), Figures 10.14–16.
Frequency, corpora and language learning 
collocational analysis paradigm to apply to frequency of co-occurrence of both lexical

and grammatical choices:
a. Pattern grammar: described as “a corpus-driven approach to the lexical grammar
of English” (Hunston & Francis 2000)
b. Collostructions: the statistical measurement of the degree of attraction or repul-
sion between words and constructions (Stefanowitsch & Gries 2003)
c. Word sketches: use of the Sketch Engine software to derive a summary of a word’s
collocational behaviour in terms of grammatical slots (Kilgarriff & Tugwell 2002)
d. Concgrams: use of ConcGram software to generate word-collocations of variable
position and distance, such that (for example) play a role, play an important role,
a key role to play can all be listed as belonging to the same concordance output
(Cheng et al. 2006)
Here I will not go into the technical characteristics distinguishing these approaches
from one another. The important point, as I see it, is that they all explore statistically
the until-recently-neglected interface between lexis and grammar. Lexis, in its pure
Hallidayan and Sinclairian form, focuses on patterns of word co-occurrence while ex-
cluding generalizations on the level of grammatical structure.14 On the other hand,
many approaches to grammar have neglected the level of lexical patterning. Surely the
most valuable way to synthesise the relations between lexis and grammar within a
single lexico-grammatical framework is to use corpus linguistic techniques such as
those in (a)–(d) above. I will illustrate this with just two examples, the first of word
sketches and the second of collostructions.
Table 2 displays a word sketch of the noun bank, showing its co-occurrence con-
nections in terms of frequency and salience (a strength-of-association measure), with
verbs in the Subject-of relation, with verbs in the Object-of relation and with adjec-
tives or nouns as modifiers of bank.
The automated analysis of grammatical structure, as shown by the Sketch En-
gine, has reached a stage where errors are rather few, and results can be regarded as
substantially reliable. On the other hand, a semantic element of analysis is still lack-
ing, as we can see from the juxtaposition at the top of the Object-of list of burst
(where the bank is obviously a river bank) and rob (where the bank is obviously a fi-
nancial institution).
The second and rather similar technique derives from Stefanowitch & Gries’s
(2003) statistical concept of collostructional analysis interrelating (as its name sug-
gests) collocational analysis and construction grammar. It can be illustrated from the
analysis of the construction [Verb NP as X] by Gries et al. (2005: 649) – see Table 3.
The interesting debate here lies in two different measures, item frequency and
strength-of-association (collostructional strength), which can produce different re-
sults. In Table 3, the verbs see and describe are more frequent in this construction than
14. See Halliday (1961: 273–277), Halliday (1966) and Sinclair (1966).
 Geoffrey Leech
Table 2.â•‡ Part of a word sketch (after Kilgarriff & Tugwell 2002: 131) of the noun bank
subject-of num sal object-of num sal modifier num sal
lend 95 21.2 burst 27 16.4 central 755 25.5

issue 60 11.8 rob 31 15.3 Swiss â•⁄ 87 18.7
charge 29 â•⁄ 9.5 overflow â•⁄ 7 10.2 commercial 231 18.6
operate 45 â•⁄ 8.9 line 13 â•⁄ 8.4 grassy â•⁄ 42 18.5
step 15 â•⁄ 7.7 privatize â•⁄ 6 â•⁄ 7.9 royal 336 18.2
deposit 10 â•⁄ 7.6 defraud â•⁄ 5 â•⁄ 6.6 far â•⁄ 93 15.6
borrow 12 â•⁄ 7.6 climb 12 â•⁄ 5.9 steep â•⁄ 50 14.4
eavesdrop â•⁄ 4 â•⁄ 7.5 break 32 â•⁄ 5.5 issuing â•⁄ 23 14.0
finance 13 â•⁄ 7.2 oblige â•⁄ 7 â•⁄ 5.2 confirming â•⁄ 13 13.8
underwrite â•⁄ 6 â•⁄ 7.2 sue â•⁄ 6 â•⁄ 4.7 correspondent â•⁄ 15 11.9
account 19 â•⁄ 7.1 instruct â•⁄ 6 â•⁄ 4.5 state-owned â•⁄ 18 11.1
wish 26 â•⁄ 7.1 owe â•⁄ 9 â•⁄ 4.3 eligible â•⁄ 16 11.1
num = number of tokensâ•…â•…â•…â•… sal = salience (roughly: strength of association)
Table 3.â•‡ A partial collostructional listing (from Gries et al. 2005: 649) of verbs most
strongly attracted to the construction [Verb NP as X]
verb in number of collostruction verb in number of collostruction

construction tokens strength construction tokens strength
regard â•⁄ 80 166.476 recognise/ize 12 12.159

describe â•⁄ 88 134.870 categorise/ize â•⁄ 6 11.525
see 111 â•⁄ 78.790 perceive â•⁄ 6 â•⁄ 8.304
know â•⁄ 79 â•⁄ 42.796 hail â•⁄ 3 â•⁄ 6.316
treat â•⁄ 21 â•⁄ 28.224 appoint â•⁄ 5 â•⁄ 6.073
define â•⁄ 18 â•⁄ 23.843 interpret â•⁄ 5 â•⁄ 5.920
use â•⁄ 42 â•⁄ 21.425 class â•⁄ 3 â•⁄ 5.379
view â•⁄ 12 â•⁄ 17.861 denounce â•⁄ 3 â•⁄ 5.158
map â•⁄â•⁄ 8 â•⁄ 12.796 dismiss â•⁄ 4 â•⁄ 5.079
regard. But regard is a more ‘typical’ verb to use with the [Verb NP as X] construction,
because a larger proportion of its tokens occur with this construction as compared
with others. It is more securely attracted (or ‘bonded’) to this construction than to oth-
ers. Describe and (especially) see are more general-purpose verbs that do not have this
special relationship with the construction.
As another illustration, Stefanowitsch & Gries (2003: 231) determine the collostruc-
tional strength of verbs with the progressive. The most strongly bonded verbs, in order,
are talk, go, try, look, work, sit and wait. This order is obviously not that of the frequency
of the verbs themselves, which is (as it happens) as follows: go, look, work, try, talk, sit
and wait. The debate is to determine whether learners acquire the construction better
Frequency, corpora and language learning 
with common verbs or with bonded verbs: arguably a matter for SLA specialists, rather
than corpus linguists. But surely both measures are potentially useful to the learner.
4. New directions in applied linguistics favourable to frequency
In this section, striking an optimistic, forward-looking note, I take account of present

directions of research favouring the importance of frequency. After this, I turn less
optimistically in Section 5 to the problems of determining frequency relevant to lan-
guage learning.
Twenty years ago, there was very little support for the idea that frequency phe-
nomena contribute to our understanding of language and language learning. Now, I
believe, there has been something of a transformation which brings frequency increas-
ingly into the limelight. I will say something about:
a theoretical positions favouring frequency (Section 4.1)
b. frequency effects in language change (Section 4.2)
c. frequency effects in language acquisition, including both L1 and L2 learning
(Section 4.3)
4.1 Theoretical positions favouring frequency
Three theoretical positions which have been gaining momentum since the 1990s all
implicitly or explicitly give frequency a role in the workings of language: usage-based
linguistics, cognitive linguistics, and construction grammar. These three differently-
labelled approaches are so closely linked that they could be called different facets of the
same theoretical paradigm.
Usage-based linguistics (based on observation and analysis of language in use – see
Barlow & Kemmer 2000) reacts strongly against Chomsky’s position that linguistics is
concerned with competence (a mental phenomenon) rather than with performance
(the use of language in utterances and texts) – or, to use a later terminology, with
(internal) I-language rather than with (external) E-language. During the heyday of the
generative paradigm, as we have seen, performance-based theorizing was inevitably
eclipsed, although the usage-oriented paradigm of Halliday’s systemic functional
grammar, for example, maintained a following (largely outside the USA). More
recently, usage-based approaches have made a significant comeback, especially in the
western part of the USA.
Cognitive grammar/linguistics has also gained momentum in the western states of
the USA since the 1970s, and is perhaps found in its most influential form in the
cognitive grammar of Langacker (1987). Although this is not the place to expound the
theoretical foundations of the cognitive linguistics enterprise, among its important
tenets is that the way we use and process language is integral to the nature of language
 Geoffrey Leech
as a cognitive phenomenon. In this sense cognitive linguistics is usage-oriented. The

notion of entrenchment is key to Langacker’s cognitive grammar: repeated exposure to
a linguistic item makes the difference between an item that is strongly and centrally
established as part of language cognition (entrenched), and one that is weakly estab-
lished and peripheral. Entrenchment is central to processes of language acquisition,
and it is dependent on frequency: the more frequently a linguistic item has been en-
countered and used, the more entrenched in the language user’s competence it is likely
to be (see Langacker 1987: 100; Gries 2006).
Construction grammar (Fillmore et al. 1988; Goldberg 1995) is a framework for
describing and accounting for language structure in terms of constructions, rather
than ‘words and rules’. A construction is a symbolic unit that combines both form and
meaning, and may be linguistically complex. It is commonly postulated that construc-
tions are learned and stored as wholes, and that they are learned from the bottom up,
on the basis of actual language use. A construction can be an idiomatic combination of
words, like garbage in, garbage out; it can also be semi-idiomatic, like the let alone
construction, or an abstract pattern such as the double-object construction. Hence
constructions accord with the phraseologists’ view of a grammar-lexicon continuum,
for which Goldberg has coined the term constructicon. Once again, frequency plays a
role, in that frequency of occurrence in the learning process is seen as a necessary pre-
condition for construction status.
These three approaches are indeed so closely linked that some might object to
their being distinguished from one another. For example, ‘cognitive linguistics’ could
be regarded as a cover term that includes construction grammar, and has the usage-
based approach as one of its chief tenets.
4.2 Frequency effects in language change
In diachronic linguistics, frequency has come to the fore above all in the theory of gram-
maticalization (Hopper & Traugott 2003), which focuses on the way lexical material
becomes (over time) converted into grammatical material as a prime force in language
change. Many studies (e.g. Hooper 1976; Bybee & Hopper 2001; Bybee 2007) show the
relevance of frequency, both as an input and as an output to the grammaticalization
process. For example, frequent expressions are susceptible to phonetic reduction
(e.g. don’t know --> dunno; kind of --> kinda), a trigger of grammaticalization. Also, after
the criterial changes of grammaticalization have taken place, the increase in frequency
can continue for centuries – witness the rise in frequency of the English progressive, a
continuous development from before Early Modern English up to the present day.
Recent short-term diachronic studies using the Brown family of corpora show
significant trends in change of grammatical frequency partly motivated by grammati-
calization as well as other processes, such as colloquialization. Leech et al.
(2009: 142–143) find that frequency changes like the increasing use of the progressive
cannot be attributed to expansion of the progressive to particular verb classes or other
Frequency, corpora and language learning 
categorical, structural or semantic trends. Rather, there seems to be a general increase

of frequency across the board. It seems a fairly natural assumption that one result of a
strengthened cognitive representation of a linguistic form is that it gets used more of-
ten by individuals, and more generally by the language community. Thus, from this
perspective, input frequency and output frequency are both concomitants of gram-
matical change:
greater input frequency → greater entrenchment → greater output frequency
4.3 Frequency effects in language acquisition
The sequence represented graphically above is primarily, of course, to be applied to

language development in the individual, and only secondarily to a whole language
community of users. Tomasello (2003), more than anyone else, has demonstrated the
case for a usage-based theoretical position on first language acquisition, rejecting
Chomsky’s view of universal grammar as a genetic basis for language acquisition, and
instead arguing for the view that language acquisition takes place through implicit
learning (using cognitively generic learning strategies) of patterns of form and mean-
ing encountered in the child’s language input.
Further, Ellis (2002a, 2002b) has presented persuasively the evidence of fre-
quency effects in language processing generally, and more particularly in SLA. He
finds that explicit and implicit learning and memory are complementary, implicit
learning being driven by frequency of exposure. These two learning processes are
seen as coming from very different neurological sources, the implicit capability de-
riving from the hippocampal system, and explicit learning from the neo-cortical
system. Frequency of activation leads to the (implicit) learning of prototype catego-
ries. However, our knowledge of frequency is unconscious, and research has shown
that even experts in language and language teaching have a poor record of guessing
the frequency of linguistic items such as verbs (cf. Alderson 2007). These findings
explain why (in the anecdote mentioned earlier) language teachers are unable to
recognize that the present simple is many times more frequent than the present pro-
gressive. Ellis’s frequency effects link SLA with the idiom principle, the phraseologi-
cal perspective on learning, construction grammar, and data-driven learning
(Johns 1994). They indicate how learning is adaptive to an unfolding history of in-
puts, how change is incremental and cumulative, and how prior activation facilitates
subsequent activation.
As we learn to process high-frequency phenomena such as multi-word expres-
sions and collostructions faster, we become more adapted to identifying them as units
and processing them holistically. Priority in learning goes to formulae, then to higher
structures (both subsumed under the constructions of construction grammar). Ellis’s
line of research ties psycholinguistic research in language processing and SLA closely
to learner corpus research. Researchers in SLA and in learner corpora, which seemed
 Geoffrey Leech
to be on separate tracks a few years ago, are at last coming together (see Granger et al.
2002) and frequency appears to be a key link between them.
We can now begin to see how the principle of ‘more frequent = more important to
learn’ fits in with advances in learning theory and SLA. Institutional L2 teaching often
has to implement adaptive learning within the confines of a curriculum where oppor-
tunities for L2 input and L2 output are severely limited in time. An important goal, in
this case, is to present the learners with materials and productive tasks that extend
their range of competence by moving them as far as possible from frequent towards
less frequent. The implicit learning which is dominant in L1 acquisition can, of course,
be complemented in L2 acquisition by explicit learning, which, through the conscious
‘noticing’ of language phenomena (see Schmidt 1990, 1995), can improve the learner’s
control of the language.
5. Challenges and possible solutions
The preceding section leads to the conclusion that frequency is an important consid-
eration in language learning, and, since corpora are the only practicable means of sup-
plying frequency information, this is where corpus linguistics should be able to make
a key contribution. However, we should not paint too rosy a picture of this marriage
between corpus linguistics and SLA: there are difficulties in determining the relevance
of frequency, and in supplying the corpus-derived information needed.
5.1 Challenge I: Bringing together corpus linguistic

and cognitive linguistic approaches
We have seen that corpus linguistics and cognitive linguistics are becoming strongly
linked through the usage-based paradigm. But there are some signs that the ‘more
frequent = more important to learn’ principle is not always supported by cognitive
linguistics. Gries (2006) and Gilquin (2006b) present two examples where what is pro-
totypical (and therefore more salient and central from a cognitive perspective) does
not correspond to what is most frequent. Gries’s analysis of the verb run from both the
cognitive and the corpus angle suggests that there is a discrepancy between the most
likely prototype sense of run (motion) and the most frequent sense (fast pedestrian
movement). Similarly, Gilquin’s analysis of causative verb constructions leads her to
the conclusion that the prototypical case of causation is not the most frequent. Al-
though determination of what is the prototype is far from clear-cut, these result appear
to contradict the implication, for example, from Ellis’s work, that the most frequent
category is the most entrenched and therefore the most cognitively salient. Perhaps
one way of resolving this conundrum is to recognize that the establishment of a proto-
type category in the adult competence may have taken place at a relatively earlier stage
Frequency, corpora and language learning 
of language acquisition, when (for example) the ‘fast pedestrian movement’ sense of
run would in fact be the sense in commonest use. Hence the most prototypical usage
would not necessarily be the one found most frequently in an adult corpus. However,
there is much more work to be done on this.
5.2 Challenge II: Corpora do not always match learners’ needs
There are many different kinds of corpora, but none of them seem to be exactly the
kind of corpus that will give frequency information relevant to learners. For English,
for example, the following varieties of corpora have been, or can be, used to provide
the empirical basis for ELT materials:
a. General purpose reference or monitor corpora (e.g. the BNC, the Bank of English)
b. Corpora of English for Specific Purposes (ESP) and English for Academic Pur-
poses (EAP) (e.g. Corpus of Professional English, CSPAE, MICASE, BASE
Corpus)15
c. Corpora of EFL (English as a Foreign Language) learner language (e.g. ICLE,
LINDSEI)16
d. Corpora containing the language of native speaker (NS) children (e.g.
CHILDES)17
e. Corpora of teenager and young adult NSs (e.g. LOCNESS, COLT)18
f. Corpora of English as a Lingua Franca (e.g. VOICE, ELFA)19
This list is far from complete and new corpora are making their appearance month by
month. In fact, there are so many corpora of potential use for English language educa-
tion that it may seem perverse to suggest that they are not enough. To some extent,
though, it is a matter of debate what kind of corpus best suits the needs of a learner.
The general principle, I suggest, is that such a corpus should represent as far as possible
the target linguistic communicative behaviour to which learning is directed. Despite
the usefulness of the above types of corpora for various purposes, there are reasons
why they are not optimal for particular groups of language learners.
General purpose corpora (a), containing both written and spoken material, although
they yield frequency data useful for adult learners, are less useful for younger adults such
15. Corpus of Spoken, Professional American-English; Michigan Corpus of Academic Spoken

English; corpus of British Academic Spoken English.
16. International Corpus of Learner English; Louvain International Database of Spoken Eng-
lish Interlanguage.
17. Child Language Data Exchange System.
18. Louvain Corpus of Native English Essays; Bergen Corpus of London Teenage Language.
19. Vienna-Oxford International Corpus of English; English as a Lingua Franca in Aca-
demic Settings.
 Geoffrey Leech
as the average undergraduate student, and because of their ‘adult’ style and content,
might be considered quite unsuitable for primary or secondary school learners.
The same applies to ESP and EAP corpora (b) such as MICASE: these are well
tailored to the academic needs of students or those training for a professional career,
but not for more general groups.
Corpora of learner language (c) such as ICLE and LINDSEI do, of course, provide
vital frequency data for comparison of learners’ language to that of NSs, as well as
comparison of the interlanguage of learners of different mother tongue backgrounds.
Even here, however, it remains somewhat problematic whether the target linguistic
behaviour with which the language of such student learners should be compared is that
of NSs of the target language of their own age group, or the specialist adults we typi-
cally find as authors in written corpora of NSs, or indeed some other target communi-
ties such as non-native speakers (NNS) using English as a Lingua Franca (ELF), whose
language use is recorded in a corpus such as VOICE. For learners of primary school
and high school ages, there is as yet a dearth of NS children’s/teenagers’ language of
primary or secondary school age (d-e), although CHILDES contains a wide variety of
spoken data of earlier age groups. Corpora of ELF (f), e.g. VOICE (Seidlhofer 2004) or
ELFA (Mauranen 2006), are new contenders on the scene, and raise the whole question
of whether NSs’ language should any longer be regarded as the standard to aim at, as it
has been unquestioningly considered in the past. In all these kinds of corpora (except
for the largest reference and monitor corpora) there is an issue, also, about the size of
available corpora and their representativeness in terms of different registers and activ-
ity types. A corpus intended to represent frequency data of target language behaviour
should ideally be large and wide-ranging enough to yield reliable frequencies not only
of words but of collocations: something that requires large corpora.
For the normal EFL educational curriculum, the ideal corpus should be longitudi-
nal, representing competent target language use appropriate to the age cohort of the
learners. An early example of such a corpus (for NS learners, however) was the 5-mil-
lion-word text collection used for the AHI Frequency Book (Carroll et al. 1971), which
consisted of reading text materials used in US schools from the third grade to twelfth
grade. Textbooks, readers, and other learning materials have been used for research
both in Germany and in Japan, but the emphasis of the research (e.g. Mindt 1996;
Römer 2005) has been to show how far the language to which students are exposed in
school is divergent from that of NS corpora. Recent research on corpora of textbooks
is reported in Meunier & Gouverneur (2009), who also give an account of their TeMa
(Textbook Material) corpus consisting of general-purpose best-selling international
ELT textbooks.
So here is another issue: how appropriate is the teaching-induced language on
which students are led, through their curriculum, to model themselves? It seems
that, for various reasons, we are far from an ideal situation in which the frequency
information applied to learner input comes from a corpus tailor-made to meet the
learner’s needs.
Frequency, corpora and language learning 
6. Conclusion: With words of comfort
In spite of the negative points raised in the preceding section, it should be emphasized
in conclusion that frequency information remains a highly valuable resource for input
to language learning materials and testing, and that it is increasingly available. To insist
on precise frequency counts is often to aim at too high an ideal, for, as Halliday put it
long ago (1971: 344), “a rough indication of frequency is often just what is needed”. The
afore-mentioned case of teachers who believed the present progressive to be more
frequent than the present simple illustrates just how wildly wrong people’s intuitions of
linguistic frequency can be: virtually any corpus representing NS productions, spoken
or written, would correct this erroneous belief. A further point (referring back to dis-
tinctions I made in Section 1) is that in general, corpora differ much more in terms of
raw frequency or normalized frequency than in terms of ordinal frequency (the plac-
ing of items in an order of frequency). Fortunately, raw or normalized frequency
counts are rarely needed: ordinal frequency (allowing certain items to be prioritized
above others) is usually all that matters for language learning and teaching purposes.
The greatest need, I believe, is for the development of longitudinal corpora of both
NSs and NNS learners. However, without waiting for the Holy Grail of the ideally tai-
lored corpus for a given learner group, much could be achieved by building a database
of frequency data from a range of different corpora and subcorpora, to enable ELT
professionals to compare frequencies in different styles, registers, age groups, etc. For
a given target learner group, corpora could be given weightings relative to their rele-
vance to the group, resulting in optimal frequencies approximating to the learners’
needs. In this way the best available value could be put on Halliday’s call for approxi-
mate frequency.
One final point: the emphasis on frequency in this chapter should not mislead any
reader into thinking that ‘all we need to do is to count things’. In the selecting, devising
and grading of learning materials, not only frequency, but other values, such as learner
interest and motivation, learner difficulty, etc. need to be factored in. But, to correct
what I believe to have been the neglect of frequency in thinking up to now, I suggest
that from now on, there is no reason why any choices regarding learner input, learner
performance and learner evaluation should not be frequency-informed.
References
Adolphs, S. 2008. Corpus and Context: Investigating Pragmatic Functions in Spoken Discourse
Alderson, J.C. 2007. Judging the frequency of English words. Applied Linguistics 28(3):
383–409.
Altenberg, B. 1998. On the phraseology of spoken English: The evidence of recurrent word
combinations. In Phraseology: Theory, Analysis and Applications, A.P. Cowie (ed.), 101–122.
Oxford: Clarendon Press.
 Geoffrey Leech
Altenberg, B. & Tapper, M. 1998. The use of adverbial connectors in advanced Swedish learners’
written English. In Learner English on Computer, S. Granger (ed.), 80–93. London: Addison-
Wesley Longman.
Barlow, M. & Kemmer, S. 2000. Usage-based Models of Language. Stanford CA: CSLI.
Biber, D., Johansson, S., Leech, G., Conrad, S. & Finegan, E. 1999. Longman Grammar of Spoken
and Written English. London: Longman.
Bybee, J. 2007. Frequency of Use and the Organization of Language. Oxford: OUP.
Bybee, J. & Hopper, P. (eds). 2001. Frequency and the Emergence of Linguistic Structure
[Typological Studies in Language 45]. Amsterdam: John Benjamins.
Carroll, J.B. 1971. Statistical analysis of the corpus. In The American Heritage Frequency Book,
J.B. Carroll, P. Davies & B. Richman (eds), xxi-xl. Boston MA: Houghton Mifflin.
Carroll, J.B., Davies, P. & Richman, B. 1971. The American Heritage Frequency Book. Boston MA:
Houghton Mifflin.
Cheng, W., Greaves, C. & Warren, M. 2006. From n-gram to skipgram to concgram. Interna-
tional Journal of Corpus Linguistics 11(4): 411–433.
Chomsky, N. 1957. Syntactic Structures. The Hague: Mouton.
Chomsky, N. 1964 [1962]. A transformational approach to syntax. In Proceedings of the Third
Texas Conference on Problems of Linguistics Analysis, A.A. Hill (ed.), 124–158. Austin TX:
University of Texas. (Reprinted in J.A. Fodor & J.J. Katz. 1964. The Structure of Language,
211–241. Englewood Cliffs NJ: Prentice-Hall.)
Chomsky, N. 1965. Aspects of the Theory of Syntax. Cambridge MA: The MIT Press.
Coxhead, A. 2000. A new Academic Word List. TESOL Quarterly 34(2): 213–238.
Davies, M. & Gardner, D. 2010. A Frequency Dictionary of Contemporary American English.
London: Routledge.
De Cock, S. 1998. A recurrent word combination approach to the study of formulae in the
speech of native and non-native speakers of English. International Journal of Corpus
Linguistics 3: 59–80.
De Cock, S. & Granger, S. 2004. Computer learner corpora and monolingual learners’ dictionar-
ies: The perfect match. In The Corpus Approach to Lexicography, W. Teubert & M. Mahlberg
(eds), Special issue of Lexicographica 20: 72–86.
Durrant, P. 2009. Investigating the viability of a collocation list for students of English for Aca-
demic Purposes. English for Specific Purposes 28(3): 157–169.
Eeg-Olofsson, M. & Altenberg, B. 1996. Recurrent word combinations in the London-Lund
Corpus: Coverage and use for word-class tagging. In Studies in Synchronic Corpus Linguis-
tics, C.E. Percy, C.F. Meyer & I. Lancashire (eds), 97–107. Amsterdam: Rodopi.
Ehrman, M.E. 1966. The Meanings of the Modals in Present-Day American English. The Hague:
Mouton.
Ellis, N.C. 2002a. Frequency effects in language processing: A review with implications for theo-
ries of implicit and explicit language acquisition. Studies in Second Language Acquisition
24(2): 249–260.
Ellis, N.C. 2002b. Reflections on frequency effects in language processing. Studies in Second
Language Acquisition 24(2): 297–339.
Ellis, N.C. & Simpson-Vlach, R. 2009. Formulaic language in native speakers: Triangulating
psycholinguistics, corpus linguistics, and education. Corpus Linguistics and Linguistic
Theory 5: 61–78.
Ellis, R. 1994. The Study of Second Language Acquisition. Oxford: OUP.
Ellis, R. 2008. The Study of Second Language Acquisition, 2nd edn. Oxford: OUP.
Frequency, corpora and language learning 
Fillmore, C.J., Kay, P. & O’Connor, M.K. 1988. Regularity and idiomaticity in grammatical con-
structions: The case of let alone. Language 64: 501–538.
Firth, J.R. 1957. Modes of meaning. In Papers in Linguistics 1934–51, 190–215. Oxford: OUP.
Francis, W.N. & Kučera, H. 1982. Frequency Analysis of English Usage: Lexicon and Grammar.
Boston MA: Houghton Mifflin.
Gabrielatos, C. 2003. Conditional sentences: ELT typology and corpus evidence. Paper given at
the Annual Meeting of the British Association of Applied Linguistics, University of Leeds,
4–6 September 2003. <http://eprints.lancs.ac.uk/140/1/Conditional_Sentences_-_ELT_ty-
pology_and_corpus_evidence.pdf>
Gabrielatos, C. 2007. If-conditionals as modal colligations: A corpus-based investigation. In
Proceedings of the Corpus Linguistics Conference: Corpus Linguistics 2007, M. Davies, P.
Rayson, S. Hunston & P. Danielsson (eds). Birmingham: University of Birmingham. <http://
www.corpus.bham.ac.uk/corplingproceedings07/paper/256_Paper.pdf>
Gilquin, G. 2002. Automatic retrieval of syntactic structures: The quest for the Holy Grail. Inter-
national Journal of Corpus Linguistics 7(2): 183–214.
Gilquin, G. 2006a. Highly polysemous words in Foreign Language Teaching: How to give learn-
ers a flying start. In Proceedings of the 7th Conference on Teaching and Language Corpora,
Université Paris 7 – Denis Diderot, 1–4 July 2006, 58–60.
Gilquin, G. 2006b. The place of prototypicality in corpus linguistics. Causation in the hot seat.
In Corpora in Cognitive Linguistics: Corpus-based Approaches to Syntax and Lexis, S.T. Gries
& A. Stefanowitsch (eds), 159–191. Berlin: Mouton de Gruyter.
Gilquin, G. & Paquot, M. 2008. Too chatty: Learner academic writing and register variation.
English Text Construction 1(1): 41–61.
Goldberg, A. 1995. Constructions: A Construction Grammar Approach to Argument Structure.
Chicago IL: University of Chicago Press.
Granger, S. 1997. On identifying the syntactic and discourse features of participle clauses in
academic English: Native and non-native writers compared. In Studies in English Language
and Teaching, J. Aarts, I. de Mönnink & H. Wekker (eds), 185–198. Amsterdam: Rodopi.
Granger, S. (ed.). 1998. Learner English on Computer. London: Addison-Wesley Longman.
Granger, S., Hung, J. & Petch-Tyson, S. (eds). 2002. Computer Learner Corpora, Second Lan-
guage Acquisition and Foreign Language Teaching [Language Learning & Language Teach-
ing 6]. Amsterdam: John Benjamins.
Granger, S. & Meunier, F. (eds.). 2008. Phraseology: An Interdisciplinary Perspective. Amsterdam:
John Benjamins.
Greene, B.B. & Rubin, G.M. 1971. Automatic Grammatical Tagging of English. Providence RI:
Department of Linguistics, Brown University.
Gries, S.T. 2006. Corpus-based methods and cognitive semantics: The many senses of to run. In
Corpora in Cognitive Linguistics: Corpus-based Approaches to Syntax and Lexis, S.T. Gries &
A. Stefanowitsch (eds), 57–99. Berlin: Mouton de Gruyter.
Gries, S.T., Hempe, B. & Schönefeld, D. 2005. Converging evidence: Bringing together experi-
mental and corpus data on the association of verbs and constructions. Cognitive Linguistics
16(4): 635–676.
Halliday, M.A.K. 1961. Categories of the theory of grammar. Word 17(3): 241–292.
Halliday, M.A.K. 1966. Lexis as a linguistic level. In In Memory of J.R. Firth, C.E. Bazell, J.C.
Catford, M.A.K. Halliday & R.H. Robins (eds), 148–162. London: Longman,.
Halliday, M.A.K. 1971. Linguistic functions and literary style. In Style: A Symposium, S. Chatman
(ed.), 330–365. Oxford: OUP.
 Geoffrey Leech
Hasselgren, A. 1994. Lexical teddy bears and advanced learners: A study into the ways Norwe-
gian students cope with English vocabulary. International Journal of Applied Linguistics
4(2): 237–258.
Hoffmann, S., Evert, S., Smith, N., Lee, D. & Berglund Prytz, Y. 2008. Corpus Linguistics with
BNCweb – A Practical Guide. Frankfurt: Peter Lang.
Hofland, K. & Johansson, S. 1982. Word Frequencies in British and American English. Bergen:
Norwegian Computing Centre for the Humanities.
Hooper, J. 1976. Word frequency in lexical diffusion and the source of morphophonological
change. In Current Progress in Historical Linguistics, W. Christie (ed.), 96–105. Amsterdam:
North Holland.
Hopper, P.J. & Traugott, E.C. 2003[1993]. Grammaticalization. Cambridge: CUP.
Hunston, S. & Francis, G. 2000. Pattern Grammar: A Corpus-driven Approach to the Lexical
Grammar of English [Studies in Corpus Linguistics 4]. Amsterdam: John Benjamins.
Jelinek, F. 1998. Statistical Methods for Speech Recognition. Cambridge MA: The MIT Press.
Johansson, S. & Hofland, K. 1989. Frequency Analysis of English Vocabulary and Grammar:
Based on the LOB Corpus, 2 Vols. Oxford: Clarendon Press.
Johns, 1994. From printout to handout: Grammar and vocabulary teaching in the context of
data-driven learning. In Perspectives on Pedagogical Grammar, T. Odlin (ed.), 293–317.
Cambridge: CUP.
Kennedy, G. 1998. An Introduction to Corpus Linguistics. London: Addison-Wesley Longman.
Kilgarriff, A. & Tugwell, D. 2002. Sketching words. In Lexicography and Natural Language Pro-
cessing: A Festschrift in Honour of B.T.S. Atkins, M-H. Corréard (ed.), 125–137. EURALEX.
<http://www.kilgarriff.co.uk/Publications/2002-KilgTugwell-AtkinsFest.pdf>
Krishnamurthy, R. (ed.). 2005. English Collocation Studies: The OSTI Report, by J. Sinclair, S.
Jones & R. Daley. London: Continuum.
Kučera, H. & Francis, W.N. 1967. Computational Analysis of Present-day American English.
Providence RD: Brown University Press.
Langacker, R.W. 1987. Foundations of Cognitive Grammar, Vol. I: Theoretical Prerequisites.
Stanford CA: Stanford University Press.
Leech, G., Hundt, M., Mair, C. & Smith, N. 2009. Change in Contemporary English: A Gram-
matical Study. Cambridge: CUP.
Leech, G., Rayson, P. & Wilson, A. 2001. Word Frequencies in Written and Spoken English: Based
on the British National Corpus. Harlow: Longman.
Lennon, P. 1996. Getting ‘easy’ verbs wrong at the advanced level. International Review of Ap-
plied Linguistics 34(1): 23–36.
Longman Dictionary of Contemporary English, 3rd edn, Dir. D. Summers. 1995. London:
Longman.
Marshall, I. 1983. Choice of grammatical word-class without global syntactic analysis. Comput-
ers and the Humanities 17: 139–150.
Marshall, I. 1987. Tag selection using probabilistic methods. In The Computational Analysis of
English: A Corpus-based Approach, R. Garside, G. Leech & G. Sampson (eds), 42–56.
London: Longman.
Mauranen, A. 2006. A rich domain of ELF – the ELFA Corpus of Academic Discourse. Nordic
Journal of English Studies 5(2): 145–159.
McEnery, T. & Wilson, A. 2001. Corpus Linguistics, 2nd edn. Edinburgh: EUP.
Meunier, F. & Gouverneur, C. 2009. New types of corpora for new educational challenges: Col-
lecting, annotating and exploiting a corpus of textbook material. In Corpora and Language
Frequency, corpora and language learning 
Teaching [Studies in Corpus Linguistics 33], K. Aijmer (ed.), 179–201. Amsterdam:

John Benjamins.
Mindt, D. 1996. English corpus linguistics and the foreign-language teaching syllabus. In Using
Computer Corpora for Language Research: Studies in Honour of Geoffrey Leech, J. Thomas &
M. Short (eds), 232–247. London: Longman.
Moon, R. 1998. Fixed Expressions and Idioms in English: A Corpus-based Approach. Oxford: OUP.
Nesselhauf, N. 2003. The use of collocations by advanced learners of English and some implica-
tions for teaching. Applied Linguistics 24(2): 223–242.
Nesselhauf, N. 2005. Collocations in a Learner Corpus [Studies in Corpus Linguistics 14].
Amsterdam: John Benjamins.
Römer, U. 2005. Progressives, Patterns, Pedagogy. A Corpus-driven Approach to English Progres-
sive Forms, Functions, Contexts and Didactics [Studies in Corpus Linguistics 18]. Amsterdam:
John Benjamins.
Römer, U. & Schulze, R. (eds). 2009. Exploiting the Lexis-Grammar Interface [Studies in Corpus
Linguistics 35]. Amsterdam: John Benjamins.
Sampson, G. 2007. Grammar without grammaticality. Corpus Linguistics and Linguistic Theory
3(1): 1–32, 111–129.
Schmidt, R.W. 1990. The role of consciousness in second language learning. Applied Linguistics
11: 129–158.
Schmidt, R.W. 1995. Consciousness and foreign language teaching: A tutorial on the role of at-
tention and awareness in learning. In Attention and Awareness in Foreign Language Learning
and Teaching, R.W. Schmidt (ed.), 1–63. Honolulu HI: University of Honolulu.
Seidlhofer, B. 2004. Research perspectives on teaching English as a lingua franca. Annual Review
of Applied Linguistics 24: 209–239.
Sinclair, J. 1966. Beginning the study of lexis. In In Memory of J.R. Firth, C.E. Bazell, J.C. Catford,
M.A.K. Halliday & R.H. Robins (eds), 410–430. London: Longman.
Sinclair, J. 1991. Corpus, Concordance, Collocation. Oxford: OUP.
Sinclair, J., Jones, S. & Daley, R. 1970. English Lexical Studies: Report to OSTI on Project C/
LP/08. Ms, University of Birmingham 1970. Reprinted in Krishnamurthy (ed.) 2005.
Stefanowitsch, A. & Gries, S.T. 2003. Collostructions: Investigating the interaction of words and
constructions. International Journal of Corpus Linguistics 8(2): 209–243.
Teubert, W. 2004. Units of meaning, parallel corpora, and their implications for language teach-
ing. In Applied Linguistics: A Multidimensional Perspective, U. Connor & T.A. Upton (eds),
171–189. Amsterdam: Rodopi.
Thomson, A.J. & Martinet, A.V. 1980 [1960]. A Practical English Grammar, 3rd edn. Oxford: OUP.
Thorndike, E.L. 1921. Teacher’s Word Book. New York NY: Columbia Teachers College.
Thorndike, E.L. 1932. A Teacher’s Word Book of 20,000 words. New York NY: Columbia Teachers
College.
Thorndike, E.L. & Lorge, I. 1944. The Teacher’s Word Book of 30,000 Words. New York NY:
Columbia Teachers College.
Tomasello, M. 2003. Constructing a Language: A Usage-based Approach to Child Language.
Cambridge MA: Harvard University Press.
West, M. 1953. A General Service List of English Words. London: Longman.
Zipf, G.K. 1935. The Psychobiology of Language. Boston MA: Houghton Mifflin.
Zipf, G.K. 1949. Human Behavior and the Principle of Least Effort. Reading MA: Addison-Wesley.
Learner corpora and contrastive
interlanguage analysis
Hilde Hasselgård and Stig Johansson1
This paper gives a glimpse of pre-corpus interlanguage studies, focusing on

some Scandinavian research projects, before moving on to the development of
computerized learner corpora and computer-aided interlanguage analysis with
special reference to the International Corpus of Learner English (ICLE) project.
Contrastive interlanguage analysis (CIA) is defined and discussed, followed by
a presentation of the so-called Integrated Contrastive Model (ICM). The two
models of analysis are illustrated by means of three case studies; two using CIA
to study the use of quite and I would say across four learner groups in ICLE and
one using the ICM to analyse seem in the interlanguage of Norwegian learners.
Towards the end, some challenges for interlanguage research are discussed.
1. Introduction
Learning a foreign language is a slow and, for most people, difficult process which
rarely leads to full mastery. Even advanced language learners make mistakes and nor-
mally have a limited repertoire compared with native speakers of the target language.
Problems may be linked to features of the target language, the learner’s first language
or to the learning process itself. Revealing features of learner language, or interlan-
guage, has become an important means of surveying both obvious and more subtle
differences between interlanguage and native speaker performance, and can poten-
tially lead to improved language teaching as well as insights into the processes of lan-
guage learning.
1. Stig Johansson sadly passed away before the article was finalized, but contributed sub-
stantially to the first submission of it and read and commented on a near-final version. The
authors thank Bengt Altenberg for insightful comments on an early and a near-final version of
this paper.
 Hilde Hasselgård and Stig Johansson
2. Interlanguage studies before computer corpora
In the 1940s and 1950s, linguists interested in language teaching emphasized the role
of contrastive analysis, on the assumption that “in the comparison between native and
foreign language lies the key to ease or difficulty in foreign language teaching”
(Lado 1957: 1). The aim of the comparison was to identify both easy and difficult fea-
tures of the language to be learnt. Lado considered that first language transfer in the
foreign language might either help the learners or cause them to produce grammatical
and lexical structures that deviate from the target norm (e.g. Lado 1957: 58).
Observations of deviant features of learner language have probably always been
made by language teachers, but it was not until about 50 years ago that they were sub-
jected to systematic analysis. The 1960s and the early 1970s were the heyday of error
analysis. Error analysis could be based on elicitation data and/or (pre-electronic) cor-
pus data.2 Unlike contrastive analysis, error analysis is not restricted to interlingual
transfer (Hammarberg 1973: 29). However, although Nickel (1973: 24) saw the “grow-
ing interest in error analysis [...] in connection with the efforts undertaken [...] to ob-
jectify measuring and grading of achievement in language testing”, it quickly became
apparent that it was not sufficient to focus on errors, as pointed out in Hammarberg’s
(1973) paper entitled “The insufficiency of error analysis”. At the same time, Enkvist
(1973) put the question “Should we count errors or measure success?” Other perspec-
tives on learner language were suggested: Levenston (1971) drew attention to over-
indulgence and under-representation in learner language, i.e. features which may not
be overtly wrong but differ in e.g. style and register from the language of native speak-
ers.3 One of the most important figures in the development of learner language re-
search, Pit Corder, also pointed out that there are both overt and covert errors: overt
errors produce linguistically unacceptable sentences, while “covertly erroneous sen-
tences are those which are not appropriate in the context in which they occur”
(Corder 1973: 272–3). More important, he stressed the significance of errors in pro-
viding a window into the learner’s mind; i.e. the study of a learner’s errors enables the
researcher to “infer the nature of his knowledge at that point in his learning career and
discover what he still has to learn” (ibid.: 257). Thus, the aim of the study is not only to
map the errors, but represent the learner’s level of proficiency. Svartvik (1973: 8) takes
a step further in suggesting that the term ‘error analysis’ should be replaced by the
2. Note that the term ‘corpus’ is used in this section to denote a “collection of naturally occur-
ring examples of language [...] which has been collected for linguistic study” (Hunston 2002: 2).
Nowadays, however, the term tends to imply that the corpus is “stored and accessed electroni-
cally” (ibid.).
3. Levenston relates over-indulgence and under-representation to contrastive analysis; learn-
ers are found (or predicted, in the case of learner groups other than Levenston’s own Hebrew-
speaking students) to overindulge in “structures which closely resemble translation-equivalents
in the mother tongue, or L1, to the exclusion of other structures (‘under-representation’) which
are less like anything in L1” (1971: 115).
Learner corpora and contrastive interlanguage analysis 
more appropriate ‘performance analysis’: “Although the study of errors is a natural

starting-point, the final analysis should include linguistic performance as a whole, not
just deviation”.4
To illustrate features of these early studies of learner errors and learner perfor-
mance, we will present a few investigations, chosen from the work of researchers in
Scandinavia. The first two investigations were initiated within the context of the
Swedish-English Contrastive Studies project directed by Jan Svartvik (see Svartvik
1973). While Thagg Fisher (1985) focuses on a grammatical problem, Linnarud (1986)
is concerned with lexis. Finally we will outline a more large-scale Danish project that
aimed at a comprehensive description of language learning as well as learner language.
Thagg Fisher (1985) is a study of Swedish learners’ concord problems in English.
Concord errors produced by Swedish learners, as found in three situations (essays,
translations, and recorded speech), were excerpted and analysed. This material was
supplemented by elicitation tests given to learners and native speakers. The outcome
was a detailed account of the frequency of different types of concord errors, a compari-
son of the three situations, and an analysis of the major causes of concord difficulty. A
hierarchy of concord error gravity was established, taking into account the behaviour of
native speakers and their reactions to the learners’ errors. Besides pointing out difficult
areas for Swedish learners, Thagg Fisher discovered conflicts between grammar/textbook
norms and actual language use. An important finding was that concord ‘errors’ are not
a matter of either/or, since there are ‘vague’ areas where the norms for concord depend
on contextual factors such as medium and style (1985: 177ff.). There is thus a scale of
error gravity implying varying degrees of irritation and negative evaluation by native
speakers (see also Johansson 1978). Some errors were classified as ‘nativelike’, reflecting
areas where native speaker usage may differ from the prescriptive norm, and ‘non-na-
tivelike’ (Thagg Fisher 1985: 191), reflecting problems that are characteristic of learners
and that are generally evaluated more negatively. Teaching should thus emphasize the
latter type and de-emphasize the former. Pedagogical applications of the study also in-
clude improved descriptions of concord in English teaching materials.
Linnarud’s (1986) investigation is a performance analysis of lexis in general, not
just errors. The material consisted of English compositions written by Swedish 17-
year-old learners and a comparable group of native speakers of English. A number of
quantitative measures were used, the most important of which were lexical individual-
ity (lexical words unique to the writer), lexical sophistication (the number of less fre-
quent words), lexical variation (type-token ratio), and lexical density (the proportion
of lexical words in relation to the total number of words). The compositions were as-
sessed by both Swedish L1 (i.e. first language) and English L1 evaluators. Not surpris-
ingly, the native speaker group wrote longer texts and made fewer mistakes. There was
a large difference in lexical individuality between the learner group and the native
4. The term which eventually became established was ‘interlanguage studies’, connected with
Selinker’s (1972) term ‘interlanguage’.
 Hilde Hasselgård and Stig Johansson
speakers, and a strong positive correlation with evaluations; lexical creativity was ap-
preciated by all evaluators, but slightly more by the native speakers of English. Just as
the native speakers used more unique words, they also used more rare words; there
was thus a great difference between the two groups in lexical sophistication, but with-
out a corresponding correlation with evaluations. Lexical variation was greater for the
native speakers, but this measure turned out to be unsatisfactory as it was not adjusted
for the length of the compositions. The native speaker essays also had a slightly higher
lexical density, but no correlation was found with evaluations. Commenting on the
findings, Linnarud stresses the importance of lexis in composition and makes a num-
ber of pedagogical recommendations for teaching vocabulary and grading composi-
tions. In both areas the importance of communication and context are emphasized.
Writing a composition is not primarily an exercise in using correct language, but a
means of expressing ideas and communicating a message where lexical choice plays a
crucial role (Linnarud 1986: 120).
Although the studies by Thagg Fisher and Linnarud are very different in most re-
spects, they are alike in the comparison of learner language with the language of native
speakers. Both used text material combined with elicitation, and both were also very
much concerned with pedagogical applications of their research.
A much more comprehensive project was going on in Denmark around the same
time, the Project In Foreign language pedagogy (PIF), one outcome of which was the
book Learner Language and Language Learning (Færch et al. 1984). Many aspects of
language learning and the study of language learning are discussed in the book, draw-
ing on the corpus compiled for the project. This was an extensive collection of samples
of the written and spoken English (including video-recordings) of more than a hun-
dred Danes, ranging from the near-beginner (after one year of instruction) to the near-
native (higher education students) stage. The cross-sectional data allowed ‘pseudo--
longitudinal’ studies of language learning (Færch et al. 1984: 297). Note this comment
in the description of the corpus:
With the one exception that the 12 learners at the lowest level did not provide
written texts, each of these texts was elicited from all our informants. So as to
hold as many factors constant as possible, learners with different ages, experience
and personalities were given the same tasks. Most of these tasks were familiar
from school, e.g. reading aloud and writing an essay, whereas the video-taped
conversation was novel and represented an attempt to place the learner in a real
communicative situation. (Færch et al. 1984: 295f.)
At that time, the PIF learner corpus was unique both in size and range and, most im-
portantly, in the systematic way in which the corpus was developed. In a working pa-
per Færch (1979) reported that the corpus of written learner language amounted to
about 100,000 words and the corpus of spoken learner language to about 250,000
Learner corpora and contrastive interlanguage analysis 
words, and he presented plans for computerization of the material.5 Here we are very
close to the stage of computer-aided analysis of learner language.
3. Learner computer corpora
A significant step in interlanguage studies was the development of computerized

learner corpora and computer-aided interlanguage analysis. Whereas earlier work was
generally limited in scale and range, it now became possible to increase the size and
variety of the material; and whereas the material used earlier rarely went beyond the
individual researcher, the new electronic corpora could be developed as research tools
to be used more generally by scholars in the field. The new technology and the re-
search methods developed in corpus linguistics in general allowed new kinds of stud-
ies to be performed, for example with easier access and greater attention to frequency
of occurrence and patterns of language use. Interest in learner corpora increased
rapidly,6 to a great extent inspired by the work of Sylviane Granger and her team at the
Centre for English Corpus Linguistics, Université catholique de Louvain, which we
will focus on below.7
In 1990 Sylviane Granger initiated a highly successful project to collect an Inter-
national Corpus of Learner English (ICLE), which inspired similar work in many
other countries. The background was her interest in interlanguage studies and also a
wish to extend English corpus research beyond native and second-language varieties
of English, which were the focus of the International Corpus of English (ICE), initiated
by Sidney Greenbaum (1991). Both ICE and ICLE should in turn be seen against the
background of the development of ‘families’ of corpora within English corpus linguis-
tics, i.e. corpora that are compiled according to the same design criteria and therefore
lend themselves to comparative studies.8
Apart from the computerization of the material and the development of computa-
tional analysis tools, the main innovative aspect of ICLE is the systematic approach to
corpus design and the compilation of comparable sub-corpora produced by learners
5. The death of Claus Færch in 1987 hampered the further development of the PIF Project.
6. A detailed survey of learner corpora can be found in Pravec (2002). See also www.uclou-
vain.be/en-cecl-lcWorld.html.
7. See the website of the Centre for English Corpus Linguistics: www.uclouvain.be/en-
cecl.html.
8. The best-known of these ‘families’ is probably the ‘Brown family’, including the Brown Cor-
pus, the LOB Corpus and their younger siblings FROWN and FLOB; see http://icame.uib.no/
newcd.htm.
 Hilde Hasselgård and Stig Johansson
with a wide range of different mother-tongue backgrounds (e.g. Granger 1994, 1996).9
These make it possible to examine the extent to which learner language is mother-
tongue specific or reflects general language learning processes.
4. Contrastive interlanguage analysis
A special feature of the ICLE project is that a framework for learner corpus research
has been developed alongside the corpus. This is Contrastive Interlanguage Analysis
(CIA), said to lie “at the core of the ICLE project” (Granger 1996: 43). Unlike contras-
tive analysis, which involves the linguistic comparison of (normally) two languages,
CIA concerns varieties of the same language. It “involves quantitative and qualitative
comparisons between native language and learner language (L1 vs. L2) and between
different varieties of interlanguage (L2 vs. L2)” (Granger 2009: 18; see also Granger
1996). The former type of comparison thus presupposes a comparable corpus of native
speaker (NS) data, whose role is to serve as a yardstick for measuring the extent to
which L2 English differs from L1 English.
As pointed out by Barlow (2005: 345), “a variety of issues arise” when “a learner
corpus is to be contrasted with an NS corpus”, for example concerning regional variety
and text type. In addition, the level of proficiency of the native speakers should be
considered to avoid inadvertent comparisons between novice and professional writers
(Granger 2002: 12). The solution to these issues within the ICLE project was the com-
pilation of the Louvain Corpus of Native English Essays (LOCNESS), consisting of
essays written by British and American students. ICLE and LOCNESS are relatively
closely matched for text type (mostly argumentative writing) as well as writer age and
experience. However, there is less information available on contributors in LOCNESS
than in ICLE (age, sex, writing conditions, etc.). Furthermore, the LOCNESS texts are
more heterogeneous as to essay topics as well as contributors (both university students
and A-level pupils). This has caused many researchers to use only a sample of it, for
instance by excluding A-level essays, or by using only US or only UK texts. Still,
LOCNESS remains the best available comparable corpus to match ICLE and continues
to be widely used.
The extent to which an NS reference corpus is adequate for CIA is intimately con-
nected with the aim of the comparison; cf. the discussions by Ädel (2006: 206) and
Gilquin et al. (2007: 326 f.). From the point of view of descriptive linguistics, it is a
clear advantage that the corpora can be closely matched on the most relevant variables,
9. The first edition of ICLE, released on CD-ROM in 2002, contained about 2.5 million words
of English, chiefly argumentative essays written by university students representing 11 different
mother-tongue backgrounds. In the second edition, ICLEv2, released in 2009, the number of
sub-corpora has increased to 16, and the material has been enriched with analysis tools
(see Granger et al. 2009).
Learner corpora and contrastive interlanguage analysis 
such as the age and level of expertise of the writers. From an English Language Teach-
ing (ELT) perspective, however, a student corpus such as LOCNESS may be consid-
ered unsuitable as a reference corpus because it does not represent the desired target
norm for proficiency or the type of language one would like to teach (cf. Leech 1998:
xix f.). Thus, if the aim is to identify areas of argumentative or academic writing in
which learners need to improve, an NS corpus consisting, for example, of press editori-
als or academic articles may be preferable.
Comparing data from a learner corpus and an NS corpus enables the researcher to
identify overuse, underuse and misuse in the English of the learners. As Granger has
repeatedly emphasized (e.g. 1998a: 18), the terms over- and underuse are intended as
neutral, quantitative measures of linguistic differences, not as qualitative judgements
on interlanguage performance. Importantly, the study of overuse and underuse marks
a widening of the scope of traditional error analysis as these phenomena, which are
difficult to identify reliably other than by computational methods, often do not consti-
tute errors. Rather, they reflect areas in which learner language differs from NS lan-
guage in terms of frequency of distribution rather than correctness. For example, the
expression kind of occurs 49 times per 100,000 words in the Norwegian sub-corpus of
ICLE (ICLE-NO)10 and 12.3 times in LOCNESS. This shows clearly that the Norwegian
learners overuse the expression. The question of whether or not they use it correctly,
however, requires a qualitative investigation.
Contrastive Interlanguage Analysis also includes the comparison of different non-
native-speaker (NNS) varieties. With ICLE, such comparisons are greatly facilitated by
the common design of the sub-corpora, with control of a range of relevant variables
(see Granger et al. 2009: 3ff.). For example, a comparison of the Norwegians’ use of
kind of with that of their Swedish neighbours reveals that the Swedes overuse the ex-
pression almost as much as the Norwegians with 44.8 occurrences per 100,000 words.
French learners overuse it even more, with 73.1 occurrences. In fact, kind of is univer-
sally overused across the sub-corpora of the second edition of the International Cor-
pus of Learner English (ICLEv2) (Granger et al. 2009), ranging from 29.1 (Tswana) to
138.5 occurrences per 100,000 words (Mandarin), which may be linked to the fact that
the expression represents a way of making up for insufficiently nuanced vocabulary.
Other lexicogrammatical items may be underused by some learner groups and over-
used by others. For example, French learners are known to overuse indeed in contrast
to some other learner groups (Granger 2004: 135), such as Norwegians, who underuse
it at 11.2 occurrences per 100,000 words vs. 17.9 in LOCNESS.
The potential and usefulness of CIA have been demonstrated in a wide range of
studies, as evidenced by e.g. Granger (1998c) and Granger et al. (2002). It should be
noted that CIA is by no means restricted to the ICLE corpus or to English; the
10. The ICLE sub-corpora are referred to here by means of their tags in ICLE with the last two
letters showing the L1 background of the learners (Norwegian, Swedish, German, French,
Spanish, Hong Kong Chinese).
 Hilde Hasselgård and Stig Johansson
methodology has been adopted by other researchers using interlanguage corpora of

for instance German, Italian and Norwegian.11 Nor is it restricted to written language.
Spoken learner language is being explored by means of, for example, the Louvain
International Database of Spoken English Interlanguage (LINDSEI),12 compiled as a
spoken counterpart of ICLE (Brand & Kämmerer 2006: 130) and comprising differ-
ent L1 backgrounds and an NS reference corpus (ibid.: 134). Because the compilation
of spoken corpora is costly in terms of time as well as money, the sub-corpora of
LINDSEI are rather small (about 100,000 words). At present the completed sub-cor-
pora represent 11 L1 backgrounds (Bulgarian, Chinese, Dutch, French, German,
Greek, Italian, Japanese, Polish, Spanish and Swedish), but more teams are joining.
Until very recently (2010), the corpus has not been publicly available outside the na-
tional project teams, and not all the sub-corpora have so far been used much in re-
search. Hence the remainder of this chapter will continue to focus on the analysis of
written corpora.
5. Some significant findings of CIA
The availability of similar corpora with a common design as well as a common re-
search model for investigating them has led to a number of important insights into
advanced learner English. In this section we will present what we consider to be sig-
nificant findings within the lexis, grammar and discourse of advanced learners of Eng-
lish (see also Hunston 2002: 206 ff.). Most of them come from studies of one or more
non-native varieties compared to an NS corpus, usually LOCNESS.
NNS vocabulary is found to be generally less varied than that of native speakers.
According to Ringbom (1998), learners rely greatly on a relatively small vocabulary
containing many words with a general meaning, such as people and thing. Similarly,
Hasselgren (1994) observes that learners tend to overuse frequent words belonging to
the core vocabulary at the expense of more precise synonyms, i.e. they cling to their
‘lexical teddy bears’, which is Hasselgren’s term for “the words they feel safe with”
(ibid.: 237). Furthermore, learners tend to use a slightly greater number of recurrent
word combinations than native speakers do (De Cock et al. 1998: 72 f.), and the fre-
quently recurring word combinations are not always the same in L1 as in L2 English
(cf. Wiktorsson [2003], who found that the prefabs used by Swedish learners were
more informal than those of native speakers).
Another common finding is that the written English of advanced learners is to a
great extent influenced by informal spoken language. This shows up clearly in the use
11. For information on the FALKO corpus of learner German, the VALICO corpus of learner
Italian and the ASK corpus of learner Norwegian, as well as other learner corpora, see www.
uclouvain.be/en-cecl-lcWorld.html.
12. See www.uclouvain.be/en-cecl-lindsei.html.
Learner corpora and contrastive interlanguage analysis 
of features of interactiveness, such as first- and second-person pronouns and other

signs of writer/reader visibility (Petch-Tyson 1998) and the high frequency of various
modal expressions (Aijmer 2002) and questions (Virtanen 1998). In their study of
connector use, Altenberg & Tapper (1998) found that Swedish learners tend to overuse
informal connectors (such as sentence-initial and and but) at the expense of more
formal connectors. Eia (2006) found the same tendency among Norwegian learners.
Gilquin & Paquot (2008: 50) likewise found an overuse of sentence-initial and and but
as well as other spoken-like features in learner writing. In addition to the influence of
spoken English they suggest that this may be explained by L1 transfer (in the case of
different style levels of otherwise equivalent expressions in the L1 and the L2), teach-
ing-induced factors, and developmental factors (ibid.: 52 ff.).
Interestingly, Ädel (2008) shows that the use of interactional features seems to
depend on factors such as task setting and intertextuality; untimed essays written by
students who used topical texts as a starting point for their discussion displayed far
fewer interactional features than those in ICLE-SW, although written by Swedish
students at the same stage of their studies. The claim that non-native written English
borrows features from the spoken language should thus be treated with some
caution.
It should also be remembered that although learners may import features of spo-
ken English into their writing, there are huge differences between real conversation
and ICLE essays (see Gilquin & Paquot 2008). When learner writing is compared to
spoken data, one finds that the ‘spoken’ features are relatively modestly represented in
the NNS essays after all. A small indication of this is given in Table 1, in which the first
four rows reproduce Petch-Tyson’s (1998: 112) figures for first- and second person
pronouns in some sub-corpora of ICLE. The Swedish learners come across as most
interactive in their writing; however, the pronouns are twice as frequent in the spoken
dialogues in the British National Corpus (BNC).
Table 1.â•‡ Use of first- and second-person reference across a number of corpora (based on
Petch-Tyson [1998: 112] with added figures for Hong-Kong Chinese and the BNC)
per 50,000 words
Dutch L1 1,195
Finnish L1 1,531
French L1 1,202
Swedish L1 1,998
HK Chinese L1 â•⁄â•‹449
BNC spoken dialogue 3,973
BNC written (press editorials) â•⁄â•‹834
US English (LOCNESS) â•⁄â•‹449
 Hilde Hasselgård and Stig Johansson
Petch-Tyson’s (1998) study of writer/reader visibility was carried out at a time when
the ICLE corpus contained only Western L1 backgrounds. Interestingly, a correspond-
ing investigation of the more recent Hong-Kong sub-corpus of ICLE (ICLE-HK) indi-
cates that first- and second-person pronouns are not overused by Hong Kong learners
(fifth row of Table 1). The difference between ICLE-HK and the other ICLE sub-cor-
pora is likely to have cultural explanations. Returning to the issue of reference corpus,
however, it is also noteworthy that the press editorials in the BNC have nearly twice as
many first- and second-person references as the US section of LOCNESS (see Table 1),
thus potentially reducing the degree of overuse by the European learners of English
and suggesting underuse in the US and HK Chinese groups.
The question of authorial presence has also been investigated by Hyland (2002),
who compares the use of first-person reference in student reports to that of published
journal articles within the same disciplines. Hyland finds that the student reports con-
tain four times fewer references to first person than the journal articles do; i.e. the
student reports have 10.1 references per 10,000 words and the published articles have
41.2 (ibid.: 1099). The findings are explained by reference to the students’ lack of au-
thority in the field with a concurrent reluctance to assert themselves. This is backed up
by the students’ own comments in interviews (ibid.: 1097). By comparison, ICLE-HK
(cf. Table 1) contains about 90 first- and second-person pronouns per 10,000 words, of
which the majority (81/10,000 words) are first-person. In other words, first-person
reference is eight times more frequent in argumentative essays than in the reports ex-
amined by Hyland (2002), thus suggesting that the use of interactive features may vary
with text type.
A number of studies find that learners transfer syntactic patterns as well as dis-
course patterns from their L1 to their written English. For example, Osborne (2008)
revealed strong L1 influence as regards the learners’ placement of adverbs; contrasts
between language families could be clearly seen in the patterns found in the learner
corpora. More specifically, the sequence V-Adv-O was overused by Romance L1 learn-
ers, underused by Germanic L1 learners and used with a frequency similar to that of
the NS control corpus by a group consisting of Slavic and Finnish L1 learners (Osborne
2008: 134). Nesselhauf (2005: 242) found that L1 influence occurred in about half of
the non-nativelike collocations identified in the German ICLE sub-corpus (ICLE-GE),
which suggests that phraseological patterns are transferred in a similar manner to syn-
tactic patterns.
The transfer of L1 syntactic patterns into NNS English need not constitute errors,
but may lead to an overuse of the pattern in question, possibly with unintended dis-
course effects. Boström Aronsson (2003), for example, found that Swedish learners
overuse cleft constructions, which could to some extent be explained by analogy with
Swedish style. The clefts are generally not ungrammatical, but according to Boström
Aronsson (2003: 209), they may entail “unmotivated focus and emphasis, and implica-
tions of contrastiveness when there is none”. Extraposition was also found to be twice
as frequent in ICLE-SW as in NS writing. As the construction often has an evaluative
Learner corpora and contrastive interlanguage analysis 
function, its overuse is interpreted as “a tendency for NNS to foreground their opin-
ions and evaluative comments” (Herriman & Boström Aronsson 2009: 109). Hasselgård
(2009a) found equal overuse of extraposition in ICLE-NO. A later study (Hasselgård
2009b) showed that fronted time and space adverbials were overused in ICLE-NO
compared to LOCNESS. The frequencies were, however, similar to those found in a
collection of Norwegian NS argumentative essays. The fronted time and space adverbi-
als in ICLE-NO furthermore had discourse functions more typical of Norwegian than
of English (particularly as text organizers).
A feature of learner language that is attributable to either learner strategies or lack
of proficiency and/or register awareness (Altenberg 1997: 130) is the use of metadis-
course. Ädel (2006: 189) found that Swedish advanced learners used metadiscourse
twice as often as American students, who in turn used it more often than British stu-
dents. The overuse among learners concerned above all “personal metadiscourse”
(ibid.: 190), i.e. items that refer directly to the writer and/or reader of the text. The
functions of personal metadiscourse items are “to introduce the topic and to repeat
(or review) some preceding discourse unit” (ibid.: 94), i.e. to negotiate the text as dis-
course between writer and reader. Such items may also be involved in definitions of
terms and concepts (ibid.). The quantitative differences between the writer groups may
be due to different writing conventions in the three cultures (ibid.: 154), but may also
reflect the learners’ consciousness that they are writing in a foreign language.
As mentioned above, most of the CIA studies carried out so far involve the com-
parison of native English to only one or two non-native varieties. Studies that involve
a wider range of non-native varieties often make interesting observations, such as the
scale of writer/reader visibility revealed by Petch-Tyson (1998: 112), the typological
differences found by Osborne (2008), and the differences in the use of academic vo-
cabulary observed by Paquot (2010). It is to be hoped that the greater cultural and
linguistic variation in L1 background represented in the latest version of ICLE, along
with the improved facilities for searching and analysing the corpus, will inspire more
such studies.
6. From CIA to the integrated contrastive model
Like the analysis of interlanguage, contrastive analysis has profited greatly by the de-
velopment of corpus research methods. The English-Norwegian Parallel Corpus
(ENPC) was the first electronic bidirectional translation corpus of its kind (see Johan-
sson 2007: 10 ff.). The model combines the idea of a translation corpus with that of a
comparable corpus, i.e. one in which the original texts in both languages are matched
for genre, publication date and size. This design allows the researcher to study transla-
tion correspondence in both directions of translation and to compare original and
translated texts in the same language or original texts in different languages.
 Hilde Hasselgård and Stig Johansson
The method for contrastive analysis based on parallel corpora has lately been suc-
cessfully paired with the CIA method; see Granger (1996) and Gilquin (2000/2001) on
the Integrated Contrastive Model (ICM). This model offers a new dimension to inter-
language studies, enabling the researcher not only to differentiate general from L1-spe-
cific learner problems but also to explain and/or predict such problems on the basis of
contrastive analyses of the L1 and the target language, in the spirit of the weak version
of the contrastive analysis hypothesis (Wardhaugh 1970: 123). The link between learner
corpus research and contrastive analysis is explored e.g. in Gilquin et al. (2008).
The Integrated Contrastive Model is visualized in Figure 1.13 Granger (1996: 46)
points out that “the model involves constant to-ing and fro-ing between CA [Contrastive
Analysis] and CIA. CA data helps analysts to formulate predictions about interlanguage
which can be checked against CIA data”. This part of the procedure follows the arrow
marked “predictive” in Figure 1. In the opposite direction, deviations between learner
language and native language can be explained (or ‘diagnosed’) by recourse to the contras-
tive analysis. The arrows pointing out of the figure were added by Gilquin (2000/2001: 100
f.) to show that not all errors can be explained by a contrastive analysis (see also Corder
1973: 288). The other change in Gilquin’s version of Granger’s (1996) diagram is the use of
broken lines between CA and CIA to indicate a weaker connection between the two.
CA
OL vs. OL SL vs. TL
QSFEJDUJWF
EJBHOPTUJD
53"/4'&3
CIA
NL vs. IL IL vs. IL
Figure 1.â•‡ The Integrated Contrastive Model (quoted from Gilquin 2000/2001: 100, based
on Granger 1996: 47)
13. Key to the abbreviations found in Figure 1: CA = Contrastive Analysis; OL = Original Lan-
guage; SL = Source Language; TL = Target Language; CIA = Contrastive Interlanguage Analysis;
NL = Native Language; IL = InterLanguage.
Learner corpora and contrastive interlanguage analysis 
The weak connection was also pointed out by Corder (1973: 229 ff.) who argued that
differences between the native language and the foreign language need not produce
learning difficulty. Differences between the native and the target language can also
have unexpected effects on interlanguage, as demonstrated by Johansson & Staves-
trand (1987). Since Norwegian does not have a grammaticalized progressive aspect, a
natural assumption would be that Norwegian learners will have difficulties acquiring
the progressive, and furthermore that they will underuse it. The investigation showed
that the Norwegian learners indeed made a number of mistakes with the form. Curi-
ously, most of the errors consisted in using the progressive where a simple form was
required. Hence, the second prediction failed: the learners in fact overused the pro-
gressive. The overuse is believed to be caused by factors such as (intralingual) hyper-
correction, overexposure in teaching and the simpler morphology of the progressive
(i.e. only one form of the lexical verb needs to be mastered).
7. Case studies
As an additional demonstration of contrastive interlanguage analysis, we will present

two small-scale case studies based on ICLEv2, namely the use of the single lexical item
quite and the phraseological item I would say. Four L1 groups have been selected:
Norwegian, German, French and Spanish, thus representing two Germanic and two
Romance L1 backgrounds. Texts have been identified on the basis of the learners’ first
language, irrespective of home country. LOCNESS has been used for comparison. A
third study makes use of the Integrated Contrastive Model in an investigation of seem
in ICLE-NO against the background of a contrastive study based on the ENPC.
7.1 Quite
Granger (1998a and b) has drawn attention to the overuse of the all-round intensifier
very at the expense of collocationally restricted -ly intensifiers such as closely or highly.
Do we find a similar tendency for quite? Table 2 shows that quite is overused in all the
learner groups but most markedly so among the Germans, followed at a distance by
the Norwegians (both at significance levels of p < 0.01).14 The overuse of quite in ICLE-
GE ties in with the general overuse of adjective modification by German learners iden-
tified by Lorenz (1998: 57). In ICLE-FR and ICLE-SP, the overuse is less dramatic
(significant at p < 0.05 for ICLE-FR, but less obviously so at p = 0.1 for ICLE-SP).
The overall frequency distribution shown in Table 2 thus seems to reflect the
14. The ICLE frequencies were found using the statistics function on the ICLEv2 CD, while LOC-
NESS was analysed using the corpus tool AntConc (www.antlab.sci.waseda.ac.jp/software.html).
The frequencies from each ICLE sub-corpus and LOCNESS were compared using chi-square.
 Hilde Hasselgård and Stig Johansson
Table 2.â•‡ Quite across corpora: Raw frequencies and relative frequencies per 100,000 words
Corpus Occurrences Rel. freq.
ICLE-NO â•⁄ 92 43.7

ICLE-GE 147 62.3
ICLE-FR â•⁄ 78 38.0
ICLE-SP â•⁄ 63 31.8
LOCNESS â•⁄ 67 20.5
Germanic – Romance distinction. The question of how the learners use this word,
however, can only be answered by studying concordance lines.
The word quite can enter into a number of grammatical patterns, notably as: (i)
modifier of adjective – quite safe; (ii) modifier of adverb – quite easily; (iii) modifier of
predicate – never quite enter the big money fights; (iv) modifier of indefinite or quanti-
fied noun phrase – quite a remarkable feat, quite some time; (v) modifier of definite
noun phrase/nominalized adjective – quite the opposite; (vi) modifier of prepositional
phrase – quite by chance. Table 3 gives the relative frequencies of the different patterns
across the corpora under study. Strikingly, the overuse of quite among German and
Norwegian learners is visible across the patterns, while the French and Spanish learn-
ers differ from the native speakers mainly in the use of quite as a modifier of an adjec-
tive. Figure 2 shows the proportional distribution of the patterns across the corpora.
The adjective modifier function of quite is most common in all the learner groups as
well as in the NS corpus. However, the groups differ as to the use of other patterns:
Spanish learners use other patterns very little, while Norwegian and German learners
use quite for indefinite NP modification significantly more often than native speakers
(p < 0.05) and also for adverb modification more often than native speakers though
not at significant levels. French learners use other patterns more than the Spanish
learners, but less than Norwegian and German learners. The adverb-modifying quite
takes up a larger proportion in NS than in NNS writing, but as Table 3 shows, this pat-
tern is actually more frequent in the learner corpora, except ICLE-SP. All other types
are too rare to show reliable tendencies, but we may note that the category of ‘other’
(which includes cases of misuse) does not occur in LOCNESS.
Table 3.â•‡ Patterns of quite across corpora, relative frequencies per 100,000 words
+adj +adv +pred +indef NP +PP +def NP other
ICLE-NO 24.7 4.3 1.9 10.5 1.0 1.0 0.5

ICLE-GE 38.5 6.8 2.1 12.3 0.4 2.1 â•⁄â•‹0
ICLE-FR 25.4 3.9 1.0 â•⁄ 5.9 â•⁄â•‹0 0.5 1.5
ICLE-SP 25.2 2.5 1.0 â•⁄â•⁄â•‹0 0.5 1.0 1.5
LOCNESS 12.6 3.4 0.9 â•⁄ 2.8 0.6 0.3 â•⁄â•‹0
Learner corpora and contrastive interlanguage analysis 
ICLE-NO
+adj
ICLE-GE
+adv
+pred
ICLE-FR + indef NP
+PP
ICLE-SP +def NP
other
LOCNESS
0% 20 % 40 % 60 % 80 % 100 %
Figure 2.â•‡ Patterns of quite across corpora
The Spanish learners have the smallest extent of overuse, but at the same time differ most
from native speakers in their use of quite. German learners, on the other hand, have a
proportional distribution of patterns that does not differ much from that of the NS group
in spite of the overuse shown in Tables 2 and 3. As noted above, Norwegian and German
learners often use quite as a modifier of noun phrases. Examples are given in (1) – (3).
(1) ... which now suddenly requires an education with quite a lot of theory.
(ICLE-NO)
(2) ... reading my way through the book itself, which turned out to be quite an
adventure given my poor standard of French. (ICLE-GE)
(3) Stating that the time of dreaming and imagination is over is quite a sad
statement. (ICLE-NO)
Norwegian and German learners have a potential problem in placing the indefinite
article between quite and a premodifying adjective, as in (3), since both Norwegian
and German place the article before the equivalent of quite in a corresponding con-
struction. The pattern seen in (2) and (3) must thus be a result of successful learning.
The pattern ‘quite a(n) + adjective’ occurs 5 times in ICLE-NO; however ‘a quite +
adjective’ is found 6 times. The corresponding figures for ICLE-GE are 9 vs. 8. Thus,
both learner groups use the pattern of their L1 in about half the cases. Interestingly, a
similar variation is found in LOCNESS. The pattern ‘a quite + adjective’, illustrated by
(4), occurred twice while the other pattern occurred only once. However, in the BNC,
the ‘quite a(n) + adjective’ pattern is clearly most frequent, with 27 instances per mil-
lion words as against 5.6 for ‘a quite + adjective’.15
15. By comparison, the French learners had ‘quite a(n) + adjective’ 7 out of 11 times. The
Spanish learners used quite with a premodified indefinite noun phrase only once, with the arti-
cle preceding quite.
 Hilde Hasselgård and Stig Johansson
(4) One possible solution is a quite radical one. (LOCNESS)

(5) Passengers whose life seems to revolve around annoying others – listening to
not-quite-personal stereos, smoking in no smoking sections, ... (LOCNESS)
Example (5) shows a creative use of quite. No similar uses were found in the NNS cor-
pora. However, a close examination of the NNS concordances for quite also shows
some cases of dissonance (Hasselgren’s [1994] term for non-nativelike usage):
(6) Even in the text there are quite allusions to Pamela. (ICLE-SP)
(7) This kind of allusion is quite used in abstracts or introductions. (ICLE-SP)
The dissonance can be due to grammatical error as in (6), where quite modifies a bare
noun phrase. In (7) the predicate is not one that can be modified for degree. Both
cases of dissonance can possibly be explained as equivalence errors between quite and
Spanish bastante, which carries much the same meaning as quite, but unlike quite can
be used as a modifier of a noun or a participle verb.16 Similarly, there are examples
from ICLE-FR where the dissonant use of quite is due to an equivalence error; in (8)
this probably concerns quite/assez as well as changing/changeant. In (9) the collocation
quite many is one that is not found in the BNC, but which may reflect the French ex-
pression (d’)assez nombreux.
(8) Whereas political borders can be quite changing, cultural ones are not.
(ICLE-FR)
(9) On a human level, I met quite many foreigners, but no Dutch people.
(ICLE-FR)
German and Norwegian learners do not seem to have much difficulty with quite, prob-
ably due to the semantic and syntactic similarity with the nearest L1 equivalents ganz
and ganske. A typical example of dissonance in these two sub-corpora is given in (10),
where the dissonance is caused by a confusion between a good deal and quite a lot. In
(11) the problem with the adjective modification is the context, i.e. the use of a ‘com-
promiser’ (Lorenz 1998: 56) where understatement does not seem intentional.
(10) ... but the figures clearly show that men on the average earn quite a deal more
than women here in Norway. (ICLE-NO)
(11) ... and we had to spent nearly two, quite exiting years in the monster’s dungeon.
(ICLE-GE)
This CIA study of quite yielded some interesting findings. First of all, the quantitative
investigation showed overuse of quite in all four learner groups, though to different
degrees. The overuse was most pronounced in ICLE-GE and least in ICLE-SP. How-
ever, a qualitative study showed that quite is not used in the same way in the five cor-
pora examined. The Spanish learners use quite as a modifier of an adjective at the cost
16. Thanks to Magali Paquot and Maximino Jesus Ruiz Rufino for identifying the Spanish
source of transfer.
Learner corpora and contrastive interlanguage analysis 
of all other constructions, while the Germans and the Norwegians overuse it as a
modifier of indefinite noun phrases. Finally, dissonant uses were studied. Most uses of
quite are correct in all the corpora. However, the most serious cases of dissonance
were found among the Spanish and French learners, possibly because the greater sim-
ilarity between quite and its closest equivalent in the Germanic languages led to fewer
problems among the German and Norwegian learners. The qualitative analysis thus
uncovered problems in those learner groups that were quantitatively closer to native
speaker usage.
7.2 I would say
In recent years, a great deal of research has focused on recurrent sequences in lan-
guage, largely inspired by John Sinclair’s insightful work on collocations and his insis-
tence on the importance of the ‘idiom principle’ (Sinclair 1991). Studies comparing
learner and NS phraseology have shown important differences in this area (see e.g.
Wiktorsson 2003; Meunier & Granger 2008). Hasselgård (2009a: 134) found that
Norwegian learners overuse the string I would say. In ICLE-NO it typically functioned
as an expression of stance, often prefacing a conclusion. In the native speaker data used
for comparison (from the British component of the International Corpus of English,
ICE-GB), the expression was found either in its literal sense or with the meaning of
approximation. As a follow-up to this, we have studied the same expression across dif-
ferent learner groups and in LOCNESS. Table 4 shows Norwegian and French learners
to have approximately the same degree of overuse, while the German and Spanish
learners are closer to the distribution found in LOCNESS.
Unlike the results for quite, the use of I would say does not reflect the Germanic
– Romance distinction. Still, the use of the expression may be attributed to L1 transfer,
or it may even be teaching-induced (some Norwegian textbooks list the expression as
a possible turn of phrase in argumentation). Incidentally, the expression is mentioned
by Granger (1998b: 156) as part of the learner’s (restricted) repertoire “for introducing
arguments and points of view”.
Table 4.â•‡ I would say across learner groups: Raw frequencies and relative frequencies per
100,000 words
Corpus Occurrences Rel. freq.
ICLE-NO 27 12.8
ICLE-GE 10 â•⁄ 4.2
ICLE-FR 23 11.2
ICLE-SP â•⁄ 7 â•⁄ 3.5
LOCNESS â•⁄ 5 â•⁄ 1.5
 Hilde Hasselgård and Stig Johansson
First we examined I would say in LOCNESS. Surprisingly it was found with functions
not attested in ICE-GB (Hasselgård 2009a: 134), namely as a stance marker (12) and as
an introduction to a conclusion (13).
(12) ... and so in some ways I would say that he is of use to the party. (LOCNESS)
(13) In conclusion, I would say that a single europe would lead to a damaging loss
of sovereignty for Britain ... (LOCNESS)
Both instances in LOCNESS of I would say as a stance marker have the function of
signposting the following proposition as the speaker’s considered, but tentative opin-
ion. As shown in (14), this use can also be identified in other NS material, such as the
BNC. We may note that the meaning of say in example 14 (taken from the academic
writing section of the corpus) is close to suggest. The fairly literal implication (i.e. the
writer’s response to a question) seems typical of NS use of the expression.
(14) So, what is to be done about sexism in language? I would say, whatever is most
effective in making people think about the implications of the expressions
they use. (BNC: CGF)
In ICLE-FR I would say is by far most frequent (80–90%) as part of a conclusion. The
expression is most often accompanied by phrases such as to conclude or in conclusion,
as exemplified by (15). This conclusive use carries a higher degree of modal certainty
than the tentative use illustrated by (12) and (14). The conclusive use of I would say in
ICLE-FR is most likely related to similar expressions in French as illustrated in exam-
ple (16).17
(15) To conclude with this whole debate, I would say that I can hardly find positive
arguments to stand for the compulsory military service. (ICLE-FR)
(16) En conclusion je dirais que ce baladeur m’a complètement séduit.
(www.iaddict.fr/ipod-shuffle.php)
The Norwegian learners also use I would say in conclusions, but the stance marker use
is about equally common. The latter typically occurs earlier on in the essay, prefacing
a proposition that the writer is going to argue for. For example, (17) is the second
sentence of an essay on ‘dreaming and imagination’. The expression can also have a
meaning similar to ‘I think’, as shown by (18), and this use may be found anywhere in
the text.
(17) I would say that it is a statement close to the truth of today’s society, and in this
essay I will give my opinions on the topic, and some reasons why this could be
a fact. (ICLE-NO)
17. Google searches restricted to the domain .fr showed that je dirais often collocates with en
conclusion or en somme. Interestingly, the one example of in conclusion I would say in the BNC
comes from a school essay.
Learner corpora and contrastive interlanguage analysis 
(18) There is a vast difference between speeding and intentionally murdering an-
other human being. In this first case I would say that punishment is just right,
by removal of the driver’s license for a period of time ... (ICLE-NO)
A formal difference between I would say and its closest Norwegian equivalent jeg vil si
(lit: ‘I will say’) is that the Norwegian modal vil has the present tense form, which is a
potential source of transfer errors. However, the expression I will say occurs only three
times in ICLE-NO. It signals either stance or conclusion, as shown by (19), which oc-
curs towards the end of a text.
(19) Anyway, from my point of view, I will say there is a great space for both dream-
ing and imaginations in our lives. (ICLE-NO)
The Norwegian expression is fairly close to ‘from my point of view’, i.e. it flags the fol-
lowing proposition as the speaker’s opinion, but not necessarily as tentative. In Eng-
lish, however, the past-tense form of the modal gives the expression I would say a
tentative ring (e.g. Biber et al. 1999: 496). It is thus possible that the Norwegian learn-
ers, through L1 transfer, invest the English expression with a higher degree of asser-
tiveness than it seems to have in NS usage.
The German learners use I would say mostly to express stance, but also in a more
literal sense as a metatextual device (Ädel 2006); in (20) the writer simply explains how
s/he would answer a question. There are also a few cases of I would say in conclusions,
as in (21).
(20) Well, what is best for them, what is it they love? I would say: sitting on their
mothers’ or fathers’ lap while being told a story ... (ICLE-GE)
(21) On balance, I would say that corporal punishment is no appropriate means to
fight against criminality. (ICLE-GE)
In example (22), from ICLE-SP, I would say has a slightly different metatextual func-
tion, namely that of commenting on the use of a word, while (23) shows the expression
of stance. These are the main uses of I would say in ICLE-SP, and the Spanish learners
use them about equally often.
(22) The recruit spends (“wastes” I would say) almost a year of his life (nine month
is the average in Europe) doing nothing except ... (ICLE-SP)
(23) First of all I would say that love was completely under the social convections
and prejudice, ... (ICLE-SP)
This investigation has shown clear overuse of I would say by Norwegian and French
learners. The overuse can probably be explained in both cases by the existence of sim-
ilar expressions in the learner’s L1. The qualitative study shows that the learners use the
expression for different functions: the conclusive use is most frequent in ICLE-FR,
where the expression often collocates with conclude, sum up or similar words. A plau-
sible explanation for the overuse of this function is the frequent collocation in French
 Hilde Hasselgård and Stig Johansson
of je dirais with expressions such as en conclusion. As for the Norwegian learners, we

suggested that they overuse the expression in conclusions because of the different de-
gree of modal certainty carried by the Norwegian cognate expression. The conclusive
use is absent from the ICE-GB material used by Hasselgård (2009a), but is found in
LOCNESS. Yet, the phraseology of I would say in native speaker material suggests a
lower degree of assertiveness than would normally be desirable in the conclusion to a
line of argumentation. It is thus possible that conclusive I would say is related not just
to L1 influence but also to developmental factors or to (lack of) speaker authority,
though this is a point that needs further study. The stance-marker function of I would
say is found in all the corpora, though it dominates most in ICLE-GE and ICLE-SP.
The metatextual function would seem to constitute a relatively simple way of marking
a rhetorical structure of question and answer in the text, and may thus be a feature of
novice writing.
Phraseological usage clearly depends on style and register and consequently re-
flects the proficiency level and writing experience of the writers. For this reason a ref-
erence corpus of ‘expert’ writing might usefully complement the NS student corpus.
Furthermore, the study of the phraseology of learner language shows very clearly that
contrastive interlanguage analysis would profit vastly from being supplemented by a
contrastive analysis of the learner’s first language and the target language.18
7.3 A Norwegian perspective on seem
To give an example of how the Integrated Contrastive Model can work, we will take as
our starting point Johansson’s (2007: 117 ff.) analysis of seem and its Norwegian cor-
respondences in the ENPC and supplement this with an investigation of seem in ICLE-
NO and LOCNESS. Johansson’s study was triggered by the observation that seem
“sometimes seems to disappear without a trace in translations into Norwegian and
likewise may be added, seemingly without any motivation, by English translators”
(ibid.). Seem indeed turned out to be much more frequent in English originals than in
translations (145.8 vs. 100.5 occurrences per 100,000 words). When comparing the
English constructions with seem to their Norwegian correspondences, Johansson
found that (i) English catenative constructions are strikingly more common than the
corresponding syntactic choice in Norwegian; (ii) copula constructions are far more
common in English than in Norwegian; those with a noun phrase complement are
found in English only; (iii) English clauses with dummy subject it or there + seem(s)
are less common than the corresponding Norwegian structures with the dummy sub-
ject det; (iv) an experiencer is more commonly expressed in Norwegian than in Eng-
lish; and (v) Norwegian uses more comparative structures, particularly with som
(‘as (if)’, ‘like’) (2007: 123 and 138).
18. For a good example, see Paquot (2008).

Learner corpora and contrastive interlanguage analysis 
Apart from the expectation that seem will be underused, these findings give rise to
the following predictions for ICLE-NO compared to LOCNESS: (i) catenative seem
will be underused; (ii) copular patterns will be underused, especially those with a noun
phrase complement; (iii) a dummy subject will be used more often by the Norwegian
learners; (iv) an experiencer will be expressed more often by the Norwegian learners;
and (v) comparative structures will show up more often in the context of seem.
The overall expectation is in fact not met: ICLE-NO has a higher frequency of
seem per 100,000 words than LOCNESS (117 vs. 90). Even more surprisingly, the cat-
enative function accounts for a slightly higher proportion of the occurrences of seem
in ICLE-NO (51%) than in LOCNESS (47.5%). On the other hand, the copular func-
tion is, as predicted, more common in LOCNESS, with a proportion of 35.5%, com-
pared to 28% of the occurrences of seem in ICLE-NO.
The third prediction is partly met; in ICLE-NO, 31.5% of the occurrences of seem
collocate with the dummy subject it, as against 23% in LOCNESS. Existential there,
however, is more common with seem in LOCNESS (6 vs. 3 occurrences) but these
figures are too low to reveal patterns. The predicted overuse of comparative structures
is also to some extent confirmed. In any case the collocations seem like and seem as if
are about twice as common in ICLE-NO as in LOCNESS. Finally, explicit experiencers
are almost twice as common in ICLE-NO as in LOCNESS, which was expected on the
basis of the contrastive study. An example is given in (24).
(24) The oral examinaton seems to me to be more of a test in how to tackle stress ...
(ICLE-NO)
To dig further into the (mis-)match between the predictions based on Johansson’s con-
trastive study and the evidence from ICLE-NO we need to take a closer look at the
learner data. First, the overuse of seem must be seen in connection with the general
overuse of modal and hedging expressions in learner data, as shown by Aijmer (2002).
Though not a modal auxiliary, seem clearly has modal meanings, particularly of evi-
dentiality, and is thus handy for writers who want to hedge their claims.
The unexpected overuse of catenative seem may take place at the expense of copu-
lar seem, as the most common lexical verb following catenative seem in ICLE-NO is be,
often with a copular function, as seen in (25). By contrast, in (26) the predicative fol-
lows seem directly, without the aid of copular be. Admittedly, be is the most frequent
verb following catenative seem in LOCNESS too, but it is more predominant in ICLE-
NO (41.5% vs. 33% of all occurrences of catenative seem). It is thus likely that the
Norwegian learners add be by analogy with corresponding Norwegian constructions
(cf. Johansson 2007: 120).
(25) The characters seem to be able to come to terms with Willie Loman’s death.
(ICLE-NO)
(26) This idea does not seem acceptable to the British public. (LOCNESS)
 Hilde Hasselgård and Stig Johansson
The high frequency of dummy it in clauses with seem might be expected from the
more general tendency of Norwegian to prefer light sentence openings (Hasselgård
2005). The dummy subject typically refers forward to a clause in extraposition. Inter-
estingly, the Norwegian learners use the conjunction like more often than the more
formal that in the extraposed clauses, as in (27). This may be due to the frequent use of
som (‘as’, ‘like’) found in a number of Norwegian correspondences of seem in Johansson
(2007). The learners also use the subordinator as if much more often than the native
speakers (11 vs. 3 occurrences), no doubt influenced by the Norwegian equivalent som
om illustrated in (28).
(27) To me it seemed like some of the teachers had never been teaching school
children (ICLE-NO)
(28) ... but it seems you also know that if that happens it would be just as easily
finished. (ENPC: ABR1)
... men det virker som om du også vet at hvis det skjer, kan det avsluttes like
lett. (ABR1T) [lit: but it seems as if ...]
In (28) som om corresponds to as if. However, om can be omitted in this construc-
tion, which is probably the cause of some dissonant occurrences like (29), where as
is not followed by if. This type of dissonance can thus be explained by reference to the
learner’s L1.
(29) It might seem as it will cost a lot in the beginning ... (ICLE-NO)
The frequent expression of experiencers with seem in ICLE-NO, illustrated by (24)
above, correlates with the general tendency to writer/reader visibility in learner texts
(Petch-Tyson 1998), as the most common realization of the experiencer is to me. The
native speakers use seem(s) to me in 7 out of 14 experiencer phrases, but the Norwegian
learners use it in 14 out of 22, and in addition three of the remaining cases have an
experiencer that includes the speaker, e.g. many of us. The tendency to overuse experi-
encer phrases may thus have two explanations; the more frequent expression of an
experiencer in Norwegian and/or the learners’ inclination to be visible in their texts.
Further exploration of other sub-corpora of ICLE is needed to check which of the ex-
planations is more plausible.
This case study has illustrated that the connection between learner data and con-
trastive data is far from straightforward. As discussed by Gilquin (2008), even features
of learner language that may be attributed to L1 transfer on the basis of a contrastive
analysis may in fact have other causes. In the study of seem it seems that the overuse of
the word is related to the general overuse of modal markers by learners of English. The
expression of experiencers may be either L1-related or due to the tendency for learners
to use colloquialisms in their written texts (e.g. Altenberg & Tapper 1998). The prefer-
ence of like to that in subordinate clauses may likewise have two explanations. How-
ever, the overuse of it as a dummy subject and the occasional omission of if in as if are
very likely caused by L1 transfer.
Learner corpora and contrastive interlanguage analysis 
It should be noted that the ICM, with the parallel corpora available, suffers from a
mismatch of genres and/or writer proficiency. The ENPC consists of fictional and non-
fictional texts. None of them are argumentative or academic (with the possible excep-
tion of a few popular science texts) and all are produced by professional writers and
translators. Thus, an ICM analysis based on a corpus such as the ENPC should ideally
be checked against a (monolingual) corpus of student writing in the learner’s L1 to
control for genre and writer variables. The contrastive analysis based on ‘OL vs. OL’ in
Figure 1 above might thus include a comparison of comparable monolingual corpora
of student writing.
8. Some challenges
Granger has often discussed (e.g. Granger 2004: 134; Granger 2009: 14) the challenge of
translating findings from CIA studies into pedagogical issues and EFL practice (see also
Hunston 2002: 208). On the one hand, CIA studies usually outline potential pedagogical
implications of the investigation, typically measures that will bring the learners closer to
NS performance; on the other these measures are not necessarily directly translatable to
classroom practice. In any case, the recommendations should probably to a greater ex-
tent take proper account of the reference corpus used as well as learner needs and teach-
ing objectives (Granger 2009: 22). As pointed out by Ädel (2006: 206), “if we take it for
granted that learners aim to achieve as professional a style of writing as possible, we
should not make recommendations to learners based on native-speaker student usage,
but rather should use professional native-speaker writing as the target”. For example, if
compared to LOCNESS, Norwegian advanced learners underuse the connector howev-
er, even at a frequency of 66 per 100,000 words (N = 139), since LOCNESS has 181 in-
stances per 100,000 words (N = 591). But a change of reference corpus alters the picture
dramatically. The press editorials in the BNC, for example, have 58 occurrences of how-
ever per 100,000 words. We may thus wonder whether the Norwegian learners really
underuse the word, or whether it is the LOCNESS writers who overuse it.
In some cases of underuse, EFL teaching might focus on the underused items,
though at the risk of inducing overuse instead. In the case of overused items, as noted
by Hunston (2002: 209), there may be little point in saying “Use thing less often” with-
out knowing what the relevant alternatives would be in specific contexts. The example
of however given above also illustrates that the concepts of overuse and underuse are
not straightforward, and quantitative findings need to be carefully considered and
cross-checked with qualitative analyses before exposing learners to them.19 This is,
19. In fact, Granger (2009: 22) points out that “features of learner language uncovered by
L[earner] C[orpus] research need not necessarily lead to targeted action in the classroom”. This
will depend on the degree of divergence between learner and native speaker usage as well as on
learner needs.
 Hilde Hasselgård and Stig Johansson
however, not to deny the immense value of quantitative studies based on the CIA
method and the ICLE corpus collection, but researchers should keep their eyes open
for alternative reference corpora and external causes for some of the findings; cf. Ädel
(2008) and Gilquin & Paquot (2008).
Another important challenge concerns genre, as Biber et al. (1999) convincingly
demonstrate that grammar depends on register. Studies of advanced learner language
often suggest that learners are unaware of genre requirements (e.g. Altenberg 1997,
Gilquin & Paquot 2008), and that this may be part of the explanation for the general
overuse of informal and spoken-like features. This may well be true. However, the
comparison of Hyland’s (2002) study of scientific reports written in English by Hong
Kong learners with the figures for ICLE-HK (see Table 1 above) may indicate that
learners of English can adapt their style to different registers. The challenge for CIA is
thus to expand its empirical base to include more registers. This work has been started
with the ongoing compilation of a new international learner corpus, the Varieties of
English for Specific Purposes dAtabase (VESPA). With this corpus alongside ICLE and
LINDSEI it will be possible to extend the field of CIA into studies of genre, medium
and style.
Finally, the study of corpora such as ICLE and LINDSEI can give invaluable in-
sights into the interlanguage of learners at a particular proficiency level. However, such
corpora cannot reveal much about language learning. For example, dummy it is not
often used instead of existential there in ICLE-NO even though this is a well-known
learning problem for Norwegians (cf. Hasselgård 2009a). When do learners begin to
keep the two constructions apart? At what stage do learners whose native language
does not have a grammaticalized progressive start overusing the form in English
(cf. Johansson & Stavestrand 1987)? When do learners acquire syntactic patterns that
are different from those of their own native tongue, and by what steps? To answer such
questions, we need data representing different stages of the learning process, from
beginners to advanced learners, for instance along the lines of the Danish PIF project
(Færch et al. 1984). Hopefully, the new Longitudinal Database of Learner English
(LONGDALE) project will bring corpus-linguistic studies closer to the language learn-
ing process.20
9. The revolution continues
About twenty years after the ICLE project was conceived, the achievement seems im-
mense. This applies not just to the important work done by Sylviane Granger and her
team at the Centre for English Corpus Linguistics. No less important is the enthusiasm
which has spread to many countries across the world (a good overview is given at
20. For information on the VESPA and the LONGDALE projects, see www.uclouvain.be/en-
cecl-vespa.html and www.uclouvain.be/en-cecl-longdale.html, respectively.
Learner corpora and contrastive interlanguage analysis 
www.uclouvain.be/en-cecl-lcWorld.html). The study of learner corpora is now an es-

tablished field of applied linguistics. But it is a field which keeps evolving; new projects
emerge, and thereby the potential for renewed research procedures, more sophisticat-
ed corpus tools, new types of investigations and new applications. An important ex-
ample of the recognition of interlanguage research is the ICLE-based contribution of
the Centre for English Corpus Linguistics to the Macmillan Dictionary (Rundell 2007).
‘Get-it-right’ boxes as well as a section entitled “Improve your writing skills” are adver-
tised as key features of the dictionary.21
One of the earliest articles presenting the ICLE project, Granger (1994), carries the
title “The Learner Corpus: A revolution in applied linguistics”. It has indeed been revo-
lutionary in the sense that it has opened up a whole range of new research questions.
Contrastive Interlanguage Analysis has turned out to be a fruitful paradigm. And yet
there were significant studies of learner language preceding ICLE. At the outset of our
paper we drew attention to some early work in Scandinavia. A hallmark of these stud-
ies is the concern with pedagogical applications (Thagg Fisher 1985; Linnarud 1986)
and with issues of language learning (Færch et al. 1984). What they lacked was the
comparison across different mother-tongue groups. In contrast, the CIA paradigm in-
cludes both learner vs. native speaker comparison and the possibility of comparing
across groups of learners with different mother-tongue backgrounds. Moreover, the
Integrated Contrastive Model has a great advantage over earlier error analysis and con-
trastive studies undertaken previously for purposes of improving language teaching:
the combined resources inherent in the model secure a much better basis for explain-
ing errors as well as making and testing predictions of learning difficulties.
In spite of the wealth of studies, Granger (2009: 14) admits that “learner corpus
research has not yet fully realized its stated ambition as its links with SLA have been
somewhat weak and it has given rise to relatively few concrete pedagogical applica-
tions”. But the potential is definitely there, and Granger points out some important
directions to go. If these are followed, the future seems bright for foreign-language
pedagogy and for understanding interlanguage and the processes of foreign language
acquisition.
References
Ädel, A. 2006. Metadiscourse in L1 and L2 English [Studies in Corpus Linguistics 24] Amsterdam:
John Benjamins.
Ädel, A. 2008. Involvement features in writing: do time and interaction trump register aware-
ness? In Linking up Contrastive and Learner Corpus Research, G. Gilquin, S. Papp & M.B.
Díez-Bedmar (eds), 35–53. Amsterdam: Rodopi.
Aijmer, K. 2002. Modality in advanced Swedish learner’ written interlanguage. In Computer
Learner Corpora, Second Language Acquisition and Foreign Language Learning [Language
21. See www.macmillandictionaries.com/about/MED2/keyfeatures.htm.

 Hilde Hasselgård and Stig Johansson
Learning & Language Teaching 6], S. Granger, J. Hung & S. Petch-Tyson, S. (eds), 55–76.
Aijmer, K. (ed.). 2009. Corpora and Language Teaching [Studies in Corpus Linguistics 33].
Altenberg, B. 1997. Exploring the Swedish component of the International Corpus of Learner
English. In Proceedings of PALC’97 Practical Applications in Language Corpora (Lódz, 10–14
April 1997), B. Lewandowska-Tomaszcyk & P.J. Melia (eds), 119–132. Lódz: Lódz Univer-
sity Press.
Altenberg, B. & Tapper, M. 1998. The use of adverbial connectors in advanced Swedish learners’
written English. In Learner English on Computer, S. Granger (ed.), 80–93. London:
Longman.
Barlow, M. 2005. Computer-based analyses of learner language. In Analysing Learner Language,
R. Ellis & G. Barkhuizen (eds), 335–357. Oxford: OUP.
Biber, D., Johansson, S., Leech, G., Conrad, S. & Finegan, E. 1999. Longman Grammar of Spoken
and Written English. London: Longman.
Boström Aronsson, M. 2003. On clefts and information structure in Swedish EFL writing. In
Extending the Scope of Corpus-based Research. New Applications, New Challenges, S. Granger
& S. Petch-Tyson (eds), 197–210. Amsterdam: Rodopi.
Brand C. & Kämmerer, S. 2006. The Louvain International Database of Spoken English Interlan-
guage (LINDSEI): Compiling the German component. In Corpus Technology and Language
Pedagogy, S. Braun, K. Kohn, & J. Mukherjee (eds), 127–140. Frankfurt: Peter Lang.
Corder, S.P. 1973. Introducing Applied Linguistics. Harmondsworth: Penguin.
De Cock, S., Granger, S., Leech, G., & McEnery, T. 1998. An automated approach to the phrasicon
of EFL learners. In Learner English on Computer, S. Granger (ed.), 67–79. London:
Longman.
Eia, A.-B. 2006. The use of linking adverbials in Norwegian advanced learners’ written English.
MA thesis, University of Oslo.
Enkvist, N.E. 1973. Should we count errors or measure success? In Errata: Papers in error analy-
sis, J. Svartvik (ed.), 16–23. Lund: Gleerup/Liber.
Færch, C. 1979. Computational analysis of the PIF Corpus of learner language. PIF Working
Papers 1, 2nd rev. version. Department of English, University of Copenhagen.
Færch, C., Haastrup, K. & Phillipson, R. 1984. Learner Language and Language Learning.
Copenhagen: Nordisk Forlag A.S. & Clevedon: Multilingual Matters.
Gilquin, G. 2000/2001. The Integrated Contrastive Model: Spicing up your data. Languages in
Contrast 3(1): 95–124. (Printed in 2003).
Gilquin, G. 2008. Combining contrastive and interlanguage analysis to apprehend transfer: de-
tection, explanation, evaluation. In Linking up Contrastive and Learner Corpus Research, G.
Gilquin, S. Papp & M.B. Díez-Bedmar (eds), 3–34. Amsterdam: Rodopi.
Gilquin G., Granger S. & Paquot M. 2007. Learner corpora: The missing link in EAP pedagogy.
In Corpus-based EAP Pedagogy, P. Thompson (ed.). Special issue of Journal of English for
Academic Purposes 6(4): 319–335.
Gilquin, G., Papp, S. & Díez-Bedmar, M.B. (eds). 2008. Linking up Contrastive and Learner Cor-
pus Research. Amsterdam: Rodopi.
Granger, S. 1994. The Learner Corpus: A revolution in applied linguistics. English Today 10(3):
25–32.
Learner corpora and contrastive interlanguage analysis 
Granger, S. 1996. From CA to CIA and back: An integrated approach to computerized bilingual
and learner corpora. In Languages in Contrast. Papers from a Symposium on Text-based
Cross-linguistic Studies, Lund 4–5 March 1994 [Lund Studies in English 88], K. Aijmer, B.
Altenberg, & M. Johansson (eds), 37–51. Lund: Lund University Press.
Granger, S. 1998a. The computer learner corpus: A versatile new source of data for SLA research.
In Learner English on Computer, S. Granger (ed.), 3–18. London: Longman.
Granger, S. 1998b. Prefabricated patterns in EFL writing. In Phraseology. Theory, Analysis, and
Applications, A.P. Cowie (ed.), 145–160. Oxford: OUP.
Granger, S. (ed.). 1998c. Learner English on Computer. London: Longman.
Granger, S. 2002. A bird’s-eye view of learner corpus research. In Granger, Hung & Petch-Tyson
(eds), 3–33.
Granger, S. 2004. Computer learner corpus research: current status and future prospects. In Ap-
plied Corpus Linguistics: A Multidimensional Perspective, U. Connor & T. Upton (eds),
123–145. Amsterdam: Rodopi.
Granger, S. 2009. The contribution of learner corpora to second language acquisition and for-
eign language teaching: A critical evaluation. In Aijmer (ed.), 13–32.
guage Acquisition and Foreign Language Learning [Language Learning & Language Teach-
Granger, S., Dagneaux, E., Meunier, F. & Paquot, M. (eds). 2009. International Corpus of Learner
English. Version 2. Louvain-la-Neuve: Presses universitaires de Louvain.
Greenbaum, S. 1991. The development of the International Corpus of English. In English Corpus
Linguistics: Studies in Honour of Jan Svartvik, K. Aijmer & B. Altenberg (eds), 83–91. Lon-
don: Longman.
Hammarberg, B. 1973. The insufficiency of error analysis. In Errata: Papers in error analysis, J.
Svartvik (ed.), 29–36. Lund: Gleerup/Liber.
Hasselgård, H. 2005. Theme in Norwegian. In Semiotics from the North: Nordic Approaches to
Systemic Functional Linguistics, K. L. Berge & E. Maagerø (eds), 35–48. Oslo: Novus.
Hasselgård, H. 2009a. Thematic choice and expressions of stance in English argumentative texts
by Norwegian learners. In Aijmer (ed.), 121–139.
Hasselgård, H. 2009b. Temporal and spatial structuring in English and Norwegian student es-
says. In Corpora and Discourse – and Stuff. Papers in Honour of Karin Aijmer. R. Bowen, M.
Mobärg, & S. Ohlander (eds), 93–104. Göteborg: Acta Universitatis Gothoburgensis.
Hasselgren, A. 1994. Lexical teddy bears and advanced learners: A study into the ways Norwe-
gian students cope with English vocabulary. International Journal of Applied Linguistics
4: 237–259.
Herriman, J. and Boström Aronsson, M. 2009. Themes in Swedish advanced learners’ writing in
English. In Aijmer (ed.), 101–120.
Hunston, S. 2002. Corpora in Applied Linguistics. Cambridge: CUP.
Hyland, K. 2002. Authority and invisibility: authorial identity in academic writing. Journal of
Pragmatics 34: 1091–1112.
Johansson, S. 1978. Studies of Error Gravity. Native Reactions to Errors Produced by Swedish
learners of English. Gothenburg: Acta Universitatis Gothoburgensis.
Johansson, S. 2007. Seeing through Multilingual Corpora: On the Use of Corpora in Contrastive
Studies [Studies in Corpus Linguistics 26]. Amsterdam: John Benjamins.
Johansson, S. & Stavestrand, H. 1987. Problems in learning – and teaching – the progressive
form. In Proceedings from the Third Nordic Conference for English Studies [Stockholm
 Hilde Hasselgård and Stig Johansson
Studies in English 73(1)], I. Lindblad & M. Ljung (eds), 139–148. Stockholm: Almqvist &
Wiksell.
Lado, R. 1957 [1971]. Linguistics across Cultures: Applied Linguistics for Language Teachers. Ann
Arbor MI: University of Michigan Press.
Leech, G. 1998. Preface. In Learner English on Computer, S. Granger (ed.), xiv-xx. London:
Longman.
Levenston, E. A. 1971. Overindulgence and underrepresentation – Aspects of mother tongue
interference. In Contrastive Linguistics, G. Nickel (ed.), 115–121. Cambridge: CUP.
Linnarud, M. 1986. Lexis in Composition: A Performance Analysis of Swedish Learners’ Written
English [Lund Studies in English 74]. Lund: Gleerup/Liber.
Lorenz, G. 1998. Overstatement in advanced learners’ writing: Stylistic aspects of adjective in-
tensification. In Learner English on Computer, S. Granger (ed.), 53–66. London: Longman.
Meunier, F. & Granger, S. (eds). 2008. Phraseology in Foreign Language Learning and Teaching.
Nickel, G. 1973. Aspects of error evaluation and grading. In Errata: Papers in Error Analysis, J.
Svartvik (ed.), 24–28. Lund: Gleerup/Liber.
Osborne, J. 2008. Adverb placement in post-intermediate learner English: A contrastive study of
learner corpora. In Linking up Contrastive and Learner Corpus Research, G. Gilquin, S. Papp
& M.B. Díez-Bedmar (eds), 127–146. Amsterdam: Rodopi.
Paquot M. 2008. Exemplification in learner writing: A cross-linguistic perspective. In Phraseol-
ogy in Foreign Language Learning and Teaching, F. Meunier & S. Granger (eds), 101–119.
Paquot, M. 2010. Academic Vocabulary in Learner Writing. From Extraction to Analysis. London:
Continuum.
Petch-Tyson, S. 1998. Writer/reader visibility in EFL written discourse. In Learner English on
Computer, S. Granger (ed.), 107–118. London: Longman.
Pravec, N. A. 2002. Survey of learner corpora. ICAME Journal 26: 81–114.
Ringbom, H. 1998. Vocabulary frequencies in advanced learner English: A cross-linguistic ap-
proach. In Learner English on Computer, S. Granger (ed.), 41–52. London: Longman.
Rundell, M. (Editor in chief) 2007. Macmillan English Dictionary for Advanced Learners, 2nd
edn. Oxford: Macmillan Education.
Selinker, L. 1972. Interlanguage. International Review of Applied Linguistics 10(3): 219–231.
Svartvik, J. (ed.). 1973. Errata: Papers in Error Analysis. Lund: Gleerup/Liber.
Thagg Fisher, U. 1985. The Sweet Sound of Concord: A Study of Swedish Learners’ Concord Prob-
lems in English [Lund Studies in English 73]. Lund: Gleerup/Liber.
Virtanen, T. 1998. Direct questions in argumentative student writing. In Learner English on
Computer, S. Granger (ed.), 94–106. London: Longman.
Wardhaugh, R. 1970. The contrastive analysis hypothesis. TESOL Quarterly 4(2): 123–130.
Wiktorsson, M. 2003. Learning Idiomaticity. A Corpus-Based Study of Idiomatic Expressions in
Learners’ Written Production [Lund Studies in English 105]. Stockholm: Almqvist & Wiksell
International.
Learner corpora and contrastive interlanguage analysis 
Corpora used in examples and case studies
British National Corpus (BNC) <www.natcorp.ox.ac.uk/>

English-Norwegian Parallel Corpus (ENPC)
<www.hf.uio.no/ilos/english/services/omc/enpc/>
International Corpus of English, British Component (ICE-GB) <www.ucl.ac.uk/english-usage/
projects/ice-gb/>
International Corpus of Learner English (ICLE) <www.uclouvain.be/en-cecl-icle.html>
Louvain Corpus of Native English Essays (LOCNESS) <www.uclouvain.be/en-cecl-locness.html>
The use of small corpora for tracing
the development of academic literacies
JoAnne Neff van Aertselaer and Caroline Bunce
Since Erasmus exchanges have fostered student mobility in the European Union,
various features of argumentation skills for Academic English (AE) have become
central elements of university curricula. This chapter presents an analysis of a
small corpus of texts written in an academic writing (AW) class by English as a
Foreign Language (EFL) Spanish university students at B1 and B2 levels of the
Common European Framework for Languages (CEFR). The small corpus data is
contrasted with the Spanish sub-corpus of the International Corpus of Learner
English (SPICLE) regarding the use of certain devices for intertextuality and
evaluation. The study shows that students who have been given very definite
CEFR guidelines regarding the use of specific academic features are able to
improve their writing, even though there remain certain types of errors in their
overall lexico-grammatical production.
1. Introduction
Given the increasing student mobility within the European Union, skill in the critical
argumentation indispensable for academic writing (AW) in English has become an
essential competency. This development within institutions of higher education is re-
flected in the manual called Relating Language Examinations to the Common European
Framework of References for Languages, published in 2009 by the Language Policy Di-
vision of the Council of Europe. On various pages (pp. 44, 138, 177), this document
addresses the question of two text types which are essential for academic work: de-
scriptive-chronological text (as in lab reports) and argumentative text type (essential
in all academic disciplines, at least for many sections of an academic report or re-
search article). To the Appendix on ‘Written assessment criteria’ (Table C4, p. 187) of
this document, the Language Policy Division has attached additional columns for
these two text types. The specifications list features of argumentative AW, such as the
ability to present a case; provide a critical appreciation of proposals; expand and sup-
port a point of view with subsidiary points, reasons and examples and provide an ap-
propriate reader-friendly logical structure. If these characteristics constitute what
 JoAnne Neff van Aertselaer and Caroline Bunce
university students’ writing will be judged on, then it is crucial that university teachers
analyse academic writing in the different disciplines in order to ascertain what these
features, which include a mixture of structural and rhetorical patterns, are and how
they could be best taught. That is, these general Common European Framework of
Reference (CEFR) features do not specify the linguistic realizations that AW requires
and therefore, these must be identified and incorporated into can do statements for
writing syllabi.1
In this chapter, we focus on a series of lexical choices which enter into grammar
patterns and their pragmatic associations2 – so often the focus of the work of Sylviane
Granger (Granger 1983; Granger 1998a; Granger 1998b; Gilquin et al. 2007; Meunier
& Granger 2008) – in order to show how the elaboration of can do statements for a
one-semester academic writing course can improve student writing (and reading
skills) in terms of the students’ communicative goals, if not their syntactic competency.
The use of these lexical items are traced through two corpora: the Spanish sub-corpus
of the International Corpus of Learner English (SPICLE), a collection of texts pro-
duced by Spanish English as a Foreign Language (EFL) students with no specific train-
ing in AW, as compared to a corpus consisting of texts written by similar students as
part of a course in AW.
The purpose of the various comparisons was to ascertain whether the syllabus for
the two AW courses (2007–2008 and 2008–2009) was actually beneficial to the stu-
dents’ literacy growth in the production of texts.3 Therefore, the study focuses more on
the students’ text production than on the readings used as models during the course.
In both of the years of the AW course, the ultimate aim of the study was pedagogical,
i.e. revising the syllabus and thus classroom practices.
The study shows that, while instructors of an AW course cannot hope to signifi-
cantly improve their students’ grammatical competence over a one-semester period,
by providing explicit descriptors for argumentative writing, they are able to help the
students understand the dialogic nature of argumentation. The attention given to the
frequency of different features of argumentation and the ways in which these combine
shows students how to produce more sophisticated texts. Furthermore, the study also
illustrates how small corpora can be usefully employed both to trace learners’ develop-
mental patterns and subsequently adapt specific classroom teaching practices
(Thompson 2001a).
1. This study forms part of the work completed for a national project funded by the Spanish
Ministry of Science and Innovation (FFI2008–03968).
2. Following Hoey (2005: 43), we define pragmatic association as the particular pragmatic
function(s) that words and nested combinations of words are primed for because of frequent
use, such as as can be seen in Table 2 as a discourse marker for presenting information. Also see
Hunston & Francis (1999).
3. No attempt was made to measure the improvement (or not) of the students’ reading
competency.
The use of small corpora for tracing the development of academic literacies 
2. The development of academic literacies in an EFL context
According to Johns (1997: 2), literacy is an inclusive term which refers to both reading
and writing, and also “encompasses ways of knowing particular content, languages
and practices”, including strategies to deal with “understanding, discussing, organiz-
ing and producing texts”. As many researchers have noted (Bazerman 1994; Johns
1997; Bhatia 2004), the development of academic literacy in particular disciplines de-
pends on the students’ having become aware of the requirements of the genre in ques-
tion – giving rise to what Hoey refers to as “productive priming” (Hoey 2005: 11) –
and also being conscious of the socio-cultural forces which give rise to the
intertextual nature of academic texts. In the context of university students of English
Studies at the Universidad Complutense de Madrid, course instructors have observed
that the students can readily classify text types4 into narrative or descriptive passages;
however, they have difficulty in explaining the reasons for their categorisation, par-
ticularly in identifying text-internal features of argumentative texts, such as the use of
modal verbs, concessive constructions, and adversative lexical phrases in order to
present various viewpoints. That is, students are intuitively aware of features of text
types but this schematic knowledge is insufficient for them to produce good argumen-
tative texts.
These linguistic forms and text patterns (text internal features) should be under-
stood as a means for negotiating a stance within a genre. But students rarely compre-
hend texts in terms of negotiating multiple text external discourses, perhaps because
they do not fully understand texts as a form of social practice. It must also be admitted
that student texts, mostly written for teacher evaluation, do not usually bring about
any “consequent social action” (Bazerman 1994: 79).
A useful concept for presenting such text external factors is genre. Swales (1990:
45–58) has defined genre as a class of communicative events with a shared set of
purposes and goals, carried out within certain conventions for the presentation of
contents, positioning and form. EFL undergraduate students, such as those whose
texts are studied here, have not had enough experience with different varieties of
academic texts, except for textbooks, to have formed prototypical concepts for these
different texts, and in particular, for highly conventional texts such as a formal re-
search paper. Since students’ contact with academic sub-genres has mostly centred
4. Text types have been defined following Werlich (1983), who proposes 5 types – description,
exposition, narration, argumentation and instruction. Genre has been defined following Biber
(1995) and Swales (1990). Text types are considered to have internal (linguistic) features which
define the types in themselves, while genres are heavily influenced by cultural, external features.
Different text types may occur within a single genre, as in a research paper, which may include
a narrative account of past research, an expository account in the Methods section and argu-
mentative text type in the Discussion section. In the 2001 book on the Common European
Framework of Reference for Languages (Council of Europe 2001: 95), text types are referred to,
but these are in fact genres (comic books, textbooks, newspapers, etc.).
 JoAnne Neff van Aertselaer and Caroline Bunce
on textbooks, it is very likely that they will confuse the types of text-internal fea-
tures (such as the use of imperative verbs and vocatives like let’s) found in textbooks
with the language they are to use in essays and academic papers. Therefore, se-
quenced, goal-directed reading tasks should be the starting point for genre acquisi-
tion (Swales 1990: 76).
Linked to the concept of genre is that of discourse community, which is defined by
Swales (1990: 24–27) as having “a common set of public goals” and, among the expert
members of the community, shared discursive practices, which often develop into one
or more genres. Our students need to become aware of the nature of the external and
internal factors which influence the academic discourse communities they are enter-
ing, in our case, Linguistics and Literature. These differences exist both between these
two communities and among various types of subgenre, such as textbooks, essays,
critical analyses, and term papers (Bhatia 2004: 31).
In addition to the necessity of beginning the AW course for university students
with general notions of genre and discourse community, at a very early point, intertex-
tuality should be introduced as a way of helping students realise that their texts will
enter into some academic discourse community, as limited as that may be within their
own institutions. There are various ways in which academic texts are intertextual.
Their form is a reflection of prior texts (both in structural and rhetorical features).
Their content also engages with prior texts, in that the arguments must be strength-
ened by the reading and digesting of others’ texts. Additionally, academic texts must
combine both the author’s intention, that is, the stance expressed towards the content,
with the evaluation of those texts read and cited as background material. Often stu-
dents do not conceive of themselves as members of an academic discourse community
and therefore do not see their texts as participating in what Briggs & Baumann
(1992: 146) have described as the “ongoing process of producing and receiving dis-
course”. Without our students’ understanding of this dialogic process, they will not be
able to make sense of the way in which structural and rhetorical features combine in
order to construct an effective academic argument.
There is a further complication for the Spanish context. Writing in academic con-
texts is often seen primarily as knowledge telling and may be governed by an assump-
tion that students should display the knowledge they have acquired, usually that given
by the teacher in class or the textbook. This attitude is reflected in examination ques-
tions which do not require the candidate to put forth stance moves or to have completed
outside critical readings. For example, a typical literature question for a Spanish univer-
sity entrance exam (Educared 2009) is the following: Características del Modernismo
(“Characteristics of Modernism”). As it is not really a question, this type of essay prompt
merely requires the candidates to list a set of characteristics, not to examine the various
issues involved, or to contrast sources; in fact, the latter are not required at all. These
types of prompts, requiring mainly descriptive answers, given over a number of years of
schooling, mean that little attention is given to argumentation, as a lesser-valued skill at
The use of small corpora for tracing the development of academic literacies 
secondary level.5 In contrast, in most schooling in English-speaking contexts, narrative

and descriptive texts are the focus of instruction until approximately 9 or 10 years of age
when factual writing of different types (description, report, explanation, persuasion,
Martin 1990: 15) begins to take on importance (Perera 1989), not only for examinations
but, when students are older, for longer texts as well, such as term papers. For the latter,
argumentation becomes the main text type and descriptive text is used mostly for con-
textualization and exemplification, in support of the arguments presented.
If Spanish contexts stress description (what something is like) over persuasive ex-
position/argumentative text types (reasons and arguments),6 Spanish university stu-
dents entering English Studies may have to struggle in order to comprehend argumen-
tation patterns and incorporate them into their writing. At tertiary level, it is difficult
for instructors to convince students that they must strive to create their own voice,
perhaps because the text internal and external features still remain implicit. The pur-
pose of the can do statements elaborated for this course is to provide students with an
explicit set of such features which can serve as the basis for academic literacy exercises,
and ultimately, academic essays.
3. The academic writing course
In order to encourage knowledge transformation, and not merely the knowledge tell-
ing found in descriptive texts, the instructors found it necessary to draw up a series of
guidelines or can do descriptors to make explicit the required structural and rhetorical
features to be learned. Since the competence levels of the students are mixed, the syl-
labus for the course centres on specific genre and intertextual practices, as displayed in
Table 1, which must be learned by the students of the AW course, regardless of their
competence level in English.7
5. This assumption is corroborated by the number of points given to the students taking the
Spanish Literature and Language exams for university entrance. The argumentative essay counts
for 1 point out of 10 points in total.
6. Although argumentation is one of the text types mentioned in preparatory university
courses for Spanish students, it is not a text type that students frequently practise.
7. The classes are not streamed in the English Studies Department at our university; thus, as
was the case in both AW courses considered in this study, students’ levels may range from A2 to
C1, as tested during the first class with the Oxford Quick Placement (OQP) Test. It is not pos-
sible to simply exclude students whose level is not at B1, the level at which the first specific de-
scriptors appear on the Writing Grid for Argument (Council of Europe 2009: 187). In order to
measure students’ progress regarding the structural and rhetorical descriptors, the data from
sample 1 (AW1) had to be matched with the final essay data (AW2) from the same students. For
this purpose, we selected from each of the two courses (2007–2008 and 2008–2009), 20 initial
essays (n = 40) and 20 final essays (n = 40). The competence level of the 40 students (OQP Test)
was as follows: A2 level: 20%; B1 level: 20%; B2/C1 levels: 65%; and C2 level: 5%.
 JoAnne Neff van Aertselaer and Caroline Bunce
Table 1.â•‡ Can do statements for B2 level
Features of structural and rhetorical Qualifications

competence
Structural features
– Can reword the prompt of a writing – Proper contextualization
assignment incorporating opposing points
of view appropriate for argumentative genre
– Can present all claims and supporting data – Few stranded claims or data
in a logically organized way
– Can use both prospection and encapsula- – Few limitations regarding lexical
tion8 to create coherence phrases used
– Can conclude by restating major ideas and – Suggestion of future events
placing the arguments in a wider context
Rhetorical features
– Can consider other points of view, – Can distinguish among the arguments
adopting a critical stance in sources
– Can incorporate intertextuality by – Can use a wide range of reporting
reporting others’ views and statements, verbs (suggest, claim, show, etc.)
using lexical resources, such as adjectives,
adverbs and verbs, which show writer
alignment (stance) – Can make effective use of passive
– Can use a reasonably extensive range of voice, modalized utterances, abstract
hedges and boosters as well as impersonal- rhetors (non-human agents)
ization strategies in presenting claims – Can effectively use lexical cohesive
– Can successfully use a variety of discourse devices (synonyms, hyponyms, etc.) as
markers (DMs) to indicate flow of text well as DMs
These features were also used to measure the students’ written performance through-
out and at the end of the course. These criteria enabled the instructors to avoid solely
focusing on the elimination of student errors and instead, to concentrate, more rea-
sonably, on feasible advancement in discourse competency.
The can do statements were presented on the first day of the course and frequently
referred to before focusing on specific writing exercises. Students reported having
found these descriptors clear and useful and also having referred to them for home
assignments.
As can be observed in this table, the can do statements cover a range of genre
characteristics. By the end of the course, the student is expected to display ownership
8. Following Sinclair (1993: 8), encapsulation is defined as phrases which reformulate what has
been stated, usually in order to move on to another topic or conclusion, and prospection occurs
when “the phrasing of a sentence leads the addressee to expect something specific in the next sen-
tence” (Sinclair 1993: 12), namely because the speaker/writer has alluded to topics to be dealt with.
The use of small corpora for tracing the development of academic literacies 
of the ideas presented as claims and sub-claims, as well as adopting an authorial stance
suitable for a nuanced argumentation. Of the above features, the ones examined in this
study are rhetorical rather than structural, particularly those related to intertextuality,
such as the range of reporting verbs used and the internal (authorial) and external
(non-authorial) voices used to present points of view.
4. The study
As previously mentioned, the aim of the study was to discover how EFL students ne-
gotiate stance in academic papers, with the ultimate aim of examining our students’
progress in the acquisition of various devices for stance-taking, an important feature
of the AW syllabus.
4.1 Texts included in the study
For purposes of measuring development in student writing, the SPICLE corpus (see
Table 2), collected throughout the 1990s, provides a picture of Spanish EFL university
writing without the benefit of a specific AW course. This corpus is a collection of texts
(194,845 words) on general interest and literature topics, written by third- and fourth-
year Spanish EFL university students, and included as the Spanish component of the
International Corpus of Learner English (ICLE), held at Louvain. The data from this
corpus is compared with the two small sub-corpora (AW1 and AW2) of English Stud-
ies students enrolled in the Academic Writing class at the Complutense University of
Madrid (UCM), during 2008 and 2009. These texts (27,462 words) are samples of ar-
gumentative essays written by second-year English Studies students on general interest
topics (i.e., approximately the same as those used for the ICLE corpus, but excluding
literature topics). Writing sample 1 (AW1), collected during the second-week of the
course, was matched with the texts of the final sample (AW2), written by the same
students. The students were required to do writing assignments throughout the course,
but only course-initial and course-final samples of their writing were selected since the
aim of the study was to analyse the students’ progress and evaluate the effectiveness of
the AW course. These two AW sub-corpora show the gains made by UCM students
after explicit teaching of the features of academic writing.
The data from the two sub-corpora are also compared with each other in order to
trace the development regarding the specific features listed in the can do statements.
The study is further complemented by previous studies carried out on part of the
Louvain Corpus of Native English Essays (LOCNESS), texts written by American uni-
versity students, especially regarding the use of deictics referring to propositions in the
text. The results from all of these studies will be used to inform the syllabus design for
the AW course in the future.
 JoAnne Neff van Aertselaer and Caroline Bunce
Table 2.â•‡ Corpora included in the study
Name of corpus Number of words
SPICLE 194,845
AW TEXTS â•⁄ 27,462
AW1 â•⁄ 10,596
AW2 â•⁄ 16,866
Since the corpora were of different sizes, all the figures for the data were normed per
one hundred words to permit comparisons. The texts produced by the AW students
represented a very limited number of words because, for the purpose of measuring
progress, we could use only the texts written by the students who had completed both
the first and final writing assignments, elaborated in class from notes.
In Appendix I, there is an example of a final essay (AW2) from the writing course
in 2009, and in Appendix II, in order to show developmental trends, there are two es-
says from the same student enrolled in the writing course in 2008: the initial essay
(AW1) and the final essay (AW2).
4.2 Methods and procedures
In order to investigate stance-taking, we first searched for the reporting verbs used by
the students in order to compare the latter with a list used by expert article writers in
English (cf. Neff et al. 2001) and then also carried out a more qualitative study of the
rhetors, or agents, established by the students as giving voice to evaluations or claims.
Two main criteria governed the inclusion or not of data in the study: one concerning
the rhetor (usually the subject) associated with the verb and the other concerning the
ideas, statements or arguments introduced by the verb (usually the object). The first
criterion was that the verb should be associated with an identifiable rhetor which could
be considered to be one of the text’s voices (rhetors) and to participate in the textual
discussions. Thus, the instances of conclude with the function of discourse organizer
(e.g. “To conclude: the best solution is ...”) were not classified, since it was considered
that they did not give sufficient emphasis to the rhetor, but rather served principally to
organize the text. The second criterion was that the verb should introduce or be associ-
ated with propositional content which could be phrased as a statement or question
(e.g. “Many agree that TV is too violent”). Thus uses such as “the discovery of AIDS
has changed how people think” or “They should think about their morals” were not
included. The data were included in the study if they fulfilled at least one of these cri-
teria. An impersonal use such as “It is reasonable to conclude also that without the
satellite this would not have occurred” was thus accepted because it fulfilled the sec-
ond criterion though not the first, while “These works and studies have looked at this
issue from many different angles” was also included on the grounds that it fulfilled the
first criterion though not the second.
The use of small corpora for tracing the development of academic literacies 
The initial quantitative approach was to focus on a range of reporting verbs, such
as argue, note, suggest and show, which we had observed as frequently used in LOC-
NESS and expert academic texts (cf. Neff et al. 2001). We first searched for the root and
irregular forms of the verbs (see Table 5 for the full list) using Wordsmith 5.0
(Scott 2007). Some instances of reporting verbs from the SPICLE corpus were not in-
cluded when they occurred in display-type answers particularly in the literature es-
says, such as “In the two final stanzas, John Donne explains the meaning of that con-
ceit” and “Joan says she will rather die than spend the rest of her days in prison”. These
uses by SPICLE writers were considered instances of contextualization and not argu-
mentation and therefore, were not taken into account in this study.
The initial analysis of reporting verbs showed some basic patterns and tendencies
with regard to the different discourse verbs used by the students. As well, it became
apparent that certain verbs tended to be used with certain types of rhetors, e.g., “this
shows that ...”. Therefore, in a second step, we carried out a more qualitative study in
order to categorize the rhetors, that is, to classify the use of voice (abstract rhetor, non-
specific rhetors or personal pronouns, etc.) and impersonal and/or passive construc-
tions (i.e., no agent).
Academic texts frequently use rhetors of various kinds, such as those shown in
Table 3: specific human agents (I, you, we); non-specific human agents (one); specific
and/or named rhetors (two researchers from New York; Hyland); general, non-specific
and unnamed rhetors (some people may think that ...); or, abstract rhetors (This study
shows that ...). Also there is frequent use of impersonal constructions, such as it is
Table 3.â•‡ Categorization of the different voices associated with reporting verbs
Classifications Examples
Abstract rhetors An examination of the programming has

concluded that ...
Impersonal and passive constructions it has been said that...; it is necessary to point
out that...
General, non-specific and Some people may think...; Opponents claim...;
unnamed rhetors Proponents of X argue that...; The average reader
may not find...
Specific and/or named rhetors Methvin believes that...; Two researchers from
New York found that...
Deictics as subject (referring to This shows that...
propositions in the text)
“one” subject One may assume that...
“you” subject If you analyze many of these arguments...
“we” subject Before we discuss the case of ...
“I” subject Personally, I find that...; I have always believed
that...
 JoAnne Neff van Aertselaer and Caroline Bunce
Table 4.â•‡ Different types of evaluative devices examined in study
Evaluative lexical device Examples
it + copular verb + adjectival it is obvious that...; it is indisputable that...; it

phrase + that seems more logical that...
it + copular verb + adjectival phrase it is important to take into account that...; it is only
+ to + verb of knowing/saying natural to think that...; it seems contradictory to
say that...
*ly adverbs used as disjunct/used to immigration is obviously a problem...; but
modify discourse verb unfortunately, many governments do not....; I will
briefly summarise...; scientists plausibly claim...
important to note that ... and passive constructions, such as it has been said that ..., in
which the rhetorical act appears to have no human agency. Many of these latter con-
structions permit the writer to present her arguments as resting upon common knowl-
edge and factual, objective data. All allow the writer to adopt a variation of stances
with regard to the propositions put forward, which range from distancing from or
subscribing to these propositions.
There are, of course, many ways in which writer stance can be expressed and in
successful academic argumentation stance-taking consists of a complex combination
of a variety of linguistic features. Therefore, in a third step of the research, we decided
to focus on four lexical resources for evaluation (all displayed with examples in Table 4)
explicitly taught during the course, namely it + copular verb + evaluative adjectival
phrase + that, it + copular verb + evaluative adjectival phrase + to + verb of mental or
verbal processes, and two uses of adverbs ending in ly: those conveying a writer com-
ment on the whole content of the proposition (disjunct), and those modifying a dis-
course oriented verb. As occurred with the reporting verbs, Wordsmith 5.0 was used
for the word searches (using the strings it **** that, it **** to and *ly) and initial data
sorting, while the subsequent elimination of irrelevant data was done manually.
For the purpose of comparison, all the figures for the various data were normed
per 100 words and chi-square was used to test for statistical significance.
5. Analysis and discussion
Stance-taking in any piece of writing requires the use of different devices employed
within a very nuanced textual process. During the AW course comprising 37 hours, it
was not possible to teach all these diverse strategies. Thus, the instructors opted for a
limited number of structural and rhetorical indicators, which appear as can do de-
scriptors in Table 1. In this study we explore the use of various of these indicators,
namely reporting verbs and rhetor types that occur with these, and four types of lexical
devices for evaluation.
The use of small corpora for tracing the development of academic literacies 
5.1 Reporting verbs
The principal findings for the reporting verbs in each corpus are presented in Table 5,
with the raw figures in the left-hand columns followed by the figures normed by 100
words. First we discuss the unusual frequencies of some of the individual verbs and
then some developmental trends.
Table 5.â•‡ Occurrences of reporting verbs per corpus
Verb SPICLE AW1 AW2
Raw fig. Normed fig. Raw fig. Normed fig. Raw fig. Normed fig
address* â•⁄â•⁄ 0 0 â•⁄ 0 0 â•⁄â•⁄ 0 0

agree*/disagree* â•⁄ 32 0.02 â•⁄ 2 0.02 â•⁄â•⁄ 6 0.04
analyz*/s* â•⁄ 19 0.01 â•⁄ 4 0.04 â•⁄â•⁄ 6 0.04
argu* â•⁄ 12 0.006 â•⁄ 4 0.04 â•⁄ 12 0.07
assum* â•⁄â•⁄ 4 0.002 â•⁄ 0 0 â•⁄â•⁄ 0 0
believ* â•⁄ 51 0.03 â•⁄ 0 0 â•⁄â•⁄ 0 0
claim* â•⁄â•⁄ 4 0.002 â•⁄ 0 0 â•⁄ 26 0.15
conclud* â•⁄â•⁄ 7 0.004 â•⁄ 2 0.02 â•⁄â•⁄ 2 0.01
discuss* â•⁄ 14 0.007 â•⁄ 4 0.04 â•⁄â•⁄ 6 0.04
establish* â•⁄â•⁄ 0 0 â•⁄ 0 0 â•⁄â•⁄ 0 0
explain* â•⁄ 30 0.02 â•⁄ 6 0.06 â•⁄ 16 0.09
find*/found â•⁄â•⁄ 3 0.002 â•⁄ 0 0 â•⁄â•⁄ 0 0
focus* on â•⁄ 18 0.009 â•⁄ 4 0.04 â•⁄â•⁄ 4 0.02
hypothesis*/iz* â•⁄â•⁄ 0 0 â•⁄ 0 0 â•⁄â•⁄ 0 0
indicat* â•⁄â•⁄ 3 0.002 â•⁄ 0 0 â•⁄â•⁄ 4 0.02
look* at â•⁄â•⁄ 3 0.002 â•⁄ 0 0 â•⁄â•⁄ 0 0
not* (note) â•⁄â•⁄ 4 0.002 â•⁄ 4 0.04 â•⁄â•⁄ 4 0.02
point* out/to â•⁄ 22 0.01 â•⁄ 8 0.08 â•⁄ 20 0.12
present* â•⁄â•⁄ 4 0.002 â•⁄ 0 0 â•⁄â•⁄ 0 0
prov* (prove) â•⁄ 20 0.01 â•⁄ 4 0.04 â•⁄â•⁄ 4 0.02
provid* â•⁄â•⁄ 4 0.002 â•⁄ 2 0.02 â•⁄â•⁄ 2 0.01
(+ evidential N.)
refer* â•⁄ 25 0.01 â•⁄ 0 0 â•⁄â•⁄ 0 0
report* â•⁄â•⁄ 1 0.0005 â•⁄ 2 0.02 â•⁄â•⁄ 0 0
say*/said 246 0.1 â•⁄ 8 0.08 â•⁄ 18 0.11
show* â•⁄ 45 0.02 10 0.09 â•⁄ 12 0.07
stat* (state) â•⁄â•⁄ 8 0.004 12 0.11 â•⁄ 24 0.14
stud* (study) â•⁄â•⁄ 0 0 â•⁄ 0 0 â•⁄â•⁄ 0 0
suggest* â•⁄â•⁄ 9 0.005 â•⁄ 2 0.02 â•⁄â•⁄ 8 0.05
think*/thought 247 0.1 20 0.2 â•⁄ 20 0.12
Total 835 0.4 98 0.92 194 1.15
 JoAnne Neff van Aertselaer and Caroline Bunce
5.1.1 Unusual frequencies

As can be seen, four of the verbs used by expert writers (cf. Neff et al. 2001), address,
establish, hypothesise-ze and study, were not used at all in any of the student corpora.
There are also very few tokens of assume, find, look at and present. These results
point to the EFL students’ lack of range in using reporting verbs, as corroborated by
other studies (Charles 2006; Neff et al. 2001). It is worth noting that some of the re-
porting verbs used by experts are particularly academic in tone, such as hypothesize-se,
and are probably not commonly used even by native undergraduate students. As a re-
sult of both novice writer and EFL writer limitations, both groups of EFL university
writers show a certain tendency to rely on a limited set of discourse oriented verbs.
5.1.2 Developmental trends

In comparing the SPICLE data with the AW data, three main trends become apparent:
1. the concentration of the SPICLE tokens on two verbs
2. the progressive increase in frequencies of use of some verbs: from SPICLE to AW1
to AW2
3. the progressive decrease in frequencies of use of some verbs: from SPICLE to AW1
to AW2
The data resulting from the corpus of students who had no specific training in AW, i.e.,
the SPICLE corpus, show that there is a much greater concentration of use on very few
common discourse verbs. In fact, two verbs, think and say, account for approximately
59% of the total use of reporting verbs. The texts of students who received AW training
show a broader range of reporting verbs. Verbs that carry more discourse value,
e.g., suggest, state and claim, now appear more frequently, which allows these students
to rely less heavily on think and say. In the AW1 texts, think and say accounted for ap-
proximately 29% and in AW2, 20% of the total reporting verbs. This result suggests
that, although the AW writers still show a certain limited range of reporting verbs,
similar to that of the SPICLE group, they rely far less on the two verbs previously men-
tioned and more readily use other discourse oriented verbs which are more academic
in tone and convey a greater degree of authorial stance.
Of the 21 remaining verbs (after discounting the 8 verbs occurring either negligi-
bly or not at all in the corpora), 19 (agree/disagree, analyze, argue, claim, conclude,
discuss, explain, focus on, indicate, note, point out/to, prove, provide, report, say, show,
state, suggest, think) appear with greater frequency in one of the AW sub-corpora than
in the SPICLE corpus, thus, in general terms corroborating the finding that the AW
writers show less over-reliance on a limited range of verbs. It is encouraging for the
instructors to note that 8 verbs (argue, claim, explain, indicate, point out, show, state,
suggest) also show a longitudinal increase in frequency when the AW1 sub-corpus is
compared with the AW2 sub-corpus. In the case of argue, explain, point out, show, state
and suggest, the AW1 texts show a greater frequency than the SPICLE texts and the
AW2 texts, in turn, an increase in frequency vis-à-vis the AW1 texts. Claim and indicate
The use of small corpora for tracing the development of academic literacies 
are not used by the AW student writers in their first essays (AW1), but the students
have incorporated them into their writing by the final week of the course (AW2) and
use them with a greater frequency than the SPICLE writers.
Finally, regarding the decrease in frequencies of use from SPICLE to the AW texts,
there are two verbs, believe and refer, that show this tendency. The explanation for this
decrease appears somewhat complex and can only be offered tentatively. In the SPICLE
texts, 64% of the instances of refer correspond to interactive9 uses with the pronouns
we and I (e.g. “We have previously refered to”, “here we are referring to the fact that”, “I
am refering to Spain”). The AW writers’ avoidance of such expressions in an attempt to
achieve a more impersonal academic voice explains, at least in part, the absence of this
verb in their data. As far as believe is concerned, the SPICLE writers use this verb in-
teractionally10 with I and we as rhetors in 43% of the cases, such as in “I believe that
university studies must be reformed”. This means that their claims are often made al-
most exclusively in terms of personal experience rather than by relying on external
authoritative sources. This overuse of believe may point to a transfer effect and a mis-
match of registers since oral Spanish prefers believe (“yo creo”, I believe) to think for
expressing personal opinions.
In the light of these data our hypothesis was that, in contrast to the SPICLE writ-
ers, the AW students express their opinions by different means. One such device would
be through evaluative adjectives and adverbs, which are used precisely to comment on
the claims made by others, as in “Actually, what it clearly reveals is that the result of this
process is ...”. This use of lexical resources to convey writer alignment would suggest
that the AW students have been successful in incorporating the rhetorical devices set
out in the can do statements.
Table 6 presents the total number of occurrences of reporting verbs in the SPICLE
data as compared with AW1 and AW2. There are statistically significant differences in
frequency between the SPICLE texts (produced without the aid of specific writing in-
struction) and the two AW sub-corpora. Moreover, significance increases as the students
progress through the coursework, from sample one (AW1) to the final essay (AW2).
Table 6.â•‡ Total occurrences of reporting verbs: SPICLE, AW1 and AW2
Total reporting verbs
Corpora SPICLE AW texts P

(raw figures) (raw figures)
SPICLE vs AW1 835 â•⁄ 98 <0.025

SPICLE vs AW2 835 194 <0.001
9. Thompson (2001b: 58) defines interactive textual resources as those which “help to guide
the reader through the text”.
10. Thompson (2001b: 58) defines interactional resources as involving “the reader collabora-
tively in the development of the text”.
 JoAnne Neff van Aertselaer and Caroline Bunce
Table 7.â•‡ Total occurrences of reporting verbs: AW1 and AW2
Total reporting verbs

AW1 vs AW2 98 194 <0.05
However, a chi-square comparison of AW1 and AW2, displayed in Table 7, shows that
the increase in frequency only approaches significance (p between 0.10 and 0.05). The
lack of significant findings may have occurred because in the AW2 texts, the students
use other, more varied devices to support their claims, not reflected by the reporting
verb figures. A detailed qualitative analysis of the AW texts certainly suggests this is
the case. What emerges from this is that by the end of the course the students are able
to effectively balance external voices with their own, as their expressions make clear,
e.g. “There can be no doubt that ...”, “Even if it has been clearly demonstrated that ...”
and “it seems more logical that society should struggle ...”.
Table 8 shows the total figures for the SPICLE corpus and the two AW sub-corpora
with regard to the types of rhetors used. The clearest developmental trends can be ob-
served in the increase on the part of AW students in the use of abstract and imper-
sonal rhetors (e.g. “one” as subject), and of impersonal and passive constructions, with
a corresponding and extremely marked decrease in the use of we and I as subject. All of
these figures point to a greater degree of impersonalization in these texts and suggest,
once again, that authorial voice is being conveyed by other evaluative lexical means.
Table 8.â•‡ Rhetor types: raw figures and normed frequencies
Classifications SPICLE AW1 AW2
Raw Normed Raw Normed Raw Normed

fig. fig. fig. fig. fig. fig.
Abstract rhetors â•⁄ 91 0.05 16 0.15 26 0.15

Impersonal and passive 139 0.07 36 0.34 38 0.22
constructions
General, non-specific and 159 0.08 22 0.21 34 0.2
unnamed rhetors
Specific and/or named rhetors â•⁄ 78 0.04 20 0.2 80 0.5
Deictics as subject (referring â•⁄â•⁄ 1 0.0005 â•⁄ 0 â•⁄â•⁄ 0 â•⁄ 0 0
to propositions in the text)
“one subject â•⁄â•⁄ 2 0.001 â•⁄ 2 0.02 â•⁄ 12 0.07
“you” subject â•⁄ 10 0.005 â•⁄ 0 â•⁄â•⁄ 0 â•⁄â•⁄ 2 0.01
“we” subject 125 0.06 â•⁄ 0 â•⁄â•⁄ 0 â•⁄â•⁄ 0 0
“I” subject 230 0.11 â•⁄ 2 0.02 â•⁄â•⁄ 2 0.01
The use of small corpora for tracing the development of academic literacies 
A further developmental pattern is clear in the greater incorporation of named out-

side sources, such as “Porter (2002) claims that ...” and “As the National Energy Educa-
tion Development Project (2008) has shown ...”. These academically appropriate refer-
ences are specifically mentioned in the can do statements and are targeted in the
course exercises.
However, one category shows no improvement, i.e. the use of a deictic as subject
(referring to propositions in the text). The AW writers do not make use of this cohesive
device in conjunction with reporting verbs and further analysis of the texts shows that
it is generally not exploited. They prefer to use deictics in noun phrases such as “These
two simple questions are answered in a book called ...”. This absence does not mean
that their texts lack cohesion, but perhaps suggests that their range of devices is still
somewhat limited. The use of deictics should therefore be the focus of future instruc-
tion and inclusion in a can do statement.
5.2 Evaluative lexical resources
We had hypothesized that the lack of a statistically significant difference between the
totals for reporting verbs in the AW1 and AW2 texts could be explained by the writers
using other lexical devices, shown in Table 9, to convey internal and external voice and
writer alignment. In order to test this assumption, a comparison was made of the fig-
ures for four types of evaluative lexical devices: it + copular verb + evaluative adjectival
phrase + that, it + copular verb + evaluative adjectival phrase + to + verb expressing
mental or verbal processes, and two types of adverbial functions ending in ly. The data
analysed here show mainly developmental patterns.
Table 9 displays the figures for these evaluative devices and shows that, as far as the
total use of such devices is concerned, while they occur with a slightly lower relative
frequency in the AW1 texts than in the SPICLE texts, their frequency in the AW2 texts
is considerably higher. The results of the chi-square tests, as shown in Table 10, re-
vealed that the low frequency in the AW1 sub-corpus did not differ statistically from
Table 9.â•‡ Distribution of evaluative devices per corpus
Type of evaluative device SPICLE AW1 AW2
Raw Normed Raw Normed Raw Normed

fig. fig. fig. fig. fig. fig.
it * adj. + that â•⁄ 64 0.03 â•⁄ 3 0.03 13 0.08

it * adj. + to + verb of mental â•⁄ 49 0.03 â•⁄ 5 0.05 â•⁄ 3 0.02
or verbal process
*ly adverbs used as disjunct/ 137 0.07 â•⁄ 5 0.05 56 0.33
used to modify discourse verb
Total 250 0.13 13 0.12 72 0.43
 JoAnne Neff van Aertselaer and Caroline Bunce
Table 10.â•‡ Total uses of evaluative devices: SPICLE, AW1 and AW2
Total uses of evaluative devices

SPICLE vs AW1 250 13 Not significant

SPICLE vs AW2 250 72 <0.001
the SPICLE writers’ use. However, the greater frequency in the AW2 sub-corpus, as
compared to SPICLE, was highly significant (p < 0.001). Thus it can be said that the
students who had received no specific AW instruction (SPICLE) and those with just
two weeks of instruction (the AW1) used these lexical resources to a similar degree.
On the other hand, from the highly significant difference between the total use of
evaluative devices in the AW2 texts, as compared to those used by AW1 (Table 11), it
can be concluded that the students have become aware of the importance of such
expressions for negotiating writer stance. It can also be presumed, with a reasonable
degree of confidence, that the lack of statistical significance regarding the increase in
the total use of reporting verbs between AW1 and AW2 texts is attributable to and
compensated for by the greater use of these other lexical resources by AW2 students.
A more detailed examination of the number of occurrences of each type of lexical
device shows that the developmental increase is not uniform across the categories. In
the it * adj. + that constructions, the relative frequencies in the SPICLE and AW1 texts
are similar, while in the AW2 texts the frequency is higher. In the case of the *ly ad-
verbs, the frequency in the AW1 texts is the lowest, and in the AW2 texts is the highest
while in the SPICLE texts the frequency falls between these two. There is even a case of
the AW2 texts showing the lowest frequency of the three groups, namely in construc-
tions with it * adj. + to + verb of mental or verbal process, especially think and say. This
does not, however, detract from the relevance of the considerable development seen in
the total figures.
One individual category does merit particular comment, i.e. that of the adverbs.
Here the frequency shown in the AW2 texts – indicating a very clear developmental
pattern – is markedly higher (0.33) than that of either the AW1 Texts (0.05) or the
SPICLE texts (0.07). Furthermore, when the two types of adverbial use are examined
individually, other interesting observations can be made. Table 12 shows that, in the
Table 11.â•‡ Total uses of evaluative devices: AW1 and AW2
Total uses of evaluative devices
Corpora AW1 AW2 P

AW2 vs AW1 13 72 <0.001

The use of small corpora for tracing the development of academic literacies 
Table 12.â•‡ Types of *ly adverbs used in the three sub-corpora
Disjuncts Adverbs modifying reporting verbs
SPICLE 124 Modifying internal voices: 7 Modifying external voices: 6

AW 1 â•⁄ 5 0
AW 2 36 Modifying internal voices: 1 Modifying external voices: 19
SPICLE corpus, 13 of the 137 adverbs (9.5%) studied correspond to adverbs used to
modify a reporting verb. The AW writers do not use adverbs in this way at all in their
course-initial texts.
In their final writing essays, however, 20 of the 56 adverbs (35.7%) are being used
in this way. Interestingly, 19 of these are used to comment on external voices. This
represents a relative frequency of 0.11 per one hundred words, in contrast to the SPI-
CLE writers’ use of only 6 such adverbs (relative frequency of 0.003 per one hundred
words). Thus, it is evident that the AW writers have, by the end of the course, realised
the importance of evaluative adverbs to comment on propositions from various inter-
nal and external sources. Also relevant is the fact that they use these adverbs to com-
ment on the opinions of others to a far greater degree than do students who have not
studied AW. There seems to be a shift from the type of use found in SPICLE to boost a
very personal presentation of the writer’s stance (e.g. “I strongly believe that ...”, “I com-
pletely disagree with the idea that ...”) to a more complex use in the AW2 texts (e.g. “as
scientists plausibly claim”, “it is rightly considered that ...”) which combines others’
voices and their own. While there is a certain awkwardness in some of the examples
(e.g. “as Tierney (1996) wrongly pointed out”), which is usual when learners are testing
the limits of a recently acquired lexical device, it is clear that these writers have become
aware of, and are experimenting with, a new possibility for expressing their alignment
with external propositions (see the example of a final essay in Appendix I).
6. Conclusion
This study, using corpora of various sizes, set out to examine the Spanish EFL univer-
sity students’ use of various devices for intertextual dialogue, namely discourse ori-
ented verbs, including the various types of rhetors which can co-occur with these,
certain kinds of grammar patterns, such as anticipated it constructions and modal
adverbs, which also permit the inclusion of writer evaluations.
The EU Framework adopted in the AW course and the re-working of the descrip-
tors for writing (can do statements) – developed from analysis of the student data
previously collected (SPICLE texts) – arose from the practical needs of our instructors
and their students. The instructors used structural and rhetorical features to draw up
a set of criteria for measuring the students’ written performance throughout and at the
end of the course.
 JoAnne Neff van Aertselaer and Caroline Bunce
Although the features examined in this study were not the only criteria used for
evaluation of the students’ final texts, these aspects of AW, adopted as criteria for ad-
vancement, enabled the instructors to avoid solely focusing on the EFL students’ lexi-
co-grammatical errors and instead, to give more attention to feasible improvement in
discourse competence (see Appendix II for developmental trends).
The difference, in regard to these criteria, between the SPICLE group (with no spe-
cific training in AW) and the two sub-groups of the AW course has shown that the aca-
demic literacy of university students can be improved by studying the student’s use of text
internal and external features and by centring sets of exercises around these features.
Specifically, the comparison of the AW1 texts (written at the beginning of the course)
with the AW2 texts (the final sample of writing) presented here confirms that the students
benefited from a detailed list of features which they could learn to incorporate into their
texts in the limited time period of the course. To be sure, there are still some features that
call for attention. However, the use of explicitly stated text internal and text external re-
quirements has advanced the students’ understanding of the dialogic processes involved
in argumentative writing and, in many cases, their discourse competence improved so
greatly that these writers appear to have progressed to an entirely new stage of competence,11
as measured by the descriptors included in the CEFR for argumentative writing.
References
Bazerman, C. 1994. Systems of genres and the enhancement of social intentions. In Genre and
New Rhetoric, A. Freedman & P. Medway (eds), 79–101. London: Taylor & Francis.
Bhatia, V. 2004. Worlds of Written Discourse. London: Continuum.
Biber, D. 1995. Dimensions of Register Variation: A Cross-linguistic Comparison. Cambridge: CUP.
Briggs, C. & Baumann, R. 1992. Genre, intertextuality and social power. Journal of Linguistic
Anthropology 2: 131–172.
Charles, M. 2006. The construction of stance in reporting clauses: A cross-disciplinary study of
theses. Applied Linguistics 27: 492–518.
Council of Europe. 2001. Common European Framework of Reference for Languages. Cambridge:
CUP.
Council of Europe. 2009. Relating Language Examinations to the Common European Framework
of Reference for Languages: Learning, Teaching, Assessment (CEFR). Strasbourg: Language
Policy Division.
Educared. 2009. Exámenes resueltos, Literatura española. <http://www.educared.net/universi-
dad/> (accessed April 2009).
Gilquin, G., Granger, S. & Paquot, M. 2007. Learner corpora: The missing link in EAP pedagogy.
Journal of English for Academic Purposes 6(4): 319–335.
11. At the 2008 Montenegro meeting of the English Profile Networks: Research Network in South-
East Europe, Cambridge First Certificate Examiners estimated that the B1.1. student <011–2008>
whose initial and final texts appear in Appendix II had improved her writing to such an extent
that, in the final essay, she appeared to be approaching B2, First Certificate Level.
The use of small corpora for tracing the development of academic literacies 
Granger, S. 1983. The “be” + Past Participle Construction in Spoken English: With Special Empha-
sis on the Passive. Amsterdam: North-Holland.
Granger, S. 1998a. Prefabricated patterns in advanced EFL writing: Collocations and formulae.
In Phraseology: Theory, Analysis and Applications, A. Cowie (ed.), 145–160. Oxford: OUP.
Granger, S. (ed.). 1998b. Learner English on Computer. London: Addison Wesley Longman.
Hoey, M. 2005. Lexical Priming: A New Theory of Words and Language. London: Routledge.
Hunston, S. & Francis, G. 1999. Pattern Grammar: A Corpus-driven Approach to the Lexical
Grammar of English [Studies in Corpus Linguistics 4]. Amsterdam: John Benjamins.
Johns, A. 1997. Text, Role and Context. Cambridge: CUP.
Martin, J. 1990. Factual Writing: Exploring and Challenging Social Reality. Oxford: OUP.
Neff, J., Martínez, F. & Rica, J. P. 2001. A contrastive study of qualification devices in NS and NNS
argumentative texts in English. In ERIC Document Reproduction Service, ED 465301.
Washington DC: Educational Resource Information Center, U.S. Department of Education.
Perera, K. 1989. Children’s Writing and Reading: Analysing Classroom Language. Oxford: Basil
Blackwell.
Scott, M. 2007. Wordsmith 5.0. Oxford: OUP.
Sinclair, J.M. 1993. Written discourse structure. In Techniques of Description. Spoken and Writ-
ten Discourse, J.M. Sinclair, M. Hoey & G. Fox (eds), 6–31. London: Routledge.
Swales, J. 1990. Genre Analysis: English in Academic and Research Settings. Cambridge: CUP.
Thompson, G. 2001a. Corpus, comparison, culture: Doing the same things differently in differ-
ent cultures. In Small Corpus Studies and ELT [Studies in Corpus Linguisics 5], M. Ghadessy,
A. Henry & R. Rosenberry (eds), 311–334. Amsterdam: John Benjamins.
Thompson, G. 2001b. Interaction in academic writing: Learning to argue with the reader.
Applied Linguistics 22(1): 58–78.
Werlich, E. 1983. A Text Grammar of English. Heidelberg: Quelle und Meyer.
Appendix 1
Final student essay <001–2009, level B1.3>

Nowadays, specialists of the environment are constantly insisting on the impor-
tance of recycling. It is rightly considered that recycling is the best way of lengthening
the Earth’s life. Consequently, society will experiment a healthier and longer life. The
problem comes when recycling. For many people it means something which might be
wrong. Porter(2002) claims that recycling involves processing used materials, reduce
the consumption of fresh raw materials, reduce energy usage, reduce air pollution(from
incineration) and water pollution(from landfilling).
First, it is obligatory to know that everyday day people use energy in their different
daily activities. For instance: when someone turn on a light; when someone is cooking;
when both, public transport and own cars, are being used; etc. That is, society wastes
energy unconsciously, according to The National Energy Education Development Project
(2008). In order to avoid this wasting Porter (2002) proposes some materials that could
be recycled, such as: paper, glass, metal, electronics, textils, and paper. None the less, in
 JoAnne Neff van Aertselaer and Caroline Bunce
some countries governments state that recycling is not only difficult to do but also useless.
That means that recycling wastes resources, as well as non-recycling do (Tierney 1996).
In spite of what Tierney claimed, the NEEDP (National Energy Education Devel-
opment Project) (2008) published some advice so that recycling would be easier. Turn-
ing off machines like the microwave, air-conditioning, etc when not necessary are
some. Furthermore, using less energy (energy conservation) and increasing ecological
machines and transport, could also help.
With regard to Economy, it is insistingly claimed that recycling would affect market
prices. Therefore, economy would experiment a descent due to the recycling process,
which is significantly expensive (Tierney 1996). Moreover, landfills would be needed and
that would arise prices and incinerators are considered as an useless recycling way. Other-
wise, they would waste more energy than save, as Tierney (1996) has intelligently stated.
In conclusion, society is divided due to the recycling process. While a part says
that it is utterly unnecessary it is unconditionally stated by the other part that it is a
favorable option and an incredible chance to save, not only human beings but also the
entire world. Unless the recycling process would be forbidden by governments, people
should act and recycle.
Appendix 2
Initial essay (AW1), B1.1. student* <011–2008>
Some students believe that university degrees do not prepare people Essay Prompt
for the real world. To what degree would you agree with this belief? (an exact copy)
P. 1 The actual Educative System is being the main theme in a lot of Contextualization
debats in our days. Politics and students do not agree because it is Claim 1
truth that a lot of students believe that the university degrees do not
prepare us for the real world and for our first job.
P. 2 On the one hand, I believe that they have some reason because Data 1
when you finish your degree really you don’t have made anything Example 1
similar to you be will have to do in your real job. In my case, for
instance, when I finished my university degree which is english filol-
ogy, I will not have any idea to teach in a school class.
P. 3 On the other hand, I think that we learn a lot of literature and Claim 2
english: so we will be able to make our job really well as soon as we Contradict
get a bit of self-confidence, because we have a good preparation. Claim 1
P. 4 FINALLY, students should have more practise classes and Conclusion

periods in our university degrees, although I think that when I have (repetition of P. 3)
finished my degree I will have a great knowledges for my new job
although I haven’t much experience.
* Paragraphs are indicated with “P”.
The use of small corpora for tracing the development of academic literacies 
Final essay (AW2), the same B1.1. student <011–2008>
P 1. Physician-assisted suicide is a controversial theme Claim 1

because of a lot of human lifes depend on this decision. In Data 1, 2, 3
this debat, there is a mixture of religious beliefs, opinion of (prospection)
the family members and of course, moral values. Nevertheless, Mistaken DM
the theme has become more important in the recent years. Stranded information
for Claim 1
P 2. Against the legalisation, the religious beliefs have a big Data 1.1.
importance. It could be observe how people who have a strong
religious education are generally disagree with legalizing
assisted suicide. Some people belive that legalizing this type of
suicide, this could lead to euthanasia, and more terminally ill
patients could decide to dye. People who disagree with this
legalisation believe that this kind of patients show a large
value of resignation.
P 3. Other important aspect in this debat could be the Data 2.1.

opinion of the family members. People who agree with the
legalization of the physician-assisted suicide, probably
consider that this solution would help to relieve families of
the burdens of caring for terminally ill relative, because it is
very exausting looking after a ill patient. Although, perhaps,
family member who disagree with this idea prefer to look
after their relatives for all their lifes. In total, there are an 35% Stranded data 4, better
of the people who do not agree with Euthanasia in various placed with Data 1.1.
European countries.
It seems that a lot of people disagree with the legalization, but Stranded data 5, better
in Europe more than an 60% of the population agree with placed with Data 2.1.
Euthanasia. People believe that a terminal ill have the right to
control his/her own life of course, if the dead has a justifi-
cated cause.
P. 5 Doctor has a important paper [role] too, because they Data 3 (not previously
could be prosecuted for assisting in the suicide, although if mentioned)
the legalization of assisted suicide is approved they could not
be prosecuted because this process would be legal.
P. 6 In conclusion, the dead is an important aspect but, over Conclusion

all, it seems that it is more important the right of the human DM: weakens conclu-
being to decide about his/her own life, and of care it is more sion (& possible
important if they don’t want to live, although it is important to contradiction of
consider the opinion of the family members. previous claims)
Revisiting apprentice texts
Using lexical bundles to investigate expert
and apprentice performances in academic writing
Christopher Tribble
Early developments in corpus linguistics were driven by the needs of those

interested in the description of a language, whether as grammarians or
lexicographers, not the needs of language teachers. This has hindered the
development of corpus applications in language education, but work by Granger
and others changed the situation and it is now accepted that learner data can
be a valuable resource for those concerned with language education. Drawing
on Biber’s (2006) account of lexical bundles, this chapter provides a practical
example how the written production of postgraduate students in a single
disciplinary area can be used to build an account of contrasts between apprentice
and expert writing, and how this account can be used in the development of a
course specification for English for Academic Purposes (EAP) writing.
1. Introduction
The first phase of computer corpus development in the United Kingdom was driven by
linguists and lexicographers whose primary concern was to build a broad account of the
written and spoken production of native speakers of the language. However, from the
late 1980s onwards, there was also an interest in building corpus resources which would
have relevance to the needs of foreign language learners (e.g. Summers & Rundell 1990)
and, as a result of the Collins Birmingham University International Language Data-
base (COBUILD) project at Birmingham University (Sinclair 1987), this kind of data
became directly available to a group of teachers concerned with the needs of students on
English for Academic Purposes programmes at Birmingham. It was this partnership
between researchers and teachers which led to the development of corpus informed or
Data Driven Learning (DDL) approaches to language teaching (Johns 1994).
A major motivation for these early developments in corpus informed language
teaching (e.g. Johns 1988, Tribble & Jones 1990, Stevens 1991) was linked to a concern
over the kinds of language data which had hitherto been available for classroom use.
 Christopher Tribble
Sinclair later summarised this concern as follows: “Linguistics has been formed and
shaped on inadequate evidence and in a famous phrase ‘degenerate data’. [...] In linguis-
tics up till now we have been relying very heavily on speculation” (Sinclair 2004: 9).
What these teachers, materials developers and researchers wanted was to give
their students access to real rather than made-up language data, with the long term
ambition of enabling learners to become actively involved in discover learning them-
selves: “If you are able to give your students access to a PC you can also give them the
chance to discover rules of language use for themselves” (Tribble & Jones 1990: 56).
Although the claims of some of those involved in the attempt to use corpus data in
language education were criticised as having insufficient regard for the linguistic needs
of learners (see Widdowson 1991), teachers and materials developers continued to
work with corpus data to prepare syllabus specifications (e.g. Willis 1990) and practi-
cal teaching materials (e.g. Thurston & Candlin 1997). During this period, work also
started on the collection and use of corpus data which it was felt would be more di-
rectly relevant to language learning purposes (Willis & Willis 1988), and to investigate
learner language with a view to better understanding the problems which learners
faced in the process of language acquisition work (Meara & English 1987, Granger
1993 and 1994). Alongside this research into issues in learner writing (i.e. error, over-
use, underuse, etc.) work was also done to use learner language as a resource in class-
room learning and syllabus design (Tribble 1989). Building on this earlier experience,
Granger & Tribble (1998) offered a set of reasons for using learner corpora in the de-
velopment of teaching materials, and provided some practical examples of how this
might be done. Granger & Tribble’s study focused on the contrast between native
speaker (NS) and non-native speaker (NNS) writing1, with an emphasis on how stu-
dents could improve their accuracy in the use of problematic words. They addressed
two main questions in the course of their study − which forms could most usefully be
presented to learners, and which model should be drawn on in the preparation of
teaching materials. These questions remain relevant today and will provide the start-
ing point for this present paper.
In the following sections, I will ask anew the questions which forms and which
models best meet the needs of learners, this time focusing on the needs of students of
English for Academic Purposes (EAP). In answering these questions I will first discuss
the value of a focus on what are variously called lexical bundles (Biber et al. 1999),
clusters (Scott 2008), or, within the Natural Language Processing (NLP) community,
n-Grams. I will then go on to demonstrate how corpora of expert and apprentice texts
can be used in the development of practical resources to support Master’s level stu-
dents who need to write English for Academic Purposes. As the texts are written by
native and non-native speakers of English, and as they are potentially of a standard
which could be offered for publication in journals belonging to the Applied Linguistics
1. It is interesting to note how the Native Speaker and Non-native Speaker dichotomy was not
contentious terminology at that time.
Revisiting apprentice texts 
discourse community, I have chosen to use the terms expert and apprentice in relation
to the texts in these two corpora; native and non-native are not relevant categories in
the present context. Such an approach draws on Rampton’s (1990) account of expertise
in foreign language production and Bazerman’s (1994) notion of ‘expert performance’.
In carrying out this investigation I shall be building on Hyland’s (2008a and 2008b)
studies in which lexical bundles were used to identify contrasts both between expert
and apprentice performances, and between texts in different disciplinary areas. This
study extends work in the area by considering contrasts between expert and apprentice
performances within the same disciplinary area, thereby offering a descriptive model
which can be of value to teachers and learners in English for Specific Purposes (ESP)
and disciplinarily specific EAP programmes.
2. Forms and models
2.1 Which forms?
Granger & Tribble (1998) took individual word forms as the starting point for their
work. However, a growing number of studies (Biber & Conrad 1999; Biber 2006; Cor-
tes 2004; Scott & Tribble 2006; Hyland 2008a and 2008b) have demonstrated the value
of the investigation of lexical bundles in differentiating between language production
in different registers, genres or text populations within genres.
Although these combinations have been variously defined, a useful and simple
account of lexical bundles is given in Biber (2006: 134).
They are simply the most frequently occurring sequences of words, such as do you
want to and I don’t know what. These examples illustrate two typical characteristics of
lexical bundles: they are usually not idiomatic in meaning, and they are usually not
complete grammatical structures.
The actual cut-off which is used to determine whether or not to include a specific
lexical bundle in a study largely depends on the size of the corpus one is studying. For
Biber (2006) the threshold was 40 times per million words. Scott (2008) sets a default
of a minimum of 5 occurrences in the corpus under investigation in his account of
clusters. As this study makes use of smaller corpora than those used by Biber, I use a
lower threshold of normalised count of 3 instances per 100,000 words. This threshold
has been arrived at empirically and was chosen to ensure that there was a sufficiently
large set of bundles in the final results to allow useful comparisons.
Apart from the issue of which threshold to use, a further consideration when se-
lecting the lexical bundles to study is whether to choose bundles containing three
words, four words or more. Most studies to date have focused on 4-word combina-
tions. As Cortes (2004: 401) argues: “many 4-word bundles hold 3-word bundles in
their structures (as in as a result of, which contains as a result)”. However, in a study
which compared apprentice writing with expert performances, Scott & Tribble
 Christopher Tribble
(2006: 141) argue in favour of also maintaining a focus on 3-word lexical bundles,
commenting that 3-word bundles offer learners ways of discovering a range of less
frequent, but no less valuable phrasal combinations. They give as an example the fact
that: “... most may be by far and away the most frequent immediate right collocate of
‘one of the’, but there is a wealth of other combinatorial potentials which we will lose
sight of if three word clusters + their right collocates are excluded from our analysis”
(ibid.: 141). Their conclusion is that: “... while 4-word clusters are strong discrimina-
tors between registers, there is a good argument for also using 3-word clusters to-
gether with their immediate right collocates in studying the contrast between different
styles of writing or the product of different groups of writers” (ibid.: 142).
In this present study, I will also focus on both 4-word and 3-word lexical bundles
in an attempt both to identify contrasts between the ways in which expert and appren-
tice writers construct the texts which are required of them in order to participate in
specific academic genres, and to outline a basis from which apprentice writers can
enhance their performances within these genres.
2.2 Which models?
In foreign language teaching, the choice of which model to present to students has
become multiply problematic. Concerns around notions of authenticity summarised
in Widdowson’s (1990) discussion of the contrast between authentic and genuine in-
put for language teaching can be seen as leading to the debates around the relevance of
corpus data which are summarised in Seidlhofer (2003). In the teaching of pronuncia-
tion, concerns over the ownership of English, the status of native and non-native
speakers of the language, and the utility of native speaker standards for pronunciation
(Seidlhofer 2000) have led to Jenkins’ (2000) argument for a model for pronunciation
teaching based on empirical comprehensibility criteria rather than on idealised native
speaker models. In EAP writing instruction, Hyland (2002) has argued for greater
specificity in the kinds of models used. This stance is in opposition to approaches
which argue in favour of developing general and transferable academic competencies
(Spack 1988), or to the position of proponents of Academic Literacies (Lea & Street
1998; Lillis 2001) who can be seen as arguing against the use of models at all. In this
present study the choice of model is less problematic as it is possible to see a close cor-
relation between the kinds of research papers and dissertations which Masters’ stu-
dents have to write for purposes of assessment, and the research papers that they read
in their field of study. Indeed, in the King’s College assessment criteria for a piece of
work to achieve the highest grade it has to be: “Striking insightful, displaying for ex-
ample: publishable quality, outstanding research potential1”.
In this context, it is possible to see examples from journal articles in applied lin-
guistics as the textual result of expert performances, following Bazerman (1994: 131)
for whom the “expert performance describes the whole act [of composition] with all its
potential variety and complexity.”
Revisiting apprentice texts 
1. 2.
Building the Modelling and
context desconstructing
the text
5.
3.
Linking related
Joint
texts
construction of
the text
4.
Independent
construction of
the text
Figure 1.â•‡ The Teaching/Writing Cycle Feez, 1998: 28
Exemplar texts arising from expert performances can then be exploited within a teach-
ing/learning cycle such as that proposed by Feez (1998), shown in Figure 1 and consti-
tute an excellent basis for an exemplar corpus.
By having access to both the exemplars themselves and the results of corpus anal-
ysis, students are better placed to work through the ‘modelling and deconstructing the
text’ phase of the cycle. Here we can see modelling fitting in with Bazerman’s (ibid: 193)
proposal that modelling is a “process when we try out the behaviours we observe in
others; it is clearly related to learning by imitation as advocated in classical rhetoric”,
and with Flowerdew’s comments that:
this skill of seeking out instances of genre-dependent language modelling use in
English and incorporating them in one’s own writing or speaking is not limited to
foreign languages. Many native speakers make use of others’ writing or speech to
model their own work in their native language, where the genre is an unfamiliar
one. It is time that this skill was brought out of the closet, and exploited as an aid
to learning, instead of remaining a secret activity not acknowledged by teachers.
(Flowerdew 1993: 313)
In this study I shall use the term exemplar corpus to refer to a collection of texts made
up of expert performances which are very closely aligned with the kinds of written
production to which apprentices aspire (Tribble 2001). For technical writing students
on an ESP programme this could be a set of engineering manuals, for post-graduate
 Christopher Tribble
students on an in-sessional programme, it could be a set of PhD theses in discipli-

narily relevant fields. In this instance the exemplar corpus I shall draw on is a collec-
tion of journal articles in Applied Linguistics.
I shall use the term analogue corpus (Tribble 2001) to refer to collections of texts
which are generically different from a particular group’s target performance, but which
usefully share some features (both in terms of register and organisation) with these
performances. Texts in analogue corpora can be close or distant to the target behaviour
of a particular group of learners. Thus, a collection of factual encyclopaedia essays can
be a reasonably close analogue corpus for students on written composition courses,
while a collection of newspaper editorials can constitute a distant but still useful ana-
logue corpus for students who need to develop argumentative essay writing skills. In
this study, I will use a collection of successful student writing as a close analogue cor-
pus, and a collection of general academic writing from the British National Corpus
(BNC) as a more distant analogue corpus alongside a collection of published academ-
ic journal articles from Acta Tropica, an international journal that covers biomedical
and health sciences with particular emphasis on topics relevant to human and animal
health in the tropics and the subtropics.
3. Investigating lexical bundles in apprentice and expert texts
3.1 Data
The five data sets which have been drawn on in this particular study are listed in Table 1:
The purpose of this selection has been to enable comparisons between lexical
bundles in a corpus of apprentice written production (KCL Apprentice Writing
Corpus) and a close analogue corpus (BAWE), an exemplar corpus (Applied Linguis-
tics Corpus) and two progressively more distant analogue corpora (BNC Baby, Aca-
demic and Acta Tropica).
3.2 Method
Word (.doc) documents (KCL Apprentice Writing Corpus) and PDFs (Applied Lin-
guistics Corpus/Acta Tropica) were converted to unicode text using a Word 2003 mac-
ro or commercially available software, and, in the case of KCL Apprentice Writing
Corpus, front and end matter was separated from the body of the text with Text En-
coding Initiative (TEI2) compliant tags. It was not feasible to carry out this process
with the ad-hoc Applied Linguistics Corpus and Acta Tropica corpora, so a certain
amount of noise had to be accepted in findings from these collections.
2. http://www.tei-c.org/index.xml (accessed 28/11/09).

Revisiting apprentice texts 
Table 1.â•‡ Corpus data in the study
Corpus Words Description
Acta Tropica 3,409,969 Acta Tropica – A 3.5 million word corpus of 1,071 articles from the
(distant journal Acta Tropica from 1989–2009
analogue) (http://www.elsevier.com/wps/find/journaldescription.cws_
home/506043/
description#description – accessed 01/12/09)
Applied â•⁄â•‹939,923 Applied Linguistics Corpus − A collection of 174 research
Linguistics articles in applied linguistics drawn from the following journals:
Corpus Applied Linguistics
(exemplar) Discourse and society
English for Specific Purposes Journal
English Language Teaching Journal
English World-Wide
International Journal of Applied Linguistics
Journal of English for Academic Purposes
Journal of Second Language Writing
Language and Communication
Language Learning and Technology
ReCALL
Studies in Higher Education
TESOL Quarterly
BAWE 3,277,560 Corpus of British Academic Written English − “The BAWE corpus
(close contains 2761 pieces of proficient assessed student writing, ranging
analogue) in length from about 500 words to about 5000 words. Holdings are
fairly evenly distributed across four broad disciplinary areas (Arts
and Humanities, Social Sciences, Life Sciences and Physical Sciences)
and across four levels of study (undergraduate and taught masters
level). Thirty-five disciplines are represented” (http://www2.warwick.
ac.uk/fac/soc/al/research/collect/bawe/− accessed 27/11/09)
BNC Baby 1,000,198 BNC Baby (Academic) − a one million word subset made up
– Academic of thirty texts from the Academic Writing component of the British
(medium close National Corpus (http://www.natcorp.ox.ac.uk/corpus/index.xml.
analogue) ID=products#baby accessed 08-02-10)
KCL Appren- 509,891 King’s College London – Apprentice Writing Corpus − a corpus of
tice Writing apprentice writing donated by students on the MA programme in
Corpus2 English Language Teaching and Applied Linguistics and BA in
(research English Language and Communications. The corpus is under
corpus) development and not publicly available as yet. For this study a subset
of 119 MA level assignments and dissertations was used. 34% of texts
were written by students for whom English is not a first language.
3. http://www.kcl.ac.uk/schools/sspp/education/courses/masters/elt/handbook.html (accessed
27/11/09).
 Christopher Tribble
UNEDITED top 5
ENGLISH FOR ACADEMIC PURPOSES 889
OF ENGLISH FOR ACADEMIC 791
JOURNAL OF ENGLISH FOR 782
ENGLISH FOR SPECIFIC PURPOSES 405
CAMBRIDGE CAMBRIDGE UNIVERSITY PRESS 264
Figure 2.â•‡ Applied Linguistics Corpus – pre-edit
EDITED top 5
ON THE OTHER HAND 130
IN THE CASE OF 105
AT THE SAME TIME 101
IN THE USE OF 70
ON THE BASIS OF 66
Figure 3.â•‡ Applied Linguistics Corpus – post-edit
Once data was ready for processing, a Wordsmith Tools Index was created for each
corpus, and 3-word and 4-word lexical bundle lists were then generated and saved in
Excel spreadsheets. Once the lists were generated, the first task was to deal with the
noise problem in the journal article collections (Applied Linguistics Corpus and Acta
Tropica) by manually editing the lexical bundle lists. An example of the un-edited and
edited data is given in Figures 2 and 3, indicating how relatively straightforward it was
to spot lexical bundles which were part of the end matter (bibliography/footnotes) or
text meta-data. Where there was any ambiguity, this was reconciled by reviewing a
concordance of the lexical bundle in question.
Once the lexical bundle lists had been saved in an Excel workbook, all lexical
bundle statistics were normalised to counts per 100,000. This has enabled a more
meaningful comparison of relative distributions of different lexical bundles across the
corpora – although as the main emphasis in this particular study has been on the top
40 lexical bundles in each corpus, apart from for ranking purposes, the frequency of
occurrence of individual lexical bundles has not had much significance.
3.3 Findings: 4-word lexical bundles
The analysis of four-word lexical bundles has been carried out in relation to two main
parameters: (a) the extent to which lexical bundles are shared between the corpora,
and (b) the extent to which specific categories of lexical bundles are shared or not
shared in the different corpora. This analysis makes use of a system of lexical bundle
categorisation proposed by Hyland (2008a and 2008b). Drawing on Halliday’s (1994)
metafunctions of language, Hyland groups lexical bundles from a functional perspec-
tive: Research-oriented, Text-oriented, and Participant oriented, and then offers a
more finely grained categorisation of each lexical bundle within these broad group-
ings. The framework is given in Figure 4.
Revisiting apprentice texts 
RESEARCH-ORIENTED. Help writers to structure their activities and experiences of the real world.
location – indicating time and place (at the beginning of, at the same time, in the present study);
procedure (the use of the, the role of the, the purpose of the, the operation of the);
quantification (the magnitude of the, a wide range of, one of the most);
description (the structure of the, the size of the);
topic – related to the field of research (in the Hong Kong, the currency board system).
TEXT-ORIENTED. These clusters are concerned with the organisation of the text and the meaning
of its elements as a message or argument.
transition signals – establishing additive or contrastive links between elements (on the other hand,
in addition to the, in contrast to the);
resultative signals – mark inferential or causative relations between elements (as a result of,
it was found that, these results suggest that);
structuring signals – text-reflexive markers which organise stretches of discourse or direct reader
elsewhere in text (in the present study, in the next section, as shown in fig);
framing signals – situate arguments by specifying limiting conditions (in the case of, with respect to
the, on the basis of, in the presence of, with the exception of).
PARTICIPANT-ORIENTED. These are focused on the writer or reader of the text (Hyland 2005).
stance features – convey the writer’s attitudes and evaluations (are likely to be, may be due to, it
is possible that); engagement features – address readers directly (it should be noted that, as can be
seen) (Hyland 2008a: 49).
Figure 4.â•‡ Functional categorisation of 4-word lexical bundles
The first analysis reported here reviews the shared and unshared lexical bundles in the
five corpora included in the study. This demonstrates the similarities and differences
between the written production of the KCL Apprentice Writing Corpus apprentice
writers, and the written production of the authors included in the comparator corpora
(both close and distant).
The second analysis makes a direct comparison between lexical bundles in the
research corpus (KCL Apprentice Writing Corpus) and the closer analogue corpora
(BAWE/BNC Baby – Academic).
3.3.1 KCL apprentice writing corpus compared with BNC baby – academic, BAWE,
and applied linguistics corpus + acta tropica
A complete summary of the 4-word lexical bundle comparison is given in Figure 5.
This figure is presented here in a reduced format as, for the moment, we are only con-
centrating on the distribution of lexical bundles, rather than on specific lexical bundle
categories but a more legible version is provided as an appendix. Dark shading indi-
cates (a) those lexical bundles which are shared across the four main corpora
(BNC Baby- Academic, BAWE, Applied Linguistics Corpus, KCL Apprentice Writing
Corpus) and (b) those from this group which also occur in Acta Tropica. Paler shading
with white text indicates those which occur in at least three of the main corpora, and
lighter shading with black text those which occur in two of the main corpora. These
different groups are also shown in the Acta Tropica set. Lexical bundles outside the
shaded areas only occur in that specific corpus.
 Christopher Tribble
BNC Baby Academic norm BAWE norm Applied Linguistics norm KLC Apprentice Writing norm
ON THE OTHER HAND 13 ON THE OTHER HAND 26 ON THE OTHER HAND 22 ON THE OTHER HAND 21
THE END OF THE 9 AS A RESULT OF 22 ON THE BASIS OF 14 IN THE FORM OF 10
ON THE BASIS OF 9 THE END OF THE 18 THE END OF THE 13 AS A RESULT OF 8
AS A RESULT OF 8 IT IS IMPORTANT TO 17 AT THE END OF 12 IT IS IMPORTANT TO 7
AT THE END OF 7 AS WELL AS THE 15 AS WELL AS THE 10 AT THE END OF 6
IT IS IMPORTANT TO 6 IN THE FORM OF 15 IT IS IMPORTANT TO 9 ON THE BASIS OF 6
IN THE FORM OF 5 AT THE END OF 13 AS A RESULT OF 9 AS WELL AS THE 5
AS WELL AS THE 4 ON THE BASIS OF 6 IN THE FORM OF 8 THE END OF THE 5
IN TERMS OF THE 11 IN THE CASE OF 19 IN THE CASE OF 19 THAT THERE IS A 8
IN THE CASE OF 10 AT THE SAME TIME 15 AT THE SAME TIME 16 IN THE CONTEXT OF 6
THE WAY IN WHICH 8 THE FACT THAT THE 13 IN THE CONTEXT OF 12 A WIDE RANGE OF 5
THE EXTENT TO WHICH 7 CAN BE USED TO 12 A WIDE RANGE OF 11 TO BE ABLE TO 5
IN THE CONTEXT OF 7 ONE OF THE MOST 11 THE EXTENT TO WHICH 10 I WOULD LIKE TO 11
AT THE SAME TIME 7 THAT THERE IS A 10 ONE OF THE MOST 7 AS A LINGUA FRANCA 9
THAT THERE IS A 5 THE WAY IN WHICH 9 THE REST OF THE 7 OF ENGLISH AS A 8
A WIDE RANGE OF 5 THE REST OF THE 9 IN TERMS OF THE 7 IS ONE OF THE 7
ONE OF THE MOST 5 IN TERMS OF THE 9 THE FACT THAT THE 6 ENGLISH AS A LINGUA 7
THE REST OF THE 5 THE EXTENT TO WHICH 8 CAN BE USED TO 5 NATIVE SPEAKERS OF ENGLISH 6
CAN BE USED TO 4 IT IS POSSIBLE TO 12 TO BE ABLE TO 5 THE INVOLVEMENT LOAD HYPOTHESIS 15
THE FACT THAT THE 4 TO BE ABLE TO 10 THE BEGINNING OF THE 11 WHEN IT COMES TO 8
IT IS POSSIBLE TO 8 IS ONE OF THE 10 THE USE OF THE 9 OF TEXTS HAVE YOU 8
IT IS CLEAR THAT 5 THE NATURE OF THE 10 NATIVE SPEAKERS OF ENGLISH 8 TEXTS HAVE YOU WRITTEN 8
THE SIZE OF THE 4 IT IS CLEAR THAT 9 THE NATURE OF THE 7 I AM GOING TO 7
IN RELATION TO THE 4 IT IS DIFFICULT TO 8 OF ENGLISH AS A 6 ENGLISH LANGUAGE OF INSTRUCTION 7
IT IS DIFFICULT TO 4 THE USE OF THE 7 AS A LINGUA FRANCA 6 IN THIS ESSAY I 7
THE HOUSE OF LORDS 8 THE BEGINNING OF THE 6 ENGLISH AS ALINGUA 6 THE ROLE OF THE 7
PERCENT OF THE 7 THE SIZE OF THE 6 I WOULD LIKE TO 5 INTHE FIELD OF 6
IN THE UNITED STATES 5 IN RELATION TO THE 6 THE WAYS IN WHICH 7 ELT AND APPLIED LINGUISTICS 6
AT THE TIME OF 5 CAN BE SEEN IN 11 AT THE UNIVERSITY OF 1 NATIVE LIKE USE OF 6
THE HOUSE OF COMMONS 5 IT CAN BE SEEN 9 AT THE BEGINNING OF 11 IN THE PROCESS OF 6
AS SHOWN IN FIG 5 TO THE FACT THAT 9 IN THE USE OF 10 LIKE USE OF ENGLISH 6
IN THE ABSENCE OF 4 A RESULT OF THE 8 IN THE PRESENT STUDY 7 AS A FOREIGN LANGUAGE 6
RULE IN RYLANDS V 4 CAN BE SEEN THAT 7 THE RESULTS OF THE 7 IMPLICIT AND EXPLICIT KNOWLEDGE 6
IS LIKELY TO BE 4 IT IS NECESSARY TO 7 OF SECOND LANGUAGE WRITING 7 IN THE EXPANDING CIRCLE 6
THE RULE IN RYLANDS 4 IN THE SAME WAY 6 THE USE OF CORPORA 6 IN THE NEXT SECTION 6
IN THIS CASE THE 4 THE ROLE OF THE 6 ON THE ONE HAND 6 ENGLISH AS A SECOND 5
THE NATURE OF THE 4 IS DUE TO THE 6 NATIVE AND NON NATIVE 6 TO LOOK AT THE 5
THE COURT OF APPEAL 4 DUE TO THE FACT 6 AS CAN BE SEEN 5 IT SHOULD BE NOTED 5
THAT THERE IS NO 4 THAT THERE IS NO 6 IT SHOULD BE NOTED 5 MEANING OF A WORD 5
AS WE HAVE SEEN 4 CAN BE SEEN AS 6 OF SPOKEN AND WRITTEN 5 OF THE INVOLVEMENT LOAD 5
Figure 5.â•‡ 4-word lexical bundles
Two kinds of contrast emerge here. The first is the relatively small number of lexical
bundles in KCL Apprentice Writing Corpus which also occur in either all three com-
parator corpora (8): on the other hand/in the form of/as a result of/it is important to/at
the end of/on the basis of/as well as the/the end of the, or in two of these corpora (3): that
there is a/in the context of/a wide range of. Although these lexical bundles are promi-
nent in any analysis of lexical bundles in written academic English (Biber 2006:
158–159), the small number which occur in KCL Apprentice Writing Corpus is note-
worthy. The smaller size of the corpus partly explains the higher proportion of topic
related lexical bundles in the KCL Apprentice Writing Corpus list (the early appear-
ance of topic related lexis is a typical feature of any word/lexical bundle list for small
corpora), and the contrasting institutional roles of apprentices and experts in academ-
ic writing events can account for the absence of stance markers such as: it is possible to/
it is clear that. However, the absence of core academic framing markers such as: in
terms of the/in the case of/the way in which/the extent to which is striking, and indicates
an area where the apprentice writers in King’s may need support to extend the ways in
which they develop arguments and comment findings.
Revisiting apprentice texts 
3.3.2 Applied linguistics corpus vs. KCL apprentice writing corpus – shared
and unshared lexical bundles
The first analysis is a direct comparison between the exemplar corpus (Applied Lin-
guistics Corpus) and the apprentice corpus (KCL Apprentice Writing Corpus). As can
be seen in Figure 6, 16 of the top 40 lexical bundles are shared between these two col-
lections (Research Orientation, R_: 6; Text Orientation, T_: 6; Participant orientation,
P_: 4). The three shared topic related lexical bundles give an indication of the align-
ment of the content of the two corpora.
Of the 16 shared lexical bundles in the two corpora, eight are common to all four
of the corpora included in the main study: on the other hand/in the form of/as a result
of/it is important to/at the end of/on the basis of/as well as the/the end of the, and can be
seen as having core functions in academic registers (see Biber 2006: 158–159).
no. APPLING NORM Func KCL_AWC NORM Func

1 ON THE OTHER HAND 22 T_transition IT SHOULD BE NOTED 5 P_engagement
2 ON THE BASIS OF 14 T_framing I WOULD LIKE TO 11 P_stance
3 THE END OF THE 13 R_location IT IS IMPORTANT TO 7 P_stance
4 IN THE CONTEXT OF 12 T_framing TO BE ABLE TO 5 P_stance
5 AT THE END OF 12 R_location AT THE END OF 6 R_location
6 A WIDE RANGE OF 11 R_quantification THE END OF THE 5 R_location
7 AS WELL AS THE 10 T_framing A WIDE RANGE OF 5 R_quantification
8 IT IS IMPORTANT TO 9 P_stance AS A LINGUA FRANCA 9 R_topic
9 AS A RESULT OF 9 T_resultative ENGLISH AS A LINGUA 7 R_topic
10 IN THE FORM OF 8 T_framing NATIVE SPEAKERS OF ENGLISH 6 R_topic
11 NATIVE SPEAKERS OF ENGLISH 8 R_topic IN THE FORM OF 10 T_framing
12 AS A LINGUA FRANCA 6 R_topic ON THE BASIS OF 6 T_framing
13 ENGLISH AS A LINGUA 6 R_topic IN THE CONTEXT OF 6 T_framing
14 I WOULD LIKE TO 5 P_stance AS WELL AS THE 5 T_framing
15 TO BE ABLE TO 5 P_stance AS A RESULT OF 8 T_resultative
16 IT SHOULD BE NOTED 5 P_engagement ON THE OTHER HAND 21 T_transition
17 AS CAN BE SEEN 5 P_engagement I AM GOING TO 7 P_engagement
18 THE FACT THAT THE 6 P_stance WHEN IT COMES TO 8 R_location
19 CAN BE USED TO 5 P_stance IN THIS ESSAY I 7 R_location
20 AT THE SAME TIME 16 R_location IN THE PROCESS OF 6 R_procedure
21 THE BEGINNING OF THE 11 R_location IS ONE OF THE 7 R_quantification
22 AT THE BEGINNING OF 11 R_location THE INVOLVEMENT LOAD HYPOTHESIS 15 R_topic
23 ONE OF THE MOST 7 R_quantification OF ENGLISH AS A 8 R_topic
24 THE REST OF THE 7 R_quantification OF TEXTS HAVE YOU 8 R_topic
25 AT THE UNIVERSITY OF 11 R_topic TEXTS HAVE YOU WRITTEN 8 R_topic
26 OF SECOND LANGUAGE WRITING 7 R_topic ENGLISH LANGUAGE OF INSTRUCTION 7 R_topic
27 OF ENGLISH AS A 6 R_topic IN THE FIELD OF 6 R_topic
28 THE USE OF CORPORA 6 R_topic ELT AND APPLIED LINGUISTICS 6 R_topic
29 NATIVE AND NON NATIVE 6 R_topic NATIVE LIKE USE OF 6 R_topic
30 OF SPOKEN AND WRITTEN 5 R_topic LIKE USE OF ENGLISH 6 R_topic
31 IN THE CASE OF 19 T_framing AS A FOREIGN LANGUAGE 6 R_topic
32 THE EXTENT TO WHICH 10 T_framing IMPLICIT AND EXPLICIT KNOWLEDGE 6 R_topic
33 IN THE USE OF 10 T_framing IN THE EXPANDING CIRCLE 6 R_topic
34 THE USE OF THE 9 T_framing ENGLISH AS A SECOND 5 R_topic
35 THE NATURE OF THE 7 T_framing MEANING OF A WORD 5 R_topic
36 THE WAYS IN WHICH 7 T_framing OF THE INVOLVEMENT LOAD 5 R_topic
37 IN TERMS OF THE 7 T_framing THAT THERE IS A 8 T_framing
38 THE RESULTS OF THE 7 T_resultative THE ROLE OF THE 7 T_framing
39 IN THE PRESENT STUDY 7 T_structuring TO LOOK AT THE 5 T_framing
40 ON THE ONE HAND 6 T_transition IN THE NEXT SECTION 6 T_structuring
Figure 6.â•‡ Applied Linguistics Corpus vs. KCL Apprentice Writing Corpus:
Shared/unshared lexical bundles4
4. All counts in this analysis have been normalised to counts per 100,000 in order to facilitate
meaningful comparisons across the differently sized data sets used in this study
 Christopher Tribble
Amongst the unshared lexical bundles, the high level of R_Topic related lexical bundles
in KCL Apprentice Writing Corpus (15) is not surprising given the relative corpus sizes
(Applied Linguistics Corpus 939,923 vs. KCL Apprentice Writing Corpus 509,891),
and other contrasts can be accounted for by the genre contrast between the MA assign-
ments and dissertations and published research articles. These are 4-word lexical bun-
dles that are found in KCL Apprentice Writing Corpus alone (i.e. they do not occur in
any of the other corpora in this study). Specific indicators (see Figure 7) include:
What is surprising is that a significant number of Research, Text and Participant
oriented lexical bundles found in Applied Linguistics Corpus is neither present in the
top 40 of KCL Apprentice Writing Corpus, nor in the top 100 lexical bundle list for
KCL Apprentice Writing Corpus. In Figure 8 an “x” indicates that an item does not
appear in the top 100 lexical bundle list in this corpus. Where a lexical bundle was
present outside the top 40, a number indicates its rank position.
I AM GOING TO P_engagement
WHEN IT COMES TO R_location
IN THIS ESSAY I R_location
IS ONE OF THE R_quantification
THAT THERE IS A T_framing
THE ROLE OF THE T_framing
TO LOOK AT THE T_framing
IN THE NEXT T_structuring
SECTION
Figure 7.â•‡ Unshared lexical bundles (KCL Apprentice Writing Corpus)
RESEARCH oriented # TEXT oriented # PARTICIPANT oriented #

THE BEGINNING OF THE X IN THE USE OF x AS CAN BE SEEN X
THE REST OF THE X THE USE OF THE x CAN BE USED TO X
AT THE SAME TIME 74 IN TERMS OF THE x THE FACT THAT THE 63
ONE OF THE MOST 50 THE RESULTS OF THE x
IN THE PRESENT STUDY x
ON THE ONE HAND x
IN THE CASE OF 88
THE WAYS IN WHICH 87
THE EXTENT TO WHICH 62
THE NATURE OF THE 59
Figure 8â•‡ Applied Linguistics Corpus lexical bundles

Revisiting apprentice texts 
When one considers the role that these lexical bundles have in the development of ar-
gument, the presentation and evaluation of data, and the guiding of the reader, their
lower frequency of occurrence in the KCL Apprentice Writing Corpus texts indicates a
contrast between the ways in which apprentice writers comment on results and organ-
ise their texts and the approaches adopted by experts in Applied Linguistics Corpus.
The examples below give an indication of the importance of these lexical bundles.
Sentence concordances have been used here rather than the standard KWIC format.
The examples for the extent to which show how the expert writers limit their evalua-
tions of their own or others’ research findings:
<the extent to which>
This has implications for the extent to which academic writing teachers can fruit-
fully adopt the concept of discourse community to contextualise student writing.
The distinctive ways in which discourse communities construct knowledge raise
questions about the extent to which their discourses can exclude the prior experi-
ences of novice participants, particularly those who may not share the same beliefs
and values of the discourse community.
The extent to which each of the criteria was fulfilled was graded on a four-point
scale from ‘Very much’ to `Not at all’.
Our next task is to review what kinds of advice or models are provided in pub-
lished materials for EAP students, and to assess the extent to which these might
need complementing.
The use of the results of the as a sentence theme demonstrates how expert writers use
such a device to assist in foregrounding the results of research processes:
<the results of the>
The results of the questionnaire survey thus showed that the course design was
quite satisfactory to the students.
The results of the study indicate that in Conservation Biology abstracts include
some moves that have been ascribed to research article introductions.
The results of the current study can be used to teach advanced level students pur-
suing master’s and doctoral degrees the structure of research article introductions
and abstracts in their disciplines.
To make the two groups as similar as possible from a proficiency-level point of
view, we also used the results of the diagnostic test which all students take at the
beginning of the semester.
 Christopher Tribble
3.3.3 4-word lexical bundles in Acta Tropica

Acta Tropica was included in the study to demonstrate another value of analysing
lexical bundles. Setting aside topic specific lexical bundles (e.g. enzyme linked immu-
nosorbent assay, for the diagnosis of), the lexical bundles derived from this corpus also
strongly contrast with those derived from Applied Linguistics Corpus (a direct coun-
terpart), as well as BAWE and BNC Baby – Academic. However, unlike the case of
KCL Apprentice Writing Corpus, where the contrast may arise from limitations in the
performance repertoire of apprentice writers, the contrast in Acta Tropica arises from
contrasting textual and disciplinary practices. This becomes clear the moment we con-
sider the Text oriented lexical bundles which occur uniquely in Acta Tropica. Nearly
all of these relate in one way or another to the reality of engaging in and reporting the
results of natural sciences. Thus we find traces of the major need to report experimen-
tal or observational results (often in tabular form): the results of the/are shown in table/
were found to be/has been shown to/been shown to be, to report on collaborative re-
search: in this study we, and the need to specify the specific bio-chemical conditions
under which experiences were carried out: in the presence of/in the absence of.
3.3.4 Interim conclusion: 4-word lexical bundles

Thus far, this study confirms the value of 4-word lexical bundles in differentiating be-
tween text collections by author status (expert/apprentice) and disciplinary area. From
a pedagogic perspective, the study offers valuable insights for teachers and students
with an interest in how expert writers in a specific field construct allowable contribu-
tions at a high level. In the next section we will see how 3-word lexical bundles can be
used to extend this analysis.
3.4 Findings: 3-word lexical bundles
Key findings for the 3-word lexical bundles in the main research corpora are presented
in Figure 9. Again, a more legible version of the table is provided in the appendices.
The distribution of 3-word lexical bundles across the corpora is similar to that
observed for 4-word lexical bundles. Thirteen 3-word lexical bundles are shared be-
tween all four main corpora (as opposed to nine shared 4-word lexical bundles), and
the contrast between Acta Tropica and the other four corpora is similarly marked
(see Appendix 0 below).
When the KCL Apprentice Writing Corpus and Applied Linguistics Corpus
3-word and 4-word lexical bundle lists are compared (see Figures 6 and 10) we find
that the same level of sharing (16) occurs across the two corpora, and that Applied
Linguistics Corpus contains an important group of Research and Text oriented lexical
bundles which do not occur with high frequency in KCL Apprentice Writing Corpus
(see Figure 11).
Revisiting apprentice texts 
BNC Baby Academic Norm BAWE Norm Applied Linguistics Norm KLC Apprentice Writing Norm Acta Tropica Norm
IN TERMS OF 34 IN ORDER TO 58 THE USE OF 80 IN ORDER TO 56 AS WELL AS 782
PART OF THE 30 AS WELL AS 36 IN ORDER TO 43 THE USE OF 51 THE USE OF 614
THERE IS A 30 THE FACT THAT 30 IN TERMS OF 40 IN TERMS OF 41 IN ORDER TO 575
ONE OF THE 27 THE USE OF 29 AS WELL AS 36 ONE OF THE 34 ONE OF THE 429
A NUMBER OF 27 THERE IS A 29 ONE OF THE 35 AS WELL AS 34 THE ROLE OF 323
AS WELL AS 26 ONE OF THE 27 A NUMBER OF 32 THERE IS A 30 A NUMBER OF 307
SOME OF THE 24 IN TERMS OF 26 THE FACT THAT 26 THE FACT THAT 29 THE FACT THAT 295
THE USE OF 23 AS A RESULT 19 ON THE OTHER 25 SOME OF THE 27 THE NUMBER OF 982
THE FACT THAT 21 PART OF THE 18 THE OTHER HAND 23 A NUMBER OF 23 THE EFFECT OF 458
IN ORDER TO 19 A NUMBER OF 17 SOME OF THE 23 ON THE OTHER 22 IT HAS BEEN 307
ON THE OTHER 16 ON THE OTHER 15 PART OF THE 17 THE OTHER HAND 21 THE PRESENCE OF 1188
THE OTHER HAND 13 THE OTHER HAND 14 THERE IS A 17 AS A RESULT 20 IN THIS STUDY 654
AS A RESULT 11 SOME OF THE 13 AS A RESULT 15 PART OF THE 18 THE PRESENT STUDY 527
IT IS NOT 25 THE NUMBER OF 18 THE CASE OF 25 THE IMPORTANCE OF 31 THE DEVELOPMENT OF 524
THE NUMBER OF 24 THAT IT IS 18 THE NUMBER OF 24 IT IS NOT 25 THE PREVALENCE OF 486
THAT IT IS 17 IT IS NOT 18 THE END OF 22 THAT IT IS 23 OF PLASMODIUM FALCIPARUM 465
THE END OF 17 THE IMPORTANCE OF 15 IN THE CASE 20 THE ROLE OF 27 OF THE DISEASE 463
THE CASE OF 13 THE END OF 15 THE IMPORTANCE OF 15 DUE TO THE 19 DUE TO THE 453
IN THE CASE 12 THE CASE OF 14 IN WHICH THE 18 THAT THERE IS 17 ACCORDING TO THE 441
THERE IS NO 24 IN THE CASE 12 THE ROLE OF 18 THERE IS NO 16 IN THE PRESENT 430
IN WHICH THE 18 DUE TO THE 38 THE BASIS OF 16 BE ABLE TO 16 MATERIALS AND METHODS 429
THE BASIS OF 13 THERE IS NO 22 END OF THE 15 IN THE CLASSROOM 36 OF TRYPANOSOMA CRUZI 404
THAT THERE IS 13 CAN BE SEEN 21 SUCH AS THE 15 ENGLISH AS A 22 OF T CRUZI 389
PERCENT OF 24 BE ABLE TO 18 CAN BE SEEN 14 IN THE UK 19 OF THE PARASITE 389
TERMS OF THE 22 SUCH AS THE 17 IN THE CORPUS 22 THE CONCEPT OF 19 THE TREATMENT OF 386
IT MAY BE 18 NEED TO BE 17 IN ACADEMIC WRITING 21 THE INVOLVEMENT LOAD 19 SEE FRONT MATTER 381
IN THIS CASE 17 THAT THERE IS 15 USE OF THE 19 IN OTHER WORDS 19 WAS CARRIED OUT 371
IT HAS BEEN 16 IT IS A 13 OF IN THE 18 NON NATIVE SPEAKERS 18 WORLD HEALTH ORGANIZATION 371
AND IT IS 15 IT HAS BEEN 12 THE PRESENT STUDY 17 NEED TO BE 18 FOUND TO BE 361
LIKELY TO BE 15 END OF THE 11 AT THE SAME 17 INVOLVEMENT LOAD HYPOTHESIS 18 A TOTAL OF 359
THE EFFECT OF 13 IT CAN BE 21 THE CONTEXT OF 17 THE TARGET LANGUAGE 18 THE ABSENCE OF 355
TO BE A 13 TO BE A 15 THE BEGINNING OF 17 THE MEANING OF 17 BASED ON THE 352
AND SO ON 13 A RESULT OF 13 THE SAME TIME 16 THE RELATIONSHIP BETWEEN 17 RIGHTS RESERVED DOI 349
THE HOUSE OF 13 THE DEVELOPMENT OF 13 ENGLISH AS A 16 WOULD LIKE TO 17 E MAIL ADDRESS 328
IT IS A 13 IN THIS CASE 13 THE TEACHING OF 16 VARIETIES OF ENGLISH 16 CARRIED OUT IN 313
MANY OF THE 12 IT WOULD BE 12 IN THIS STUDY 16 BASED ON THE 16 IN REVISED FORM 312
BUT IT IS 12 BE USED TO 12 OF THE CORPUS 15 MOST OF THE 16 RECEIVED IN REVISED 312
IS LIKELY TO 12 THE PRESENCE OF 11 OF THE STUDENTS 15 THE NOTION OF 16 T B GAMBIENSE 303
IN THE FIRST 12 IT IS IMPORTANT 11 THE RESULTS OF 15 OF THE LANGUAGE 15 ANALYSIS OF THE 299
IT IS POSSIBLE 11 IT IS THE 11 ANALYSIS OF THE 15 THE PROCESS OF 15 IN THE PRESENCE 299
Figure 9.â•‡ 3-word lexical bundles across all corpora
A further feature of the 3-word lexical bundles in Applied Linguistics Corpus and KCL
Apprentice Writing Corpus is illustrated in Figure 12.
This figure shows the fourteen 3-word lexical bundles which are not entailed by
4-word lexical bundles in Applied Linguistics Corpus. This group constitutes a further
kind of resource, which is either limited to a three-word horizon (in this paper) or is
less phrasally stable, but which nevertheless collocates with a range of discoursally or
disciplinarily important lexis. Examples include:
– DISCOURSE: the immediate right collocates of IN ORDER TO: understand/iden-
tify/gain/avoid/determine/ensure/investigate/control/function/provide/achieve/as-
sess/explore/facilitate/illustrate/improve/better/complete/express/help/see/address/
establish
– DISCIPLINE: the immediate right collocates of THE IMPORTANCE OF: learn-
ing/linguistic/writing/academic/frequent/genre/language/student/understanding/
appropriate/aware/awareness/based/corpora
As mentioned earlier, these 3-word lexical bundle PLUS collocate combinations would
not necessarily be sufficiently frequent to figure as 4-word lexical bundles, yet they
constitute an important resource for learners to draw on during the Phase 2: modelling
and deconstructing in the teaching and learning cycle discussed above (Section 0).
 Christopher Tribble
# APPLING Norm Func KCL_AWC Norm Func

1 THE FACT THAT 26 P_engagement THE FACT THAT 29 P_engagement
2 THE IMPORTANCE OF 15 P_engagement THE IMPORTANCE OF 31 P_engagement
3 PART OF THE 17 R_description PART OF THE 18 R_description
4 SOME OF THE 23 R_description SOME OF THE 27 R_description
5 AS WELL AS 36 R_location AS WELL AS 34 R_location
6 THE ROLE OF 18 R_procedure THE ROLE OF 27 R_procedure
7 THE USE OF 80 R_procedure THE USE OF 51 R_procedure
8 A NUMBER OF 32 R_quantification A NUMBER OF 23 R_quantification
9 ONE OF THE 35 R_quantification ONE OF THE 34 R_quantification
10 ENGLISH AS A 16 R_topic ENGLISH AS A 22 R_topic
11 AS A RESULT 15 T_framing AS A RESULT 20 T_framing
12 IN ORDER TO 43 T_framing IN ORDER TO 56 T_framing
13 IN TERMS OF 40 T_framing IN TERMS OF 41 T_framing
14 THERE IS A 17 T_framing THERE IS A 30 T_framing
15 ON THE OTHER 25 T_transition ON THE OTHER 22 T_transition
16 THE OTHER HAND 23 T_transition THE OTHER HAND 21 T_transition
1 OF IN THE 18 fragment BE ABLE TO 16 P_engagement
2 CAN BE SEEN 14 P_engagement NEED TO BE 18 R_stance
3 AT THE SAME 17 R_location WOULD LIKE TO 17 R_stance
4 END OF THE 15 R_location THE PROCESS OF 15 R_procedure
5 THE BEGINNING OF 17 R_location MOST OF THE 16 R_quantification
6 THE END OF 22 R_location IN THE CLASSROOM 36 R_topic
7 THE PRESENT STUDY 17 R_location IN THE UK 19 R_topic
8 THE SAME TIME 16 R_location INVOLVEMENT LOAD HYPOTHESIS 18 R_topic
9 THE NUMBER OF 24 R_quantification NON NATIVE SPEAKERS 18 R_topic
10 THE RESULTS OF 15 R_resultative OF THE LANGUAGE 15 R_topic
11 ANALYSIS OF THE 15 R_topic THE CONCEPT OF 19 R_topic
12 IN ACADEMIC WRITING 21 R_topic THE INVOLVEMENT LOAD 19 R_topic
13 IN THE CORPUS 22 R_topic THE MEANING OF 17 R_topic
14 OF THE CORPUS 15 R_topic THE NOTION OF 16 R_topic
15 OF THE STUDENTS 15 R_topic THE RELATIONSHIP BETWEEN 17 R_topic
16 THE TEACHING OF 16 R_topic THE TARGET LANGUAGE 18 R_topic
17 IN THE CASE 20 T_framing VARIETIES OF ENGLISH 16 R_topic
18 IN WHICH THE 18 T_framing BASED ON THE 16 T_framing
19 SUCH AS THE 15 T_framing IT IS NOT 25 T_framing
20 THE CASE OF 25 T_framing THAT IT IS 23 T_framing
21 THE CONTEXT OF 17 T_framing THAT THERE IS 17 T_framing
22 USE OF THE 19 T_procedure THERE IS NO 16 T_framing
23 THE BASIS OF 16 T_resultative DUE TO THE 19 T_resultative
24 IN THIS STUDY 16 T_structuring IN OTHER WORDS 19 T_transition
Figure 10.â•‡ 3-word lexical bundles (KCL Apprentice Writing Corpus vs. Applied
Linguistics Corpus)
CAN BE SEEN 14 P_engagement IN THE CASE 20 T_framing

AT THE SAME 17 R_location IN WHICH THE 18 T_framing
END OF THE 15 R_location SUCH AS THE 15 T_framing
THE BEGINNING OF 17 R_location THE CASE OF 25 T_framing
THE END OF 22 R_location THE CONTEXT OF 17 T_framing
THE PRESENT STUDY 17 R_location USE OF THE 19 T_procedure
THE SAME TIME 16 R_location THE BASIS OF 16 T_resultative
THE NUMBER OF 24 R_quantification IN THIS STUDY 16 T_structuring
THE RESULTS OF 15 R_resultative
Figure 11.â•‡ Research and textual lexical bundles in Applied Linguistics Corpus which do
not appear in KCL Apprentice Writing Corpus
Revisiting apprentice texts 
APPLING Norm Func

A NUMBER OF 32 R_qunatification
ANALYSIS OF THE 15 R_topic
CAN BE SEEN 14 R_engagement
IN ORDER TO 43 T_framing
IN THIS STUDY 16 T_structuring
IN WHICH THE 18 T_framing
PART OF THE 17 R_description
SOME OF THE 23 R_description
SUCH AS THE 15 T_framing
THE BEGINNING OF 17 R_location
THE IMPORTANCE OF 15 P_engagement
THE NUMBER OF 24 R_qunatification
THE ROLE OF 18 R_procedure
THERE IS A 17 T_framing
Figure 12.â•‡ 3-word lexical bundles not subsumed by 4-word lexical bundles in Applied
Linguistics Corpus
4. From description to application
An example of how lexical bundles have been used in pedagogy might be instructive
as a final step in this article. Cortes (2006) provides an account of some of the issues
that can arise in when lexical bundles are taught in the context of academic writing
instruction in the disciplines. Working with a group of native English-speaking third
and fourth year History students at Iowa State University, and with the active partici-
pation of a member of the History faculty, she attempted to teach a salient set of n-
grams drawn from research articles in their disciplinary area. Through pre- and post-
course analysis of the students’ writing, Cortes’ study attempted to assess the extent to
which relevant lexical bundles could be acquired and used by participants in the study.
Results were disappointing, with only a small number of the target bundles being in-
corporated into texts which students wrote following the two week course of 5 x 20
minute session.
In a subsequent analysis of the students’ writing and published research articles,
Cortes notes that one of the main reasons for this apparent failure is that students
consistently: “generally favored simple conjunctions, conjuncts, and adverbs to express
functions which published authors frequently convey by using lexical bundles”
(Cortes 2006: 399). She concludes by commenting:
All in all, the differences in linguistic exponents (lexical bundles, adverbs, con-
junctions, etc.) used to convey academic-related functions by published authors
and university students present a gap that seems difficult to bridge. On the one
 Christopher Tribble
hand, expressions like lexical bundles, which are extremely frequent in the pro-
duction of published authors in history are extremely rare in student production.
On the other hand, students seem to favor structurally simple expressions or sin-
gle words to convey certain functions, expressions and words which, in general,
are frequently used in spoken registers and do not seem to be published authors’
first choices. (ibid: 401)
Cortes herself admits that part of the reason for this apparent failure might lie in the
instructional methodology she used (which depended on the introduction of a set of
de-contextualised lexical bundles and gap-fill and matching activities as a means of
consolidating learning), and also on a mis-match between the present level of the stu-
dents’ cognitive development as writers and their engagement with the discipline, and
the target language behaviour the were expected to achieve. The lesson which I draw
from Cortes’ study is that it is essential: (a) to align the exemplar corpus as closely to
the needs of the learners as possible; (b) to recognise that lexical bundles express epis-
temologies and modes of reasoning which students may not yet be able to access; and
(c) that instruction should not be simply a process of presentation, practice and pro-
duction, but will require the kinds of critical engagement which are implicit in genre
approaches to language instruction as outlined in Section 0 above.
5. Conclusions
From my perspective, the key insight of Granger’s pioneering work on learner lan-
guage (from Granger 1993 through to Granger & Paquot 2009) has always been that
the production of those who are on the way to gaining fuller control of linguistic sys-
tems is as legitimate a focus for linguistic research as the hitherto privileged produc-
tion of native-speaking users of a language. Indeed, in language pedagogy I would
argue that unless we have a clear idea of what aspects of the language system our stu-
dents use, fail to use, underuse and overuse when their production is set against that
of relevant comparators, we will be hard pressed to develop useful curricula and learn-
ing programmes.
In this present study, I hope that I have demonstrated that there is a value in inves-
tigating the contrasts between what advanced students in a disciplinary area are able
to do, and how this compares with the language use of experts in the same field. Clear-
ly, lexical bundles are only part of the story; as Corte’s study highlights, matching the
exemplar corpus tolearners’ current needs is also a major issue. One of the things that
is now required is serious effort to better understand how lexical bundles operate
across different stages in, for example, the development of an argument, the reporting
of results, and the citing of authorities, and how apprentice writers at different levels of
engagement with their disciplinary areas realise such literacy practices. Despite these
challenges, I would contend that lexical bundles offer a valuable starting point for a
Revisiting apprentice texts 
better understanding of how apprentice writers write − and for how we as teachers can
help students to develop expertise in disciplinary writing.
References
Bazerman, C. 1994. Constructing Experience. Carbondale IL: Southern Illinois University Press.
Biber, D. 2006. University Language: A Corpus-based Study of Spoken and Written Registers
Biber, D. & Conrad, S. 1999. Lexical bundles in conversation and academic prose. In Out of
Corpora: Studies in Honor of Stig Johansson, H. Hasselgard, H. & Oksefjell, S. (eds), 181–189.
Amsterdam: Rodopi.
Cortes, V. 2004. Lexical bundles in published and student disciplinary writing: Examples from
history and biology. English for Specific Purposes 23: 397–423.
Cortes, V. 2006. Teaching lexical bundles in the disciplines: An example from a writing intensive
history class. Linguistics and Education 17(4): 391–406.
Feez, S. 1998. Text-based Syllabus Design. Sydney: Sydney: NCELTR, Macquarie University.
Flowerdew, J. 1993. An educational or process approach to the teaching of professional genres
ELTJ 47(4): 305–316.
Granger, S. 1993. The international corpus of learner English. In English Language Corpora:
Design, Analysis and Exploitation, J. Aarts, P. de Haan & N. Oostdijk (eds), 57–69.
Amsterdam: Rodopi.
Granger, S. 1994. The learner corpus: A revolution in applied linguistics. English Today 39(10/3):
25–29.
Granger, S. & Tribble, C. 1998. Exploiting learner corpus data in the classroom: Form-focused
instruction and data-driven learning. In Learner Language on Computer, S. Granger (ed.),
199–209. Harlow: Longman.
Granger, S. & Paquot, M. 2009. In search of General Academic English: A corpus-driven study.
In Options and Practices of L.S.P Practitioners Conference Proceedings [University of Crete
Publications, E-media, 94–108], K. Katsampoxaki-Hodgetts (ed.). <http://cecl.fltr.ucl.ac.
be/Downloads/In_search_of_a_general_academic_english.pdf> (12 July, 2009).
Halliday, M.A.K. 1994. An Introduction to Functional Grammar, 2nd edn. London: Edward
Arnold.
Hyland, K. 2002. Specificity revisited: How far should we go now? English for Specific Purposes
21: 385–395.
Hyland, K. 2005. Stance and engagement: A model of interaction in academic discourse. Dis-
course Studies 7(2): 173–91.
Hyland, K. 2008a. Academic clusters: Text patterning in published and postgraduate writing.
International Journal of Applied Linguistics 18(1): 41–62.
Hyland, K. 2008b. As can be seen: Lexical bundles and disciplinary variation. English for Spe-
cific Purposes 27: 4–21.
Jenkins, J. 2000. The Phonology of English as an International Language. Oxford: OUP.
Johns, T. 1988. Whence and whither classroom concordancing? In Computer Applications in Lan-
guage Learning, T. Bongaerts, P. de Haan, S. Lobbe & H. Wekker (eds). Dordrecht: Foris.
 Christopher Tribble
Johns, T. 1994. From printout to handout: Grammar and vocabulary learning in the context of
data-driven learning. In Approaches to Pedagogic Grammar, T. Odlin T. (ed.), 293–313.
Cambridge: CUP.
Lea, M.R & Street, B.V. 1998. Student writing in higher education:an academic literacies ap-
proach. Studies in Higher Education 23(2): 157–172.
Lillis, T.M. 2001. Student Writing: Access, Regulation, Desire. London: Routledge.
Meara, P. & English, F. 1987. Lexical errors and learners’ dictionaries. London: Birkbeck College,
Applied Linguistics Group. <http://www.eric.ed.gov/ERICDocs/data/ericdocs2sql/con-
tent_storage_01/0000019b/80/1c/4b/98.pdf> (16 November, 2009).
Rampton, M.B.H. 1990. Displacing the ‘native speaker’: Expertise, affiliation, and inheritance.
ELT Journal 44(2): 97–101.
Scott, M. 2008., WordSmith Tools version 5. Liverpool: Lexical Analysis Software.
Scott, M. & Tribble, C. 2006. Textual Patterns: Key Words and Corpus Analysis in Language Edu-
cation [Studies in Corpus Linguistics 22]. Amsterdam: John Benjamins.
Seidlhofer, B. 2000. Mind the gap: English as a mother tongue vs. English as a lingua franca.
Views 9(1): 51–58.
Seidlhofer, B. (ed.). 2003. Controversies in Applied Linguistics. Oxford: OUP.
Sinclair, J.M. (ed.). 1987. Looking Up. An Account of the COBUILD project. London: Collins.
Sinclair, J.M. 2004. New evidence, new priorities, new attitudes. In How to Use Corpora in Lan-
guage Teaching, J.M. Sinclair (ed.), 271–297. Amsterdam: John Benjamins.
Spack, R. 1988. Initiating ESL students into the academic discourse community: How far should
we go? TESOL Quarterly 22(1): 29–52.
Stevens, V. 1991. Classroom concordancing: Vocabulary materials derived from relevant, au-
thentic text. English for Specific Purposes 10: 10–15
Summers, D. & Rundell, M. 1990. Longman Dictionary of Contemporary English, 2nd edn.
Harlow: Longman.
Thurston, J. & Candlin, C.N. 1997. Exploring Academic English: A Workbook for Student Essay
Writing. Sydney: NCELTR.
Tribble, C. 1989. The use of text structuring vocabulary in native and non-native speaker writ-
ing. MUESLI News. Jun-89: 11–13.
Tribble, C. 1999. Genres, keywords, teaching: Towards a pedagogic account of the language of
project proposals. In Rethinking Language Pedagogy from a Corpus Perspective: Papers from
the Third International Conference on Teaching and Language Corpora, [Lodz Studies in
Language], L. Burnard & T. McEnery (eds). Frankfurt: Peter Lang.
Tribble, C. 2001. Corpora and corpus analysis: New windows on academic writing. In Academic
Discourse, J. Flowerdew (ed.), 131–149. Harlow: Addison Wesley Longman.
Tribble, C. & Jones, G. 1990. Concordances in the Classroom. Harlow: Longman.
Widdowson, H.G. 1990. Aspects of Language Teaching. Oxford: OUP.
Widdowson, H.G. 1991. The description and prescription of language. In Georgetown University
Round Table on Languages and Linguistics 1991, J.E. Alatis (ed.), 11–24. Washington DC:
Georgetown University Press.
Willis, D. 1990. The Lexical Syllabus. London: Collins.
Willis, J. & Willis, D. 1988. Collins COBUILD English Course, Part 1. Birmingham: Collins
COBUILD.

Appendices
4-word clusters
BNC baby Academic norm BAWE norm Applied Linguistics norm KLC Apprentice Writing norm Acta Tropica norm
ON THE OTHER HAND 13 ON THE OTHER HAND 26 ON THE OTHER HAND 22 ON THE OTHER HAND 21 ON THE BASIS OF â•⁄ 5
THE END OF THE â•⁄ 9 AS A RESULT OF 22 ON THE BASIS OF 14 IN THE FORM OF 10 AS WELL AS THE â•⁄ 4
ON THE BASIS OF â•⁄ 9 THE END OF THE 18 THE END OF THE 13 AS A RESULT OF â•⁄ 8 AS A RESULT OF â•⁄ 4
AS A RESULT OF â•⁄ 8 IT IS IMPORTANT TO 17 AT THE END OF 12 IT IS IMPORTANT TO â•⁄ 7 THE END OF THE â•⁄ 6
AT THE END OF â•⁄ 7 AS WELL AS THE 15 AS WELL AS THE 10 AT THE END OF â•⁄ 6 IN THE CASE OF â•⁄ 6
IT IS IMPORTANT TO â•⁄ 6 IN THE FORM OF 15 IT IS IMPORTANT TO â•⁄ 9 ON THE BASIS OF â•⁄ 6 AT THE END OF â•⁄ 5
IN THE FORM OF â•⁄ 5 AT THE END OF 13 AS A RESULT OF â•⁄ 9 AS WELL AS THE â•⁄ 5 ON THE OTHER HAND â•⁄ 9
AS WELL AS THE â•⁄ 4 ON THE BASIS OF 6 IN THE FORM OF â•⁄ 8 THE END OF THE â•⁄ 5 IS ONE OF THE â•⁄ 3
IN TERMS OF THE 11 IN THE CASE OF 19 IN THE CASE OF 19 THAT THERE IS A â•⁄ 8 IN THE PRESENT STUDY 10
IN THE CASE OF 10 AT THE SAME TIME 15 AT THE SAME TIME 16 IN THE CONTEXT OF â•⁄ 6 AT THE TIME OF â•⁄ 5
THE WAY IN WHICH â•⁄ 8 THE FACT THAT THE 13 IN THE CONTEXT OF 12 A WIDE RANGE OF â•⁄ 5 THE RESULTS OF THE â•⁄ 4
THE EXTENT TO WHICH â•⁄ 7 CAN BE USED TO 12 A WIDE RANGE OF 11 TO BE ABLE TO â•⁄ 5 IN THE PRESENCE OF â•⁄ 9
IN THE CONTEXT OF â•⁄ 7 ONE OF THE MOST 11 THE EXTENT TO WHICH 10 I WOULD LIKE TO 11 IN THE ABSENCE OF â•⁄ 7
AT THE SAME TIME â•⁄ 7 THAT THERE IS A 10 ONE OF THE MOST â•⁄ 7 AS A LINGUA FRANCA â•⁄ 9 FOR THE TREATMENT OF â•⁄ 6
THAT THERE IS A â•⁄ 5 THE WAY IN WHICH â•⁄ 9 THE REST OF THE â•⁄ 7 OF ENGLISH AS A â•⁄ 8 IN THE TREATMENT OF â•⁄ 5
A WIDE RANGE OF â•⁄ 5 THE REST OF THE â•⁄ 9 IN TERMS OF THE â•⁄ 7 IS ONE OF THE â•⁄ 7 FOR THE DIAGNOSIS OF â•⁄ 5
ONE OF THE MOST â•⁄ 5 IN TERMS OF THE â•⁄ 9 THE FACT THAT THE â•⁄ 6 ENGLISH AS A LINGUA â•⁄ 7 ENZYME LINKED â•⁄ 4
IMMUNOSORBENT ASSAY
THE REST OF THE â•⁄ 5 THE EXTENT TO WHICH â•⁄ 8 CAN BE USED TO â•⁄ 5 NATIVE SPEAKERS OF â•⁄ 6 WERE FOUND TO BE â•⁄ 4
ENGLISH
CAN BE USED TO â•⁄ 4 IT IS POSSIBLE TO 12 TO BE ABLE TO â•⁄ 5 THE INVOLVEMENT 15 FOR THE DETECTION OF â•⁄ 4
LOAD HYPOTHESIS
THE FACT THAT THE â•⁄ 4 TO BE ABLE TO 10 THE BEGINNING OF THE 11 WHEN IT COMES TO â•⁄ 8 WAS FOUND TO BE â•⁄ 4
IT IS POSSIBLE TO â•⁄ 8 IS ONE OF THE 10 THE USE OF THE â•⁄ 9 OF TEXTS HAVE YOU â•⁄ 8 FOR THE PRESENCE OF â•⁄ 4
Revisiting apprentice texts 
BNC baby Academic norm BAWE norm Applied Linguistics norm KLC Apprentice Writing norm Acta Tropica norm
IT IS CLEAR THAT â•⁄ 5 THE NATURE OF THE 10 NATIVE SPEAKERS OF â•⁄ 8 TEXTS HAVE YOU â•⁄ 8 HAS BEEN SHOWN TO â•⁄ 4
ENGLISH WRITTEN
THE SIZE OF THE â•⁄ 4 IT IS CLEAR THAT â•⁄ 9 THE NATURE OF THE â•⁄ 7 I AM GOING TO â•⁄ 7 USED IN THIS STUDY â•⁄ 4
IN RELATION TO THE â•⁄ 4 IT IS DIFFICULT TO â•⁄ 8 OF ENGLISH AS A â•⁄ 6 ENGLISH LANGUAGE OF â•⁄ 7 RESEARCH AND TRAINING â•⁄ 4
 Christopher Tribble
INSTRUCTION IN
IT IS DIFFICULT TO â•⁄ 4 THE USE OF THE â•⁄ 7 AS A LINGUA FRANCA â•⁄ 6 IN THIS ESSAY I â•⁄ 7 AND TRAINING IN â•⁄ 4
TROPICAL
THE HOUSE OF LORDS â•⁄ 8 THE BEGINNING OF THE â•⁄ 6 ENGLISH AS A LINGUA â•⁄ 6 THE ROLE OF THE â•⁄ 7 THE TOTAL NUMBER OF â•⁄ 4
PER CENT OF THE â•⁄ 7 THE SIZE OF THE â•⁄ 6 I WOULD LIKE TO â•⁄ 5 IN THE FIELD OF â•⁄ 6 UNDP WORLD BANK WHO â•⁄ 4
IN THE UNITED STATES â•⁄ 5 IN RELATION TO THE â•⁄ 6 THE WAYS IN WHICH â•⁄ 7 ELT AND APPLIED â•⁄ 6 FOR RESEARCH AND â•⁄ 4
LINGUISTICS TRAINING
AT THE TIME OF â•⁄ 5 CAN BE SEEN IN 11 AT THE UNIVERSITY OF 11 NATIVE LIKE USE OF â•⁄ 6 WAS CARRIED OUT IN â•⁄ 3
THE HOUSE OF â•⁄ 5 IT CAN BE SEEN â•⁄ 9 AT THE BEGINNING OF 11 IN THE PROCESS OF â•⁄ 6 TRAINING IN TROPICAL â•⁄ 3
COMMONS DISEASES
AS SHOWN IN FIG â•⁄ 5 TO THE FACT THAT â•⁄ 9 IN THE USE OF 10 LIKE USE OF ENGLISH â•⁄ 6 IN THE NUMBER OF â•⁄ 3
IN THE ABSENCE OF â•⁄ 4 A RESULT OF THE â•⁄ 8 IN THE PRESENT STUDY â•⁄ 7 AS A FOREIGN LANGUAGE â•⁄ 6 ARE SHOWN IN TABLE â•⁄ 3
RULE IN RYLANDS V â•⁄ 4 CAN BE SEEN THAT â•⁄ 7 THE RESULTS OF THE â•⁄ 7 IMPLICIT AND EXPLICIT â•⁄ 6 BEEN SHOWN TO BE â•⁄ 3
KNOWLEDGE
IS LIKELY TO BE â•⁄ 4 IT IS NECESSARY TO â•⁄ 7 OF SECOND LANGUAGE â•⁄ 7 IN THE EXPANDING â•⁄ 6 WORLD BANK WHO â•⁄ 3
WRITING CIRCLE SPECIAL
THE RULE IN RYLANDS â•⁄ 4 IN THE SAME WAY â•⁄ 6 THE USE OF CORPORA â•⁄ 6 IN THE NEXT SECTION â•⁄ 6 THIS WORK WAS SUPPORT- â•⁄ 3
ED
IN THIS CASE THE â•⁄ 4 THE ROLE OF THE â•⁄ 6 ON THE ONE HAND â•⁄ 6 ENGLISH AS A SECOND â•⁄ 5 IN AN AREA OF â•⁄ 3
THE NATURE OF THE â•⁄ 4 IS DUE TO THE â•⁄ 6 NATIVE AND NON â•⁄ 6 TO LOOK AT THE â•⁄ 5 AS WELL AS IN â•⁄ 3
NATIVE
THE COURT OF APPEAL â•⁄ 4 DUE TO THE FACT â•⁄ 6 AS CAN BE SEEN â•⁄ 5 IT SHOULD BE NOTED â•⁄ 5 WORK WAS SUPPORTED BY â•⁄ 3
THAT THERE IS NO â•⁄ 4 THAT THERE IS NO â•⁄ 6 IT SHOULD BE NOTED â•⁄ 5 MEANING OF A WORD â•⁄ 5 IN THIS STUDY WE â•⁄ 3
AS WE HAVE SEEN â•⁄ 4 CAN BE SEEN AS â•⁄ 6 OF SPOKEN AND â•⁄ 5 OF THE INVOLVEMENT â•⁄ 5 SPECIAL PROGRAMME FOR â•⁄ 3
WRITTEN LOAD RESEARCH

3-word clusters
BNC Norm BAWE Norm Applied Linguistics Norm KLC Apprentice Norm Acta Norm
Baby Academic Writing Tropica
IN TERMS OF 34 IN ORDER TO 58 THE USE OF 80 IN ORDER TO 56 AS WELL AS 26

PART OF THE 30 AS WELL AS 36 IN ORDER TO 43 THE USE OF 51 THE USE OF 21
THERE IS A 30 THE FACT THAT 30 IN TERMS OF 40 IN TERMS OF 41 IN ORDER TO 19
ONE OF THE 27 THE USE OF 29 AS WELL AS 36 ONE OF THE 34 ONE OF THE 14
A NUMBER OF 27 THERE IS A 29 ONE OF THE 35 AS WELL AS 34 THE ROLE OF 11
AS WELL AS 26 ONE OF THE 27 A NUMBER OF 32 THERE IS A 30 A NUMBER OF 10
SOME OF THE 24 IN TERMS OF 26 THE FACT THAT 26 THE FACT THAT 29 THE FACT THAT 10
THE USE OF 23 AS A RESULT 19 ON THE OTHER 25 SOME OF THE 27 THE NUMBER OF 33
THE FACT THAT 21 PART OF THE 18 THE OTHER HAND 23 A NUMBER OF 23 THE EFFECT OF 15
IN ORDER TO 19 A NUMBER OF 17 SOME OF THE 23 ON THE OTHER 22 IT HAS BEEN 10
ON THE OTHER 16 ON THE OTHER 15 PART OF THE 17 THE OTHER HAND 21 THE PRESENCE OF 40
THE OTHER HAND 13 THE OTHER HAND 14 THERE IS A 17 AS A RESULT 20 IN THIS STUDY 22
AS A RESULT 11 SOME OF THE 13 AS A RESULT 15 PART OF THE 18 THE PRESENT STUDY 18
IT IS NOT 25 THE NUMBER OF 18 THE CASE OF 25 THE IMPORTANCE OF 31 THE DEVELOPMENT OF 18
THE NUMBER OF 24 THAT IT IS 18 THE NUMBER OF 24 IT IS NOT 25 THE PREVALENCE OF 16
THAT IT IS 17 IT IS NOT 18 THE END OF 22 THAT IT IS 23 OF PLASMODIUM 16
FALCIPARUM
THE END OF 17 THE IMPORTANCE OF 15 IN THE CASE 20 THE ROLE OF 27 OF THE DISEASE 16
THE CASE OF 13 THE END OF 15 THE IMPORTANCE OF 15 DUE TO THE 19 DUE TO THE 15
IN THE CASE 12 THE CASE OF 14 IN WHICH THE 18 THAT THERE IS 17 ACCORDING TO THE 15
THERE IS NO 24 IN THE CASE 12 THE ROLE OF 18 THERE IS NO 16 IN THE PRESENT 14
IN WHICH THE 18 DUE TO THE 38 THE BASIS OF 16 BE ABLE TO 16 MATERIALS AND 14
METHODS
THE BASIS OF 13 THERE IS NO 22 END OF THE 15 IN THE CLASSROOM 36 OF TRYPANOSOMA CRUZI 14
THAT THERE IS 13 CAN BE SEEN 21 SUCH AS THE 15 ENGLISH AS A 22 OF T CRUZI 13
Revisiting apprentice texts 
BNC Norm BAWE Norm Applied Linguistics Norm KLC Apprentice Norm Acta Norm
Baby Academic Writing Tropica
PER CENT OF 24 BE ABLE TO 18 CAN BE SEEN 14 IN THE UK 19 OF THE PARASITE 13

TERMS OF THE 22 SUCH AS THE 17 IN THE CORPUS 22 THE CONCEPT OF 19 THE TREATMENT OF 13
IT MAY BE 18 NEED TO BE 17 IN ACADEMIC 21 THE INVOLVEMENT 19 SEE FRONT MATTER 13
 Christopher Tribble
WRITING LOAD
IN THIS CASE 17 THAT THERE IS 15 USE OF THE 19 IN OTHER WORDS 19 WAS CARRIED OUT 12
IT HAS BEEN 16 IT IS A 13 OF IN THE 18 NON NATIVE 18 WORLD HEALTH 12
SPEAKERS ORGANIZATION
AND IT IS 15 IT HAS BEEN 12 THE PRESENT STUDY 17 NEED TO BE 18 FOUND TO BE 12
LIKELY TO BE 15 END OF THE 11 AT THE SAME 17 INVOLVEMENT LOAD 18 A TOTAL OF 12
HYPOTHESIS
THE EFFECT OF 13 IT CAN BE 21 THE CONTEXT OF 17 THE TARGET 18 THE ABSENCE OF 12
LANGUAGE
TO BE A 13 TO BE A 15 THE BEGINNING OF 17 THE MEANING OF 17 BASED ON THE 12
AND SO ON 13 A RESULT OF 13 THE SAME TIME 16 THE RELATIONSHIP 17 RIGHTS RESERVED DOI 12
BETWEEN
THE HOUSE OF 13 THE DEVELOPMENT OF 13 ENGLISH AS A 16 WOULD LIKE TO 17 E MAIL ADDRESS 11
IT IS A 13 IN THIS CASE 13 THE TEACHING OF 16 VARIETIES OF 16 CARRIED OUT IN 10
ENGLISH
MANY OF THE 12 IT WOULD BE 12 IN THIS STUDY 16 BASED ON THE 16 IN REVISED FORM 10
BUT IT IS 12 BE USED TO 12 OF THE CORPUS 15 MOST OF THE 16 RECEIVED IN REVISED 10
IS LIKELY TO 12 THE PRESENCE OF 11 OF THE STUDENTS 15 THE NOTION OF 16 T B GAMBIENSE 10
IN THE FIRST 12 IT IS IMPORTANT 11 THE RESULTS OF 15 OF THE LANGUAGE 15 ANALYSIS OF THE 10
IT IS POSSIBLE 11 IT IS THE 11 ANALYSIS OF THE 15 THE PROCESS OF 15 IN THE PRESENCE 10
Automatic error tagging of spelling
mistakes in learner corpora
Paul Rayson and Alistair Baron
Manual error tagging of learner corpus data is time consuming and creates
a bottleneck in the analysis of learner corpora. This had led researchers to
apply techniques from the area of natural language processing to assist in the
automatic analysis of such data. This chapter presents the novel application of a
hybrid approach to the detection of spelling errors in learner data. The Variant
Detector (VARD) software was developed to match historical spelling variants to
modern equivalents with the intention of improving the accuracy and robustness
of corpus linguistics techniques when applied to historical corpora. Here, we
describe its application to detect spelling errors in written learner corpora
consisting of 50,000 words from each of three learner backgrounds (French,
German and Spanish).
1. Introduction
As witnessed by the contributions in this book and elsewhere, computer learner cor-
pus (CLC) research originating from the Louvain-la-Neuve team led by Sylviane
Granger has contributed significantly to the description and analysis of learner errors,
second language acquisition research and beyond (Dagneaux et al. 1998, Granger
1999, Granger & Thewissen 2005a, 2005b, Meunier & Granger 2008). One of the key
elements of CLC research, in addition to the collection of real language output from
learners, is the marking of learner errors directly in the resulting corpora. The annota-
tion of learner errors, also known as error tagging, enables mistakes to be counted,
sorted in specific ways as well as viewed in context. Error tagging has previously been
carried out manually or semi-automatically using software assistance in terms of com-
puter-aided error analysis (Dagneaux et al. 1998). However, even when using intelli-
gent editors, manual error tagging is time consuming and creates a bottleneck in the
analysis of learner corpora. Recently, researchers in Intelligent Computer-Aided Lan-
guage Learning (ICALL) have begun to apply results from the area of natural language
processing (NLP) to learner corpora with two workshops bringing together research
 Paul Rayson and Alistair Baron
on the automatic analysis of learner language in 20081 and 20092. The research pre-
sented in this paper continues this trend.
Spelling mistakes are one of the basic errors that learners make in their writing. In
other areas of corpus linguistic research, the consideration of spelling variants is also
an important issue, for example in corpora of online or internet varieties such as chat
language or email communication (Gries & Myslin 2009) where novel variants (such
as gr8 for great) are emerging. In applying corpus linguistic techniques to the analysis
of Early Modern English corpora, spelling variants (such as avysyd instead of advised)
are found to degrade the performance and robustness of techniques such as key word
analysis (Baron et al. 2009b), part-of-speech tagging (Rayson et al. 2007)3 and seman-
tic tagging (Archer et al. 2003). Hence, techniques have been developed to detect his-
torical spelling variants and insert modern equivalents alongside the original forms.
The corpus techniques can then be applied to the modern forms while retaining the
original spellings. Our contribution in this paper is to apply the Variant Detector
(VARD) software (Baron & Rayson 2008), originally designed to address this issue in
historical corpora, to learner data. Our aim is to evaluate VARD’s potential for the
automatic detection of learner spelling errors and the insertion of corrections within
the learner corpora. Patterns of learner spelling errors are more diverse (e.g. across
mother-tongue backgrounds) than spelling errors resulting from typos and other na-
tive errors. The hybrid approach taken by the VARD tool is therefore expected to be of
particular value in this area. Our research contributes to the understanding of auto-
matic analysis of learner language, and if successful, will partly address the bottleneck
of manual error analysis of learner corpora because at least one type of error can be
marked up automatically.
The remainder of the paper begins (in Section 2) with an introduction to the
VARD tool and a description of previous work on spelling errors in learner data. In
Section 3, we describe the experimental setup and the data used for the study. Section 4
presents the results and we conclude the paper in Section 5 with some suggestions for
further work.
2. Background
In addition to learner data, there are other varieties of language with significant
amounts of spelling variation. One such area is historical corpora. Over recent years,
vast digitisation efforts have been undertaken to create textual resources, for example
1. https://www.calico.org/p-364-CALICO%2008%20Workshop.html
2. https://calico.org/p-420-AALL09.html
3. A few similar studies have been carried out on the effect of learner errors on part-of-speech
taggers (van Rooy and Schäfer, 2002)
Automatic error tagging of spelling mistakes in learner corpora 
the Open Content Alliance4, Google Books5 and Early English Books Online6. Much
of this data is out of copyright material from the Early Modern English period. In ad-
dition, historical corpora have been compiled containing texts from the same time
period; these include the Helsinki, ARCHER, ZEN, the Corpus of Early English Cor-
respondence (CEEC), the Corpus of English Dialogues (CED) and the Early Modern
English Medical Texts (EMEMT) corpus7. Our research on the detection of spelling
variants was driven by the need to take natural language processing (NLP) tools that
were trained on modern language varieties, and apply them to historical corpora.
When applied to historical data, the accuracy and robustness of existing NLP tools
(e.g. part-of-speech taggers and semantic taggers) is severely reduced due to a number
of factors, the most prominent of which is historical spelling variants and the resultant
mismatch to the modern lexicons embedded within the tools. In addition, even the
most basic corpus linguistic techniques and methods such as frequency profiling, key
words, concordances and collocations are affected due to the dispersal of frequency
counts of a given word across a number of different orthographic forms in the corpus
(see Baron & Rayson 2009, for a summary).
Our solution to the problems of historical corpus analysis was to develop the Vari-
ant Detector (VARD) software, to act as a pre-processor for NLP and corpus tools. The
initial aim of the software was to process the corpus data and insert modern equiva-
lents alongside the historical variants. The historical variants were preserved within an
XML tag, but the subsequent NLP or corpus tools ‘saw’ only the modern equivalent for
tagging or searching, for example:
<replaced orig=“companie”>company</replaced>
The original version of the VARD tool (Rayson et al, 2005) exploited a large manually
compiled list of c. 45,000 entries each consisting of an historical variant linked to its
modern equivalent. The pre-processing consisted of a search and replace operation on
the corpus data. This first version was useful for corpora on which it had been trained
but suffered due to its fixed list of variants when applied to previously unseen corpora.
Due to the nature of historical spelling variation, listing all of the possible variants was
shown not to be a scalable solution. VARD28 (Baron & Rayson 2008) was then devel-
oped to address this limitation by incorporating a hybrid of other methods for detecting
variants and finding candidate modern equivalents using techniques embedded in spell
4. http://www.opencontentalliance.org/
5. http://books.google.com
6. http://eebo.chadwyck.com/home
7. See the Corpus Resource Database for more information: http://www.helsinki.fi/varieng/
CoRD/
8. VARD 2 is freely available for academic research from http://www.comp.lancs.
ac.uk/~barona/vard2/
 Paul Rayson and Alistair Baron
checkers such as those contained in word processors (e.g. Microsoft Word). The process
incorporated in the current version of VARD (2.2) employs the following steps:
1. Compare each word in the input text to a large and broad coverage modern word
list derived from the British National Corpus and the Spell Checking Oriented
Word List (SCOWL)9. If the input text word is not found in the modern list then
mark it as a potential variant.
2. For each potential variant, produce a list of candidate modern equivalents using
the four techniques below and rank the resulting list with a confidence score based
on the weighted combination of techniques used to find each candidate:
a. Known variants list, i.e. the manually created list of historical variants and
modern equivalents
b. Phonetic matching algorithm, adapted from the Soundex phonetic algorithm
that is used to assign the same code to homophones in order to match them
despite small spelling differences
c. Letter replacement rules, representing common patterns of alternative spell-
ings, e.g. ‘replace u with v’
d. Edit distance, which records the number of edits (insertions, deletions and
substitutions) required to transform the variant to its equivalent
3. In the interactive version of VARD2, present the resulting rank list to the user along-
side each variant, allowing the user to choose the best modern replacement (in a
similar way to how corrections are displayed in word processors). A non-interactive
‘batch’ version of the tool also exists which can perform automatic insertion of the
highest ranked modern equivalents that have scores above a certain threshold.
The confidence score that is used to rank the candidate modern equivalents is based on
a weighted combination of the four methods listed above (see 2.a-d). Initial weights
are assigned to each method based on our previous experience with the tool, but when
applied to text the tool recalculates these weights based on the number of times that a
method is used to find a candidate replacement that is subsequently chosen by the user
in the interactive tool. Hence, during a training phase these weights can change sub-
stantially over time. This might reflect where, for example, in a particular historical
corpus the pre-built list of known variants is not as suitable as expected and where the
letter replacement technique is more often successful at suggesting chosen candidate
modern equivalents. It is this capability to ‘learn’ that makes VARD2 a tool worthy of
consideration for other situations where spelling variation occurs. It allows the tool to
be retrained by first applying it to sample texts from a particular corpus and then run-
ning it in a non-interactive mode over the remainder of the corpus. The interactive
version of VARD2 also permits users to customise the tool by adding to the built-in list
of known variants, replacing or extending its modern lexicon (i.e. for a different lan-
guage) and adding new letter replacement rules. The learning method is described in
9. See http://wordlist.sourceforge.net/scowl-readme
Automatic error tagging of spelling mistakes in learner corpora 
more detail in Baron & Rayson (2009) where the tool is evaluated on a child language
corpus in addition to an historical dataset.
In addition to VARD2, we have developed a complementary tool called DICER (Dis-
covery and Investigation of Character Edit Rules). DICER takes a corpus previously stan-
dardised manually through VARD or a list of variant and equivalent mappings and ex-
tracts a database of letter replacement rules and their frequencies of use in the input text.
This frequency analysis of letter replacement patterns can be used subsequently to create
a new set of letter replacement rules for VARD2 or to find extensions or restrictions on
the existing set of rules. The aim for DICER was to improve the accuracy of VARD2 after
manual training, and we have shown that a significant increase in performance follows
(Baron et al. 2009a). In addition, DICER results can be used to study spelling variation in
a given corpus, e.g. to find changes in spelling patterns over time or in child language
data. In this paper, we will apply DICER to detect patterns of misspelling in learner data.
Until recently, there has been a dearth of work focussing on the spelling errors of
English as a Foreign Language/English as a Second Language (EFL/ESL) language
learners. As Granger (2003) points out, error-tagged corpora of learner language are
especially useful for second language acquisition (SLA), foreign language teaching
(FLT) research and computer-assisted language learning (CALL). As well as the re-
search on learner corpora, research on spelling errors falls across multiple areas within
linguistics, second language acquisition, language learning and teaching, psycholin-
guistics, and educational research.
An early study by Ibrahim (1978) categorised the spelling errors made by a group
of Arab EFL learners in the Department of English at the University of Jordan. The size
of the group of learners was unspecified and the resulting categorisation described a
list of error types rather than tokens. Bebout (1985) compared misspellings made by
first (English speaking children) and second language learners (Spanish-speaking
adults learning English). She described how the ESL-speakers made more errors in-
volving consonant doubling. Zutell & Allen (1988) examined the spelling strategies of
108 English-Spanish bilingual children and found a Spanish phonological influence
on the spelling of English words. Wade-Woolley & Siegel (1997) examined the spelling
performance of 79 children and found that second language (ESL) speakers performed
in a similar manner to native speakers. Cook (1997) compared the spelling of adult
second language (L2) users of English with that of first language (L1) children and L1
adult users. Cook’s source data was corpus-based and taken from a mixture of exam
scripts, essay samples from assessment tests, and essays produced not under exam
conditions. Wang & Geva (2003) carried out a longitudinal study of 35 Chinese ESL
children and found an L1 transfer effect in relation to two English phonemes. Figuere-
do’s (2006) extensive review article incorporated an analysis of 27 other papers consid-
ering the influence of first language upon ESL learners’ spelling, including those men-
tioned above. The majority of the studies cited used a descriptive qualitative approach
and supported the hypothesis that there is a relationship between the first language of
the ESL learner and development of skills in English spelling.
 Paul Rayson and Alistair Baron
What all of these studies have in common is the manual approach to finding spell-
ing errors in language data, the resultant small size of the data sets and the variety of
patterns observed from different learner backgrounds. Since the late 1990s, the amount
of research activity involving the collection and annotation of learner corpora has
grown significantly (Tono 2003) and it is the case that manual error analysis is still
predominantly the norm. In this paper, we aim to address this issue by piloting a hy-
brid approach to the automatic discovery and correction of learner spelling errors. The
eventual aim is to allow much larger datasets to be analysed automatically, thus im-
proving the reliability and replicability of the research.
We follow in the footsteps of other corpus-based studies of spelling errors from
language learners. Lefer & Thewissen (2007) studied orthographic and morphological
errors in learner argumentative essays using samples from the Spanish, German and
French components of the International Corpus of Learner English (ICLE) represent-
ing intermediate to advanced learner writing. They compared a manual approach using
the second version of the Louvain error tagging system (Dagneaux et al. 2005) with an
automatic approach which identified unknown words as a side effect of semantic tag-
ging using the UCREL Semantic Analysis System (USAS) tagger (Rayson et al. 2004).
In order to use these automatic results, a significant amount of manual weeding out was
needed (around half of the results) since the USAS tagger marked as unknown (and
therefore as candidates for spelling errors) proper nouns and other words that were not
in its lexicon. The manual approach was shown to be better at identifying contextual
and capitalisation errors e.g. woman for women and church for Church. On average the
manual approach identified around 14% more learner errors. Lefer & Thewissen’s paper
was itself a revisitation of an earlier study by Granger & Wynne (1999) to explore the
effect of learner corpora on measures of lexical variation such as type-token ratio.
Granger & Wynne showed that such measures should be considered unsafe on learner
corpus data because it contains non-standard word forms (i.e. spelling errors and non-
standard coinages) which will unduly boost type-token ratio measures. As a side effect
of their research, they were able to extract these non-standard forms using an earlier
version of the USAS system. Milton & Chowdhury (1994) reported on the corpus-based
analysis of interlanguage which included the manual markup of spelling errors in a
written corpus of Chinese learners of English. Nicholls (2003) described the error cod-
ing in the Cambridge Learner Corpus which included the markup of spelling errors in
a very large (6-million word) component. However, this data is not publicly available.
Recently, research in Intelligent Computer-Aided Language Learning (ICALL)
has applied natural language processing techniques to learner corpora with the aim of
improving spelling checkers for L2 writers in addition to informing second language
acquisition research. Some research in this area has found its way into prototype
applications. The Microsoft Research ESL Assistant10 targets common errors made by
native speakers of East Asian languages (Chinese, Japanese and Korean) (see Gamon
10. http://research.microsoft.com/en-us/projects/msreslassistant/
Automatic error tagging of spelling mistakes in learner corpora 
et al. 2008, 2009). Check My Words11 is targeted at Chinese learners of English in Hong
Kong and enables learners to check their vocabulary and grammar online (Milton, 2004).
However, still further progress is required. Rimrott & Heift (2008) evaluated the per-
formance of the spell checker in Microsoft Word on 1,027 spelling errors types (1,808
tokens) of L2 writers in German and found that only 62% of the errors are success-
fully detected and corrected. This was mainly due to L2 learner errors in their corpus
containing multiple erroneous letters rather than one single erroneous letter as is more
expected in native writing. This estimate is confirmed in a study by Hovermale (2008)
who found that one third of the errors made by Japanese learners of English were not
detected and corrected by standard spell checkers.
As with spell checking of native language, the detection and correction of spelling
errors for learner language must deal with both non-word and real-word errors. For
real-word errors, where the error is contextual (e.g. their instead of there), then the
problem has extra complexity in the learner data because it most likely sits with other
errors in the surrounding context and there are more possible deviations from the
target word (Hovermale & Mehay 2009). Lee (2009) has extended the work on spell-
ing error correction to that of grammatical error correction using syntactic analysis.
3. Experiment
Having described the previous research in this area, we now turn to the evaluation of
the detection and correction abilities of the VARD software on learner data. In this
section, we discuss the experimental methodology.
The data that we used for this experiment is an expanded version of the set that
was used to study orthographic and morphological errors in learner writing (Lefer &
Thewissen 2007). That study used 30,000 words each from three learner populations
(Spanish, German and French) with data drawn from the ICLE corpus. Since the 2007
study, this data set has been expanded by Jennifer Thewissen and incorporates c.
50,000 words per L1 background. The data is marked up for all types of learner error.
However, we focussed on the spelling and morphological errors, marked as (FS), (FM)
and (GADJN) in the corpus.
The description of the relevant tags under the form (F) category in the ICLE cor-
pus error tagging manual (Dagneaux et al, 2005) are as follows:
(FS) includes all spelling errors. It is also used for mis-
use or omission of capital letters, word coinages (those
which do not belong to the (FM) category), borrowings,
homophones (e.g. it’s/its, their/there), doubling of con-
sonants/vowels, and misuse/omission of hyphens/blanks in
compound words.
11. http://mywords.ust.hk/
 Paul Rayson and Alistair Baron
(FM) is used for morphological errors (inflectional and

derivational). Inflectional errors result from the misuse
of grammatical morphemes (plural, genitive, verb morphol-
ogy, degree of adjectives, etc.) while derivational er-
rors are due to the addition of an erroneous affix to an
existing word.
Learner errors are marked in the ICLE corpus with an error tag in round brackets fol-
lowed by the original error and then the corrected form which is enclosed in dollar
signs. In order for the data to be processed by VARD and DICER, it was converted into
an XML format. All other error tags and corrections were removed with a Perl script
leaving only the spelling errors. For example, for the original text:
If we believe the boulevard newspapers and the talk shows
and especially the commercials, the human race is on (FS)
it’s $its$ way to perfection. We are offered the right
soap for dry or greasy skin, the perfect collar for (GDO)
your $our$ dachshund or (FS) sheep dog $sheepdog$, the
(FS) imacculate $immaculate$ photo for (GDO) your $our$
wedding album, the ideal book for (GDO) your $our$ (LS)
difficulties $problems$ in maths and biology, the best fit-
ness (FS) programm $program$ for (GDO) your $our$ (LP)
life belts $excess weight$ – or rather against them.
The converted version was as follows:

If we believe the boulevard newspapers and the talk shows
and especially the commercials, the human race is on <re-
placed orig=“it’s”>its</replaced> way to perfection. We
are offered the right soap for dry or greasy skin, the
perfect collar for your dachshund or
<replaced orig=“sheep dog”>sheepdog</replaced>, the
<replaced orig=“imacculate”> immaculate</replaced> photo
for your wedding album, the ideal book for your diffi-
culties in maths and biology, the best fitness <replaced
orig=“programm”>program</replaced> for your life belts –
or rather against them.
In addition, other spelling errors occur when a general rule of grammar is broken, in
which case the error is classified under the Grammar (G) category in the error tagging
manual e.g. (GADJN) poors $poor$ people. Here, the word poors is a spelling error, in
the sense that the attachment of the ‘s’ breaks the rule that all adjectives are invariable
in English. These errors were also included in our experimental dataset.
The converted corpus was fed into VARD which enabled us to calculate the accu-
racy of the VARD tool. The manually marked up corpus was used as a gold-standard
for the first analysis. Subsequently, the manually marked up corpus was split into train-
ing and test sets in order to see whether VARD’s learning abilities can be used to retrain
Automatic error tagging of spelling mistakes in learner corpora 
the tool from the detection of Early Modern English variants to those produced by
language learners. In parallel to the VARD analysis, the manually marked up corpus
was loaded into the DICER tool and the results are discussed in the following section.
4. Results
Using the manually corrected corpus as a gold-standard in the DICER tool, we can
produce a variety of statistics of interest. Although these results are not the main focus
of this paper, it is worth highlighting the differences between the corpora in order to
place our later VARD analysis in context. First, we can compare the profiles for edit
distance. As mentioned in Section 2, edit distance is the number of edits (insertions,
deletions and substitutions) that are needed to change an original word form produced
by the learner into the corrected form that has been manually inserted. It should be
noted that both DICER and VARD do not take account of corrections (i.e. edits) due
to capitalisation and these are given an edit distance of zero. Without contextual
knowledge, it is difficult to correctly predict capitalisation errors. Overall, there are
1765 spelling errors manually marked in the three corpora, of which 307 are due to
capitalisation errors. Table 1 shows the percentage of corrections (and therefore learn-
er errors) that are due to capitalisation. The rate in the German corpus (13.7%) is no-
ticeably lower than in the French and Spanish corpora.
Turning now to errors which are not due to capitalisation, Table 2 shows the profile
of edit distance for the overall corpus and then broken down by learner background.
Rimrott & Heift (2008) found that only 62% of the L2 misspellings in their dataset
were corrected by Microsoft Word. This was mainly due to many L2 misspellings
Table 1.â•‡ Percentage of replacements due to capitalisation
All French German Spanish
All replacements 1765 351 432 982

Replacements due â•⁄ 307 â•⁄ 67 â•⁄ 59 181
to capitalisation â•⁄â•⁄â•⁄â•⁄â•⁄ 17.4% â•⁄â•⁄â•⁄â•⁄ 19.1% â•⁄â•⁄â•⁄â•⁄ 13.7% â•⁄â•⁄â•⁄â•⁄ 18.4%
Table 2.â•‡ Edit distance profile
Edit distance All % French % German % Spanish %
1 75.0 69.7 83.1 73.0

2 14.2 16.2 â•⁄ 8.6 16.1
3 â•⁄ 4.0 â•⁄ 5.3 â•⁄ 2.9 â•⁄ 4.1
4 â•⁄ 2.0 â•⁄ 2.5 â•⁄ 1.9 â•⁄ 1.9
5+ â•⁄ 4.8 â•⁄ 6.3 â•⁄ 3.5 â•⁄ 4.9
 Paul Rayson and Alistair Baron
in their dataset containing multiple-edits. Our results show that an average of 75% of
the learner errors have an edit distance of one, representing the insertion, deletion or
substitution of one character between the learner error and the corrected form. An
average of 14.2% of the errors have an edit distance of two from the corrected form,
although the German corpus shows a significantly lower percentage (8.6%). Beyond
this point, there are much smaller numbers of errors: 4.0% showing an edit distance of
three overall, 2.0% with an edit distance of four overall and then 4.8% of the correc-
tions have an edit distance of five or more.
Using DICER, we can also detect where in each word the corrections are required.
Table 3 shows the location of these changes within words of each corpus. It can be seen
that French learners are more likely than German learners to make spelling errors at
the end of words and Spanish learners even more so. By contrast, German learners are
much more prone to make spelling errors towards the middle of words than French
and Spanish learners.
The DICER analysis also permits the counting of what types of spelling errors are
being made and, in addition, to group them by type of edit, i.e. by insertion, deletion
and substitution. Table 4 shows these comparative results that are based on types of
rules. From these initial results, we can hypothesise that Spanish learners make more
substitution errors than French and German learners. It should be noted that this table
is based on numbers of types rather than tokens. Therefore, the overall figures are not
an average of the three groups. The higher overall percentage of substitution errors
(76.2%) is due to a lack of overlap between specific types of substitution errors, i.e. the
three corpora do not share as many types of substitution errors (as they do with dele-
tions and insertions). The implications of this difference between learner groups will
also emerge in the VARD analysis that is to follow.
Table 3.â•‡ Position of corrections with words
Position of correction All % French % German % Spanish %
Start â•⁄ 7.2 â•⁄ 7.3 â•⁄ 4.9 â•⁄ 8.2

Second â•⁄ 8.4 â•⁄ 7.9 â•⁄ 6.8 â•⁄ 9.3
Middle 56.7 60.6 66.8 50.8
Penultimate â•⁄ 8.6 â•⁄ 7.9 10.7 â•⁄ 7.8
End 19.2 16.4 10.7 23.9
Table 4.â•‡ Type of spelling errors made
Type of correction All % French % German % Spanish %
Deletion 12.1 15.9 17.4 12.6

Insertion 11.7 14.7 16.5 12.6
Substitution 76.2 69.4 66.1 74.8
Automatic error tagging of spelling mistakes in learner corpora 
The DICER software would also allow us to investigate further differences in terms of
specific learner spelling errors. However, such an analysis is out of scope for this study
and remains future work. Now, we turn to the results arising from the application of
the VARD tool to the data.
During our initial experiments on the learner corpus data, we made some im-
provements to the VARD software over and above those described in our recent work
(Baron & Rayson, 2009). These included improvements to how the known variants list
and letter replacement rules are used. For the known variants list, each individual vari-
ant to modern equivalent mapping is now assigned a precision and recall score which
contributes to the confidence score for a candidate variant replacement. This allows
more fine-grained training of VARD as some entries in the list will be of more value to
the current dataset than others (e.g. entries of Early Modern English are likely to be of
much less use for learner errors than new additions to the list from training). The letter
replacement technique was also improved to allow more specific additions to the rule
list from the DICER analysis; previously a rule’s application position was limited to
start, end or anywhere. Middle, second and penultimate positions have been added to
bring the letter replacement method in line with the DICER analysis.
Taking the manually corrected corpus as a gold-standard, we are able to count the
number of ‘real-word errors’ using VARD. These are learner spelling errors that match
other words in the VARD dictionary, e.g. it’s (learner) for its (corrected) and the
(learner) for they (corrected). In addition, the real word error category includes a large
number of corrections due to removing or inserting spaces or hyphens e.g. match box
(learner) for matchbox (corrected) and mountain-bikes (learner) for mountain bikes
(corrected). Without taking into account local context, VARD and other spell-check-
ing tools are unable to spot these errors. Table 5 shows the percentages of types and
tokens that are real-word errors. The Spanish learners represented in the corpus make
the lowest number of spelling errors that are also real-words while the German learn-
ers make over double that amount relatively speaking, although they make fewer errors
overall. If we include errors due to capitalisation, then the overall percentages increase
to 27.7% for types and 33.8% for tokens. This corresponds roughly with a third of the
tokens in the Rimrott & Heift (2008) study which Microsoft Word was unable to cor-
rect. The large number of real word errors causes a significant problem for VARD as
discussed below.
Given that we have a manually checked corpus where all the learner errors have
been marked and corrected, we can apply the VARD software to this data and calculate
Table 5.â•‡ Real-word errors as a percentage of total spelling errors
All % French % German % Spanish %
Types 19.7 20.7 36.4 12.7

Tokens 21.9 22.6 35.9 15.2
 Paul Rayson and Alistair Baron
Table 6.â•‡ VARD recall and precision before training
Types % Tokens %
Recall â•⁄ 7.8 â•⁄ 7.6

Precision 87.9 88.9
the number of learner spelling errors that it detects and how many it corrects. By com-
paring the automatic method as represented by VARD with the manual method from
the corpus, we can calculate how accurate the tool is. It is worth reiterating at this point
that the VARD tool was designed to detect historical spelling variants in Early Modern
English and the resources it contains in terms of known variants list and letter replace-
ment rules have been trained for this kind of data. Applying the tool as is, without any
training, we observe the results shown in Table 6.
The accuracy (precision) of the untrained tool is high with a success rate of almost
90%. However, the number of learner spelling errors that it finds and tries to correct
(recall) is low, under 10%. There is a compromise to be made between recall and preci-
sion; if we wish to lower the thresholds in the tool and attempt to correct more errors,
then the accuracy will fall. Two scenarios can be imagined. First, where the analyst will
carry out manual spot checks on the data output from VARD. In this case, lower preci-
sion and higher recall are acceptable. Second, where the analyst wishes to run a very
large amount of data without carrying out any post-editing. Here, higher precision is
preferred since we would not want the tool to introduce corrections where they are not
needed (false positives).
In addition to the learner spelling errors that are manually marked in the data set,
VARD also considers the remaining words in the text. Therefore, it may also detect oth-
er candidates for learner errors that have not been marked as such by the human analyst,
e.g. und12. The rate of false positives is very low as reported in Table 7. From the point of
view of our experiment, these can be viewed as mistakes by VARD. However, if it can
help identify untagged learner spelling errors, VARD can then be used as a tool to assist
the human analyst with manual error tagging.
The second stage of our experiment is to use part of the manually corrected corpus
to train the VARD tool on the type of spelling errors that learners make. Our approach
has been to use three quarters of each corpus as training material and one quarter for
testing. The training and test sections are sampled randomly into 500 (±10) word
sections. We use a replacement threshold of 80% in VARD for all experiments
Table 7.â•‡ VARD false positive rate before training
Types % Tokens %
False positives 5.8 4.6
12. This was deliberately left untagged by the analyst since it occurs in German book titles.
Automatic error tagging of spelling mistakes in learner corpora 
Table 8.â•‡ VARD recall and precision after training
Types % Tokens %
Recall 13.8 16.2

Precision 87.9 90.7
reported. Following this training process, we observe the results shown in Table 8 for
the overall dataset of the three corpora. The precision values have increased slightly
while the recall values have significantly improved. This means that VARD is correct-
ing around double the number of learner spelling errors compared to before training,
without any loss of accuracy. Splitting the data into individual native languages and
training VARD on these datasets in isolation provides improved performance over
using the dataset as a whole. Figure 1 shows VARD’s recall rising to higher levels in the
individual language sets; In terms of tokens13, scores of 14.5% for French, 20% for
German and 16.2% for Spanish are observed. An improvement in precision was also
observed, with French and German reaching 100% accuracy throughout the training
process and Spanish beginning with 100% accuracy but dropping slightly to 96.6% by
the end of training.
The third stage of our experiment links back to the DICER analysis described
earlier. We used the DICER tool to extract by hand a set of letter replacement rules
observed in the manual corrections. We extracted the most frequent (i.e. successful)
40
French recall
German recall
35
Spanish recall
30
25
% Tokens
20
15
10
0
0 5000 10000 15000 20000 25000 30000 35000
Sample tokens seen
Figure 1.â•‡ Training effect on VARD recall for individual languages
13. For types, recall scores of 14.5% for French, 18.4% for German and 15.8% for Spanish are
observed
 Paul Rayson and Alistair Baron
rules and added them to the set of letter replacement rules within VARD. The aim was to
improve the automatic analysis by increasing the likelihood of a correction suggested by
VARD being the same as the one introduced manually in the error tagging. The resulting
improvement can be seen in Table 9. Again, the precision is not affected, but the rate of
recall (i.e. number of spelling errors found and corrected) has improved. Around one
quarter of the learner spelling errors that are manually error tagged are now being found
by VARD and automatically corrected. Again, using each native language in isolation
improves performance; Table 9 shows VARD’s recall during the training of each lan-
guage set as in Figure 2. Here recall is further improved by introducing rules from spe-
cific DICER analyses for individual native languages. Recall scores (in terms of tokens14)
of 21.8%, 29.6% and 20.8% are attained for French, German and Spanish respectively.
Precision is maintained at 100% for all languages throughout the training process15.
Table 9.â•‡ VARD recall and precision after manual addition of DICER rules
Types % Tokens %
Recall 19.1 23.4

Precision 87.7 90.8
40
French recall
German recall
35
Spanish recall
30
25
% Tokens
20
15
10
0
0 5000 10000 15000 20000 25000 30000 35000
Sample tokens seen
Figure 2.â•‡ Training effect on VARD recall after manual addition of DICER rules
14. For types, recall scores of 21.8% for French, 22.5% for German and 20.6% for Spanish are
observed.
15. These perfect precision scores are tempered slightly by VARD introducing false positives
through the attempted standardisation of extra words detected as variants (as described for the
whole dataset in Table 7.
Automatic error tagging of spelling mistakes in learner corpora 
Compared with the approach used by Lefer & Thewissen (2007) and described in
Section 2, the results show that VARD is much more suitable for this automated task.
Lefer & Thewissen had to manually weed-out around 58% of the unrecognised forms
suggested by the automatic process as candidate learner errors. VARD’s false positive
rate is around 5% as shown above.
5. Conclusion
In this paper, we have discussed previous work on spelling errors in learner data from
a variety of perspectives: second language acquisition, language teaching, psycholin-
guistics and educational research. The experiments described here draw on the areas
of computer-aided language learning where techniques from natural language pro-
cessing are applied to learner corpora and computer learner corpus research where
manual error tagging is still the norm.
The techniques employed by the VARD tool are highly accurate, around 90% pre-
cision, for correcting learner errors. Further research is required to improve the
detection of learner errors since the recall shown is around 23%. Specifically, new tech-
niques are required for the detection of learner spelling errors that can only be found
using contextual patterns. Techniques such as these do exist in spell checkers, but
learner research shows that different patterns emerge depending on the language
background and experience of the learner. We have shown the potential of NLP meth-
ods to contribute to the automatic error analysis of learner corpora. VARD can con-
tribute in at least two ways. First, it can assist a manual process of editing by suggesting
further learner spelling errors that are missed by a human analyst. Second, after the
manual correction of a sample corpus, VARD can be trained and run automatically
over the full corpus to generate larger amounts of data for analysis. Indirectly through
computer learner corpus research, these results contribute to the improvement of spell
checking techniques for learners and it allows the selection of corpus data for L1-spe-
cific spelling and morphology exercises.
Seventeen years ago, Granger & Meunier (1994) suggested the idea of a grammar
checker for learners of English. Much more research is required, but we offer the experi-
mental results and VARD tool presented here as a partial contribution to this endeavour.
Acknowledgements
We are grateful to Jennifer Thewissen who provided the error-tagged data for our ex-
periments and commented on a draft of this paper.
 Paul Rayson and Alistair Baron
References
Archer, D., McEnery, T., Rayson, P. & Hardie, A. 2003. Developing an automated semantic anal-
ysis system for Early Modern English. In Proceedings of the Corpus Linguistics 2003 Confer-
ence [UCREL Technical Paper Number 16], D. Archer, P. Rayson, A. Wilson & T. McEnery
(eds), 22 – 31. Lancaster: UCREL, Lancaster University.
Baron, A., Rayson, P. & Archer, D. 2009a. Automatic standardization of spelling for historical
text mining. In Proceedings of Digital Humanities 2009, Maryland, USA, 309–312. College
Park MD: University of Maryland.
Baron, A., Rayson, P. & Archer, D. 2009b. Word frequency and key word statistics in historical
corpus linguistics. Anglistik: International Journal of English Studies 20(1): 41–67.
Baron, A. & Rayson, P. 2008. VARD 2: A tool for dealing with spelling variation in historical
corpora. In Proceedings of the Postgraduate Conference in Corpus Linguistics. Birmingham:
Aston University.
Baron, A. & Rayson, P. 2009. Automatic standardisation of texts containing spelling variation:
How much training data do you need? In Proceedings of Corpus Linguistics 2009, University
of Liverpool, UK, July 2009. Liverpool: University of Liverpool.
Bebout, L. 1985. An error analysis of misspellings made by learners of English as a first and as a
second language. Journal of Psycholinguistic Research 14(6): 569–593.
Cook, V.J. 1997. L2 users and English spelling. Journal of Multilingual and Multicultural Devel-
opment 18(6): 474–488.
Dagneaux E., Denness, S. & Granger, S. 1998. Computer-aided error analysis. System: An Inter-
national Journal of Educational Technology and Applied Linguistics 26(2): 163–174.
Dagneaux E., Denness S., Granger S., Meunier F., Neff J. & Thewissen J. 2005. Error Tagging
Manual Version 1.2. Louvain-la-Neuve: Centre for English Corpus Linguistics, Université
Catholique de Louvain.
Figueredo, L. 2006. Using the known to chart the unknown: A review of first-language influence
on the development of English-as-a-second-language spelling skill. Reading and Writing
19(8): 873–905.
Gamon, M., Gao, J., Brockett, C., Klementiev, A., Dolan, W., Belenko, D. & Vanderwende, L.
2008. Using contextual speller techniques and language modeling for ESL error correction.
In Proceedings of IJCNLP, Hyderabad, India, Asia Federation of Natural Language Process-
ing, January 2008, 449–456.
Gamon, M., Leacock, C., Brockett, C., Dolan, W., Gao, J., Belenko, D. & Klementiev, A. 2009. Using
statistical techniques and web search to correct ESL errors. CALICO Journal 26(3): 491–511.
Granger S. 1999. Use of tenses by advanced EFL learners: Evidence from an error-tagged com-
puter corpus. In Out of Corpora – Studies in Honour of Stig Johansson. H. Hasselgård & S.
Oksefjell (eds), 191–202. Amsterdam: Rodopi.
Granger, S. 2003. Error-tagged learner corpora and CALL: A promising synergy. CALICO Jour-
nal 20(3): 465–480.
Granger, S. & Meunier, F. 1994. Towards a grammar checker for learners of English. In Creating
and Using English Language Corpora: Papers from the Fourteenth International Conference
on English Language Research on Computerized Corpora, Zürich 1993, U. Fries, G. Tottie &
P. Schneider (eds), 79–91. Amsterdam: Rodopi.
Granger S. & Wynne M. 1999. Optimising measures of lexical variation in EFL learner corpora.
In Corpora Galore, J. Kirk (ed.), 249–257. Amsterdam: Rodopi.
Automatic error tagging of spelling mistakes in learner corpora 
Granger S. & Thewissen J. 2005a. Towards a reconciliation of a ‘Can Do’ and ‘Can’t Do’ approach
to language assessment. Paper presented at the Second Annual Conference of EALTA, 2nd-
5th June 2005, Voss, Norway.
Granger S. & Thewissen J. 2005b.The contribution of error-tagged learner corpora to the assess-
ment of language proficiency. Paper presented at the 2005 Language Testing Research Col-
loquium, July 20th- 22nd 2005, Ottawa, Canada.
Gries, S. & Myslin, M. 2009. k dixez? A corpus study of Spanish Internet Orthography. In Pro-
ceedings of Corpus Linguistics 2009, Liverpool, July 2009. Liverpool: University of Liverpool.
Hovermale, D. 2008. SCALE: Spelling correction adapted for learners of English. In Proceedings
of the CALICO-08 ICALL Special Interest Group Pre-conference Workshop, San Francisco,
CA, USA.
Hovermale, D. & Mahey, D. 2009. Real-word spelling correction for CALL. Presented at the
Sixth Midwest Computational Linguistics Colloquium, MCLC-6, May 2009, Indiana Uni-
versity, Bloomington, USA.
Ibrahim, M. H. 1978. Patterns in spelling errors. English Language Teaching Journal 32: 207–212.
Lee, J.S.Y. 2009. Automatic Correction of Grammatical Errors in Non-native English Text. PhD
dissertation, Massachusetts Institute of Technology.
Lefer, M.-A. & Thewissen, J. 2007. Orthographic and morphological errors in learner writing.
Presented at ICAME 2007, Stratford-upon-Avon, UK.
Meunier, F. & Granger, S. (eds) 2008. Phraseology in Foreign Language Learning and Teaching.
Milton, J. 2004. Mark my words: Technologies for supporting, managing and responding to
student writing. In Proceedings of the Second Teaching and Learning Symposium, Hong Kong,
May 17, 2004, Senate Committee on Teaching and Learning Quality, and Center for En-
hanced Learning and Teaching, HKUST, Hong Kong.
Milton, J. & Chowdhury, N. 1994. Tagging the interlanguage of Chinese learners of English. In
Proceedings of Joint Seminar on Corpus Linguistics and Lexicology, Guangzhou and Hong
Kong, 19–22 June, 1993, Language Centre, HKUST, Hong Kong, 127–143.
Nicholls, D. 2003. The Cambridge Learner Corpus – error coding and analysis for lexicography
and ELT. In Proceedings of Corpus Linguistics 2003, Lancaster University, UK, 572–581.
Rayson, P., Archer, D., Piao, S. L. & McEnery, T. 2004. The UCREL semantic analysis system. In
Proceedings of the Workshop on Beyond Named Entity Recognition Semantic Labelling for
NLP Tasks in Association with 4th International Conference on Language Resources and
Evaluation (LREC 2004), 25th May 2004, Lisbon, Portugal, 7–12.
Rayson, P., Archer, D. & Smith, N. 2005. VARD versus word: A comparison of the UCREL vari-
ant detector and modern spell checkers on English historical corpora. In Proceedings of
Corpus Linguistics 2005, University of Birmingham, UK, July 2005.
Rayson, P., Archer, D., Baron, A., Culpeper, J. & Smith, N. 2007. Tagging the bard: Evaluating the
accuracy of a modern POS tagger on Early Modern English corpora. In Proceedings of Cor-
pus Linguistics 2007, July 27–30, University of Birmingham, UK.
Rimrott, A. & Heift, T. 2008. Evaluating automatic detection of misspellings in German. Lan-
guage Learning & Technology 12(3): 73–92.
van Rooy, B & Schäfer, L. 2002. Southern African Linguistics and Applied Language Studies 20(4):
325–335.
Tono, Y. 2003. Learner corpora: Design, development and applications. In Proceedings of Corpus
Linguistics 2003, Lancaster University, UK, 800–809.
 Paul Rayson and Alistair Baron
Wade-Woolley, L. & Siegel, L. 1997. The spelling performance of ESL and native speakers of
English as a function of reading skill. Reading and Writing 9(5–6): 387–406.
Wang, M. & Geva, E. 2003. Spelling acquisition of novel English phonemes in Chinese children.
Reading and Writing: An Interdisciplinary Journal 16: 325–348.
Zutell, J. & Allen, V. 1988. The English spelling strategies of Spanish-speaking bilingual children.
TESOL Quarterly 22(2): 333–340.
Data mining with learner corpora
Choosing classifiers for L1 detection
Scott Jarvis
This paper discusses the usefulness of machine-learning techniques for the

investigation of cross-linguistic influence in learner corpora, and focuses on
an approach known as supervised classification. Within this approach, one of
the challenges that researchers face is deciding which particular method – or
classifier – to use for a particular task. The classification task that this paper
deals with is the ability of classifiers to learn to detect native language-related
patterns in samples of second language writing. The empirical portion of
this paper compares 20 classifiers in relation to their ability to perform this
task with second language texts written by learners from 12 different native
language backgrounds on the basis of their use of words and word sequences
(or n-grams).
1. Introduction
An important characteristic of corpus analysis – including learner corpus analysis – is

its heavy reliance on computer automation for purposes of discovering patterns in the
data. Because of the size and complexity of most language corpora, it would be infea-
sible to perform comprehensive analyses of the data without computer automation.
Automated processes of searching for and extracting newly discovered information
from large databases, such as language corpora, are often referred to as data mining,
and the computer-based tools available for this type of information retrieval are be-
coming increasingly varied and sophisticated.
One of many approaches to data mining involves what is known as classification,
which can be further divided into unsupervised classification and supervised classifi-
cation. The purpose of unsupervised classification in the case of corpus analysis is to
identify clusters of texts that have similar contents in relation to a number of textual
features (or variables), such as the relative frequencies of specific letters, morphemes,
words, word classes, syntactic constructions, semantic relations, and/or ratios or other
types of indices that reflect various aspects of the contents of a text. The clustering or
identification of sets of texts with similar contents can lead to new discoveries
 Scott Jarvis
concerning the factors at play in the texts or among the people who produced the texts.
For example, a study by Jarvis et al. (2003) used an unsupervised classification tool
known as cluster analysis to examine the linguistic similarities and differences that can
be found across highly rated learner texts. The results of the study revealed multiple
clusters of highly rated texts, which indicated multiple profiles of effective second lan-
guage (L2) writing. One of the profiles, for instance, involved a high level of lexical
diversity, a high use of nouns, and a high use of prepositions, whereas another profile
involved a low use of nouns and prepositions but a high use of adverbials and present
tense verbs. Among other things, this study provided an indication of the combina-
tions of variables that work together in successful L2 texts, and also showed that there
are multiple alternative routes to successful L2 writing. This study used cluster analy-
sis, but some of the other tools available for unsupervised classification include statisti-
cal procedures known as Independent Component Analysis and neural network mod-
els (Hinton & Sejnowski 1999; Duda et al. 2000; Hyvärinen & Oja 2000; Kotsiantis &
Pintelas 2004). For convenience, computer programs that perform unsupervised clas-
sification are often referred to simply as clusterers (Witten & Frank 2005).
Supervised classification, in turn, is a form of machine learning where a computer
program learns to recognize patterns associated with predefined classes. The term su-
pervised refers to the fact that, in this type of machine learning, the computer program
is not designed to discover classes (i.e. groups of cases) on its own, but is told what the
relevant classes are and is directed to discover the patterns in the data that are most
distinctive of those particular classes. When used with a corpus of texts, supervised
classification performs its learning on training data (i.e. a subset of the corpus) that
include not only the types of textual features that are used in unsupervised classifica-
tion, but which also include predefined class labels associated with each text. In some
cases, the labels represent text variables, such as the topics of the texts or the genres
they represent. In other cases, the labels represent attributes of the authors who pro-
duced the texts, such as their gender, nationality, or attitude toward the topic. The
purpose of supervised classification is to discover patterns among the textual features
fed into the program that may be predictive of the class labels associated with the texts.
After the program has constructed a predictive model on the basis of the training data,
the model is applied to a further set of texts whose labels are withheld from the classi-
fier in order to determine how generalizable the model is – to determine how accu-
rately it is able to predict the class memberships of texts whose classes are unknown
(to the classifier, at least). High levels of classification accuracy are indicative of two
things: (a) that the data do indeed contain patterns associated with the class labels in
question (i.e. the program could not learn to detect these class-related patterns cor-
rectly if there were nothing in the data to learn), and (b) that the program itself is ef-
fective in discovering these patterns.
A recent study by Crossley & McNamara (2009) provides a clear example of how
supervised classification can be used to discover patterns associated with different
groups of writers. The study examined argumentative essays written in English by
Data mining with learner corpora 
both English-speaking and Spanish-speaking university students. The purpose of the

study was to determine the ways in which “L2 writers of English differ from L1 writ-
ers in their use of lexical cohesive devices and other lexical features” (p. 123). The
predefined classes in this study were thus native versus nonnative, and the authors set
out to determine how accurately the class membership of each text could be pre-
dicted on the basis of the use of various lexical and cohesion-related features
(e.g. average levels of hypernymy and polysemy, argument overlap, the use of causal
verbs and motion verbs). To conduct their analysis, the researchers used a supervised
classification tool known as Discriminant (Function) Analysis. During the training
phase of their analysis, they fed the class labels (i.e. native or nonnative) and the fea-
ture values (e.g. hypernymy values, polysemy values) for half of the texts into the
Discriminant Analysis program. During this phase, the computer program created a
statistical model of the relationship between features and classes. Then, during the
testing phase, the researchers applied that statistical model to the other half of the
texts to determine how well it could predict whether they were written by native or
nonnative speakers of English. The results showed that a statistical model based on
just 7 features provided the highest degree of classification accuracy, classifying cor-
rectly 79.10% of the texts that were held back for the testing phase. The strength and
clarity of these results pointed to relatively reliable differences between English-
speaking and Spanish-speaking university students in relation to the levels of lexical
depth of knowledge, lexical variation, and lexical sophistication found in their argu-
mentative writing.
I mentioned earlier that computer programs used for unsupervised classification
are often referred to simply as clusterers. By contrast, computer programs used for
supervised classification purposes are often referred to as classifiers. One such classi-
fier is Discriminant Analysis – mentioned in the preceding paragraph – and other
classifiers include statistical programs such as Support Vector Machines, Bayesian
classification models, rule-based models, and decision-tree models, among many
others (Witten & Frank 2005; Kotsiantis 2007). Classifiers are used not only for text
classification purposes, but also for a vast range of classification purposes in many
other fields, such as in medical research, where it is used in the identification of pre-
dictors of diseases like cancer (e.g. Shen et al. 2007), and in geophysical research,
where it is used to create maps of land use based on data from satellite-borne sensors
(e.g. Liu et al. 2003). A number of the discoveries and developments regarding classi-
fiers correspondingly come from other fields (e.g. Molinaro et al. 2005), and several
studies referred to in this paper thus come from outside of linguistics and language-
related research.
The topic of the present chapter is the value of (supervised) classifiers for second
language research with learner corpora. Such tools are already in widespread use in
fields such as stylometry, literary analysis, and information science, and they are
likely to become increasingly common in the analysis of learner corpora, too.
Whereas traditional second language research tends to examine individual language
 Scott Jarvis
features (e.g. subject-verb agreement) in relation to the central tendencies of groups

of learners defined according to a given criterion (e.g. proficiency), classifiers high-
light the ways in which multiple language features work together in the language use
of individuals sharing particular background characteristics. In other words, tradi-
tional second language research tends to examine (a) one language feature at a time
(b) in relation to group tendencies, whereas classifier-driven research examines (c)
constellations of features in (d) the language use of individuals belonging to certain
groups. Regarding (b) and (d), where traditional second language research and
learner corpus research tend to rely on group means and/or overall frequencies of
occurrence, classifier-driven research tends to rely instead on classification accuracy
– i.e. the percentage of individuals within each group whose use of a particular bun-
dle of features is group-distinctive in the sense of being both representative of a
particular group (i.e. group-representative) and also relatively unique to that group
(i.e. group-specific). The classifier-driven approach offers a clearer picture of how
well learners’ group memberships can be predicted on the basis of their language
behaviors.
In recent papers, I have emphasized the value of classifiers for the investigation
of cross-linguistic influence (or language transfer), and I have referred to the use of
such means for the collection of evidence for cross-linguistic effects as the detection-
based approach (Jarvis 2010; Jarvis forthcoming). As I describe in these papers, the
detection-based approach complements the more traditional comparison-based ap-
proach by offering an alternative argument for cross-linguistic influence, which is
that the ability to identify learners’ native languages (L1s) accurately from their pat-
terns of language use is prima facie evidence for L1 effects. Such evidence is not
necessarily incontestable, as even in the legal sense the term prima facie evidence
means that the case is not closed but that the evidence is judged to be strong enough
to be presented to a jury, or alternatively that the evidence compels a particular
conclusion unless counterevidence is available to rebut it (e.g. Herlitz 1994–1995:
395). Accordingly, a fully rigorous detection-based argument for the presence of L1
effects requires not only L1 detection accuracies that are significantly above the
level of chance, but also the presentation of facts that rule out the possibility that the
observed L1 detection accuracies may have resulted from factors other than L1 in-
fluence. When such confounds cannot be ruled out, other means will be needed to
achieve argumentative rigor. Yet, even when argumentative rigor is not achieved
through detection-based argumentation alone, the detection-based approach still
plays an important role in leading researchers to possible cases of cross-linguistic
influence that can later be scrutinized from the perspectives of both detection-based
and comparison-based evidence (see Jarvis 2000; Jarvis & Pavlenko 2008; Jarvis
2010) – similar to how prima facie evidence in a legal context is the impetus for a
full trial.
The detection-based approach to transfer research is highlighted in a forthcoming
collection of studies dealing with issues of argumentative rigor and the strength of
Data mining with learner corpora 
evidence for L1 influence derivable from classifiers applied to learner corpora (see
Jarvis & Crossley forthcoming). Given that those issues are addressed at length in that
volume, I will not deal with them further in this paper. However, those studies do raise
a separate question that I will attempt to address relatively comprehensively in the
present chapter. It is the question of which of the many available classifiers is the most
useful for this type of research in terms of the levels of classification accuracy it
achieves, the interpretability of its output, and the practicality of its use. As I will dis-
cuss shortly, the answer to this question is complicated by the fact that the usefulness
of classifiers appears to be heavily dependent on how many and which specific lan-
guage features are being investigated, what the specific relationship is between those
features, and what their specific relationship is with the class variable that the re-
searcher is trying to predict. The studies in the Jarvis & Crossley (forthcoming) vol-
ume all use Linear Discriminant Analysis (LDA) for classification purposes, and the
main purpose of the present paper is to determine how LDA compares with other clas-
sifiers in terms of its ability to learn to recognize L1-related patterns among the fea-
tures under investigation. Because each study in the Jarvis & Crossley volume investi-
gates a different set of language features in relation to learners’ L1 backgrounds,
however, it is possible that the best performing classifier could be different for each
study. Given that it is impractical in this paper for me to conduct a thorough com-
parison of classifiers with the data used in each of those studies, I will restrict my focus
to the study in that volume that deals with the greatest number of features and the
greatest number of L1 backgrounds. The study in question is an investigation of the
ability of LDA to identify L1-related patterns in the use of 722 word n-grams (or multi-
word sequences) in L2 essays extracted from the International Corpus of Learner Eng-
lish (ICLE) that were produced by learners of English from 12 L1 backgrounds (Jarvis
& Paquot forthcoming). The relevant details of that study will be described in
Section 3.2, after I have discussed the types of classifiers that are available and how
they work, and after I have reviewed empirical studies that have compared their per-
formance in general, as well as studies that have examined their ability to detect learn-
ers’ L1 backgrounds in particular.
2. Classifiers
It is important to point out at the beginning of this section that the term classifier is
alternately used in the literature with two different meanings: a general sense and a
specific sense. In its general sense, the term refers to a computer program (e.g. a pro-
gram that performs LDA) that has been designed to construct a predictive model of
the relationship between features (e.g. language features) and a class variable (e.g. L1
background). In its specific sense, it refers to a specific model that has been construct-
ed by such a program. In the present paper, I will use this term in its general sense, and
will use the term model to refer to the more specific meaning.
 Scott Jarvis
2.1 Types of classifiers
The number of existing classifiers is large and quickly expanding, which makes it in-
feasible to present a full catalog. In this section, I will briefly describe the major types
of classifiers in relation to the kinds of algorithms they rely on. This information is also
summarized in Table 1, along with examples of some of the prominent classifiers with-
in each category. More information about each type of classifier can be found in
Appendix 1.
Centroid-based classifiers create statistical models in which each case (e.g. text) is
mathematically represented as if it existed as a point in multi-dimensional space. The
classifier also determines the exact center, or centroid, of all points belonging to the
same class (e.g. L1 group). During the testing phase, the classifier measures the dis-
tance between each case and each centroid, and classifies each case as a member of the
class whose centroid it is closest to. Boundary-based classifiers are similar to centroid-
based classifiers, except that instead of relying on group centroids, they attempt to
determine boundaries between each cluster of points, and then classify each case as
belonging to the class whose boundaries it falls within.
Bayesian classifiers use a simple algorithm for calculating the probability that a
particular case (e.g. text) is a member of a particular class (e.g. L1 group). These
Table 1.â•‡ Major types of classifiers
Type of Classifier Examples
Centroid-based Linear Discriminant Analysis (LDA)

Quadratic Discriminant Analysis (QDA)
Nearest Shrunken Centroids (NSC)
Boundary-based Support Vector Machines (SVM)
Optimal Separating Hyperplanes (OSH)
Sequential Minimal Optimization (SMO)
Bayesian classifiers Naïve Bayes (NB)
Naïve Bayes Multinomial (NBM)
Complement Naïve Bayes (CNB)
Artificial neural networks Multilayer Perceptron Analysis (MPA)
Radial Basis Function (RBF)
Decision trees Simple CART (SC)
Random Forest (RF)
Rule-based Conjunctive Rule Classifier (CRC)
RIPPER (RIP)
Means-based Delta
Delta Prime (DP)
Composite Naïve Bayes Tree (NBT)
Classification via Regression (CVR)
LogitBoost (LB)
Data mining with learner corpora 
probabilities are calculated from feature values, such that a higher or lower frequency
of a particular feature (e.g. the definite article) will either raise or lower the probability
that a case belongs to a particular class. Classification is performed by aggregating the
probabilities for all features, and by classifying a case as belonging to the class for
which it has the highest probability. Whereas Bayesian classifiers treat features as if
they are independent of one another, artificial neural networks treat them as being
interrelated. Artificial neural networks also assign different weights to different fea-
tures in order to improve classification accuracy.
Decision trees and rule-based classifiers are similar to each other in that both tend
to rely on the flow-chart principle, where individual cases are sorted into classes
through sequences of decisions. The decisions at each step can usually be stated as if-
then statements (e.g. if the relative frequency of the definite article in this text is be-
tween 95 and 110 occurrences per 1000 words, then proceed to branch/rule X). Means-
based classifiers, on the other hand, are more similar to Bayesian classifiers in the sense
that they involve adding up values associated with individual features, and classifying
cases as belonging to the group with the most similar aggregate value. The difference
between Bayesian classifiers and means-based classifiers is that the former rely on
probabilities calculated for each feature, whereas the latter rely on z-scores (or stan-
dardized values) associated with each feature. Finally, composite classifiers (which are
sometimes referred to as meta classifiers or ensemble classifiers) involve a combina-
tion of more than one type of classification, such as the combination of Bayesian prob-
abilities with a decision-tree algorithm.
2.2 Feature selection and parameter tuning
Classification accuracy is often enhanced by reducing the number of features in the

model. This not only improves the efficiency of the classifier, but also removes unnec-
essary complexity from the model. In many cases, overall classification accuracy can
be improved by removing features that do not contribute to the predictive power of the
model. Some classifiers also have restrictions on the number of cases that are needed
per feature. For example, Linear Discriminant Analysis (LDA) rests on parametric as-
sumptions for multivariate statistical tests, which, according to the old rule of thumb,
require at least 10 cases per variable (i.e. 10 texts per feature). Other scholars have
proposed even more rigid requirements for LDA, calling for at least five times as many
cases per class as there are features (Burns & Burns 2008: 591) or at least 20 times as
many cases in the overall training data as there are features (Field 2005). Under all of
these circumstances, it is desirable if not absolutely necessary to limit the number of
features on which a classifier will build its model of class membership.
Ultimately, features should be selected and prioritized on the basis of theoretical
criteria. However, sometimes this is not possible or desirable, such as in the case of
exploratory research whose purpose is the pre-theoretical, empirical discovery of
which features (or combination of features) are most predictive of membership in
 Scott Jarvis
given classes. Multiple automated options for feature selection have been developed
for these types of exploratory purposes. With respect to LDA, statistical software ap-
plications such as SPSS include stepwise options for selecting features according to
how much a particular feature adds to the strength of the model. The stepwise method
first chooses the feature that is most strongly correlated with the class (or grouping)
variable, then it chooses the feature with the next highest unique correlation with the
class variable, and so on until no further feature adds significantly to the model
(McLachlan 2004: 412; Burns & Burns 2008: 604–605). Options also exist for adjusting
the alpha level used to define significance and for choosing different indices of model
strength. Choosing the best set of features for a particular analysis usually involves a
good deal of experimentation with different options, as well as the use of cross-valida-
tion, which I will discuss in the following section.
There are three general methods for performing feature selection: wrappers, filters,
and embedded methods. According to Guyon & Elisseef (2003: 1166), “wrappers utilize
the learning machine of interest as a black box to score subsets of variable [sic] accord-
ing to their predictive power. Filters select subsets of variables as a pre-processing step,
independently of the chosen predictor. Embedded methods perform variable selection
in the process of training and are usually specific to given learning machines” (empha-
ses in the original). I have only rarely encountered the use of wrappers in the classifica-
tion literature, but such methods are available in classification software such as Weka
(see Witten & Frank 2005).1 Examples of the use of filters are much more frequent, in-
cluding the use of measures such as information gain (a measure of entropy changes) or
the Gini index (a measure of statistical dispersion) for feature selection before the data
are submitted to a decision-tree classifier (see Raileanu & Stoffel 2004; Kotsiantis 2007:
252). An example of an embedded method is the use of the stepwise procedure in Linear
Discriminant Analysis (LDA), which I described earlier. There appears to be a consen-
sus that no one feature-selection method is consistently better than all others; instead,
different methods tend to work best in different contexts (e.g. Guyon & Elisseef 2003:
1178; Kotsiantis 2007: 252). For the researcher, this unfortunately means that a good
deal of experimentation with different methods may be in order. Under certain circum-
stances, some classifiers may perform optimally without removing any features at all.
This is particularly likely with classifiers such as Support Vector Machines (SVM) that
do not have strict restrictions on the ratio of features to cases and that deal well with
multidimensionality (i.e. the complexity resulting from a high number of features).
Regardless of whether or how feature selection is carried out, the classification ac-
curacy of a classifier can often be improved by adjusting – or tuning – its parameters.
An important parameter for LDA, for example, is prior probabilities – i.e. whether the
1. Wrappers determine all possible subsets of features from the given feature pool, and then
follow an algorithm for sampling these subsets and submitting them to the chosen classifier in
order to determine which combination of features results in the highest classification accuracy
for that particular classifier (see e.g. Kohavi & John 1997).
Data mining with learner corpora 
classifier should treat each class as being equal or whether it should calculate prior
probabilities from class sizes (e.g. the number of texts from each L1 background).
SVM, for its part, includes a complexity parameter that can be adjusted, and also dif-
ferent kernel options that determine how boundaries between classes are calculated.
Random Forest, in turn, includes parameter options for setting the maximum depth of
a tree (i.e. how many sequential decisions it will include), the number of randomly
selected features that will be included in each tree, and the number of trees it will gen-
erate (see, e.g. Witten & Frank 2005; Shen et al. 2007). In many cases, a classifier’s de-
fault settings will produce optimal results, but researchers need to be aware of what the
settings are and how they might affect the results. Experimenting with different pa-
rameter settings is often useful, of course, and fortunately some statistical software
suites (e.g. Weka) include automated means for testing multiple parameter settings
and choosing the one that results in the highest classification accuracy.
2.3 Cross-validation
One of the most important components of classification analysis is the use of cross-
validation (CV). In the simplest case, this involves splitting one’s data into a training
set and a testing set. During the training phase, the classifier is given both the feature
values and the class labels for all of the cases (e.g. texts) in the training set. The training
set is thus used to build the model of the relationship between features and classes.
Afterwards, during the testing phase, that model is applied to the feature values of each
of the cases in the testing set in order to determine how accurately the model is able to
predict the class memberships of new cases (i.e. cases that were not used to build the
model, and whose class labels are withheld from the model). The model’s classification
accuracy with the testing set is used as an indication of the model’s generalizability as
a predictor of class membership in the classes in question.
If both the training set and testing set are truly large, representative of their popu-
lations, and normally distributed, then this form of CV should be perfectly reliable.
When these conditions are not met, however, the specific way in which the data are
divided into training and testing sets can result in an accidental bias and an unstable
model. This means that the classification accuracy results would be different if the data
were divided differently, or if the testing set were used as the training set and vice
versa (Molinaro et al. 2005: 3306). A useful way to compensate for this problem is by
dividing the data into multiple sets and by iterating through several steps where each
data set is given its own turn of being held back as the testing set while all other sets
are combined to form the training set. This method is referred to as k-fold CV
(sometimes as v-fold CV), where k refers to the number of sets the data have been di-
vided into and likewise the number of training-testing iterations the CV will progress
through. The most common form of k-fold CV is 10-fold CV, which has been found to
be optimal with respect to both reliability and efficiency (cf. Molinaro et al. 2005: 3306;
Lecocke & Hess 2006: 315). In a 10-fold CV, the data are divided into 10 equally sized
 Scott Jarvis
subsets. In the first fold of the CV, the first subset is held back as the testing set, and the
other nine subsets are used as the training set. During the second fold of the CV, the
second subset is held back as the testing set, and the other nine are used as the training
set, and so forth. Each fold of the CV produces results concerning the number or per-
centage of cases in the testing set that were classified accurately. After all 10 folds of the
CV are completed, the final cross-validated accuracy is the overall percentage of cases
classified correctly across all 10 folds.
In k-fold CV, the highest value that k can take is the number of cases in the data-
base. K-fold CV that treats each case as a separate subset of the data is referred to as
leave-one-out CV (LOOCV). In LOOCV, the number of folds in the CV is equal to the
number of cases. In each fold of LOOCV, only a single case is held back as the testing
set while all others are used as the training set to build the model. The final result of
LOOCV is, as with other forms of CV, the overall percentage of testing cases that have
been classified correctly. Although some researchers have pointed to potential prob-
lems with LOOCV (e.g. Gavin & Teytaud 2002), empirical comparisons between dif-
ferent forms of CV have generally shown 10-fold CV and LOOCV to be similarly su-
perior to other forms of CV, with 10-fold CV being the most efficient computationally
(e.g. Molinaro et al. 2005: 3306; Lecocke & Hess 2006: 315).
The process of CV takes on an extra degree of complexity in cases where feature
selection is performed in conjunction with classification. This is because automated
feature selection capitalizes on even randomly occurring strong statistical relation-
ships in the data. This results in a model that is overfitted to the training set in relation
to the features included in the model. In order to avoid bias in the selection of features
for the model, and in order to avoid overly optimistic CV results regarding the ability
of the model to predict the class memberships of new cases, it is therefore necessary to
embed feature selection within each fold of the CV. This will not only give more real-
istic results regarding the predictive power of the model, but will also show, for ex-
ample, which features are selected across multiple folds of the CV, and therefore which
features are truly generalizable predictors of class membership. Feature selection that
is embedded within the folds of a CV has alternatively been referred to as honest, com-
plete, embedded, and nested CV (e.g. Molinaro et al. 2005: 3303; Lecocke & Hess 2006:
316). In this paper, I will refer to it as embedded CV, and it is noteworthy that both
embedded 10-fold CV and embedded LOOCV have been found to be reliable and opti-
mally unbiased estimates of classification accuracy (e.g. Molinaro et al. 2005).
3. Previous research
3.1 Which classifier is best?
A number of studies have performed comparisons of classifiers in order to determine

which classifier (with which set of features and which parameters settings) produces
Data mining with learner corpora 
the highest level of cross-validated classification accuracy. The results of these studies
have varied a great deal. One of the broadest comparisons of classifiers is a study by
Shen et al. (2007), which does not deal with language at all, but rather compares the
classification accuracy of nine classifiers in relation to their ability to predict the oc-
currence of liver cancer. Seven of the nine classifiers in the study are among those
listed in Table 1. These include Linear Discriminant Analysis (LDA), Quadratic Dis-
criminant Analysis (QDA), and Nearest Shrunken Centroids (NSC) (i.e. centroid-
based classifiers), Support Vector Machines (SVM) with a linear kernel and SVM with
a radial kernel (boundary-based classifiers), Simple CART (SC) and Random Forest
(RF) (decision trees), and LogitBoost (composite classifier). The other two classifiers
included in the study were K Nearest Neighbor (KNN) classification and a type of ar-
tificial neural network referred to by the authors as NNET. The data included 88 cases
(people), two classes (59 people with liver cancer, 29 people without liver cancer), and
30 features (mass spectrometry measures of biological samples taken from the partici-
pants). Because of the relatively small size of the database, the authors divided it into
only four subsets for CV purposes (i.e. 4-fold CV), and they ran the classifiers on the
full set of 30 features as well as on a smaller set of 17 features derived through a strict-
er threshold criterion (see p. 332). The results showed that SVM with a radial kernel
produced the highest cross-validated classification accuracies with both 30 features
and 17 features. In both cases, SVM with a radial kernel classified 67% of the partici-
pants correctly with respect to whether they had been diagnosed with liver cancer.
Because of restrictions regarding the necessary proportion of cases to features with
some classifiers, QDA could not be used with 30 features, but with 17 features it turned
out to be nearly as powerful as SVM, producing a classification accuracy of 66%. NSC
and RF also performed quite well, whereas LogitBoost, KNN, and SC performed quite
poorly in this particular classification task.
In the study just described, SVM produced the most predictive model of class mem-
bership, but the relative usefulness of this and other classifiers can differ a good deal
from one study to the next, and this is true not only between fields but also within fields
of research. In the literature on text classification, a study by Jockers & Witten (2010)
shows that SVM is the least effective out of five classifiers in identifying the authors of
texts in the Federalist Papers corpus. The five classifiers in the comparison were SVM,
KNN, NSC, Delta, and a form of Discriminant Analysis referred to as Regularized Dis-
criminant Analysis (RDA). There were three classes (i.e. Jay, Madison, and Hamilton,
the three authors of the Federalist Papers) and 2,907 features consisting of the relative
frequencies of the words and word bigrams (i.e. combinations of two words) that oc-
curred at least once in the training texts written by Jay, Madison, and Hamilton. A small-
er set of 298 features was also used, consisting of all words and word bigrams that have
an overall relative frequency of at least 0.05% (or 5 occurrences per 10,000 words) in the
Federalist Papers corpus as a whole. On the testing set of 70 texts of known origin, the
best 10-fold cross-validated performance was 100% classification accuracy, which was
achieved by (a) NSC using 718 of the original 2,907 features, (b) NSC using 199 of the
 Scott Jarvis
truncated set of 298 features, and (c) RDA using 312 of the original 2,907 features. The
lowest cross-validated accuracy was 86%, achieved by SVM using all 2,907 features, but
SVM achieved 94% accuracy when run on just the set of 298 features.
The differences between the two studies just described in relation to the ranking of
classifiers such as SVM could be due to many factors, such as differences in (a) the
number of features they dealt with, (b) the sizes of their training sets, and (c) the spe-
cific characteristics of the features and classes under investigation. From the perspec-
tive of (c), a study that is especially interesting and also directly relevant to the present
chapter is a study by Estival et al. (2007). In this study, the authors performed a com-
parison of a number of classifiers available in the Weka toolkit (see Witten & Frank
2005) in order to determine which worked best in each of a number of classification
tasks involving a database of 9,836 email messages written in English by 1,033 people
from 5 regions of the world who speak three different L1s. The eight classifiers in the
study included decision trees, a so-called lazy learner, a rule-based classifier, bound-
ary-based classifiers, and so-called meta classifiers. The authors extracted 689 features
from the data, which were divided into three categories: character-related (e.g. punc-
tuation frequencies, word length indices), lexical (e.g. relative frequencies of function
words and part-of-speech categories), and structural (e.g. paragraph breaks, presence
or absence of various HTML tags). The feature values for each text were fed into all
eight classifiers, and the classifiers were compared in relation to their ability to predict
class memberships for the following class variables: age, gender, L1 (Arabic, English,
Spanish), level of education, country of origin, agreeableness, conscientiousness, ex-
traversion, neuroticism, and openness. The classifiers were used in combination with
feature selection, parameter tuning, and 10-fold cross-validation in order to arrive at a
reliable and optimal solution for each classifier in each classification task. The results
showed that the very highest level of classification accuracy was achieved by Random
Forest (RF, a decision-tree classifier) in the task involving L1 prediction (84%) when it
was combined with an information-gain criterion for feature selection and when it also
involved removing features that were highly correlated with L1 (e.g. function words
that were used by speakers of only one L1). The second highest classification accuracy
was achieved by Sequential Minimal Optimization (SMO – a boundary-based adapta-
tion of Support Vector Machines) in the task involving country prediction (81%), and
SMO also performed best in relation to gender (69%) and age (56%); in all cases, SMO
performed most optimally when using all 689 features. The only other classifier to
achieve a cross-validated classification accuracy above 60% was Bagging (a meta clas-
sifier), which achieved an accuracy rate of 80% in the task involving education predic-
tion. It achieved this result when using all features except function words.
Together, these studies show that most of the major classifiers perform quite well
in certain circumstances, but no classifier is the best pattern learner in all classification
tasks. Unfortunately, it is not possible to predict in advance which classifier will per-
form best in a particular classification task, but an understanding of the characteristic
strengths and weaknesses of each classifier can nevertheless help in deciding which to
Data mining with learner corpora 
use for a particular purpose. The advantages and disadvantages of each type of classi-
fier are summarized by Kotsiantis (2007: 262–263). The author points out, for example,
that boundary-based classifiers (e.g. SVM, SMO) and neural networks (e.g. MPA) tend
to deal best with large numbers of features, especially when the features are continuous
variables. With fewer variables, Bayesian classifiers are particularly useful, and in cases
where the features involve a combination of discrete, binary, and continuous variables,
decision trees tend to be superior. Decision trees, Bayesian classifiers, and rule-based
classifiers are also advantageous with respect to interpretational transparency (i.e. how
easy it is to see which features lead to which predictions). When the researcher’s goal
is strictly classification accuracy and not interpretational transparency, Kotsiantis
points out that it may be best to rely on the majority vote of an ensemble of classifiers
(ibid.: 263). This is one of the methods that will be used in the empirical portion of this
chapter (see Section 5).
3.2 Previous studies on L1 detection
The first investigation of L1 classification I am aware of is Mayfield Tomokiyo & Jones

(2001). This study used a simple Bayesian classifier (NB) to examine whether lexical
features (word n-gram and part-of-speech frequencies) could be used to distinguish
English spoken texts produced by native speakers versus those produced by nonnatives
(Chinese and Japanese speakers). The authors also tested whether the classifier could
correctly distinguish the Chinese from the Japanese speakers. Due to the small sample
and apparent lack of controls, the results are probably overly optimistic, but they show
that the classifier was able to achieve 100% accuracy in distinguishing between the
samples produced by Chinese (n = 6) and Japanese speakers (n = 31). The study also
showed that the NB classifier performs much better with a carefully selected subset of
fewer than 100 features (selected on the basis of the information gain index and various
stopword lists) than with the full set of 4,800 features.
Other studies on L1 classification include, in chronological order, Jarvis et al.
(2004), Koppel et al. (2005), Tsur & Rappoport (2007), the Estival et al. (2007) study
described earlier, Wong & Dras (2009), and the studies in the Jarvis & Crossley (forth-
coming) volume. In the remainder of this section, I will focus on the four studies that
are most relevant to the purposes of the present chapter. These include Koppel et al.
(2005), Tsur & Rappoport (2007), Wong & Dras (2009), and Jarvis & Paquot
(forthcoming). All four studies examine a large number of texts drawn from the ICLE
corpus, and all four studies perform L1 prediction on the basis of a large number of
textual (including lexical) features.
The study by Koppel et al. (2005) used the Support Vector Machines (SVM) classi-
fier with a linear kernel in an attempt to predict the L1 backgrounds of English texts
written by speakers of five different L1s: Bulgarian, Czech, French, Russian, and Spanish.
The texts represent only a sample of the texts available in the ICLE, but are nevertheless
quite numerous, consisting of 258 texts per L1 group, for a total of 1,290 texts. The
 Scott Jarvis
features examined are also quite numerous, consisting of 1,035 features of the following
types: 400 function words, 200 frequent letter n-grams, 185 error types, and 250 rare
part-of-speech bigrams identified by Francis & Kucera (1982). The 10-fold cross-validat-
ed classification accuracy with all 1,035 features was 80%. Tests run with subsets of the
features showed that the 400 function words alone result in 75% classification accuracy,
and the 200 letter n-grams alone result in 71% classification accuracy. These results –
particularly the 80% achieved with all 1,035 features – are quite remarkable in light of the
fact that the number of classes (or L1s) is five, meaning that the level of chance is only
20% accuracy (i.e. given that the L1 groups are equally represented in the sample).
The study by Tsur & Rappoport (2007) replicates several aspects of the Koppel et
al. study but focuses especially on the ability of letter n-grams to distinguish L1 groups.
Tsur & Rappoport used the same classifier (SVM) with the same five L1 backgrounds,
but drew their own random sample of texts from the ICLE, which included 238 (instead
of 258) texts per L1, for a total of 1,190 texts. The researchers also used several of the
same features as were used by Koppel et al., but focused mainly on the effects of letter
bigrams and trigrams. Tsur & Rappoport appear not to have performed an overall clas-
sification using all features at the same time, but their results with different sets of
features show a 10-fold cross-validated accuracy of 67% with 460 function words, 66%
with the 200 most frequent letter bigrams, and 60% with the 200 most frequent letter
trigrams. Given that Tsur & Rappoport used essentially the same data and the same
classifier as Koppel et al. did, it is unclear why Tsur & Rappoport found lower accuracy
rates for similar sets of features. Three possible explanations are that (a) Tsur &
Rappoport may have used a different kernel with SVM (they did not report which
kernel they used), (b) SVM parameters may have been tuned slightly differently in the
two studies, and (c) the slightly smaller number of texts used in the Tsur & Rappoport
may have hindered the classifier’s construction of an optimal model of the relationship
between the features in question and the learners’ L1 backgrounds.
A further follow-up to the Koppel et al. (2005) study is found in Wong & Dras
(2009). Like the two previous studies, Wong & Dras also used SVM as their classifier.
However, they used SVM with a radial kernel instead of the linear kernel that Koppel et
al. used. Wong & Dras also extended their study to seven L1 backgrounds represented in
the ICLE, adding Chinese and Japanese to the five L1s in the previous two studies. De-
spite the increased number of L1 groups, however, the sample in Wong & Dras is small-
er than that of the previous two studies, consisting of only 70 texts per group (490 total)
as the training set and 25 texts per group (175 total) as the testing set. As this implies, the
authors also appear to have used a simple-split CV rather than a k-fold CV. Wong & Dras
also used a slightly different set of features, which nevertheless overlaps a great deal with
the other two studies. The features in this study include 400 function words, 500 charac-
ter n-grams, and 650 part-of-speech n-grams. The highest level of classification accuracy
achieved was 74%, which was attained with two different sets of features: a combination
of all three types of features, and a combination of just the function words and part-of-
speech n-grams. The 74% accuracy achieved in this study is perhaps equally remarkable
Data mining with learner corpora 
as the accuracy rate of 80% in the Koppel et al. (2005) study given the larger number of
L1s in the Wong & Dras study and the concomitantly lower level of chance (i.e. 14% vs.
20%). At the same time, the smaller sample and the use of a simple-split CV in the Wong
& Dras study do cast some doubt on the generalizability of their results.
Some of the questions left unanswered by these studies are (a) whether high levels
of classification accuracy can also be achieved when the number of L1s is extended
substantially beyond seven, (b) whether highly frequent words that include not just
function words but also content words will facilitate L1 classification, and (c) what lev-
els of L1 classification accuracy can be attained with n-grams made up of word se-
quences instead of letter sequences and part-of-speech sequences. These are some of
the questions that Jarvis & Paquot (forthcoming) set out to address while taking advan-
tage of the wealth of data available in the newest version of the ICLE. The newest ver-
sion of the ICLE includes argumentative and literary texts written in English by learners
from 16 different L1 backgrounds, but Jarvis & Paquot chose to focus on just 12 of these
because the texts written by Chinese, Japanese, Tswana, and Turkish speakers include a
relatively high proportion of lower proficiency texts (cf. Table 6 in Granger et al. 2009).2
Following the conventions of the previous three studies, Jarvis & Paquot used only
those texts that were between 500 and 1,000 words in length. This was done for reasons
of inter-class comparability, and a further criterion that the authors imposed on their
selection of texts was to use only argumentative texts. The resulting breakdown of the
number of texts per L1 group that were included in the study is shown in Table 2.
Table 2.â•‡ Texts included in the Jarvis & Paquot (forthcoming) study
L1 Number of texts
Bulgarian â•⁄â•— 140

Czech â•⁄â•— 116
Dutch â•⁄â•— 125
Finnish â•⁄â•— 121
French â•⁄â•— 200
German â•⁄â•— 182
Italian â•⁄â•⁄â•— 86
Norwegian â•⁄â•— 270
Polish â•⁄â•— 288
Russian â•⁄â•— 144
Spanish â•⁄â•— 144
Swedish â•⁄â•— 217
TOTAL 2,033
2. The fact that Wong & Dras (2009) included ICLE texts written by Chinese and Japanese
speakers raises additional questions about the reliability of their results.
 Scott Jarvis
The features used in the Jarvis & Paquot study included four categories of word n-
grams extracted from the data: unigrams (single words), bigrams (two-word se-
quences), trigrams (three-word sequences), and quadrigrams (four-word sequences).
The selected n-grams were the most frequent 200 n-grams in each category that oc-
curred at least 35 times in the data and were not prompt-induced. This latter criterion
meant that n-grams were manually disqualified if they included content words
(and their families) that were used in the essay prompts that the essays were written
in response to. Examples of such prompt-induced words that were disqualified in-
clude society, prison, science, technology, television, religion, imagination, and dream,
which appear in essay prompts such as ‘Some people say that in our modern world,
dominated by science and technology and industrialisation, there is no longer a place
for dreaming and imagination. What is your opinion?’ or ‘Marx once said that reli-
gion was the opium of the masses. If he was alive at the end of the 20th century, he
would replace religion with television’. The final feature set for the study consisted of
200 unigrams, 200 bigrams, 200 trigrams, but only 122 quadrigrams because only 122
quadrigrams met the criteria for inclusion. The total number of features was therefore
722, and these included frequent n-grams made up of both content words and func-
tion words.
Unlike the previous three studies, which used an SVM classifier, Jarvis & Paquot
chose LDA as their classifier because of the relatively clear interpretability of the re-
sults it produces and for reasons of consistency with the other studies in the same
collection. One of the disadvantages of this choice, however, is that LDA has stricter
statistical assumptions than SVM and most other classifiers, requiring at least 10 cases
per feature. Given that there were 2,033 texts in their dataset, Jarvis & Paquot were able
to perform LDA classification using only approximately 200 features instead of the full
set of 722 features. In order to comply with this limitation while nevertheless taking
advantage of the full set of 722 features, they performed stepwise feature selection us-
ing a relatively strict criterion for feature entry (p < .01) and removal (p > .05), which
resulted in the selection of 200 features representing a combination of unigrams, big-
rams, trigrams, and quadrigrams that contributed the most to the strength of the mod-
el. Because they combined feature selection with classification, it was also necessary to
embed the stepwise procedure within their 10-fold CV.
Their final embedded 10-fold cross-validated classification accuracy for predict-
ing the L1 backgrounds of the 2,033 texts representing 12 L1 backgrounds was 53.57%.
This is lower than the accuracy levels achieved by the previous three studies, but this is
of course to be expected in light of the substantially higher number of L1s in this study.
Jarvis & Paquot set their LDA classifier to treat all L1 groups as having equal prior
probabilities, which means that the level of chance for correct L1 prediction was only
8% (compared with 20% for five L1 groups and 14% for seven L1 groups). A more
conservative baseline is the level of accuracy that would have been attained if all texts
had been classified as belonging to the biggest L1 group (i.e. the Polish group, n = 288),
in which case the baseline is 14% accuracy. In either case, the result of 53%
Data mining with learner corpora 
cross-validated classification accuracy with 12 L1 groups is quite high and clearly

points to group-distinctive behaviors in the learners’ use of n-grams.
Although the higher number of L1s is certainly largely responsible for the lower
classification accuracies found in the Jarvis & Paquot study vis-à-vis those found in the
previous three studies, other factors that may also have contributed to the difference are
differences in (a) the strictness of the researchers’ criteria for text inclusion, (b) the spe-
cific features that were used in the analysis, and (c) the classifier that was used. In the
study presented in the following sections of this chapter, I address the third possibility,
and specifically consider whether other classifiers can achieve higher cross-validated
classification accuracies with precisely the same texts and features that Jarvis & Paquot
used. This is especially relevant in light of the fact that many other classifiers do not have
strict restrictions on the ratio of texts to features. Accordingly, an important question is
whether classifiers that are able to create a model that includes all 722 of the features
used in the Jarvis & Paquot study would achieve superior classification accuracies.
4. Method
As just mentioned, the data used in the present study are identical to those in the Jarvis
& Paquot (forthcoming) study. A breakdown of the 2,033 texts included in the present
study was shown in Table 2, and the features are, again, 722 n-grams made up of the
most frequent unigrams (n = 200) (e.g. the, to, of, and, a, is, in, that, it, are, for, be, not,
they, have, we), bigrams (n = 200) (e.g. of the, on the, there is, I think, we are), trigrams (n
= 200) (e.g. a lot of, in order to, the fact that, one of the, on the other, in my opinion), and
quadrigrams (n = 122) (e.g. on the other hand, at the same time, I would like to, to be able
to, is one of the) that occur at least 35 times in the data and are not prompt-induced.
The purpose of the present study is to compare the performance of a number
of classifiers in relation to their ability to produce accurate cross-validated predictions
of the L1 backgrounds of the 2,033 texts in the data, with all relying on the same pool of
n-gram features. The classifiers selected for this study include Linear Discriminant
Analysis (used by Jarvis & Paquot), Support Vector Machines (used by Koppel et al.
2005, Tsur & Rappoport 2007, Wong & Dras 2009, and found to be a superior classi-
fier by Shen et al. 2007), Random Forest (found to be superior for L1 detection by
Estival et al. 2007), Sequential Minimal Optimization (also found to be useful by Esti-
val et al. 2007), Nearest Shrunken Centroids (found to be superior for authorship at-
tribution by Jockers & Witten 2010), Delta and Delta Prime (also found to be useful by
Jockers & Witten 2010), Naïve Bayes (used by Mayfield Tomokiyo & Jones 2001 in
their study of L1 detection), and a number of classifiers included in the Weka toolkit.
Some classifiers that are available in the Weka toolkit were not included because of
how much time they took to process the data (e.g. the MultilayerPerception classifier
took over two hours to complete its initial, pre-CV analysis), or because their catego-
ries were considered to be already well enough represented.
 Scott Jarvis
In all, I used 20 classifiers and attempted to find the optimal classification accu-
racy for each by experimenting with various parameter settings and feature-selection
methods. Of course, it was not feasible to test all possible parameter settings and fea-
ture-selection methods with these 20 classifiers, so it is possible that the optimal ac-
curacy rates for some classifiers may be higher than what I have found. Nevertheless, I
am confident in the results for the best-performing classifiers because most of these
achieved their highest levels of classification accuracy without any parameter tuning
or feature selection at all. In all cases, the final classification accuracies were deter-
mined through 10-fold CV. In the case of LDA, where the optimal result was obtained
through the reduction of features, feature selection was embedded in the 10-fold CV.
In the case of the classifiers available in Weka, however, only non-embedded 10-fold
CV was used. This means that the classifiers run in Weka that relied on feature selec-
tion may have produced somewhat overly optimistic classification accuracies, but this
is probably not a problem because these classifiers produced the lowest accuracy rates
at any rate.
5. Results
The results of the classifier comparison are shown in Table 3, where the 20 classifiers
are listed according to their optimal accuracy rates with respect to the data under in-
vestigation. The table also shows the classifier type that each classifier represents, as
well as the software application (and the software package, in the case of R) that was
used to run the classifier. The final two columns in the table show the number of fea-
tures that was selected for each classifier’s optimal model and the optimal 10-fold
cross-validated accuracy rates for each classifier.
As these results show, the best-performing classifiers for the present classification
task are Linear Discriminant Analysis (LDA), Sequential Minimal Optimization
(SMO), Naïve Bayes Multinomial (NBM), and Nearest Shrunken Centroids (NSC),
with relatively little difference between them. LDA was restricted to only 200 features,
but it performed as well or better than all classifiers that could and did include all 722
features in their models. Aside from LDA, the other 10 classifiers with the highest clas-
sification accuracies all made use of the full set of 722 features, although this statement
needs to be qualified with respect to Delta Prime, which in the present case was found
to achieve optimal results when it ignored differences between texts and the means of
particular L1 groups that did not exceed 0.8 standard deviations. All 20 classifiers
achieved classification accuracies that are substantially higher than chance (8%) and,
with the exception of Random Tree, also substantially higher than the more conserva-
tive baseline of 14%. Nevertheless, a noticeable gap exists between the seven classifiers
with the best performance (accuracies of 44.66% and higher) and the 13 classifiers
with the worst performance (accuracies of 39.10% and lower).
Data mining with learner corpora 
Table 3.â•‡ Classifiers ordered by accuracy in detecting L1 background
Classifier Class. Type Application Features Accuracy

(10-fold
CV)
Lin. Discriminant Analysis (LDA) Centroid SPSS 200 (stepwise) 53.57%

Seq. Minimal Optimization (SMO) Boundary Weka 722 53.22%
Naïve Bayes Multinomial (NBM) Bayesian Weka 722 52.29%
Nearest Shrunken Centroids (NSC) Centroid R (pamr) 722 51.45%
Delta Prime (>.8) (DP) Means Perl script 722 47.37%
Complement Naïve Bayes (CNB) Bayesian Weka 722 44.76%
Support Vector Machines (SVM) Boundary R (e1071) 722 44.66%
Naïve Bayes Tree (NBT) Composite Weka 722 39.10%
Bayes Net (BN) Composite Weka 722 38.81%
Logit Boost (LB) Composite Weka 722 37.48%
Naïve Bayes (NB) Bayesian Weka 722 34.92%
Classification via Regression (CVR) Composite Weka 69 (infogain) 34.38%
Bagging (Bag) Tree Weka 69 (infogain) 32.66%
Delta Means Perl script 722 30.20%
Random Forest (RF) Tree Weka 55 (g. stepw.) 29.32%
Simple Cart (SC) Tree Weka 69 (infogain) 26.36%
J48 Graft (J48G) Tree Weka 55 (g. stepw.) 23.81%
J48 Tree Weka 55 (g. stepw.) 23.02%
Jrip Rule Weka 69 (infogain) 23.02%
Random Tree (RT) Tree Weka 55 (g. stepw.) 18.74%
A further question that is addressed by the results is whether the majority-vote meth-
od with an ensemble of classifiers might produce higher classification accuracies
than any one classifier alone (cf. Kotsiantis 2007). To find out, I created several differ-
ent ensembles of classifiers, beginning with the two to five classifiers with the highest
accuracy rates in Table 3, and then successively including four additional classifiers
that had relatively high classification rates but are relatively unique with respect to
the algorithms they rely on. I considered the diversity of algorithms to be an advan-
tage for the majority-vote (or ensemble) method in order to avoid a situation where
similar algorithms lead to the same misclassifications. In each ensemble of classifiers,
I extracted the L1 predictions that each classifier made regarding each text, and then
determined each text’s ensemble classification as the L1 prediction made by a plural-
ity of classifiers. The L1 prediction made by a classifier was thus treated as the classi-
fier’s vote, and the final L1 classification for a text was determined by a plurality of
votes. In some cases, however, two or more L1s received the same number of winning
votes (i.e. two or more L1s tied for first place). There were a number of cases, for
 Scott Jarvis
example, where the highest number of votes that any L1 received was only two or
three, and in such cases it was common for more than one L1 to receive this number
of votes. When this happens, it is difficult to say whether the ensemble method has
truly identified the correct L1, but it is clear that this may nevertheless be a useful
way of narrowing the range of possibilities when the L1 truly is not known. Table 4
shows the classification accuracies that were achieved with each ensemble of classi-
fiers. The results include both the percentage of correct classifications where there
was no tie for first place and the percentage of correct classifications that included
ties for first place. The former involves unambiguously correct classifications, where-
as the latter is a combination of unambiguously correct classifications plus those
cases where the ensemble vote identified the correct L1 as one of two or more equal
possibilities.
It is not completely clear from these results whether the ensemble method is
superior to the use of LDA or SMO alone. On the basis of clear-winner voting, en-
sembles of at least three classifiers produce results very similar to those of LDA and
SMO. When ties are taken into consideration, the ensembles produce higher classi-
fication accuracies, but ties ultimately still need to be resolved. Assuming that ties
will not be settled completely successfully, the true predictive power of ensemble
voting is likely to be somewhere between the clear-winner and tied-for-first accu-
racy rates. Interestingly, both the clear-winner and tied-for-first accuracy rates ap-
pear to become stable with ensembles of five or more classifiers, remaining in the
narrow range of 53.32–53.76% for clear winners, and staying within the narrow
range of 59.67–60.21% for ties. In light of these results, it seems doubtful that the
inclusion of more classifiers in the ensemble would result in higher accuracy rates,
especially since the accuracy rates of most of the remaining available classifiers are
rather low.
Table 4.â•‡ Ensemble classification by majority vote
Ensemble of classifiers Accuracy Accuracy (clear winner

(clear winner) + tied for first place)
LDA, SMO 38.71% 68.08%

LDA, SMO, NBM 53.96% 65.17%
LDA, SMO, NBM, NSC 51.50% 61.49%
LDA, SMO, NBM, NSC, DP 53.32% 59.67%
LDA, SMO, NBM, NSC, DP, LB 53.47% 60.16%
LDA, SMO, NBM, NSC, DP, LB, CVR 53.76% 59.81%
LDA, SMO, NBM, NSC, DP, LB, CVR, Bag 53.37% 60.21%
LDA, SMO, NBM, NSC, DP, LB, CVR, Bag, SC 53.52% 59.76%
Data mining with learner corpora 
6. Discussion and conclusions
The primary research question addressed in the present study concerns which of the
many available classifiers is best able to learn to recognize the relationship between n-
gram patterns in ICLE texts and the L1 group membership of the learners who pro-
duced those texts. For the classification task used in the present study, LDA showed the
strongest ability to learn these patterns, but SMO, NBM, and NSC produced compara-
bly high levels of classification accuracy. One important difference, however, is that
LDA achieved this result with far fewer features than these other classifiers did. In
some situations, such as where the number of texts available is only a few dozen or even
a few hundred, LDA’s restrictions on the ratio of features to texts may severely hinder
its usefulness. This was not the case in the present analysis, however. Thus, it appears
that LDA was indeed one of the best options – if not the best option – for the particular
purposes of the Jarvis & Paquot (forthcoming) study. According to the results of the
present study, not even the use of the ensemble method would have led to an improve-
ment over LDA in the number of unambiguously correctly classified cases.
Potential uses of the ensemble method do seem intriguing, however, especially
with respect to how they might help to determine the true percentage of texts in a
corpus that contain the relevant group-related pattern. The question of which classifier
is best is actually secondary to this higher aim of discovering and capturing as much of
the true group-related patterning (or signal) as possible. It seems that the ensemble
method is a good place to start in order to obtain an estimate of the percentage of texts
in which the true signal may be found. On the basis of the results of the ensemble
method, the researcher could then select a specific classifier that accounts for as much
of that signal as possible, and which also has other characteristics (e.g. interpretational
transparency) that are conducive to the purposes of the study. In the present case, it
appears that LDA was able to capture most of the true L1-related signal embedded
within the features it was given, although the tied-for-first results of the ensemble
method suggest that some of that signal may be intertwined with other signals and
may therefore be difficult if not impossible to tease apart fully.
A critical question regarding L1-related signals is whether a classifier’s ability to
identify the correct L1 backgrounds of L2 texts necessarily means that the signal that
the classifier has tuned into really is being produced by the L1 itself, or whether it may
be produced by other factors that happen to coincide with L1 group divisions. For
example, if the L1 groups represented in the data are not at equivalent levels of L2
proficiency, if they have had different types and amounts of L2 instruction, and/or if
they have experienced the L2 in differing environments with different types and
amounts of input and exposure and differing opportunities to use the L2, then these
factors by themselves, in combination with one another, and/or in combination with
L1 influence, may be the source of the signal that allows the classifier to achieve such
high levels of L1 classification accuracy. Regarding the ICLE data used in the present
study, it is in fact certain that the different L1 groups have differing ranges of L2 abilities
 Scott Jarvis
(see Bestgen et al. forthcoming), and there are also some indications of potential ef-
fects of training and instruction on the ICLE texts that coincide with L1 divisions (see
e.g. Paquot 2010). Nevertheless, there is also a great deal of evidence of direct L1 influ-
ence in the ICLE data, such as the French writers’ distinctively high use of on the con-
trary (see also Paquot 2007) and the Finnish writers’ distinctively high use of all the
time (see Jarvis & Paquot forthcoming), whose counterparts in the respective L1s oc-
cur with correspondingly high frequency rates. The very real effects of cross-linguistic
influence are also underscored by the finding by Jarvis & Paquot (forthcoming) that,
when texts are misclassified, there is a strong tendency for them to be classified into
the correct language families (e.g. Dutch as German, Italian as French or Spanish,
Norwegian as Swedish). An important direction for future research in this area will be
to combine the power of classifiers with principled methods for teasing apart the ef-
fects of the L1 from other potential factors (cf. Jarvis 2000; Jarvis 2010).
In this study, I have highlighted the use of classifiers for investigating L1-related
effects, but it is important to recognize that classifiers can be used similarly for the in-
vestigation of the relationship between language features and other class variables,
such as text type, task type, topic, learners’ L1 writing proficiency, learners’ L2 profi-
ciency in general or in specific ability areas, learners’ educational backgrounds, the
context of their L2 learning, the number of years of L2 instruction they have received,
and so forth. Concerning cross-linguistic influence, classification could and probably
should also be used to investigate influences of prior languages besides just the L1.
One of the interesting results in Jarvis & Paquot (forthcoming), for example, is that
when Finnish speakers’ texts are misclassified, they are more frequently identified with
Swedish than with any other L1 background. Interestingly, Swedish is unrelated to
Finnish but is a language that all Finnish speakers are required to study in school. In-
fluences of nonnative languages on each other may, in fact, be some of the strongest
signals intertwined with the L1 signal. Other promising avenues for the future of this
area of research include the development of classifiers that perform cross-language
comparisons and retrieval (cf. Sorg & Cimiano 2010), which could considerably en-
hance the ability of classifiers to detect and verify direct influences of one language on
another. Finally, despite the exciting technological developments in this area of re-
search, it is of course important to make sure that we use it to enhance rather than to
supplant our expertise in qualitative linguistic analysis, which is ultimately what al-
lows us to make sense of learner data.
References
Bestgen, Y., Granger, S. & Thewissen, J. Forthcoming. Error patterns and automatic L1 identifi-
cation. In Approaching Transfer through Text Classification: Explorations in the Detection-
based Approach, S. Jarvis & S. Crossley (eds). Bristol, UK: Multilingual Matters.
Data mining with learner corpora 
Burns, R.B. & Burns, R.A. 2008. Business Research Methods and Statistics Using SPSS. London:
Sage.
Burrows, J.F. 2002. ‘Delta’: A measure of stylistic difference and a guide to likely authorship.
Literary and Linguistic Computing 17: 267–287.
Crossley, S.A. & McNamara, D.S. 2009. Computational assessment of lexical differences in L1
and L2 writing. Journal of Second Language Writing 18: 119–135.
Duda, R.O., Hart, P.E. & Stork, D.G. 2000. Pattern Classification, 2nd edn. New York NY: Wiley.
Estival, D., Gaustad, T., Pham, S.B., Radford, W. & Hutchinson, B. 2007. Author profiling for
English emails. Proceedings of the 10th Conference of the Pacific Association for Computa-
tional Linguistics (PACLING 2007), Melbourne, Australia, 31–39.
Field, A. 2005. Discovering Statistics Using SPSS. London: Sage.
Francis, W. & Kucera, H. 1982. Frequency Analysis of English Usage: Lexicon and Grammar.
Boston MA: Houghton Mifflin.
Gavin, G. & Teytaud, O. 2002. Lower bounds for training and leave-one-out estimates of the
generalization error. In ICANN 2002, LNCS 2415, J.R. Dorronsoro (ed.), 583–588. Berlin:
Springer.
Granger, S., Dagneaux, E., Meunier, F. & Paquot, M. 2009. The International Corpus of Learner
English. Handbook and CD-ROM, Version 2. Louvain-la-Neuve: Presses universitaires de
Louvain.
Guan, H., Zhou, J. & Guo, M. 2009. A class-feature-centroid classifier for text categorization. In
Proceedings of the 18th International Conference on the World Wide Web, 201–210. New
York NY: Association for Computing Machinery.
Guyon, I. & Elisseef, A. 2003. An introduction to variable and feature selection. Journal of Ma-
chine Learning Research 3: 1157–1182.
Herlitz, G.N. 1994–1995. The meaning of the term prima facie. Louisiana Law Review 55:
391–408.
Hinton, G. & Sejnowski, T.J. (eds). 1999. Unsupervised Learning: Foundations of Neural Compu-
tation. Cambridge MA: The MIT Press.
Hoover, D.L. 2004a. Testing Burrows’s Delta. Literary and Linguistic Computing 19: 453–475.
Hoover, D.L. 2004b. Delta prime? Literary and Linguistic Computing 19: 477–495.
Hyvärinen, A. & Oja, E. 2000. Independent component analysis: Algorithms and application.
Neural Networks 13: 411–430.
Jarvis, S. 2000. Methodological rigor in the study of transfer: Identifying L1 influence in the in-
terlanguage lexicon. Language Learning 50: 245–309.
Jarvis, S. 2010. Comparison-based and detection-based approaches to transfer research. In
EUROSLA Yearbook 10, L. Roberts, M. Howard, M. Ó Laoire & D. Singleton (eds), 169–
192. Amsterdam: John Benjamins.
Jarvis, S. Forthcoming. Introduction. In Approaching Transfer through Text Classification: Explo-
rations in the Detection-based Approach, S. Jarvis & S. Crossley (eds). Bristol, UK: Multilin-
gual Matters.
Jarvis, S., Castañeda-Jiménez, G. & Nielsen, R. 2004. Investigating L1 lexical transfer through
learners’ wordprints. Paper presented at the 2004 Second Language Research Forum. State
College, Pennsylvania.
Jarvis, S. & Crossley, S.A. (eds). Forthcoming. Approaching Transfer through Text Classification:
Explorations in the Detection-based Approach. Bristol, UK: Multilingual Matters.
Jarvis, S., Grant, L., Bikowski, D. & Ferris, D. 2003. Exploring multiple profiles of highly rated
learner compositions. Journal of Second Language Writing 12: 377–403.
 Scott Jarvis
Jarvis, S. & Paquot, M. Forthcoming. Exploring the role of n-grams in L1 identification. In Ap-
proaching Transfer through Text Classification: Explorations in the Detection-based Ap-
proach, S. Jarvis & S. Crossley (eds). Bristol, UK: Multilingual Matters.
Jarvis, S. & Pavlenko, A. 2008. Crosslinguistic Influence in Language and Cognition. London:
Routledge.
Jockers, M.L. & Witten, D.M. 2010. A comparative study of machine learning methods for au-
thorship attribution. Literary and Linguistic Computing 25: 215–223.
Keerthi, S.S., Shevade, S.K., Bhattacharyya, C. & Murthy, K.R.K. 2001. Improvements to Platt’s
SMO algorithm for SVM classifier design. Neural Computation 13: 637–649.
Kohavi, R. & John, G.H. 1997. Wrappers for feature subset selection. Artificial Intelligence 97:
273–324.
Koppel, M., Schler, J. & Zigdon, K. 2005. Determining an author’s native language by mining a text
for errors. In Proceedings of the Eleventh ACM SIGKDD International Conference on Knowl-
edge Discovery in Data Mining, 624–628. Chicago IL: Association for Computing Machinery.
Kotsiantis, S. 2007. Supervised machine learning: A review of classification techniques. Infor-
matica Journal 31: 249–268.
Kotsiantis, S. & Pintelas, P. 2004. Recent advances in clustering: A brief survey. WSEAS Transac-
tions on Information Science and Applications 1: 73–81.
Lecocke, M. & Hess, K. 2006. An empirical study of univariate and genetic algorithm-based fea-
ture selection in binary classification with microarray data. Cancer Informatics 2: 313–327.
Liu, J.Y., Zhuang, D.F., Luo, D. & Xiao, X. 2003. Land-cover classification of China: Integrated
analysis of AVHRR imagery and geophysical data. International Journal of Remote Sensing
24: 2485–2500.
Mayfield Tomokiyo, L. & Jones, R. 2001. You’re not from ‘round here, are you? Naive Bayes detec-
tion of non-native utterance text. In Proceedings of the Second Meeting of the North American
Chapter of the Association for Computational Linguistics (NAACL ‘01), unpaginated elec-
tronic document. Cambridge MA: The Association for Computational Linguistics.
McCallum, A. & Nigam, K. 1998. A comparison of event models for Naïve Bayes text classification.
In AAAI/ICML-98 Workshop on Learning for Text Categorization [Technical Report WS-98-
05], 41–48. Menlo Park CA: Association for the Advancement of Artificial Intelligence.
McLachlan, G.J. 2004. Discriminant Analysis and Statistical Pattern Recognition. Hoboken
NJ: Wiley.
Millet-Roig, J., Ventura-Galiano, R., Chorro-Gascó, F.J. & Cebrián, A. 2000. Support Vector
Machine for arrhythmia discrimination with wavelet-transform-based feature selection.
Computers in Cardiology 27: 407–410.
Molinaro, A.M., Simon, R. & Pfeiffer, R.M. 2005. Prediction error estimation: A comparison of
resampling methods. Bioinformatics 21: 3301–3307.
Paquot, M. 2007. EAP Vocabulary in EFL Learner Writing. A Phraseology-oriented Approach.
PhD dissertation, Université Catholique de Louvain.
Paquot, M. 2010. Academic Vocabulary in Learner Writing: From Extraction to Analysis. London:
Continuum.
Prinzie, A. & Van den Poel, D. 2007. Random multiclass classification: Generalizing Random
Forests to random MNL and random NB. In DEXA 2007, LNCS 4653, R. Wagner, N. Revell
& G. Pernul (eds), 349–358. Berlin: Springer.
Raileanu, L.E. & Stoffel, K. 2004. Theoretical comparison between the Gini Index and Informa-
tion Gain criteria. Annals of Mathematics and Artificial Intelligence 41: 77–93.
Data mining with learner corpora 
Shen, C., Breen, T.E., Dobrolecki, L.E., Schmidt, C.M., Sledge, G.W., Miller, K.D. & Hickey, R.J.
2007. Comparison of computational algorithms for the classification of liver cancer using
SELDI mass spectrometry: A case study. Cancer Informatics 3: 329–339.
Sorg, P. & Cimiano, P. 2010. An experimental comparison of explicit semantic analysis imple-
mentations for cross-language retrieval. In Natural Language Processing and Information
Systems: NLDB 2009, LNCS 5723, H. Horacek, E. Metais & R. Munoz (eds), 36–48.
Berlin: Springer.
Tibshirani, R., Hastie, T., Narasimhan, B. & Chu, G. 2003. Class prediction by Nearest Shrunken
Centroids, with applications to DNA microarrays. Statistical Science 18: 104–117.
Tsur, O. & Rappoport, A. 2007. Using classifier features for studying the effect of native language
on the choice of written second language words. Proceedings of the Workshop on Cognitive
Aspects of Computational Language Acquisition, 9–16. Cambridge MA: The Association for
Computational Linguistics.
Witten, I.H. & Frank, E. 2005. Data Mining: Practical Machine Learning Tools and Techniques,
2nd edn. Amsterdam: Elsevier.
Wong, S.-M. J. & Dras, M. 2009. Contrastive analysis and native language identification. In Pro-
ceedings of the Australasian Language Technology Association, 53–61. Cambridge MA: The
Association for Computational Linguistics.
Appendix 1. Types of classifiers
Centroid-based classifiers. This type of classifier creates a vector space model in which
each case (e.g. text) is represented as a vector in multidimensional space. The vectors
are created by entering the values for each feature (e.g. relative frequencies for various
language features found in the text) into formulas that combine these values with nu-
merical weights that maximize the similarities within classes and the differences be-
tween classes. The vector space model also creates a prototype vector for each class
(e.g. each L1 background), which is referred to as the class centroid. Classification is
performed by comparing the vector for a text with each of the class centroids, and by
classifying the text as belonging to the class whose centroid is closest, or most similar,
to the text’s own vector. Centroid-based classifiers are among the most popular classi-
fiers because of their computational efficiency, and Linear Discriminant Analysis
(LDA) is probably the most widely used classifier of this type. Despite their computa-
tional efficiency, however, LDA and other traditional centroid-based classifiers have
often been found to perform less optimally than other types of classifiers, such as
boundary-based classifiers. Their lower performance is believed to be due to the fact
that the traditional algorithms for determining class centroids do not produce initial
values that are optimally representative of their classes. To compensate for this defi-
ciency, new classifiers have been developed for adjusting centroids to make them more
predictive of the classes they represent (see e.g. Guan, Zhou and Guo 2009). One such
classifier that does this is referred to as Nearest Shrunken Centroids (NSC). NSC uses
a threshold function for adjusting each class centroid a certain distance toward the
 Scott Jarvis
middle of all class centroids, which in many cases has been found to improve classifi-
cation accuracy (see e.g. Tibshirani et al. 2003).
Boundary-based classifiers. These classifiers are similar to centroid-based classifi-
ers in the sense that both represent texts as vectors in a vector space model. The differ-
ence is that boundary-based classifiers do not use centroids, but instead use mathe-
matical means for determining boundaries between classes. These boundaries are
referred to as hyperplanes, and the cases that lie along the margins of the hyperplanes
are referred to as support vectors. Classification takes place by determining which side
of a hyperplane a case’s vector falls on, and by classifying the case as a member of the
class whose side it is on. The most common boundary-based classifier is Support Vec-
tor Machines (SVM), but adaptations of SVM have been developed to correct the po-
sitioning of the hyperplane between support vectors (e.g. Optimal Separating Hyper-
planes, OSH, see Millet-Roig et al. 2000) and to optimize the SVM algorithm for
efficiency (e.g. Sequential Minimal Optimization, SMO, see Keerthi et al. 2001).
Bayesian classifiers. The most common Bayesian classifier is referred to as a Naïve
Bayes classifier, which relies on a relatively simple algorithm for using feature values to
determine the probability that a particular case belongs to a particular class. One of the
assumptions underlying this approach is that the presence or absence of one feature is
independent of the presence or absence of other features. With respect to language
features, this assumption is generally wrong, and for this reason, a Naïve Bayes classi-
fier tends not to classify texts as accurately as more sophisticated types of classifiers do.
However, attempts have also been made to modify the Bayesian model in order to cre-
ate alternative Bayesian classifiers that do take into account the relationships among
features (Kotsiantis 2007). Naïve Bayes Multinomial (NBM) and Complement Naïve
Bayes (CNB) are two such classifiers (see e.g. McCallum & Nigam 1998).
Artificial neural networks. An artificial neural network is, as the name implies, a
mathematically represented network of interconnected artificial neurons, or nodes. In
learning tasks, such as supervised classification, an artificial neural network is trained
on input-output pairs in order to allow it to assign appropriate weights to the input that
will result in the correct output. Artificial neural networks usually consist of multiple
input nodes and at least one layer of so-called hidden nodes between input and output
nodes. In this type of model, initial input values are sent to each of the intermediate
hidden nodes with which they are connected, and then each of the hidden nodes uses
that information with its own set of weights to calculate its own activation value, which
is then passed on to the output nodes with which it is connected. As with real neural
networks, individual input units can have effects throughout the entire network, but
each neuron (or node) interprets and converts information in its own particular way.
Some of the more sophisticated neural network classifiers include Multilayer Percep-
tron Analysis (MPA) and Radial Basis Function (RBF) (e.g. Kotsiantis 2007).
Decision trees. As described by Kotsiantis (2007: 251), “decision trees are trees that
classify instances [i.e. texts, in this case] by sorting them based on feature values. Each
node in a decision tree represents a feature in an instance to be classified, and each
Data mining with learner corpora 
branch represents a value that the node can assume. Instances are classified starting at
the root node and sorted based on their feature values”. The root node of a decision
tree represents the feature that best divides the training data into correct classes, and
the tree’s construction progresses from one node to the next in accordance with which
further features best separate the classes in the training data. One problem with deci-
sion tree classifiers is that they are prone to overfitting the training data, which means
that they tend to become so overly complex in accounting for the specific characteris-
tics of the training data that they do not generalize well to future cases. In order to
avoid overfitting, decision-tree classifiers often include pruning algorithms that re-
move leaves and even full branches that do not improve the classification accuracy.
Simple CART (SC) is an example of a decision-tree classifier that uses a pruning algo-
rithm. Another approach to reducing dimensionality is to limit the number of features
used to build a tree. The Random Forest (RF) classifier, for example, builds decision
trees based on a random subset of features. However, given that a single subset of fea-
tures might result in an unstable model, RF constructs several random trees (hence, a
random forest) and classifies texts according to the majority vote of all the trees in the
forest (Kotsiantis 2007, Prinzie & Van den Poel 2007).
Rule-based classifiers. A decision tree can be converted into a rule-based classifier
by formulating each possible path from the root node to each leaf as a separate rule,
but other types of algorithms can also be used for creating a rule-based classifier. Rules
can be thought of as IF-THEN statements, where the IF part of the statement often
includes multiple AND (conjunction) and OR (disjunction) conditions related to the
features that are used to predict class membership. The predicted class membership is
the THEN part of the statement. There can be several rules associated with each class,
but an excessive number of rules generated by the classifier “is usually a sign that the
learning algorithm is attempting to ‘remember’ the training set, instead of discovering
the assumptions that govern it” (Kotsiantis 2007: 253). This would be a matter of over-
fitting the training data, and one way of minimizing overfitting is to use pruning pro-
cedures, similar to what is done with decision-tree classifiers. RIPPER is a prominent
rule-based classifier that proceeds repeatedly through a series of growing and pruning
phases until it arrives at an optimal set of rules (see Kotsiantis 2007: 253–254).
Means-based classifiers. This is perhaps the simplest type of classifier, as it is based
simply on the mean difference between the value of each feature in a particular text
and the mean values of those same features for each class. Because different features
will usually have different ranges of values, the values are first converted into z scores,
which use a standardized scale to represent how many standard deviations above or
below the mean a particular value is. To take a simple example, if the features in ques-
tion are the relative frequencies of the words the, of, and and, then the first step is to
calculate the overall means and standard deviations for these words in the entire
training set. These overall means and standard deviations are then used as the basis
for calculating z scores for each of these words for every text and for every class
(e.g. every L1 group). Classification is carried out by determining the mean difference
 Scott Jarvis
between the z scores for these features in a text and the corresponding z scores for
each class, and by classifying the text as a member of the class for which it shows the
smallest mean difference. This type of classification was first introduced by Burrows
(2002) as a way of determining the authorship of a specific text, where the classes in
question are individual authors rather than groups. This method has been referred to
alternately as Delta and Burrows’ Delta (e.g. Hoover 2004a), and has proven to be
quite effective for identifying the authors of specific texts. However, under certain
circumstances, it has been found to be even more effective when it is combined with
an algorithm that ignores small differences (e.g. < 0.2 standard deviations) between
the z scores of texts and classes. This modified method is sometimes referred to as
Delta Prime (Hoover 2004b).
Composite classifiers. Not all classifiers fit neatly into one of the categories just de-
scribed, and in fact some classifiers combine elements from more than one category.
For example, a Naïve Bayes Tree is constructed as a normal decision tree but then
implements Naïve Bayes classification at its leaves. Other composite classifiers com-
bine regression methods with classification procedures by giving binary values to
classes, and by creating a separate regression model for each class value. This is re-
ferred to as Classification via Regression. A variation of this is the LogitBoost classifier,
which creates a logistic model of the probability that a case belongs to one of two com-
peting classes. The LogitBoost classifier also re-samples the data adaptively so that it
prioritizes the selection of cases that are most often misclassified, which then makes
the classifier better able to account for unusual cases. Through re-sampling, Logit-
Boost creates multiple possible models of the training data. The classification of cases
in LogitBoost is carried out through the majority-vote principle, similar to what was
described in relation to the decision tree RF classifier (see e.g. Shen et al. 2007). Logit-
Boost and several other composite classifiers are referred to as meta classifiers in the
Weka toolkit (see Witten & Frank 2005).
Learners and users – Who do we
want corpus data from?
Anna Mauranen
Learner corpora and lingua franca corpora differ in important ways in

social and interactional aspects. Yet in the cognitive domain of language
processing they have much in common, as reflected in lexicogrammatical
and phraseological features. They can therefore be seen as complementary
takes on second language research. We can expect advanced second/foreign
language learners to show similar linguistic features to lingua franca speakers,
and supporting evidence is accumulating. This paper suggests that although
some features in an English as a lingua franca (ELF) corpus can be explained
on cognitive principles similar to those likely to operate in learners, such as
economy of effort, others cannot. For instance the common use of not quite
native-like phraseological units requires a use-based rather than learning-based
explanation. On the whole, the major differences between learner and ELF
corpora make it necessary to keep them separate. At the same time, both can
contribute results of considerable mutual interest.
1. Introduction
In a world that is increasingly globalised, learning and using second and foreign lan-
guages has become everyday reality for a growing number of people. Bi- or multilin-
gualism has always been a normal feature of human life, but powerful linguistic theo-
ries were erected in the last century on the idea that the ordinary speaker is monolingual.
This speaker was assumed to possess a virtually infallible intuition about the gram-
maticality of his or her own language, and theoretically modelling such an intuitive
competence would involve just a little extra step of abstraction and idealization. Well,
reality has struck back and we have accepted not only the fallibility of the native speak-
er, but also their multilingualism. The majority of people have some knowledge of
more than one language.
The new realism in linguistics has shifted our interest increasingly towards
analysing the ways in which languages are used, and building models on the basis of
large databases of actual usage. Corpora soon proved to be not merely convenient
 Anna Mauranen
repositories of authentic examples for grammarians and lexicographers to draw on but

powerful instruments for discovering new facets of language. Collecting corpora start-
ed with native speakers, and this already revolutionised second/foreign-language
teaching: instead of trusting the native’s intuition, the learner could now consult the
natives’ attested language practices by accessing corpora. Tim Johns’s “Data-driven
learning” (e.g. Johns 1991, 1994) was an innovative application to language learning of
the insights that John Sinclair’s Cobuild project had uncovered in lexical patterning
(e.g. Sinclair 1987, 1991).
The route of corpus-based research from one application (lexicography) to revela-
tions of theoretical significance (phraseological patterning in language) and from there
to a different application (language learning) showed in an intriguing way how closely
intertwined practical and theoretical interests can be. In the same spirit of merging
practical and theoretical interests, another step was not a long way off, which put cor-
pora and learners together in yet a new configuration: language learners themselves
became sources of linguistic data. Second/foreign-language corpora, or computer
learner corpora (CLC) began to make their way to the corpus world and even make
inroads into the Second Language Acquisition (SLA) research world in the mid-nine-
ties. Sylviane Granger was the primus motor in this with her now world-famous Inter-
national Corpus of Learner English (ICLE; see Granger et al. 2009) of English as a
Foreign Language (EFL) essays, which has subsequently been followed by other kinds
of learner corpora, both in Louvain and elsewhere: the Polish and English Language
Corpora for Research and Applications (PELCRA) corpus, the Japanese EFL Learner
corpus (JEFLL), and the Hong Kong TELEC Secondary Learner Corpus (TSLC)
among others. The remarkable applicability of corpus findings to language learning
was testified by a lexicographic extension (Macmillan English Dictionary 2007), which
made use of learner corpus results by Granger’s team (see e.g. Gilquin et al. 2007).
Learner corpora have made a difference to language research on two fronts: one is
the break away from corpora of native speakers only. There is no reason to assume that
the only speaker groups of linguistic interest should be native speakers. How languag-
es are learned, whether first or later languages, is a central theoretical question in lin-
guistic enquiry. The other direction where learner corpora are beginning to make an
impact is the field of second language acquisition. SLA has traditionally been domi-
nated by an experimental approach, and thereby necessarily small-scale studies. With
sizable corpora of learner language, general patterns of L2 can be found (assuming
‘second language’, L2, as a cover term here for foreign, second, third, nth and any non-
first language), and importantly, corpus data can provide a powerful source in investi-
gating the influence of first language transfer, along with other kinds of data (see, for
example, Granger et al. 2002; Jarvis & Pavlenko 2008). Evidence of both has been
found in the many studies that learner corpora have given rise to all over the world.
What is the next step in this chain of corpus linguistic development? It seems to
me that non-native users in their own right are the group that we need corpus data
from. Corpora of English as a Lingua Franca (ELF) are already in existence: English as
Learners and users – Who do we want corpus data from 
a Lingua Franca in Academic Settings (ELFA) in Helsinki, and Vienna-Oxford Inter-

national Corpus of English (VOICE) in Vienna. These corpora are not confined to the
language classroom, but constitute the new avenue that is now open to the research
community: authentic language of second language speakers in multilingual environ-
ments where the language is naturally spoken as a lingua franca. If learner corpora
liberated L2 learners from the laboratory, corpora of lingua franca and second lan-
guage use liberate L2 speakers from the confines of the classroom.
2. How are learner and L2 user corpora different?
When people use English as a lingua franca, that is, a contact language between speak-
ers who do not share a first language, they are L2 users but not learners. We can draw
a line between second language acquisition (SLA) and second language use (SLU) and
look at its consequences on corpus compilation and interpretation. To appreciate the
contrast between learner and L2 user corpora, it is useful to sketch out basic dissimi-
larities between L2 learners and L2 users. At the same time, we should not lose sight of
the fact that they also have very much in common. Although there is reason to believe
that fundamental processes in language use must be essentially the same for a speaker’s
languages whether they are an early bilingual’s languages, an initial monolingual’s first
and later languages, or a plurilingual’s complex mixture of language resources, it is also
reasonable to assume that there is bifurcation at stages closer to the ‘surface’ of the
actual reception and production processes among languages differentially entrenched
in speaker’s repertoires – such as their first and other languages. Thus, even though
processes such as memory storage and retrieval are likely to be basically similar in
terms of neural pathway formation, information chunking, or simultaneous process-
ing and monitoring at different levels, things like ease and speed of retrieval, access to
alternative expressions, and mapping linguistic and social repertoires effectively onto
each other are probably rather different in a speaker’s different languages. The differ-
ences between first and second languages are in many respects shared by second lan-
guage learners and users, and therefore any research findings from either group are of
interest to those studying the other. However, what I want to do at this point is to
throw light on matters where these two groups differ.
The differences are perhaps best illuminated by reference to certain social, cogni-
tive, and interactive parameters.
In social terms, an immediately obvious difference is that ELF speakers do not
share a cultural background or a first language: by definition a lingua franca means a
vehicular language between speakers who do not share a first language (L1). In most
classrooms around the world, especially where English is learned as a foreign lan-
guage, students share an L1. As many textbooks and other pedagogical materials tes-
tify, much pedagogical effort rests on the assumption that learners who share a mother
tongue will have similar problems and are therefore offered similar remedies. Mixed
 Anna Mauranen
classrooms also exist, of course, and in those, English can appear as a lingua franca
along with its role as the object of study, but these may be more typical of English-
speaking countries than the number of non-English speaking countries where same-
language students are taught. In an environment of shared linguistic and cultural as-
sumptions the social orientation to the new language is also shared, and along with it,
cultural identities and expectations relative to target language speakers.
For learners, the principal English-speaking countries constitute ‘target cultures’,
which can be seen against their own cultural background for comparison, contrast,
and for models of social appropriateness. In a lingua franca environment, the scene is
quite different, because the vehicular language is chosen as a matter of convenience or
necessity, and interlocutors may have very little idea of each other’s cultural back-
grounds or familiarity with Anglo-American cultures. English-speaking cultures with
their conventions may be far from appropriate in situations where communicative ef-
fectiveness hinges upon dealing with various cultures and cultural mixes as they come
up in particular situations or tasks. Often the target is simply an international or glob-
al audience. But a definable, let alone national, target culture is an irrelevant concept.
A classroom is a social environment of a particular kind. It imposes particular
social positions on learners that do not hold outside its own context. Why this matters
is that the learner position overrules other social parameters in a classroom setting,
whereas outside the classroom other social parameters relevant to positioning people
override the learner status. A learner position is one from which educational and class-
room targets are viewed as relevant, and they regulate the norms of interaction in ev-
ery respect. This is particularly obvious in giving and receiving feedback, providing
and following models of behaviour and practices of assessing performance. Out of the
bounds of educational settings interactional parameters do not follow classroom rules
– there are even pedagogical genres that are specific to educational settings only, such
as particular question-answer sequence types, fill-in exercises, or ‘composition’. In
principle it is possible to transcend such borderlines at times; they are negotiable to an
extent, and we can make a learner position relevant in an ordinary encounter outside
the classroom by for example asking our interlocutors about the correctness of our
language. We can also invoke learners’ other identities such as professional or gender
identities in class. But it seems these borderlines are not often transcended, and even
in native/non-native situations where a non-native speaker is assuming a learner posi-
tion, native speakers tend to orient to them as speakers, not correcting their language
but orienting to the contents of what is being said (see e.g. Kurhila 2003)
It has often been pointed out that people can alternate in these roles – in the class-
room they are learners, but as soon as they get outside it they may turn into users of the
same language. Therefore, the argument continues, the identities are inseparable be-
cause the people are the same. However, we do not have single or simple identities, but
assume them situationally as is relevant, foregrounding and backgrounding our differ-
ent identities and their elements, and drawing on them as the need arises in response
to a social environment. It is important to be sensitive to the situational demands on
Learners and users – Who do we want corpus data from 
social identity also as an analytical principle: when people enter an educational context
as language learners, their position shifts from that of a user to whom a given language
is the relevant means of communication. For example I am writing this in Denmark
and despite my virtually nonexistent Danish do not have any inclination to position
myself as a learner of Danish outside my “Danish for beginners” class. To get by, I use
Swedish, English, or rudimentary Danish according to how I judge the demands of the
situation, but rarely if at all invoke a learner identity outside the classroom.
A “learner” identity can also be seen reductive and limiting, as has been pointed out
by Firth & Wagner (1997), who criticize the SLA research paradigm for doing just that:
learners are seen as deficient communicators, and their output as a struggle with diffi-
culties. The target set for them is an idealised native speaker, which is beyond reach for
learners, given that ideal models just do not fit in with the contingencies of reality. Firth
& Wagner called for a broadening of the basis of SLA studies to embrace the everyday
use of a second language outside classroom settings, and the inclusion of L2–L2 com-
munication. ELF research has done this, expanding the perspective for L2 research.
There is also the wider issue of the potential influence of these different kinds of L2
– learner or user – on the English language. Learner language cannot influence the
target language by definition, because it orients to acquiring the native norm; learners
get corrected for their errors, and because they are learners, they will accept the correc-
tion as far as they can, and the target language remains intact. ELF is used to achieve
communication in international environments, and it does not have a ‘target language’
but is an ‘instrument language’. The forms that ELF assumes may not be very influential
in fleeting encounters between strangers at airports, but it constitutes the working lan-
guage of many more permanent and important communities of practice in business,
academia, research, and so on. In the absence of linguistic authority other than com-
municative efficiency in a community of practice, group norms evolve without the ex-
ternal intervention of the standard language norms that guide learners and teachers.
Speakers mediating norms may even deliberately appeal to practices that are interna-
tional but not in accordance with British or American norms (Hynninen 2011). Since
ELF is a more widespread use of English than communities where English is used as a
native language (ENL), and since many ELF using communities command high inter-
national prestige in for instance multinational companies, international politics and
science, they hold the key to the future of English.
Changes in English as a result of its increasing use as a global lingua franca are
likely to arise from common but complex units such as phraseological sequences
which consist of structural and lexical elements and which have variable as well as in-
variant parts. ELF speech shows many kinds of phraseological sequences. As pointed
out in the previous paragraph, ELF speech has the potential to influence English by its
new developments, most likely when the same features appear independently and re-
peatedly in different places. As the interactively co-constructed group norm evolves,
there is no intervening external authority to correct ELF speakers, but whatever works
well is likely to be strengthened and diffused. Phraseological sequences may be a point
 Anna Mauranen
where ELF begins to impact English more widely, because they seem to serve their
communicative purposes quite well without being target-like in ENL terms. An ex-
ample of a phraseological frame with such potential is -ly speaking. It is a partly flexible
unit, where the adverb ending can in principle be attached to any adjective. In practice
this possibility is constrained by conventional preference, so that although the unit is
productive, it is also restricted. I looked at the Michigan Corpus of Academic Spoken
English (MICASE) (http://lw.lsa.umich.edu/eli/micase/index.htm), more precisely, its
ENL parts, since it is the most closely comparable ENL corpus to ELFA. In MICASE,
the -ly speaking frame appears fifteen times in the whole 1.8 million corpus (i.e. 8.3/
million words), and the overwhelmingly preferred expression is generally speaking
with six occurrences. Strictly speaking appears twice, and all other cases just once
(Table 1).
With fifteen occurrences in MICASE, the expected number in the first part of
ELFA (ELFA(i), the first 0.6 million words of the database which was finished earlier
than the rest) is five, assuming that the rate of occurrence is the same. However, the
actual number of occurrences is 19 (Table 2), (i.e. 31.7/million words). In SLA terms,
we might want to speak of ‘overuse’ of the frame in ELFA. For ELF, this is hardly a
relevant characterisation. Clearly, the frame is salient and it is being utilised as a con-
veniently productive frame.
Thus in ELFA the frame is proportionally almost four times as common as in MI-
CASE, with more occurrences even in absolute terms. In spite of this, the most fre-
quent item (historically speaking) occurred only three times, followed by four others,
each with two instances (basically/formally/frankly/generally speaking). Thus, we can
detect not only what in SLA terms would be ‘overuse’ (of the frame) and ‘underuse’
(of the preferred item) in ELF, with total absence of the second-ranking item in ENL
(strictly speaking). What the more general implication is, however, is that such shifts in
ELF preferences affect English usage. If we take ‘English usage’ to refer to all of the
Table 1.â•‡ Partly flexible phraseological frame: -ly speaking in MICASE
Expression Abs. frequency
generally speaking â•⁄ 6

strictly speaking â•⁄ 2
morally speaking â•⁄ 1
objectively speaking â•⁄ 1
properly speaking â•⁄ 1
relatively speaking â•⁄ 1
roughly speaking â•⁄ 1
simply speaking â•⁄ 1
stylistically speaking â•⁄ 1
Total 15
Learners and users – Who do we want corpus data from 
Table 2.â•‡ Partly flexible phraseological frame: -ly speaking in ELFA(i)
Expression Abs. frequency
historically speaking â•⁄ 3

basically speaking â•⁄ 2
formally speaking â•⁄ 2
frankly speaking â•⁄ 2
generally speaking â•⁄ 2
comfortably speaking â•⁄ 1
honestly speaking â•⁄ 1
largely speaking â•⁄ 1
legally speaking â•⁄ 1
linguistically speaking â•⁄ 1
realistically speaking â•⁄ 1
relatively speaking â•⁄ 1
seriously speaking â•⁄ 1
Total 19
English being used in the world, this must be the case. Even if native speakers should
maintain their previous usage, they are in a minority, and the breaking down of con-
ventionally preferred forms affects the relative frequencies of English if all its global use
is taken into account. Frequency patterns in turn affect anyone using the language.
The expression -ly speaking was partly variable to begin with, and it is possible that
L2 speakers perceive such expressions as even more freely variable. Yet even fully fixed
expressions can become subject to similar fracturing: in the first half of the ELFA cor-
pus, the form as the matter of fact occurred more often than the ENL as a matter of fact.
ELF generates its own patterning on the basis of standard varieties of English, and
gradually makes inroads into its use.
In cognitive terms, ELF speakers as L2 users do not orient to their linguistic envi-
ronment as a setting for language learning, but focus their efforts on making sense and
making themselves understood. Many ELF scholars have noted a strong orientation to
content over form in ELF discourse (e.g. Karhukorpi 2006; Ehrenreich 2009), and so
have researchers working on authentic L1–L2 interaction in real-life conversations
(Kurhila 2003). In contrast, the cognitive orientation of learners is far more towards
language form. This is a consequence of the pedagogical setting, where the immediate
focus is necessarily on learning grammar, vocabulary, phonology, and phraseology in
the new language. Other aspects of language, such as textual organisation, style, regis-
ter, and pragmatics come in as well, but the principle remains the same. Feedback and
evaluation are based on mastering elements of the language, and it is hard to imagine
an educational setting where this would not be so. Long-term objectives of SLA cur-
ricula are typically defined in real-life communicative terms, but those objectives
 Anna Mauranen
remain outside the classroom context. They can be simulated in the classroom, but not
performed there. Thus they cannot be assessed in class for their success or effective-
ness in achieving their goals outside it. Communicative simulations may be pedagogi-
cally useful, as Widdowson often points out (e.g. Widdowson 2000), even realistic and
meaningful, but not authentic in its basic sense of being real (see also Mauranen 2004).
Success in SLA and SLU context thus depends on different criteria, and the situated
cognitive orientation in SLA and SLU diverge in consequence.
The particular demands on ELF speakers are often exacerbated by the unpredict-
ably varying language parameters they need to cope with: their interlocutors’ accents,
transfer features, and proficiency levels. The reality for classroom learners may also be
more varied in multilingual circumstances than in shared-L1 classes, but the variation
becomes familiar in the classroom, as speakers get used to each other’s ways of speak-
ing (see Smit 2010). Reduced predictability is therefore basically an SLU feature that
affects cognition as well as interaction.
From an interactional perspective, users of ELF typically find themselves in situa-
tions where discourse norms are not clear or given. Terms of appropriate interaction
must be negotiated by participants. In contrast, while learners may not master the
discourse conventions of the target culture, they orient to modelling their behaviour
on those, and take the lead from native speakers. Native speaker authority and supe-
rior expertise are axiomatic for learners of foreign languages, and therefore native
preferences at all levels of language use are to be emulated for improved proficiency. In
a lingua franca context, linguistic authority is not given, and participants are not seek-
ing to learn the language from each other. As mutual understanding is constructed,
any linguistic solutions that serve the purpose may be adopted by common consent –
the best solutions need not be the most standard-like or native-like (see Hülmbauer
2009). Things that are ruled out in SLA classrooms, like language mixing, can be effec-
tive strategies in ELF communication (see e.g. Klimpfinger 2009). In this way, there is
far greater symmetry in the interaction of lingua franca users, in addition to the open-
ness and negotiability of discourse conventions.
As ELF speakers orient to mutual comprehensibility, they engage in interactive
strategies in support of this, such as enhanced explicitness (Mauranen 2007). An im-
portant aspect of ELF communication is that speakers seem to be prepared for the
possibility of misunderstanding (Mauranen 2006a), and take steps to pre-empt that,
which in effect results in few misunderstandings (Mauranen 2006a; Kaur 2009). Clear-
ly, learners attend to comprehensibility as well in certain types of cooperative com-
munication tasks, and again this is not a categorical divide between learners and users,
but rather a difference in emphasis: for users it is a constant determiner of behaviour,
whereas for learners it is less vital, even if desirable, as the classroom safety net will
prevent major disasters ensuing from communication breakdown.
The above distinctions are reflected in compiling corpora of learner English and
ELF. For example the ELFA corpus of English as a Lingua Franca in Academic Settings
(www.eng.helsinki.fi/elfa/elfacorpus), which was the first ELF corpus (finished 2008),
Learners and users – Who do we want corpus data from 
differs radically from learner corpora at the outset because it has deliberately avoided
collecting any data from learners of English. In other words, it has not recorded any
data from classes where English would be the object of study.
This choice was prompted by the general considerations just discussed, and also in
view of more directly corpus-related issues, among which two major factors separate
learner and ELF corpora. One is speaker proficiency and the other is participants’
mother tongue. Learner corpora are compiled following the principle familiar from
SLA data of controlling for learner proficiency as well as possible. While this cannot in
practice be too narrowly defined, as students even in the same classrooms or in com-
parable stages in their studies vary in their proficiency levels (see e.g. Granger et al.
2009), the criteria are set with awareness of proficiency levels in mind. Either all data
is gathered from the same proficiency level or the data is organised in terms of devel-
opmental stages. In an ELF corpus this would be an untenable solution, because it is in
the nature of lingua franca communication that speakers’ proficiencies vary, some-
times considerably. This is one of the unpredictable factors in a lingua franca environ-
ment. There is not only a straightforward variability of level, such as obtains between
stronger and weaker students, but a huge diversity of previous learning environments
and earlier experiences of English. The earlier language experiences that usually count
in learner corpora relate to time spent in English-speaking countries, but in the case of
ELF, earlier experience often comes from non-English speaking countries, which is
increasingly typical among internationally mobile university students. Thus any idea
of an even tolerably unidimensional scale of proficiency is alien to ELF and should not
be attempted in an ELF corpus.
The second corpus compilation principle where SLA and SLU part company is in
terms of speakers’ first language. It makes sense for a learner corpus to keep first lan-
guages separated – either in different corpora or in separate sections or subcorpora.
Moreover, each L1 background needs to be represented in a sufficient and similar way
to enable comparisons. For practical as well as theoretical reasons, any research on
learner corpora is interested in the effects of a particular L1 on the target L2, even
though this is not the only kind of research such corpora lend themselves to (see e.g.
Ädel 2008). This is clearly useful for teaching applications, and there is continued in-
terest in such language-specific information, as testified by the popularity of guide-
books to teachers that contain typical learner errors from different L1 backgrounds
along with their preferred target forms in ENL (e.g. Swan’s hugely popular reprinted
and re-edited guidebook 2005). Textbooks of this kind would certainly benefit from
systematic corpus-based studies of learner language now increasingly available
(see e.g. Nesselhauf 2004, 2005), and it is to be hoped that evidence of the distribution
of error and problem types according to learners’ first languages that corpora can pro-
vide will find their way to teaching and learning materials on a scale comparable to
ENL corpora. However, pedagogical applicability does not exhaust the potential of L1-
based divisions in learner corpora; by keeping L1s clearly separate a learner corpus
also lends itself to testing more theoretical predictions about the errors of learners with
 Anna Mauranen
a particular language background (e.g. Ringbom 1992, 2007; papers in Granger 1998).
It is therefore important to capitalise on the deeper understanding of the deviations
from target use that can be derived from learner corpora.
It makes sense, then, to maintain dividing lines according to language background
in learner corpora – and they need not be limited to first languages, but can incorpo-
rate bilingual backgrounds where English is a third language, and so on. Insofar as
sufficiently large groups with similar language backgrounds can be found, the case for
keeping them in separate subcorpora can be made. But this is not the path for ELF. If
we wish to investigate a lingua franca, we need to focus on environments where it is
typically used, and gather data from those situations. This may mean a proliferation of
first languages in unpredictable mixes. So for example ELFA has speakers from 51 dif-
ferent first languages in a million words of speaking (as opposed to the 16 carefully
controlled ones in the second version of ICLE), which appear in their authentic mixes,
and in different quantities. It is hardly possible to even try to control for language
backgrounds so as to gather equal amounts from all languages represented, but just as
in learner corpora, it is important to keep track of the L1s. In ELF, the focus is on en-
suring a good mix so as not to get excessive dominance from one language group. This
might skew findings towards L1 transfer features from that group – and as learner
corpora among other evidence (see e.g. Jarvis & Pavlenko 2008) tell us, this is a ubiq-
uitous feature of learner language and therefore extremely likely to surface in SLU.
Dominance of one L1 group or closely related L1s, say, Nordic languages, may also af-
fect the propensity to code-switch, to rely on shared cultural knowledge, and other
interactive strategies. ELFA has therefore made a special effort to keep the majority
language of the matrix culture, Finnish, to a reasonable proportion, and at just over a
quarter of the data it is a good achievement (see Mauranen et al. 2010).
What about native speakers and ELF? ENL speakers also find themselves every
now and then in situations where English is a lingua franca. While this is true it is not
of central relevance to ELF use. The VOICE corpus (http://voice.univie.ac.at/) has
drawn the line between ELF and native/non-native situations where non-native ma-
jority begins: dyadic conversations between ENL and non-native speakers do not
count as lingua franca use. While there is a certain arbitrariness in this, it is a workable
solution to a dilemma that otherwise might linger on forever. ELFA has included ENL
speech as part of polylogic conversations, and in all ENL talk amounts to about 5% of
the corpus.
In all, learner and ELF corpora have fundamental differences that require keeping
them clearly separate. The main distinction boils down to language as an object of
study vs. language as a means of achieving particular objectives in real environments.
In spite of this, or perhaps because the two strands of non-native corpus studies differ
in their very starting points, they can be of use to each other in important ways. While
learner corpora can inform ELF study about what linguistic deviations from ENL are
common in SLA and thus likely to have learning-based explanations if found in ELF,
ELF can in return enlighten SLA research by showing how L2 actually works in
Learners and users – Who do we want corpus data from 
ordinary life outside educational contexts and what might be worth focusing on in
teaching. Together, learner and ELF language research can get deeper into the nature
of languages other than the first.
3. How are learner and L2 user corpora similar?
The most obvious affinity between learner and lingua franca corpora is that both
collect data from speakers using a non-native language. Their social background
and language identity are not those of a native speaker on account of their primary
socialisation in other environments. Cultural conventions known widely among
ENL speakers are largely unfamiliar to them. Such social and cultural similarities
are perhaps the most obvious but at the same time somewhat trivial, because the
differences between learners and users in sociocultural identity and position, dis-
cussed in the previous section, are so fundamental. Where cultural aspects most
probably converge in their influence on ELF speakers as well as learners is around
problems both groups experience with ENL speech. These are above all allusions to
aspects of major ENL cultures that their members are likely to share but non-mem-
bers far less likely to do so; especially certain culture-specific linguistic expressions,
particularly what Seidlhofer (2002) has termed “unilateral idiomaticity” – that is,
one speaker using idiomatic expressions the interlocutor might not know. These
phenomena may be even more problematic to learners, whose relation to ‘target
culture’ and ‘target language’ goes beyond the needs of effective communication on
the spot.
Where the similarity of learners and L2 users is the most relevant is in the domain
of cognitive processes. They all use a linguistic repertoire where items are inclined to
be less deeply entrenched than in an L1 repertoire, and where different stored systems
are likely to compete (see e.g. Riionheimo 2009). It is here that we can expect most
fruitful cross-fertilization between the two kinds of corpora.
If we try to understand how second languages differ from first languages linguisti-
cally, and the kinds of changes that languages undergo in the hands of L2 speakers,
learner and user corpora are both important data sources. If we look at some of the
most typical examples of non-standard lexicogrammatical features from ELFA, the
similarities are immediately clear.
One commonly occurring non-standard feature is article use. Articles in ELF are
often missing (it was absolutely in spirit of the time), superfluous (not in a principle) or
just different from those expected in Standard English (I have written down here a
word chronic liver diseases). Likewise, prepositions are very often used in non-standard
ways (discuss about; obsession in; we’re dealing what is science; on this stage). Similar
departures from Standard English article and preposition use have been found in
learner English (e.g. Jarvis & Odlin 2000), with article use remaining shaky even in
near-native speakers (Ringbom 1993).
 Anna Mauranen
In morphology, two tendencies can be distinguished: regularisation of irregular

forms (teached) and overproductive or nonstandard morphology (interpretate; maxi-
malise; introducted). The tendency to regularise is probably also what underpins shifts
from uncountable to countable nouns (offsprings). Morphology tends to be overpro-
ductive in its possibilities in natural languages. While convention and acquired prefer-
ence keep it in check in native language communities, its possibilities are liberally uti-
lised by non-natives. This has also been observed in learners, and referred to as
‘overgeneralisation’ (e.g. Master 1997) or ‘elaborative simplification’ (Meisel 1980; see
also Winford 2003). ELF speakers also resort to well-known word formation practices
like ‘back-formation’, which leads to forms like interpretate. Syntactically lack of con-
cord or agreement is usual (each sciences; the main ideas is), as are for example non-
standard word order in interrogative clauses, or embedded inversions (Ranta 2010,
forthcoming) and the very high frequency of the -ing form of the verb, especially in its
progressive use (Ranta 2006). Thus, while some of these processes can easily be seen as
facets of simplification (like regularisation), they cannot all be comfortably thrown
into that category (like morphological overproductivity or lack of concord), at least,
unless we adopt simplification as an overarching cover term in Meisel’s (1980) way,
and add modifiers like ‘elaborative’ to keep it alive.
The similarity of the above examples of recurrent lexicogrammatical ELF features
to what are regarded as typical learner errors is clear. It does not seem too far-fetched
to suggest that such features result from speaking a second language, and the weak or
unstable entrenchment of lexicogrammar in the L2, perhaps involving online choices
from competing systems. Such cognitive aspects of dealing with an L2 should cover
much of the common ground between learners and users.
A similar explanation would seem plausible in the case of phraseological units,
which are notorious in SLA for presenting difficulties for learners (e.g. Nattinger &
DeCarrico 1992; see also Seidlhofer 2002; Mauranen 2006b). While the SLU angle
departs from the ‘difficulty’ or ‘error’ conceptualization of such expressions, the lin-
guistic phenomenon in itself is the same. Examples of approximations to ENL phrase-
ology abound in ELFA (to put the end on it, take closer look to the world, on the end, the
hen or the egg...). Many of the phraseological units have a different preposition or ar-
ticle from the ENL phrase, but also lexical and structural substitutions occur. These are
not dissimilar to the findings that have been made in learner language studies
(e.g. papers in Schmitt 2004), notably in the corpora compiled at the Centre for Eng-
lish Corpus Linguistics (CECL) in Louvain (e.g. papers in Meunier & Granger 2008).
What is worth noticing about these units is not only the now generally recognised fact
that second-language speakers tend to get them slightly wrong even at high levels of
proficiency, but perhaps more importantly that L2 speakers use them with great fre-
quency. The fact that people use them outside educational environments is important
for efforts to understand their significance to L2 users. Educational settings may re-
ward learners for their use (and penalise them for getting them slightly wrong), where-
as in second language use such as lingua franca communication their use must be
Learners and users – Who do we want corpus data from 
explained by other means. In other words, we need use-based explanations along with
learning-based ones.
As pointed out by Wray (2002) among others, schematic units reduce a speaker’s
processing load because they are relatively predictable and make processing faster.
Therefore when we speak second languages, it would seem a good strategy to try to
resort to those just as we do in our first languages. Such units may be less readily avail-
able in an L2, at least in their accurate form, than they are in a well-entrenched first
language where at least monolinguals face no competition from other stored systems,
but since they are useful building blocks, their approximate forms may work just as
well for the purposes of facilitating communication. In this, learner language and ELF
should be essentially similar.
Some distributional patterns reveal other tendencies that can also easily be related
to SLA findings. So for example distributions of ‘announced self-repairs’ studied by
Marx & Swales (2005) in the MICASE corpus. By these they meant
phrases that a speaker might use when he or she wanted to tell the interlocutors
that an attempt to fix a speech mistake, clarify an idea, or rephrase an ambiguous
utterance was coming up. (Marx & Swales 2005)
On inspection, it turned out that these are very different in MICASE and ELFA: in
sheer quantitative terms, there was strikingly more announced self-rephrasing in
ELFA (Mauranen 2007). As to the actual expressions, the favourite ELF items were not
the same as those in ENL. The overwhelmingly most common ELF way of announcing
a self-rephrase was I mean, while the corresponding top preference fell on in other
words in the ENL material. I mean was the second most common in MICASE, but
nowhere near in other words, which was more than four times as frequent. None of the
other expressions found by Marx & Swales were much repeated in the ELFA data, and
some did not appear at all. This is in line with what is often found in learner data:
learners use a small variety of expressions for a particular function, but the ones they
use are extremely frequent (see e.g. Altenberg & Granger 2002). This is of course an
economical strategy for any L2 user: ‘make good use of the items you know’. Lingua
franca communication is akin to learner strategies in this respect. Resources have to go
far, so speakers economise on their cognitive effort. One meaning for one form, or
isomorphism, is what Winford (2003) suggests as a central principle in SLA. It would
indeed seem like an economical strategy to hang on to one form (I mean) for one
sense, rather than learn several (in other words, that is to say, etc), and this goes for
learners and L2 users alike. The explanation seems to fit this instance fairly well, but it
is hardly likely to suffice as the overarching principle of L2 use.
We already saw above cases that would not seem to be readily explained in this
way: in many cases ELF introduced new variability into ENL preferences, and the
frame -ly speaking showed little preferential patterning. And even if I mean was the
clearly preferred expression, it was by no means the only one used for the ‘announced
self-repair’ function. The finding also presents a further question: why this expression?
 Anna Mauranen
Why I mean and not in other words, which in comparable ENL circumstances is the
most frequent? The preferred ELF expression is not the most bookish one either, as one
might quite reasonably expect in an academic, text-dominated environment, but one
which is typical of everyday speech. This would seem to point to spontaneous acquisi-
tion in social interaction rather than classroom learning or to a strong written lan-
guage bias. Interestingly, there is also evidence suggesting that L2 learners tend to
overuse expressions from informal spoken mode even in their writing, and following
a primarily academic education (see e.g. Gilquin & Paquot 2008). There is much left to
investigate here, and these questions certainly fall within the common ground that all
those have who take an interest in understanding L2, whether it occurs in a learning or
use environment.
4. Conclusion
Learner language corpora such as ICLE and others in the by now impressive CECL
collection are a great step forward in studying learner performance, because they have
a wide international coverage and capture learner language in large quantities. The
corpora complement experimentally and qualitatively oriented SLA studies in a very
important way, reaching far beyond the small-scale studies typical in the field. They
have been compiled in a number of countries in a reasonably comparable way and
consist of learners’ extended products as part of their normal language studies. The
resulting corpora are not entirely without their problems (cf. e.g. Ädel 2006), but some
faults can always be found in large databases. They are definitely a major contribution
to language learning research and applications to learning and teaching.
Learner corpora also provide highly useful material for comparison with lingua
franca studies, because advanced learners can reasonably be expected to show many
similar language features to speakers who are in natural out-of-classroom situations
– and there is already evidence that they do. Some examples were shown in this paper
of common features with plausible similar explanatory bases. Others again seemed to
point to more use-based than learning-based explanations. Looking at ELF in real-life
contexts gives us a view of second language in action. It is precisely the commonalities
between the learner and the user perspectives that hold the most theoretical promise:
what remains constant in different social contexts, and conversely, what is different
according to social and interactional context?
The most fruitful areas of common interest are clearly lexicogrammatical and
phraseological aspects of language, which are of course at the heart of much corpus
work. To understand what second languages are about, and what they can tell us about
human language in general, we need research into second language learning as well as
second language use. Corpora in both domains give the best opportunities for seeing
the big picture of the kind of language that we are dealing with.
Learners and users – Who do we want corpus data from 
In all, there are principled differences between learner and ELF corpora, and good
reasons for keeping them separate. At the same time, they share certain features which
make them yield results which are of great mutual interest.
References
Ädel, A. 2006. Metadiscourse in L1 and L2 English [Studies in Corpus Linguistics 24]. Amster-
dam: John Benjamins.
Ädel, A. 2008. Involvement features in writing: Do time and interaction trump register aware-
ness? In Linking up Contrastive and Learner Corpus Research, G. Gilquin, M.B. Díez Bedmar
& S. Papp (eds), 35–53. Amsterdam: Rodopi.
Altenberg, B. & Granger, S. 2002. The grammatical and lexical patterning of make in native and
non-native student writing. Applied Linguistics 22(2): 173–189.
Ehrenreich, S. 2009. English as a lingua franca in multinational corporations – Exploring busi-
ness communities of practice. In English as a Lingua Franca: Studies and Findings, A.
Mauranen & E. Ranta (eds), 126–151. Newcastle: Cambridge Scholars.
Firth, A. & Wagner, J. 1997. On discourse, communication, and (some) fundamental concepts
in SLA research. Modern Language Journal 81(3): 285–300.
Gilquin, G., Granger, S. & Paquot, M. 2007. Learner corpora: The missing link in EAP pedagogy.
Journal of English for Academic Purposes 6(4): 319–335.
Granger, S. (ed.). 1998. Learner English on Computer. London: Addison Wesley Longman.
Granger, S., Dagneaux, E., Meunier, F. & Paquot, M. (eds). 2009. The International Corpus of
Learner English. Handbook and CD-ROM, Version 2. Louvain-la-Neuve: Presses universi-
taires de Louvain.
guage Acquisition and Foreign Language Teaching [Language Learning & Language Teach-
Hülmbauer, C. 2009. “We don’t take the right way. We just take the way that we think you will
understand” – The shifting relationship between correctness and effectiveness in ELF. In
English as a Lingua Franca: Studies and Findings, A. Mauranen & E. Ranta (eds), 323–347.
Newcastle: Cambridge Scholars.
Hynninen, N. 2011. The practice of ‘mediation’ in English as a lingua franca interaction. Journal
of Pragmatics. 965–977.
Jarvis, S. & Odlin, T. 2000. Morphological type, spatial reference, and language transfer. Studies
in Second Language Acquisition 22: 535–566.
Jarvis, S. & Pavlenko, A. 2008. Crosslinguistic Influence in Language and Cognition. London:
Routledge.
Johns, T. 1991. Should you be persuaded – Two examples of data-driven learning materials.
English Language Research Journal 4: 1–16.
Johns, T. 1994. From printout to handout: Grammar and vocabulary teaching in the context of
data-driven learning. In Perspectives on Pedagogical Grammar, T. Odlin (ed.), 293–313.
Cambridge: CUP.
 Anna Mauranen
Karhukorpi, J. 2006. Negotiating Opinions in Lingua Franca E-mail Discussion Groups, Dis-
course Structure, Hedges and Repair in Online Communication. Licenciate thesis, Univer-
sity of Turku.
Kaur, J. 2009. Pre-empting problems of understanding in English as a Lingua Franca. In English
as a Lingua Franca: Studies and Findings, A. Mauranen & E. Ranta (eds), 107–123. New-
castle: Cambridge Scholars.
Klimpfinger, T. 2009. “She’s mixing the two languages together” – Forms and functions of code-
switching in English as a Lingua Franca. In English as a Lingua Franca: Studies and Find-
ings, A. Mauranen & E. Ranta (eds), 344–371. Newcastle: Cambridge Scholars.
Kurhila, S. 2003. Co-constructing Understanding in Second Language Conversation. Helsinki:
University of Helsinki.
Macmillan English Dictionary for Advanced Learners, 2nd edn. 2007. Basingstoke: Macmillan.
Marx, S. & Swales, J.M. 2005. Announcements of self-repair: “all i’m trying to say is, you’re un-
der an illusion”. <http://micase.elicorpora.info/micase-kibbitzers/8-announcements-of-
self-repair>
Master, P. 1997. The English article system: Acquisition, function, and pedagogy. System 25:
215–232.
Mauranen, A. 2004. Spoken corpus for an ordinary learner. In How to Use Corpora in Language
Teaching [Studies in Corpus Linguistics 12], J.McH. Sinclair (ed.), 89–105. Amsterdam:
John Benjamins.
Mauranen, A. 2006a. Signalling and preventing misunderstanding in English as lingua franca
communication. International Journal of the Sociology of Language 177: 123–150.
Mauranen, A. 2006b. Speaking the discipline. In Academic Discourse Across Disciplines, K. Hy-
land & M. Bondi (eds), 271–294. Bern: Peter Lang.
Mauranen, A. 2007. Hybrid voices: English as the Lingua Franca of academics. In Language and
Discipline Perspectives on Academic Discourse, K. Fløttum, T. Dahl & T. Kinn (eds), 244–
259. Newcastle: Cambridge Scholars.
Mauranen, A., Hynninen, N. & Ranta, E. 2010. English as an academic lingua franca: The ELFA
project. English for Specific Purposes 29(3): 183–190.
Meisel, J. 1980. Linguistic simplification. In Second Language Development: Trends and Issues, S.
Felix (ed.), 13–40. Tübingen: Gunter Narr.
Nattinger, J.R. & DeCarrico, J. 1992. Lexical Phrases and Language Teaching. Oxford: OUP.
Nesselhauf, N. 2004. Learner corpora: Learner corpora and their potential for language teach-
ing. In How to Use Corpora in Language Teaching [Studies in Corpus Linguistics 12], J.McH.
Sinclair (ed.), 125–152. Amsterdam: John Benjamins.
Ranta, E. 2006. The ‘attractive’ progressive – why use the -ing form in English as a lingua franca?
Nordic Journal of English Studies 5(2): 95–116.
Ranta, E. 2010. Models for English grammar at school? Paper given at the International ELF 3
Conference, 22–25 May 2010, University of Vienna, Austria.
Ranta, E. Forthcoming. Universals in a Universal Language? – Study into the Verb-Syntactic
Features of English as a Lingua Franca. PhD dissertation, University of Tampere.
Learners and users – Who do we want corpus data from 
Riionheimo, H. 2009. Interference and attrition in inflectional morphology: A theoretical per-

spective. In Language Contact Meets English Dialects: Studies in Honour of Markku Filppula,
E. Penttilä & H. Paulasto (eds), 83–106. Newcastle: Cambridge Scholars.
Ringbom, H. 1992. On L1 transfer, L2 comprehension and L2 production. Language Learning
42(1): 85–112.
Ringbom, H. 1993. Near-Native Proficiency in English. Turku: English Department Publications
Abo Akademi University.
Ringbom, H. 2007. Cross-Linguistic Similarity in Foreign Language Learning. Clevedon: Multi-
lingual Matters.
Schmitt, N. (ed.). 2004. Formulaic Sequences: Acquisition, Processing and Use [Language Learn-
ing & Language Teaching 9]. Amsterdam: John Benjamins.
Seidlhofer, B. 2002. The shape of things to come? Some basic questions about English as a Lin-
gua Franca. In Lingua Franca Communication, K. Knapp & C. Meierkord (eds), 269–302.
Frankfurt: Peter Lang.
Sinclair, J. (ed.). 1987. Looking Up. Account of the Cobuild Project in Lexical Computing. London:
Collins Cobuild.
Smit, U. 2010. English as a Lingua Franca in Higher Education. Berlin: Mouton de Gruyter.
Swan, M. 2005. Practical English Usage, 3rd edn. Oxford: OUP.
Widdowson, H. 2000. On the limitations of linguistics applied. Applied Linguistics 21(1): 3–25.
Winford, D. 2003. An Introduction to Contact Linguistics. Oxford: Blackwell.
Wray, A. 2002. Formulaic Language and the Lexicon. Cambridge: CUP.
References to corpora
ELFA: <http://www.helsinki.fi.elfa>
ICLE: <http://www.uclouvain.be/cecl-icle.html>
MICASE: <http://lw.lsa.umich.edu/eli/micase/index.htm>
VOICE. 2009. The Vienna-Oxford International Corpus of English (Version 1.0 online): <http://
voice.univie.ac.at/>
Learner knowledge of phrasal verbs
A corpus-informed study
Norbert Schmitt and Stephen Redwood
This study analyses whether a group of learners’ productive and receptive

knowledge of some of the most common phrasal verbs (PVs) is related to the
frequency of those PVs. Secondly, we look at factors which may have affected the
learners’ PV knowledge. The learners completed two tests (productive, receptive)
and were also required to complete a biodata questionnaire containing questions
about age, gender and nationality, and items relating to the language instruction
they received and the incidental exposure they had to English. The analysis of
the data shows that there is a relationship between learner knowledge and PV
frequency, and that extensive reading and watching English language films and
TV programmes appear to have a positive effect on the acquisition of PVs.
1. Introduction
Phrasal verbs are one of the most productive areas of the English language (Konishi
1958, Bolinger 1971), consisting of many thousands of items (Gardner & Davies
2007), and with new ones regularly coming into use (e.g. chill out, freak out, log off/on,
max out, scroll up/down, sex up, space out). They are a key feature of both spoken and
written language, with Gardner & Davies (ibid.: 347) estimating that phrasal verbs
occur, on average, every 192 words, that is almost 2 phrasal verbs per page of written
text. Language coursebooks are now belatedly giving much more attention to these
items, and a growing number of dictionaries and other publications devoted exclu-
sively to phrasal verbs have been published in recent years, for example: Longman
Dictionary of Phrasal Verbs (Courtney 1983), The Ultimate Phrasal Verb Book
(Hart 2009), English Phrasal Verbs in Use: Advanced (McCarthy & O’Dell 2007), Dic-
tionary of English Phrasal Verbs and their Idioms (McArthur & Atkins 1990), and Col-
lins COBUILD Dictionary of Phrasal Verbs (Sinclair 2002). Despite their frequency in
spoken and written language, phrasal verbs are often perceived as ‘difficult’ by both
English as a Foreign/Second language (EFL/ESL) teachers, and their learners. There
appear to be a number of reasons for this. Much of the language that we use is both
idiomatic and formulaic and cannot be interpreted simply by looking at the individual
 Norbert Schmitt and Stephen Redwood
words (Moon 1997). Phrasal verbs as multi-word units are no exception and many are
opaque, making them difficult to decipher and understand. They often consist of a
high frequency, monosyllabic, delexicalised verb (e.g. get, give, go, make, take) and
one of a fixed number of particles (e.g. down, in, off, on, out, over, up), and the prob-
lem for learners is that these frequent and apparently simple components may come
together to form units which are specialised, emotive, and idiomatic (e.g. the situation
is really getting her down; I can’t make out what this says; don’t give up now; it was too
much to take in).
The opaque and idiomatic nature of some phrasal verbs presents obvious diffi-
culties for learners and these problems are compounded when we take into account
the significant number of phrasal verbs that are also polysemous. Sometimes there is
a degree of transparency, and a semantic link may be made between the different
senses (cf. fill in a hole, fill in a form, fill in somebody on something, fill in for some-
body). However, in other instances the connection is more tenuous (cf. put up a
fence, put up a fight, put somebody up for the night), and the meanings more difficult
to interpret.
In addition to the semantic complexity of phrasal verbs, particle movement can
also present difficulties for learners. We may think of phrasal verbs as holistic multi-
word units, but with most transitive and a number of intransitive phrasal verbs, par-
ticles may be separated from their verbs by pronouns, adverbs or noun phrases (e.g.
she put her new fur coat on; he picked her up from the station; I’ll come straight over to
see you; we tried to calm the old woman down). Learners not only have to decide
whether a phrasal verb is separable (cf. I stayed up late last night; *I stayed late up last
night) but also what it can be separated by (adverb, pronoun, short noun phrase, long
noun phrase). For example, it is acceptable to say he gave all of his vast fortune away,
but not *the rebels are putting a huge amount of resistance up). This decision is not al-
ways based on transitivity or other grammatical considerations, but often depends
on stylistic and syntactic conventions, context, prosody and intended meaning
(see Bolinger 1971).
2. The acquisition of phrasal verbs
The widespread use of phrasal verbs means that learners need to know them, but their
semantic, syntactic, and pragmatic complexities lead to learning difficulties. So how
can researchers and teachers help learners master this important linguistic feature?
One way is to better understand the factors that lead to the learning of phrasal verbs.
SLA research has identified a wide range of factors that influence language learning
in general (see Dörnyei 2009; and Ellis 2008 for overviews), but recent research and
theorizing have highlighted exposure to the target language as the driving force of
language learning (e.g. Ellis 2003; Tomasello 2003). This exposure can come from the
naturalistic environment, or from classroom input. In both cases, frequency is an
Learner knowledge of phrasal verbs 
essential factor, because all things being equal, the more frequent an item is, the more
a learner will be exposed to it. This is certainly true for vocabulary learning, where
frequency is widely accepted as one of the best predictors of whether individual words
will be known or not (Nation 2001; Schmitt 2008, 2010). However, phrasal verbs have
some important differences from individual words as we have seen above, and it is
not obvious whether frequency is such a clear predictor of their learning as it is for
individual words. If not, teachers and materials writers will have to look for other
characteristics to guide the sequencing of the phrasal verbs they wish to teach. How-
ever, if frequency in the learning environment does prove to be predictive, then prac-
titioners could tentatively assume that learners know the highest frequency phrasal
verbs from exposure, and would need to focus on teaching the somewhat less fre-
quent ones.
So in a naturalistic environment, frequency is important, because the more fre-
quently items occur, the better they are generally learned. This has been shown in a
number of studies into incidental vocabulary learning from reading, perhaps the most
important source of outside input (e.g. Horst 2005; Rott 1999). But there are a number
of other ways that learners can gain exposure to the target language, including film,
television, radio, music, and social networking sites. We do not know yet how much
effect these kinds of exposure have on language acquisition, but some believe that they
can help significantly in the learning process (e.g. Pemberton & Fallahkair 2008;
Sjöholm 2004).
For explicit instruction, frequency is essential for selecting the phrasal verbs that
will be the most beneficial for learners. There are many thousands of phrasal verbs (e.g.
Gardner & Davies 2007), but as with other vocabulary items, some occur more fre-
quently in language than others. Lexical items that are in common use are more often
than not those which are the most useful, and as such their acquisition should be a
priority for both teachers and learners (Leech, this volume; Nation 2001; Nation &
Waring 1997). Unfortunately, some of the ‘most frequent phrasal verb’ lists in text-
books and dictionaries appear to be based more on intuition and tradition than on
solid corpus data. As a result of this somewhat arbitrary selection process, students
may be learning low frequency phrasal verbs which are rarely used in the real world,
and worse, not acquiring those which are most frequent and useful (Darwin & Gray
1999: 67). Good frequency information would indicate which phrasal verbs are the
most common, and therefore the ones to prioritise.
So frequency is an important factor in learning from both naturalistic environ-
ment and formal instruction contexts. For determining the frequency of occurrence of
lexical items in both, corpus analysis is the essential tool. Before the advent of corpora,
intuition was the only guide to assessing lexical frequency, and while it may be a useful
tool, it is not always a reliable guide (Hunston 2002: 20–21; Schmitt 2008: 333). But
with the development of large and accessible corpora (multi-million word corpora are
now common), it has become possible to determine the most common words and
phrases, and their most frequent uses (Biber et al. 1999; Gardner & Davies 2007;
 Norbert Schmitt and Stephen Redwood
Miller 2005). This is particularly true with formulaic language, which is probably the
linguistic category that phrasal verbs can best be conceptualized as belonging to. For
example, Sylviane Granger and her research unit (Centre for English Corpus Linguis-
tics) at the Université catholique de Louvain have shown how corpus evidence can il-
lustrate L2 learners’ acquisition and use of various kinds of formulaic language (see De
Cock 2000; De Cock et al. 1998; Granger 1998; Granger & Meunier 2008; Learner
Corpus Bibliography 2010; Meunier & Granger 2008).
From the above discussion we see that there are good reasons to expect that fre-
quency should be a strong factor in the learning of phrasal verbs. However, there has
been little direct research on the relationship between the two. This study will focus on
comparing the frequency of phrasal verbs (as determined by corpus evidence) with the
degree to which L2 learners know them (receptively and productively), which leads to
the first research question:
1. How well do learners know, productively and receptively, some of the most fre-
quently occurring phrasal verbs in the English language?
There are also a number of other factors which may affect how successfully a
learner masters common phrasal verbs, and we will also explore a limited number
of these:
2. Does overall language proficiency have a significant effect on phrasal verb
knowledge?
3. Do gender and age have a significant effect on phrasal verb knowledge?
4. Do the amount and mode of language instruction have a significant effect on
phrasal verb knowledge?
5. Does incidental learning through exposure to the target language outside the
classroom have a significant effect on phrasal verb knowledge?
3. Methodology
3.1 Participants
Our participants consisted of 68 EFL/ESL students from three private language schools
in the Nottingham and Eastbourne areas; 23 students at intermediate level and 45 at
upper intermediate level. Their levels had been assessed initially by their schools’
placement tests and confirmed, after a number of lessons and by further progress
checks, by their EFL/ESL teachers. The participants were made up of 47 females and 21
males, ranging in age from 14 to 55, from 14 countries, with 10 mother tongues, the
largest group being the Italians (32). Table 1 shows a breakdown of the participant’s
nationalities, genders, ages and language levels.
Learner knowledge of phrasal verbs 
Table 1.â•‡ The Participants
Nationality N M F Age Intermediate Upper Intermediate
Italian 32 â•⁄ 5 27 14–21 18 14

Columbian â•⁄ 9 â•⁄ 6 â•⁄ 3 18–26 â•⁄ 0 â•⁄ 9
Spanish â•⁄ 6 â•⁄ 1 â•⁄ 5 33–46 â•⁄ 0 â•⁄ 6
Polish â•⁄ 5 â•⁄ 0 â•⁄ 5 23–29 â•⁄ 5 â•⁄ 0
Saudi â•⁄ 5 â•⁄ 4 â•⁄ 1 20–33 â•⁄ 0 â•⁄ 5
German â•⁄ 2 â•⁄ 0 â•⁄ 2 19–55 â•⁄ 0 â•⁄ 2
Libyan â•⁄ 2 â•⁄ 0 â•⁄ 2 17–30 â•⁄ 0 â•⁄ 2
Chilean â•⁄ 1 â•⁄ 0 â•⁄ 1 31 â•⁄ 0 â•⁄ 1
Chinese â•⁄ 1 â•⁄ 1 â•⁄ 0 21 â•⁄ 0 â•⁄ 1
Kazak â•⁄ 1 â•⁄ 0 â•⁄ 1 28 â•⁄ 0 â•⁄ 1
Portuguese â•⁄ 1 â•⁄ 1 â•⁄ 0 22 â•⁄ 0 â•⁄ 1
Taiwanese â•⁄ 1 â•⁄ 0 â•⁄ 1 15 â•⁄ 0 â•⁄ 1
Turkish â•⁄ 1 â•⁄ 0 â•⁄ 1 38 â•⁄ 0 â•⁄ 1
Vietnamese â•⁄ 1 â•⁄ 1 â•⁄ 0 25 â•⁄ 0 â•⁄ 1
Totals 68 19 49 – 23 45
3.2 Target phrasal verbs
The study included 60 phrasal verbs. The majority (50) were taken from Gardner &
Davies’ (2007) list of the 100 most frequently occurring phrasal verbs in the British
National Corpus (BNC 2007). Because phrasal verbs are considered difficult to ac-
quire, we concentrated on high frequency examples because we wished to see how well
our learners knew the type of phrasal verb they would presumably have had the most
exposure to. However, we also wished to have a range of frequency on the list, so we
included ten less frequent phrasal verbs, which were selected from student course-
books and grammar reference books.
In addition to investigating the relationship between overall phrasal verb frequen-
cy and learner knowledge, we also wanted to find out whether there were differences
in knowledge levels between those phrasal verbs found more often in written language
and those more frequent in spoken language. To do this, we consulted the BNC (2007)
for phrasal verb frequencies. We chose the BNC because it is one of the largest corpora
(100 million words) publicly available, and because it represents a cross-section of
written and spoken language from a wide range of late twentieth century sources (BNC
Homepage). Another key advantage is that the complete corpus can be bought and
downloaded, which made our phrasal verb analysis possible. Gardner & Davies used
the following definition as the basis for the identification and tagging of phrasal verbs:
“all two-part verbs in the BNC consisting of a lexical verb ... proper ... followed by an
adverbial particle ... that is either contiguous ... to that verb or non-contiguous ...”
 Norbert Schmitt and Stephen Redwood
(ibid.: 341). Our definition of a phrasal verb was rather broader in that we included
verbs followed by prepositional as well as adverbial particles (see Biber et al. 1999: 403;
Bolinger 1971: 6; Collins Cobuild English Grammar 2005; McArthur & Atkins 1990).
First we looked up the phrasal verbs’ overall frequency in the complete BNC. We then
repeated the process using only the spoken section (10 million words). Finally, by sub-
tracting the spoken frequency results from those for the complete BNC we arrived at
figures for the written section (90 million words). Each phrasal verb and its inflections
(come off, came off, coming off) was tagged for contiguous (verb + particle) and non-
contiguous (verb + word(s) + particle) occurrences, up to a limit of 4 words separating
verb and particle. We found that most phrasal verbs were either contiguous, or sepa-
rated by a single word. Very few phrasal verbs were separated by 4 words (228 occur-
rences in the whole of the BNC) and there were many lexical strings that were not
phrasal verbs (the Carry On films; the meeting was held on the 28th of January; pay me
when you get back). The occurrences of phrasal verbs separated by 5 or more words
were so infrequent that we did not consider these in the calculations. When we com-
pared our findings for the complete BNC with those of Gardner & Davies (see Appendix
A for comparison) we found that the majority of our frequency figures were higher, on
occasions significantly so (e.g. get in, go in, put on), with a small number lower
(e.g. carry on, carry out). These differences may be partly due to our use of a broader
phrasal verb definition, or the tagging methods used; but we can only speculate as we
do not know exactly how Gardner and Davies’ figures were calculated. The results
showing the frequency figures for the 60 target phrasal verbs are shown in Table 2.
3.3 Receptive and productive measurement instruments
One of the goals of the study was to establish learners’ knowledge about the target
phrasal verbs, and it seemed important to assess both receptive and productive mas-
tery (Schmitt 2010). The productive test used a cloze technique in which the partici-
pants had to produce the target vocabulary themselves, requiring a higher level of
mastery than would a receptive word recognition test (Groot 2000: 76). Cloze tests are
used extensively as a testing procedure, and are seen, especially in the area of vocabu-
lary, as a good measure of lexical knowledge (Read 1997). An example for set up is
given below:
The police s__________ u__________ roadblocks to stop people driving into the city
centre. (build, erect)
To be consistent with the aim of testing both productive and receptive knowledge of
the same target language we used similar items in both tests. The differences between
the two tests being, the first letter prompts were omitted from the receptive test, mul-
tiple-choice options were added to the receptive test, and the items in each test were in
different orders. To help reduce guessing, a fifth ‘Don’t know’ option was included in
the receptive test. The productive test, being the one in which the participants had to
Learner knowledge of phrasal verbs 
Table 2.â•‡ BNC Target Phrasal Verb Frequency
Phrasal BNC Written Spoken Phrasal BNC Written Spoken

Verb Verb
1 go on 16228 12591 3637 31 get in 4671 3221 1450

2 pick up 10884 10147 737 32 hold on 1797 1493 304
3 come in 9777 7700 2077 33 go over 1732 1173 559
4 take up 9450 6548 2902 34 move in 1377 1134 243
5 go out 7765 5008 2757 35 turn down 1355 1182 173
6 hold on 6977 4444 2533 36 look 1350 1268 82
around
7 put on 6760 5484 1276 37 come over 1262 916 346
8 find out 6329 5605 724 38 come off 1191 803 388
9 work out 5257 4732 525 39 sit up 1181 1040 141
10 make up 5231 3369 1862 40 put off 1075 851 224
11 come out 5190 3922 1268 41 make out 1067 937 130
12 sit down 5022 4610 412 42 turn off 1057 650 407
13 take on 4717 3886 831 43 pick out 905 732 173
14 carry on 4695 1994 2701 44 hold back 862 811 51
15 set up 4199 3981 218 45 take down 849 521 328
16 go in 3892 3449 443 46 give back 654 404 250
17 get up 3637 2338 1299 47 move up 616 537 79
18 get on 3441 1949 1492 48 move back 603 527 76
19 carry out 3406 2283 1123 49 move out 594 468 126
20 come down 3083 2301 782 50 give out 550 404 146
21 get out 3010 2367 643 51 dig up 383 318 65
22 get in 2587 2466 121 52 pay back 295 230 65
23 bring in 2565 1928 637 53 pin down 249 229 20
24 put up 2386 2118 268 54 tear up 224 202 22
25 go over 2152 1810 342 55 think over 219 206 13
26 go off 1728 1326 402 56 fall behind 156 145 11
27 break down 1469 1031 438 57 pass away 104 89 15
28 take back 1430 1272 158 58 chat up 100 87 13
29 move on 1415 809 606 59 take after 84 58 26
30 put out 1251 753 498 60 cool off 65 59 6
BNC = token frequency BNC complete
Written = token frequency BNC written
Spoken = token frequency BNC spoken
 Norbert Schmitt and Stephen Redwood
recall and produce the target language, was to be administered first. If the productive
test was given second there would be the possibility of testees remembering some of
the multiple-choice answers from the receptive test. Obviously, learners who knew the
answer to an item productively would also be likely to know it receptively as receptive
knowledge usually precedes productive knowledge (Melka 1997; Schmitt 2010). The
receptive version of the above item is illustrated below. See Appendices B & C for the
complete productive and receptive tests.
The police __________ __________ roadblocks to stop people driving into the city
centre. (build, erect)
A. set in B. set up C. set on D. set at E.?
3.4 Biodata questionnaire
In addition to establishing the relationship between frequency and learner knowledge,

we were also interested in gathering information about some of the other factors which
may have had an effect on phrasal verb acquisition. We already had a rough idea of our
participants’ language proficiency from the school’s in-house assessments, so we pro-
duced a 10-item questionnaire which contained items on basic biodata information
(age, gender, nationality), and items relating to language exposure through classroom
instruction, and exposure through incidental learning, that is, extensive reading, the
media and entertainment, and social networking (see Appendix D for the complete
questionnaire).
3.5 Procedure
Having written the tests it was essential to thoroughly pilot them to test their validity
and reliability (Dörnyei 2007), and importantly, to confirm that they could be com-
pleted in the time available (Schmitt 2010). We first asked ten educated native speakers
to complete the tests and comment on any difficulties they had with any of the test
items. Subsequent feedback showed that most of the native speakers took 15 to 25
minutes to finish the productive test and 10 to 15 minutes for the receptive test. They
reported no serious problems with the items, but we listened carefully to the com-
ments they made and as a consequence rewrote several of the items to make them
clearer, and modified the definitions/synonyms for others to improve their perfor-
mance. We repeated the exercise with 8 other native speakers and made further modi-
fications. We were satisfied that the instruments worked well with native speakers but
we also required confirmation from non-native speakers. We asked a number of up-
per-intermediate and advanced level learners to complete the tests, and in response to
the feedback received a number of minor alterations were made.
The tests were then given to the participants in a single session in their intact
classes with a short break between the tests. Instructions explaining the purpose of the
Learner knowledge of phrasal verbs 
test, its format, and what the participants had to do were printed at the beginning of
each test, together with example items. To make sure the participants knew exactly
what to do, they were led through all the instructions, paying particular attention to
the example items, and the amount of time they had to complete the test. In addition,
we explained that they could answer each question with the base form of the phrasal
verb, but that any of its inflections would be accepted as correct (work out, worked out,
working out). The productive test was given first, then the receptive version after a
10 minute break, and finally the biodata questionnaire. All the participants had suffi-
cient time to complete both tests and fill in the biodata questionnaire.
4. Results and discussion
4.1 Phrasal verb frequency and knowledge
The main aim of the study was to explore the link between phrasal verb frequency and
how well they are learned by L2 learners. In other words, do learners tend to know
more about the most frequently occurring phrasal verbs than the less frequent ones? In
addition we wanted to discover whether there was a link between mode (spoken vs.
written) and phrasal verb knowledge. That is, do learners know more about those
phrasal verbs more frequently found in written language, those used more often in
spoken language, or do they have a broad knowledge extending across the two modes?
To explore this link, we first carried out correlations comparing the results of the
productive and receptive tests with our phrasal verb frequency rankings from the BNC
complete, BNC written, and BNC spoken. The Pearson coefficients indicated a signifi-
cant positive correlation between mean tests scores and phrasal verb frequencies as
shown in Table 3. The strengths of the correlation coefficients were moderate for the
productive test, and relatively low for the receptive test. To achieve a better under-
standing of the strengths of the correlations, the correlation coefficients were squared
(r2), which produced figures which represent the percentage of the variance in the test
scores that can be related to frequency. These figures are shown in parentheses, and
they indicate that for the BNC complete, 20% of the variance in the productive scores
was attributable to frequency, but for the receptive scores only 9% of the variance was
related to frequency. Thus we find that the learning of phrasal verbs is related to their
frequency of occurrence, just as it is with individual words. However, the strength of
relationship is not particularly strong, and varies according to productive and recep-
tive knowledge. As for the difference between phrasal verbs in written and spoken
discourse, there was virtually no difference in terms of receptive knowledge, and only
a small difference in terms of productive knowledge. These results suggest it is proba-
bly sufficient to use overall corpus frequency figures (i.e. combined written and spo-
ken) when thinking about the likely acquisition of phrasal verbs, as there seems to be
no real advantage to distinguishing between spoken and written frequencies.
 Norbert Schmitt and Stephen Redwood
Table 3.â•‡ Correlations between Tests Scores and Phrasal Verb Frequencies (BNC)
Phrasal Verb frequencies Productive test Receptive test
BNC complete .45** (20.3)a .30* (9.0)

BNC written .41** (16.8) .29* (8.4)
BNC spoken .46** (21.2) .31* (9.6)
*p < .05, **p < .01
a. r2 reported in percentage
To better understand the frequency-knowledge relationship, it is useful to look at the

data in graphic form. Figure 1 shows the frequencies (BNC complete) of the 60 phras-
al verbs used in the study. They range from the most frequent, go on (16,228 tokens) to
the least frequent, cool off (65 tokens).
If the participants’ phrasal verb knowledge was related closely to phrasal verb fre-
quency, then test results should have shown a similar curve to that in Figure 1. Figures 2
(BNC complete), 3 (BNC written) and 4 (BNC spoken) are graphic representations of
the relationship between phrasal verb knowledge and phrasal verb frequency accord-
ing to the three corpora. The phrasal verbs are arranged in frequency groups of 5 to
reduce the effect of individual item variation.
Several points are immediately evident when viewing these curves. First, none of
the curves match that in Figure 1 very closely, so learning does not seem to be highly
18
16
14
PV token frequency (thousands)
12
10
0
1 11 21 31 41 51
PV frequency ranking
Figure 1.â•‡ Target Phrasal Verb (PV) Frequency BNC Complete

Learner knowledge of phrasal verbs 
100
90
80
70
Mean test scores %
60
50
40
30
20
10
0
1 6 11 16 21 26 31 36 41 46 51 56
60 PVs by frequency (grouped in 5s)
productive receptive
Figure 2.â•‡ Test Scores BNC Complete
dependent on the absolute frequency of a phrasal verb. Second, there is a considerable

amount of variation in knowledge of the phrasal verbs (the curves bounce up and
down), even though the phrasal verbs have been clustered in groups of five to even out
this variation. Thus we find that learning does not smoothly follow rank order frequency
either. Third, despite the previous two observations, there is clearly some overall rela-
tionship between frequency and knowledge. This is most obvious with the productive
tests, where there is a clear decline in knowledge as frequency decreases, with the ex-
ception of a blip at the 46–50 frequency ranking (see below). The receptive trend is
harder to discern, with a fairly clear decline in the first ten or so phrasal verbs, but
thereafter a great deal of variation in what is essentially a plateau. Fourth, as might be
expected, the receptive scores were usually higher than the productive scores, as recall-
ing language in order to use it productively is more difficult and requires a greater
depth of knowledge than being able to recognize it receptively (e.g. Groot 2000;
Nation 2001). Overall, the learners scored 17% higher on the receptive test than the
productive test on average and this difference was significant (Pearson, t = 12.01,
p<.001, Eta squared = .69).
Overall, the evidence points to a general trend of higher frequency leading to a
greater chance of learning phrasal verbs to a productive degree of mastery. The rela-
tionship is not strongly linear, but higher frequency phrasal verbs were clearly learned
by a greater number of our participants than lower frequency phrasal verbs. Con-
versely, with the exception of the very highest frequency phrasal verbs, there does not
seem to be a very reliable relationship between the frequency of phrasal verbs and
 Norbert Schmitt and Stephen Redwood
90
80
70
60
Mean test scores %
50
40
30
20
10
0
1 6 11 16 21 26 31 36 41 46 51 56
Figure 3.â•‡ Test Scores BNC Written
90
80
70
60
Mean test scores %
50
40
30
20
10
0
1 6 11 16 21 26 31 36 41 46 51 56
Figure 4.â•‡ Test Scores BNC Spoken

Learner knowledge of phrasal verbs 
mastery of receptive knowledge. Thus, it seems that in order to develop the more ad-
vanced productive mastery of phrasal verbs, the repeated exposure that comes from
higher frequency is necessary. Receptive mastery, which is presumably easier to ac-
quire, does not seem so dependent on this exposure. We might speculate that this is
because only a few exposures might lead to receptive mastery. This would be congru-
ent with findings from incidental vocabulary acquisition studies, where it has been
found to take something like 8–10 exposures to learn words from reading, but where
productive mastery is seldom achieved (Schmitt 2008).
Furthermore, the relationship between frequency and learning may be stronger
than demonstrated here. We used mainly the highest frequency phrasal verbs (50/60)
in this study to see if our participants knew these high-exposure items. If we had used
a group of phrasal verbs with a wider range of frequencies, we may well have found a
clearer frequency-knowledge trend.
Another point to keep in mind is that the frequency information was from occur-
rence in general, as indicated by the sources included in the BNC. We assume that
higher frequencies in the BNC also indicate higher levels of exposure among our par-
ticipants. However, this assumption may be unfounded to some extent. Learners may
(probably?) receive quite different exposure to the L2, especially in a classroom, than
is indicated by a native corpus. If we were able to use their actual exposure rates as our
frequency figures, the correlation would undoubtedly be higher.
This leads to the question of whether other corpora may better predict the learning
of phrasal verbs. One suitable candidate is the Corpus of Contemporary American Eng-
lish (COCA). It has in excess of 400 million words of text and is equally divided
between spoken, fiction, popular magazines, newspapers, and academic texts
(Davies 2008). We did not have the resources to carry out a full lemmatized and non-
contiguous analysis as we did with the BNC, but were able to do a simplified analysis
based only on the contiguous base forms of the target phrasal verbs as shown in Table 2.
The correlations for frequency and productive mastery, using the data from our test
results, were very similar to the BNC results, but the correlations for frequency and
receptive mastery were marginally higher than the BNC figures (Table 4). The similar
results from the two main large-scale, accessible corpora give us confidence in conclud-
ing that the frequency of phrasal verbs as shown by large corpora predicts phrasal verb
Table 4.â•‡ Correlations between Tests Scores and Phrasal Verb Frequencies (COCA)
Phrasal Verb frequencies Productive test Receptive test
COCA complete .42** (17.6)a .36** (13.0)

COCA written .40** (16.0) .34** (11.6)
COCA spoken .42** (17.6) .38** (14.4)
**p<.01
a. r2 reported in percentage
 Norbert Schmitt and Stephen Redwood
acquisition (productive mastery) somewhere around the 16–20% covariance level and
receptive mastery at around the 8–14% level.
The data also gives us a chance to look at how well our participants knew the phras-
al verbs in real terms. The majority of the participants were able to recognize most of the
phrasal verbs receptively (average score 65.2%), and were able to produce 48.2% of them
on average. Thus we find that despite being a relatively difficult type of lexical item, the
participants had a good knowledge of the target phrasal verbs relative to their language
levels. However, there were a number (18) of phrasal verbs that less than half of the
learners knew either receptively or productively. A number of these were relatively in-
frequent (cool off, dig up, pin down) which we expected would have low scores, but a
number were some of the most frequent in the BNC (e.g. carry out, go in, take up, work
out). The low scores may in part be attributable to learners being unfamiliar with some
of the contexts and meaning senses presented in the tests, or the wording of some of the
test items themselves, but even if we make some allowance for these anomalies, there
would still remain a number of moderately high to high frequency phrasal verbs that
were relatively unknown to at least half of the students. There does not appear to be any
particular semantic or syntactic features that distinguish these phrasal verbs from oth-
ers in the tests. In fact, some were relatively transparent (come off, give out, go in, hold
back), and we can speculate that some students’ lack of receptive or productive knowl-
edge of these items was due to the absence or paucity of exposure to these phrasal verbs,
even though they occurred relatively frequently in the corpora. One factor that may
partly account for this lack of exposure is the fact that often a learner’s primary source
of exposure to English is in the language classroom, through the medium of student
coursebooks, which are frequently the core resource of the language syllabus. Although
a number of coursebooks purport to be corpora-based, research shows that often the
language presented in these publications appears to have been selected in an intuitive or
arbitrary fashion without reference to corpus data (e.g. Koprowski 2005). Furthermore,
the phrasal verbs are often presented on a single page in large numbers in test-like for-
mats, which give little or no opportunity for learners to use the target phrasal verbs
productively. Additionally, once these phrasal verbs have been ‘covered’ on these pages,
quite frequently no attempt is made to re-cycle these items in subsequent parts of the
book. Another factor that may influence phrasal verb exposure is the fact that the ma-
jority of learners around the world are taught by non-native teachers who themselves
may not use, or even be aware of, those phrasal verbs most commonly used.
Finally we looked at the phrasal verbs at the lower end of the frequency range to
see if we could explain the blip which occurred at the 46–50 rank level. The blip is
largely a result of the spoken frequency curve, and so we looked at the five phrasal
verbs in this cluster in terms of spoken frequency. They include: carry on, look around,
move up, move back, and pay back. These phrasal verbs were considerably better
learned than the phrasal verbs in adjoining frequency clusters (41–45: break up, give
out, sit up, make out, move out; 51–55: dig up, hold back, take after, tear up, pin down).
It is difficult to pinpoint the reason for this, although one potential explanation is that
Learner knowledge of phrasal verbs 
at least four of the phrasal verbs can be interpreted literally (look around, move up,
move back, pay back), while in the adjoining clusters there are more phrasal verbs
which cannot be (break up = *breaking something in an upwards direction; take after
= *taking something subsequently to something else). Regardless of the phrasal verb
characteristic(s) at play here, our frequency/knowledge curves suggest that phrasal
verbs are idiosyncratic in terms of learning burden, and that a purely frequency-based
explanation can never fully explain their acquisition.
4.2 Individual differences factors in the acquisition of phrasal verbs
We have seen that frequency is a factor in the acquisition of phrasal vocabulary, but
that it only explains 10–20% of the variance in test scores. Unsurprisingly, other fac-
tors must also be at play. One area that has been well documented is that of first lan-
guage (L1) influence on language acquisition. Phrasal verbs are found predominantly
in English and a few other cognate languages. German, for example, whilst not having
phrasal verbs as such, does have particle verbs which are superficially similar
(see Waibel 2007: 38–40). L1 influence is certainly an important factor in language
acquisition, and the absence of a feature, like phrasal verbs, from a learner’s L1, can
affect the way a second language (L2) is acquired (e.g. Dagut & Laufer 1985; Hulstijn
& Marchena 1989; Laufer & Eliasson 1993; Liao & Fukuya 2004; Siyanova & Schmitt
2007; Swan 1997). However, as only 2 of our 68 learners had L1s (German) that con-
tained an equivalent to phrasal verbs we decided not to take this factor into consider-
ation, and concentrated instead on other individual differences to determine if they
had any effect on the acquisition of phrasal verbs.
4.2.1 Language proficiency

Does phrasal verb knowledge increase as overall language proficiency rises? To answer
this question we compared the scores of the intermediate and upper-intermediate
learners to see if there were significant differences.1 Table 5 shows the results of the
independent sample t-tests, indicating that the upper-intermediate learners scored on
average higher than their intermediate counterparts. The differences in scores were
significant (p<.05), and the effect sizes (eta squared)2 large, accounting for 20%
1. The proficiency assessments of the different schools involved in the study are idiosyncratic,
and so cannot be directly compared. This makes the distinction between intermediate and up-
per intermediate proficiency levels in the study somewhat tenuous. Although it is difficult to
quantify what these proficiency levels mean in absolute terms, we had personal experience of all
participants, and feel that the distinction accurately reflects a noticeable difference in relative
levels of proficiency.
2. Effect size is a measure of the strength of the relationship between two variables. Eta squared
is the proportion (.01 = small effect, .06 = moderate effect, .14 = large effect) of the total variance
that is attributed to an effect and is usually expressed as a percentage by multiplying it by 100.
 Norbert Schmitt and Stephen Redwood
Table 5.â•‡ Proficiency Level Comparisons (Independent Samples T-Tests)
Ma SD D t Effect sizeb
productive 66 –4.079* .20

intermediate (n = 23) 22.91 â•⁄ 5.15
upper-intermediate (n = 45) 33.65 10.00
receptive 66 –4.482* .23
intermediate (n = 23) 32.00 â•⁄ 5.80
upper-intermediate (n = 45) 41.91 â•⁄ 7.79
*p < .001
a. Max score = 60
b. Eta squared
(productive) and 23% (receptive) of the differences between the intermediate and up-
per-intermediate scores, confirming that learners’ knowledge of phrasal verbs appears
to be related to overall language proficiency.
The differences in the phrasal verb knowledge of intermediate and upper-interme-
diate students may also be related to the language level at which learners are first ex-
posed to phrasal verbs. Very few coursebooks below intermediate level have any ex-
plicit or implicit reference to phrasal verbs, and whilst there may be valid pedagogic
reasons for this, it does mean that phrasal verb acquisition may lag behind other areas
of language at lower proficiency levels.
4.2.2 Gender
There has been much debate about the role of gender in language learning and acquisi-
tion, and research has examined a number of areas such as language proficiency, atti-
tudes, motivation, and learning, cognitive and metacognitive strategies (e.g. Kobayashi
2002; Tercanlioglu 2004). We were interested to know if gender was also a factor in the
acquisition of phrasal verbs. The results from our t-tests indicate that, although males
scored higher in both tests, the differences in scores were not statistically significant,
and for these participants at least, gender did not appear to be a factor in their knowl-
edge of phrasal verbs.
4.2.3 Age
We were also interested in whether age had any influence on the participants’ produc-
tive and receptive knowledge of phrasal verbs. The ages of the learners ranged between
14 and 55 and for the purpose of the analyses we divided them into 3 age groups
(under 18, 18–25, over 25). The results from one-way ANOVAs indicate that the older
learners scored higher in both the productive and receptive tests, but the differences in
scores were not significant, showing that age was not a causal factor.
Learner knowledge of phrasal verbs 
4.3 Exposure to target language inside and outside the classroom
The second type of factor we explored was the amount and type of exposure our par-
ticipants had to English both inside and outside the language classroom.
4.3.1 Formal language instruction

Achieving proficiency in a second language is dependent on a number of factors, not
least the quantity and quality of language instruction. The biodata questionnaire in-
cluded items relating to the length of time the participants had been learning English,
where they took their lessons, and how many hours of instruction they received each
week. The results of the comparison of test scores, perhaps surprisingly, indicated that
overall the type of instruction and hours of classroom input that the learners received
did not have a significant effect on their test scores.
4.3.2 Extensive reading

Research indicates that extensive reading can improve vocabulary knowledge and have
a positive effect on language proficiency overall. Using data collected from the biodata
questionnaire, we divided students into those who read in English 0–1 hour per week
1–2 hours, and 2+ hours. One-way ANOVAs were significant (productive: F(2,
65)=4.46, p<.05; receptive: F(2, 65)=4.04, p<.05), and Least Significant Difference
(LSD) post-hoc tests showed the difference (p<.05) existed between those who read
the least (0–1 hour = 27.0 productive and 37.4 receptive) and those who read the most
(2+ hours = 36.7 productive and 45.2 receptive). The effect sizes were moderately high
(.12 productive, .11 receptive). So while differences in classroom input did not signifi-
cantly affect acquisition of phrasal verbs, the amount of input from reading did have
an effect.
4.3.3 Watching English language films and television

As reading had an effect on the acquisition of phrasal verbs, it is interesting to see if
other types of non-classroom input did as well. Another way of increasing one’s expo-
sure to the target language is through watching English language films and TV shows,
and we included an item in the biodata questionnaire asking participants how much
time they spent on these activities. Using the same methodology as for reading, we
came up with nearly identical findings. The ANOVAs were significant (productive:
F(2, 65)=4.54, p<.05; receptive: F(2, 65)=3.83, p<.05) and the LSD post-hoc tests
(p<.05) showed that learners who spent more than 2 hours per week watching English
language films and TV shows knew more phrasal verbs (33.2 productive, 42.9 recep-
tive) than those who only watched for an hour or less (25.7 productive, 36.7 receptive).
The effect sizes are the same as for reading (.12 productive, .11 receptive). These results
indicate that this type of exposure is also effective in promoting the acquisition of
phrasal vocabulary.
 Norbert Schmitt and Stephen Redwood
4.3.4 Listening to English language music

Another type of input that many learners partake of is listening to English music out-
side the classroom. English language popular music has a worldwide appeal and some
research has indicated that incidental listening can have a positive effect on language
acquisition (Sjöholm 2004). We used the same type of analysis as for reading and film/
TV watching, but whilst those learners who listened to English language music for 1 to
2 hours per week scored higher than those who listened less, the differences in their
scores were not significant. It seems therefore that the amount of listening to English
music does not affect the acquisition of phrasal vocabulary. This may be because lis-
tening to music requires much less attention and concentration than watching films or
TV programmes.
4.3.5 Social networking

Social networking sites have become extremely popular in recent years (e.g. Facebook,
MySpace, Twitter), and together with other forms of electronic communication
(Skype, SMS) have allowed millions to interact and socialise on a global scale. English
is often the lingua franca of the Internet and we were interested to see how many of the
participants took advantage of these forms of communication to practise their lan-
guage skills, and whether it had any effect on their vocabulary knowledge. Half of the
participants spent more than two hours each week using English as a lingua franca on
social networking sites. However, those who used these sites the most did not score
significantly higher than those who used these sites the least.
5. Conclusion
Our study set out to explore what ESL/EFL learners knew about relatively frequent
phrasal verbs, and how that knowledge was acquired. We found that frequency
(as indicated by large General English corpora) predicted phrasal knowledge to a con-
siderable degree in terms of productive mastery (r2 ≈ 20%), but not in terms of recep-
tive mastery (r2 ≈ 9%). Corpus frequency figures will always be useful in identifying
the phrasal verbs that need to be known, as high frequency phrasal verbs undoubtedly
have great utility for students. However, the same frequency figures seem to have dif-
ferential ability in predicting whether phrasal verbs are known or not. They produce
strong enough correlations to predict productive knowledge to some extent, but seem
to lack the capacity to do the same for receptive mastery. Clearly, the acquisition of
phrasal verbs relies on more than just frequency of exposure. Interestingly, our results
showed no effect for formal-instruction-based variables, but did show that more out-
of-class exposure (in the form of outside reading, film/TV watching) facilitated the
learning of phrasal verbs. It is interesting to note that not all outside exposure was
beneficial though; the amount of listening to English language music and social net-
working did not have an effect. Perhaps the most encouraging outcome of the study
Learner knowledge of phrasal verbs 
was the relatively good knowledge our participants demonstrated of the target phrasal
verbs. Overall, they knew about two-thirds of them receptively, and about one-half
productively. While admittedly the target phrasal verbs were mostly among the most
frequent in English, this knowledge is a good start, and the quest continues to find
ways of helping students/learners master the rest of the phrasal verb inventory. A bet-
ter understanding of the ways frequency interacts with learner knowledge and acquisi-
tion can only aid this pursuit.
References
Biber, D., Johansson, S., Leech, G., Conrad, S., & Finegan, E. 1999. Longman Grammar of Spoken
and Written English. Harlow: Longman.
Bolinger, D.L.M. 1971. The Phrasal Verb in English. Cambridge MA: Harvard University Press.
British National Corpus. 2007. from Oxford University Computing Services on behalf of the
BNC Consortium, version 3 (BNC XML Edition). <http://natcorp.ox.ac.uk>
Collins Cobuild English Grammar. 2005. 2nd edn. Glasgow: HarperCollins.
Courtney, R. 1983. Longman Dictionary of Phrasal Verbs. London: Longman.
Dagut, M., & Laufer, B. 1985. Avoidance of phrasal verbs: A case for contrastive analysis. Studies
in Second Language Acquisition 7(1): 73–79.
Darwin, C. M. & Gray, L. S. 1999. Going after the phrasal verb: An alternative approach to clas-
sification. TESOL Quarterly 33(1): 65–83.
Davies, M. 2008. The corpus of contemporary American English (COCA): 400+ million words,
1990-present. <http://www.americancorpus.org>
De Cock, S. 2000. Repetitive phrasal chunkiness and advanced EFL speech and writing. In Corpus
Linguistics and Linguistic Theory, C. Mair & M. Hundt (eds), 51–68. Amsterdam: Rodopi.
De Cock, S., Granger, S., Leech, G. & McEnery, T. 1998. An automated approach to the phrasi-
con on EFL learners. In Learner English on Computer, S. Granger (ed.), 67–79. London:
Addison Wesley Longman.
Dörnyei, Z. 2007. Research Methods in Applied Linguistics. Oxford: OUP.
Dörnyei, Z. 2009. The Psychology of Second Language Acquisition. Oxford: OUP.
Ellis, N. C. 2003. Constructions, chunking, and connectionism: The emergence of second lan-
guage structure. In The Handbook of Second Language Acquisition,C.J. Doughty & M.H.
Long (eds), 63–103. Oxford: Blackwell.
Ellis, R. 2008. The Study of Second Language Acquisition. Oxford: OUP.
Gardner, D., & Davies, M. 2007. Pointing out frequent phrasal verbs: A corpus-based analysis.
TESOL Quarterly 41(2): 339–359.
Granger, S. 1998. Prefabricated patterns in advanced EFL writing: Collocations and formulae. In
Phraseology: Theory, Analysis and Applications, A.P. Cowie (ed.), 145–160. Oxford: OUP.
Granger, S. & Meunier, F. (eds). 2008. Phraseology: An Interdisciplinary Perspective. Amsterdam:
John Benjamins.
Groot, P.J.M. 2000. Computer assisted second language acquisition. Language Learning and
Technology 4(1): 60–81.
Hart, C.W. 2009. The Ultimate Phrasal Verb Book, 2nd edn. Hauppauge NY: Barron’s Educa-
tional Series.
 Norbert Schmitt and Stephen Redwood
Horst, M. 2005. Learning L2 vocabulary through extensive reading: A measurement study. The
Canadian Modern Language Review 61(3): 355–382.
Hulstijn, J.H. & Marchena, E. 1989. Avoidance: Grammatical or semantic cause? Studies in Sec-
ond Language Acquisition 11(3): 241–255.
Hunston, S. 2002. Corpora in Applied Linguistics. Cambridge: CUP.
Kobayashi, Y. 2002. The role of gender in foreign language learning attitudes: Japanese female
students’ attitudes towards English learning. Gender and Education 14(2): 181–197.
Konishi, T. 1958. The growth of the verb-adverb combination in English: A brief sketch. In Stud-
ies in English Grammar and Linguistics: A Miscellany in Honour of Takanobu Otsuka, K.
Araki & T. Otsuka (eds). Tokyo: Kenyusha.
Koprowski, M. 2005. Investigating the usefulness of lexical phrases in contemporary course-
books. ELT Journal 59(4): 322–332.
Laufer, B. & Eliasson, S. 1993. What causes avoidance in L2 learning: L1–L2 difference, L1–L2
difference, or L2 complexity? Studies in Second Language Acquisition 15(1): 35–48.
Learner Corpus Bibliography. 2010. Centre for English Corpus Linguistics. <http://www.uclou-
vain.be/en-cecl-lcBiblio.html>
Liao, Y. & Fukuya, Y.J. 2004. Avoidance of phrasal verbs: The case of Chinese learners of English.
Language Learning 54(2): 193–226.
McArthur, T. & Atkins, B. 1990. Dictionary of English Phrasal Verbs and Their Idioms. London:
Collins.
McCarthy, M. & O’Dell, F. 2007. English Phrasal Verbs in Use: Advanced: 60 Units of Vocabulary
Reference and Practice; Self-study and Classroom Use. Cambridge: CUP.
Melka, F. 1997. Receptive vs. productive aspects of vocabulary. In Vocabulary: Description, Ac-
quisition and Pedagogy, N. Schmitt & M. McCarthy (eds), 84–102. Cambridge: CUP.
Miller, G. 2005. WordNet, Version 2.1. Princetown University. <http://wordnet.princeton.edu/>
Moon, R. 1997. Vocabulary connections: Multi-word items in English. In Vocabulary: Descrip-
tion, Acquisition and Pedagogy, N. Schmitt & M. McCarthy (eds). Cambridge: CUP.
Nation, I.S.P. 2001. Learning Vocabulary in Another Language. Cambridge: CUP.
Nation, P. & Waring, R. 1997. Vocabulary size, text coverage and word lists. In Vocabulary: Descrip-
tion, Acquisition and Pedagogy, N. Schmitt & M. McCarthy (eds), 6–19. Cambridge: CUP.
Pemberton, L. & Fallahkair, S. 2008. Interactive television as a vehicle for language learning. In
Interactive Digital Television: Technologies and Applications, G. Lekakos, K. Chorianopoulos
& G.I. Doukidis (eds), 18–32. Hershey NJ: IGI Publishers.
Read, J. 1997. Vocabulary and testing. In Vocabulary: Description, Acquisition and Pedagogy, N.
Schmitt & M. McCarthy (eds), 303–320. Cambridge: CUP.
Rott, S. 1999. The effect of exposure frequency on intermediate language learners’ incidental
vocabulary acquisition and retention through reading. Studies in Second Language Acqui-
sition 21(4): 589–619.
Schmitt, N. 2008. Instructed second language vocabulary learning. Language Teaching Research
12(3): 329–363.
Schmitt, N. 2010. Researching Vocabulary: A Vocabulary Research Manual. Basingstoke:
Palgrave.
Sinclair, J.M. 2002. Collins COBUILD Dictionary of Phrasal Verbs. London: HarperCollins.
Siyanova, A. & Schmitt, N. 2007. Native and nonnative use of multi-word vs. one-word verbs.
International Review of Applied Linguistics in Language Teaching 45: 119–139.
Learner knowledge of phrasal verbs 
Sjöholm, K. 2004. The complexity of the learning and teaching of EFL among Swedish-minority
students in bilingual Finland. Journal of Curriculum Studies 36(6): 685–696.
Swan, M. 1997. The influence of the mother tongue on second language vocabulary acquisition
and use. In Vocabulary: Description, Acquisition, Pedagogy, N. Schmitt & M. McCarthy
(eds). Cambridge: CUP.
Tercanlioglu, L. 2004. Exploring gender effect on adult foreign language learning strategies. Is-
sues in Educational Research 14(2): 181–193.
Tomasello, M. 2003. Constructing a Language: A Usage-based Theory of Language Acquisition.
Cambridge MA: Harvard University Press.
Waibel, B. 2007. Phrasal Verbs in Learner English: A Corpus-based Study of German and Italian
Students. Freiburg: Albert-Ludwigs-Universität.
 Norbert Schmitt and Stephen Redwood
Appendix A.â•‡ BNC phrasal verb frequency: Comparison of results
Phrasal Verb G&D S&R Phrasal G&D S&R

Verb
â•⁄ 1 go on 14903 16228 31 sit up 1158 1181

â•⁄ 2 carry out 10798 4199 32 get in 1127 4671
â•⁄ 3 set up 10360 10884 33 make out 1105 1067
â•⁄ 4 pick up 9037 9777 34 turn down 1051 1355
â•⁄ 5 go out 7688 7765 35 come over 1004 1262
â•⁄ 6 find out 6619 6760 36 go over 991 1732
â•⁄ 7 make up 5469 6329 37 hold on 908 1797
â•⁄ 8 come out 5022 5231 38 pick out 856 905
â•⁄ 9 come in 4814 9450 39 hold back 823 862
10 work out 4703 5190 40 move in 790 1377
11 take up 4608 5257 41 look around 779 1350
12 sit down 4478 4717 42 take down 775 849
13 take on 4199 5022 43 put off 742 1075
14 get up 3936 3892 44 turn off 594 1057
15 carry on 3869 2587 45 move out 573 594
16 get out 3545 3406 46 move back 566 603
17 come down 3305 3637 47 give out 532 550
18 put up 2835 3083 48 come off 518 1191
19 get on 2696 3441 49 give back 507 654
20 bring in 2505 3010 50 move up 477 616
21 break down 2199 2386 51 dig up – 383
22 go off 2104 2565 52 pay back – 295
23 go in 1974 4695 53 pin down – 249
24 put out 1660 1728 54 tear up – 224
25 take back 1628 1469 55 think over – 219
26 get down 1538 1415 56 fall behind – 156
27 put on 1428 6977 57 pass away – 104
28 move on 1419 2152 58 chat up – 100
29 put back 1369 1251 59 take after – 84
30 break up 1286 1430 60 cool off – 65
G & D = Gardner & Davies token frequency
S & R = Schmitt & Redwood token frequency
Learner knowledge of phrasal verbs 
Appendix B.â•‡ Productive phrasal verb test
Student: _____________________________________ Level: _________

We are carrying out a study of students’ receptive and productive knowledge of phrasal
verbs. To help us in our research please complete this productive knowledge test.
Read each question carefully, and then write what you think the missing words
(a phrasal verb) are, in the space next to the question. To help you, the first letter of
each word is shown. We have also given a definition for each phrasal verb after every
sentence. There are 60 questions and each one uses a different phrasal verb.
You have 40 minutes to finish the test. Good luck!
Example questions:
# Question Answer
i This is a really good piece of work. You must have p________ i__________ put in
a lot of effort. (make an effort, spend time)
ii I don’t have enough money to pay the tuition fees. I need to ask the bank if take out
I can t__________ o__________ a loan to pay for them. (get, obtain,)
iii We spent the afternoon at the airport watching the planes t________ take off
o__________ and land. (leave the ground and fly) taking off
â•⁄ 1 Mike needs a lift from the station. Can you go and p__________ him
u________? (collect, give a lift)
â•⁄ 2 I think we’ve spent enough time talking about this. We should
m__________ o__________ to the next item. (continue, proceed)
â•⁄ 3 P__________ the book b__________ on the shelf when you’ve finished
with it. (return, replace)
â•⁄ 4 I don’t like that picture on the wall there. I think I’ll t__________ it
d__________ and hang it somewhere else. (remove, move to a lower position)
â•⁄ 5 I don’t want to stay in and cook tonight. Let’s g__________ o__________
for a meal. (leave your house for a special reason)
â•⁄ 6 I know you’re tired, but we can’t stop now. We have to g__________
o__________ until we finish. (continue, proceed)
â•⁄ 7 She was relaxing reading a book when a loud crash made her s________
u__________ straight in her chair. (seated with a straight back)
â•⁄ 8 It rained all morning, but in the afternoon the sky cleared and the sun
c__________ o__________. (appear, become visible)
â•⁄ 9 I can’t find my phone anywhere. Please l__________ a__________ the

house and see if you can find it. (search, view)
 Norbert Schmitt and Stephen Redwood
10 The police s__________ u__________ roadblocks to stop people driving

into the city centre. (build, erect)
11 Derek has got the keys to his new flat and I’m going to help him
m__________ i__________ tomorrow. (occupy, start to live in)
12 I was extremely sorry to hear that John’s father p__________ a__________

yesterday. I understand that he had been very ill for a long time. (die)
13 I wonder where Pete is today. Jim, could you f__________ o__________

what’s happened to him? (discover, check)
14 I have been offered a really good job in London, but I don’t want to move,
so I’m going to t__________ the offer d__________. (reject, refuse, say no)
15 Please c__________ i__________, take a seat and make yourself comfort-

able. (enter)
16 We heard this really loud explosion, and found out later that a bomb had
g__________ o__________ in the city centre. (explode)
17 I think I t__________ a__________ my mother. We look very similar and

we like the same kind of things. (similar to, be like)
18 As I was running across the field one of my shoes got stuck in the mud and
c__________ o__________. (be detached, separate from)
19 There are plenty of chairs. Let’s all s__________ d__________ together

and have a nice long chat. (take a chair)
20 It’s a problem finding a job now because companies are just not
t__________ o__________ new staff at the moment. (employ, recruit,
accept)
21 Don’t let go of the rope. H__________ o__________ tight and I’ll try and
pull you out. (grasp, grip firmly)
22 I thought this question was difficult at first but I managed to w__________

o__________ the right answer in the end. (learn, discover, calculate)
23 What time does this train g__________ i__________ to Manchester?

(arrive, enter the station)
24 I am trying to get Peter to tell me when he wants to go on holiday, but it’s

very difficult to p__________ him d__________ to an exact date. (make
him decide)
25 There are a lot more girls than boys in the English Department. In fact,
they m__________ u__________ 85% of the students. (comprise, form)
26 There’s more milk in the fridge. Can you g__________ some o__________
please? (remove, take from)
Learner knowledge of phrasal verbs 
27 We could fit more people on the bus if everybody m__________

u__________ a bit. (change position to make more space)
28 I should go to bed. I’ve got to g__________ u__________ early in the

morning. (rise from bed)
29 They c__________ o__________ from Italy every summer to stay with us

in London. (travel)
30 Jean was so angry with Ray that she took all his photos, t__________ them
u__________, and threw the pieces on the fire. (rip apart, shred)
31 Henry has m__________ o__________ of his flat and gone back to live
with his parents. (leave, vacate)
32 I need more time to decide what to do. Can you give me a few days to
t__________ it o__________? (consider, contemplate, ponder)
33 I didn’t mean to stop you working. Please c__________ o__________

with what you were doing. (continue)
34 When searchers saw the floating wreckage they knew the missing plane
had c__________ d__________ in the sea. (crash, land, fall)
35 I have got your new English books here. Maria, can you g_______ them
o_______ to the class? (distribute, hand to)
36 The writing was very difficult to read and it was hard to m_________
o_________ what it said. (see, recognise, distinguish)
37 When are you going to p__________ me b__________ the money I lent

you? (return)
38 Can you g__________ me b__________ my pen? I need it now. (return)
39 There’s no room for my things on the shelf. Your books t__________

u__________ all the space. (occupy, use, fill)
40 The doctors aren’t sure what’s wrong with her and they need to
c__________ o__________ more tests. (do, complete)
41 Don’t climb up there. It’s dangerous. G__________ d__________ at once

before you fall! (move to a lower position, descend)
42 Let’s p__________ u__________ some posters on the notice board to

advertise our concert. (fix/attach somewhere they can be seen)
43 When we first met we didn’t like each other much but now we g_______
o_______ really well. (have a good relationship, be friends)
44 Mary missed a lot of lessons and has f__________ b__________ the rest of
the class. She will have to work hard to catch up. (fail to keep level with)
 Norbert Schmitt and Stephen Redwood
45 When I have a long piece of writing to do I find it easier if I b__________

it d__________ into small parts. (divide, separate, take apart)
46 I can’t hear what you’re saying. Can you t__________ that music
o__________? (stop by using a switch)
47 Trevor was working in his garden the other day, putting in some new
plants, when he d__________ u__________ an old box full of silver coins.
(remove from the ground)
48 Do the plates g__________ i__________ this cupboard? I’m not sure

where to put them. (be stored, be put)
49 It’s always a good idea to g__________ o__________ your answers to

check you haven’t made any silly mistakes. (check, examine, survey)
50 The staff, using buckets of water, managed to p_________ the fire

o_________ before the fire crew arrived. (extinguish, stop from burning)
51 The crowd rushed forward and the riot police were unable to h__________
them b__________. (stop, contain, check)
52 Mark thinks he is a bit of a romeo. He is always trying to c__________

u__________ the girls. (talk to in a friendly way)
53 They’ve p__________ o__________ their trip to Australia until next year

to give them more time to save up some money. (postpone, cancel until a
later date)
54 Lots of people applied for the job but Mary was p__________
o__________ as the best candidate. (choose, select)
55 Quick, p__________ your coat o__________. We’re going now. (wear,

clothe yourself)
56 Are you m__________ b__________ to Scotland after you’ve finished your

work here? (return)
57 When the food and drink ran out the party b__________ u__________
and everyone went home. (come to an end, finish)
58 It’s so hot. Let’s go for a swim in the lake to c__________ o__________.

(lose heat, get colder)
59 The football club sacked their manager and b__________ i__________ a

new man in the hope of improving results. (introduce, employ)
60 This phone’s still not working properly. I’ll have to t__________ it

b_________ to the shop where I bought it. (return)
Thank you very much for completing the first part of the study.
Learner knowledge of phrasal verbs 
Appendix C.â•‡ Receptive phrasal verb test
Student: __________________________________ Level: __________

For the second part of our study we would like to know about your receptive knowl-
edge of phrasal verbs. To help us, please complete this multiple choice test.
Read each question carefully, and then choose the best answer (A, B, C, D) to go
in the spaces. There is only one correct answer for each question. If you do not know
the answer write E. To help you there is a definition for each phrasal verb after every
sentence. You have 30 minutes to finish. Good luck!
Example question:
# Sentence A B C D E Answer
â•⁄ 0 When we tried to buy tickets sold sold sold up sold in ? B

for the concert we were told down out
there they had ______ ______
within a couple of hours. (all
had been bought and there were
none left)
â•⁄ 1 We heard this really loud gone gone in gone gone ?

explosion, and found out later back off up
that a bomb had ______
______ in the city centre.
(explode)
â•⁄ 2 We could fit more people on broke looked turned moved ?

the bus if everybody _____ up up up up
_____ a bit. (change position to
make more space)
â•⁄ 3 I am trying to get Peter to tell pin in pin on pin up pin ?

me when he wants to go on down
holiday, but it’s very difficult to
______ him ______ to an
exact date. (make him decide)
â•⁄ 4 I think we’ve spent enough move move move move ?

time talking about this. We in down out on
should _____ _____ to the
next item. (continue, proceed)
â•⁄ 5 Mike needs a lift from the pick pick up pick at pick on ?
station. Can you go and out
______ him ______? (collect,
give a lift)
 Norbert Schmitt and Stephen Redwood
â•⁄ 6 I wonder where Pete is today. find in find up find on find ?

Jim, could you ______ ______ out
what’s happened to him?
(discover, check)
â•⁄ 7 I should go. I’ve got to _____ work stand get up take up ?
_____ early in the morning. up up
(rise from my bed)
â•⁄ 8 What time does this train take in give in get in bring ?
_____ _____ to Manchester? in
(arrive, enter the station)
â•⁄ 9 I don’t like that picture on the turn stand take hold ?
wall there. I think I’ll ______ it down down down down
_______ and hang it some-
where else. (remove, move to a
lower position)
10 It rained all morning, but in took came made passed ?

the afternoon the sky cleared out out out out
and the sun _____ _____.
(appear, become visible)
11 Henry’s _____ _____ of his moved moved moved moved ?

flat and gone back to live with out off back on
his parents. (left, vacate)
12 When searchers saw the come come come come ?

floating wreckage they knew down across out up
the missing plane had _____
_____ in the sea. (crash, land,
fall)
13 There’s no room for my things take in take on take take up ?

on the shelf. Your books _____ out
_____ all the space. (occupy,
use, fill)
14 Can you ______ me ______ give off give up give give ?

my pen? I need it now. (return) out back
15 It’s always a good idea to _____ show take go over give ?

_____ your answers to check over over over
you haven’t made any silly
mistakes. (check, examine,
survey)
Learner knowledge of phrasal verbs 
16 There are plenty of chairs. Let’s sit sit on sit over sit off ?
all _____ _____ together and down
have a nice long chat. (take a
chair)
17 The football club sacked their held in brought turned came ?

manager and _____ _____ a in in in
new man in the hope of
improving results. (introduce,
employ)
18 Lots of people applied for the picked picked picked picked ?

job, but she was _____ _____ out back over in
as the best candidate. (choose,
select)
19 The police _____ _____ set in set up set on set at ?

roadblocks to stop people
driving into the city centre.
(build, erect)
20 This phone’s still not working take set turn look ?

properly. I’ll have to ______ it back back back back
______ to the shop where I
bought it. (return)
21 Jean was so angry with Ray took up tore up set up looked ?

that she took all his photos, up
______ them ______, and
threw the pieces on the fire.
(rip apart, shred)
22 Let’s _____ _____ some go up put up give up sit up ?

posters to advertise our
concert. (fix, attach somewhere
they can be seen)
23 When the food and drink ran broke broke broke broke ?
out the party _____ _____ and up in over out
everyone went home. (come to
an end, finish)
24 Please _____ _____, take a seat put in come give in bring ?

and make yourself comfortable. in in
(enter)
 Norbert Schmitt and Stephen Redwood
25 I was extremely sorry to hear passed passed passed passed ?

that John’s father _____ _____ about back away up
yesterday. I understand that he
had been very ill for a long
time. (die)
26 I know you’re tired, but we put on look on go on take on ?

can’t stop now. We have to
_____ _____ until we finish.
(continue, proceed)
27 Do the plates _____ _____ this come give in take in go in ?

cupboard? I’m not sure where in
to put them. (be stored, be put)
28 They _____ _____ from Italy come come come come ?

every summer to stay with us about on off over
in London. (travel)
29 I need more time to decide think think think think ?

what to do. Can you give me a over under up back
few days to ______ it ______?
(consider, contemplate, ponder)
30 I can’t find my phone any- look look look on look ?

where. Please _____ _____ the across down around
house and see if you can find it.
(search, view)
31 When I have a long piece of break break break break ?

writing to do I find it easier if I back off out down
______ it ______ into small
parts. (divide, separate, take
apart)
32 I thought this question was work in work work work ?

difficult at first but I managed up out off
to _____ _____ the right
answer in the end. (learn,
discover, calculate)
33 Don’t let go of the rope. _____ Hold in Hold Hold Hold ?

_____ tight and I’ll try and pull off on up
you out. (grasp, grip firmly)
Learner knowledge of phrasal verbs 
34 I have got your new English give off give give up give on ?
books here. Maria, can you out
______ them ______ to the
class? (distribute, hand to)
35 There’s more milk in the fridge. go out hold make get out ?
Can you _____ some _____ out out
please? (remove, take from)
36 I don’t want to stay in and cook go go out go up go in ?

tonight. Let’s _____ _____ for under
a meal. (leave your house for a
special reason)
37 I didn’t mean to stop you carry carry carry carry ?

working. Please _____ _____ off back on up
with what you were doing.
(continue)
38 It’s so hot. Let’s go for a swim cool on cool in cool off cool up ?
in the lake to _____ _____.
(lose heat, get colder)
39 There are a lot more girls than make make make make ?
boys in the English Depart- on up in off
ment. In fact, they _____
_____ 85% of the students.
(comprise, form)
40 ______ the book ______ on Put off Put in Put Put ?

the shelf when you’ve finished under back
with it. (return, replace)
41 The crowd rushed forward and hold hold hold on hold ?

the riot police were unable to under back over
______ them ______ (stop,
contain, check)
42 The doctors aren’t sure what’s carry carry carry carry ?

wrong with her and they need down in up out
to _____ _____ more tests. (do,
complete)
43 Don’t climb up there. It’s Get Take Look Put ?

dangerous. _____ _____ at down down down down
once before you fall! (move to a
lower position, descend)
 Norbert Schmitt and Stephen Redwood
44 When are you going to _____ pay pay on pay pay ?

me _____ the money I lent back after down
you? (return)
45 The staff, using buckets of put out put up put off put in ?
water, managed to ______ the
fire ______ before the fire crew
arrived. (extinguish, stop from
burning)
46 It’s a problem finding a job now taking going looking getting ?

because companies are just not on on on on
_____ _____ new staff at the
moment. (employ, accept)
47 I can’t hear what you’re saying. turn turn turn in turn off ?
Can you _____ that music out back
_____? (stop by using a switch)
48 She was relaxing reading a sit off sit over sit up sit on ?
book when a loud crash made
her _____ _____ straight in
her chair. (seated with a straight
back)
49 I think I _____ _____ my take in take up take take ?

mother. We look very similar after back
and we like the same kind of
things. (similar to, be like)
50 Are you _____ _____ to stand- looking moving bring- ?

Scotland after you’ve finished ing back back ing
your work here? (return) back back
51 Mary missed a lot of lessons looked turned fallen put ?

and has _____ _____ the rest behind behind behind behind
of the class. She will have to
work hard to catch up. (fail to
keep level with)
52 I have been offered a really turn turn up turn turn off ?

good job in London, but I don’t over down
want to move, so I’m going to
______ the offer ______.
(reject, refuse, say no)
Learner knowledge of phrasal verbs 
53 Trevor was working in his dug dug up dug off dug on ?

garden the other day, putting in down
some new plants, when he
_____ _____ an old box full of
silver coins. (remove from the
ground)
54 Derek has got the keys to his give in move make work in ?
new flat and I’m going to help in in
him _____ _____ tomorrow.
(occupy, start to live in)
55 When we first met we didn’t take on look on bring get on ?

like each other much but now on
we _____ _____ really well.
(have a good relationship, be
friends)
56 The writing was very difficult make make make make ?

to read and it was hard to off out up in
_____ _____ what it said. (see,
recognise, distinguish)
57 They’ve _____ _____ their trip put off put up put put in ?
to Australia until next year to over
give them more time to save up
some money. (postpone, cancel
until a later date)
58 As I was running across the took off came turned put off ?
field one of my shoes got stuck off off
in the mud and _____ _____ .
(be detached, separate from)
59 Quick, ______ your coat look on put on hold on make ?

______. We’re going now. on
(wear, clothe yourself)
60 Mark thinks he is a bit of a chat up chat off chat in chat ?

romeo. He is always trying to out
____ _____ girls. (talk to in a
friendly way)
 Norbert Schmitt and Stephen Redwood
Appendix D. Biodata questionnaire
Finally, we would like to know how much exposure you have to English. Please spend
a few minutes filling in this brief questionnaire.
less than 1 1 – 2 years 3 – 5 years over 5 years

How long have you been learning year
English?
school language private

Where do you have English lessons? school lessons
(you can mark more than one box)
1 – 2 hours 2 – 4 hours more than

How many hours of English lessons do you have each 4 hours
week?
How much time do you spend reading books, 0 – 1 hour 1 – 2 hours more than
magazines and newspapers in English, or visiting 2 hours
English language websites each week?
0 – 1 hour 1 – 2 hours more than

How much time do you spend watching films, videos 2 hours
or TV in English each week?
0 – 1 hour 1 – 3 hours more than

How much time do you spend listening to music in 3 hours
English each week?
never 1 – 2 hours more than

Do you use English to make new friends and keep in a week 2 hours a
contact with people? (Facebook, MySpace, Twitter, week
Skype, email, instant messaging, SMS [texts] etc)
Learner knowledge of phrasal verbs 
Your age
male female
Your gender
Your nationality country
Many thanks for your help. If you would like to know your scores please fill in your
email address below.
Corpora and the new Englishes
Using the ‘Corpus of Cyber-Jamaican’ to explore
research perspectives for the future
Christian Mair1
Contrasts between British and American usage were an important topic in

computer-aided corpus linguistics from the very start. The present contribution
shows how from these beginnings the scope of corpus-based research was
successively extended to cover standard varieties of the New Englishes (e.g.
in the International Corpus of English) and eventually also non-standard and
vernacular varieties, so that today the corpus-linguistic approach has become
an important complement to sociolinguistics in the study of variation in the
New Englishes. From a general discussion of this development, the contribution
moves on to present the ‘Corpus of Cyber-Jamaican’ (CCJ), a large web-derived
corpus of diasporic Jamaican web forums, and shows in a number of exploratory
studies how this new resource can be used to investigate the globalisation of
vernacular features.
1. The corpus-based documentation of the New Englishes:

A brief historical survey
An interest in the description of regional varieties of Standard English, especially the

pluricentric standardisation of the world language, was one of the driving forces be-
hind the rise of computer-aided corpus linguistics from the very start. When W. Nelson
Francis and Henry Kučera had completed the Standard Corpus of Present-Day Edited
American English, for Use with Digital Computers (subsequently known as the Brown
corpus) in 1964 (sampling date of texts: 1961), this inspired the creation of its British
analogue, the Lancaster-Oslo/Bergen (or LOB) corpus, albeit with a delay of more
1. The present paper was written while I enjoyed the extremely productive and congenial
working environment provided by FRIAS, Freiburg University’s Institute for Advanced Studies.
I am grateful for this support. My thanks are also due to the members of the CCJ team, Anasta-
sia Cobet, Johanna Holz, Véronique Lacoste, Andrea Moll, Larissa Teichert, for help with corpus
searches and many fruitful discussions.
 Christian Mair
than ten years (completion of corpus: 1978; sampling date: 1961). To this pair were
eventually added three corpora devoted to different kinds of New Englishes: the
Kolhapur Corpus of Indian English in 1986 (sampling date: 1978), the Australian Cor-
pus of English (ACE) and the New Zealand ‘Wellington’ Corpus of English (sampling
dates: 1986 and 1986/87 respectively). By the early nineteen-nineties corpus coverage
of the New Englishes was thus certainly not complete, but the major types – trans-
planted ‘settler’ Englishes (Australia, New Zealand) and second-language or ‘official’
English (India) – were represented.
In spite of its general currency, the term ‘New English(es)’ is notoriously fuzzy to
define. The least controversial understanding is the purely chronological one: a variety
of English which arose in the wake of the second wave of British colonial expansion,
after the loss of the thirteen North American colonies in the late 18th century. This
definition does not imply any claim about the linguistic structure or social status of
any one specific New English. A more specific understanding (advocated, for example,
in Platt et al. 1984) would restrict the term to non-native institutionalised varieties
(which, of course, would exclude Australian English or natively spoken South African
English). If defined in this way, New Englishes are first and foremost to be described
as contact Englishes, or learner Englishes which have been institutionalised in their
communities.
Since all New Englishes, however defined, result from colonialism, there is sig-
nificant overlap between this group and what Peter Trudgill has referred to as “colonial
Englishes”:
[We shall use the term] colonial as a technical term covering in principle all types
of English other than those spoken in England and the lowlands of Scotland – the
part of the world to which English was almost entirely confined until the seven-
teenth century, which is to say for most of its history. Those varieties of English
which are spoken elsewhere in the world – the colonial varieties – have resulted
from movements of people outwards from Britain, from the seventeenth century
onwards, often involving dialect mixture; the influence of other languages with
which English has come into contact; and independent developments that have
occurred subsequently in different parts of the world, some of them in response
to new environments and new uses. These colonial varieties include the forms of
English spoken in the Highlands of Scotland, in Wales, in the English county of
Cornwall (which has been entirely English speaking only since the seventeenth
or eighteenth centuries), and in Ireland, the Isle of Man, Canada, the United
States of America, Central America, South America, the Caribbean, the Bahamas,
Bermuda, St. Helena, Tristan da Cunha, the Falkland Islands, Liberia, East Africa,
South Africa, Zimbabwe, Australia, and New Zealand, as well as in many other
areas of the world where second-language and/or pidginized and creolized forms
of English are to be found. (Trudgill 1986: 127)
Corpora and the new Englishes 
As can be seen, the difference between natively spoken and non-native varieties of
English plays no prominent role in this definition, and this has become the main-
stream view in World Englishes research today (with the term colonial being replaced
by postcolonial, as in the title of Schneider’s [2007] book).
Given that the boundary between native and non-native varieties is permeable
and that US independence is a historical rather than a linguistic watershed, a priori
definitions and categorisations seem of limited practical value. Singaporean English,
for example, like Malaysian English, started out as a clear case of a second-language
variety at the dusk of the colonial era in South East Asia in the 1950s. For many of its
speakers, however, it has now become a native variety (while Malaysian English, by
contrast, has undergone a process of societal disestablishment, towards being a foreign
language rather than a second or official one, during the same period – cf. Schneider
2007: 147–148). Irish English represents a chronological dilemma. Is it a New English
because its most distinctive features result from the mass shift from Irish to English in
the second half of the 19th century, or do we have to also consider its pre-history ex-
tending back into the Middle Ages?
Jamaican English, the variety which will be focussed on in the present contribu-
tion, is intractable on both counts. Chronologically, Jamaican Creole or patois, the
English-lexifier creole spoken by the mass of the population, had certainly consoli-
dated by the first half of the 18th century, before US independence. With regard to
native-speaker status, there is a clear conflict between popular opinion, according to
which the creolophone British West Indies have always seen themselves as ‘English-
speaking’, and many linguists and educators, who point out that the creoles of the re-
gion are phonologically and grammatically distinct languages from English and that a
competence in English for most speakers is acquired in the educational system, much
as is the case in second-language communities.
After reviewing the various problems, the working definition of ‘New English(es)’
I shall adopt is ‘any postcolonial variety of English which is undergoing the process of
endonormative stabilisation and standardisation that Standard British and American
English, the two global reference standards, have completed’. Australian English, for
which there have been widely accepted locally produced dictionaries and usage manu-
als since the 1980s and which has supra-national influence in the South Pacific region,
presents itself as a New English far advanced in this process. Jamaican English, whose
norms are still emerging in a force-field defined by an inherited but weakening British
Standard, increasing influence from the US, and a growing readiness to accommodate
Jamaican Creole features, is following behind at some distance.
Whichever way one defines the notion of ‘New English’, however, all early corpora
in the field had a major shortcoming, in that they were restricted to written English. In
this regard, the International Corpus of English (ICE) project, conceived in 1990 by
Sidney Greenbaum (see Greenbaum 1990, 1996), represented a major advance by
 Christian Mair
including spoken language.2 Equally important was a broadening of the base for sys-
tematic comparative research, both on the relationship between ‘Old’ and New Eng-
lishes, and on similarities and contrasts among the New Englishes themselves. From
the start, links were established between the ICE project and the International Corpus
of Learner English (ICLE) developed at around the same time by Sylviane Granger.
Apart from studying the obviously interesting question of which widespread learner
features generally tend to make it into institutionalised second-language standards, it
is possible also to look at more specific constellations: Hong Kong English, for example,
is documented as an institutionalised contact English in ICE, and similar contact phe-
nomena will no doubt be in evidence in the Chinese component of ICLE.3 The critical
question whether or not the second-language New Englishes represent a special case of
language acquisition was bravely raised in a very early paper by Williams (1987), but
not followed up systematically – no doubt partly because of the lack of suitable data
and corpora. Today, we have these data, for example by joining up ICE and ICLE in the
way envisaged by Sylviane Granger at the very inception of the ICLE project.
Directed by Sidney Greenbaum until his death and then passing on into the stew-
ardship of Bas Aarts, the British component of ICE (ICE-GB) was completed, anno-
tated for part of speech, parsed syntactically, and made searchable through a sophisti-
cated customised software package (ICECUP). When with the second release of the
corpus in 2006 the sound files were made available alongside the transcription, a
benchmark was set for other ICE ventures and the wider corpus-linguistic community.
A project designed to document as many existing and emerging regional standard
varieties of English as possible is of course beset by numerous difficulties, but it is
vivid testimony to Greenbaum’s far-reaching insight that, at the time of this writing,
ten more ICE sub-corpora have been completed (in plain-text versions) and several
more are in the making at various stages of completeness.4
As the survey of these corpus-ventures makes clear, the focus of current corpus-
based research on the New Englishes is very much on the Standard English end of the
sociolinguistic scale. ICE corpora document the English of educated users of the lan-
guage, and not of others. Thus, the Jamaican component of ICE focuses on Jamaican
English, and not on the island’s vital and thriving English-lexifier creole, or the me-
solectal span of the English-creole continuum which most residents of the island tend
2. Another pioneer in the development of spoken corpora of the New Englishes who deserves
mention is Janet Holmes, who started collecting texts for the Wellington Spoken Corpus of New
Zealand English (WSC; completed 1998) in 1988.
3. For a more comprehensive survey of work taking place at the intersection of New-English-
es and learner-English research, see Mukherjee & Hundt (forthcoming).
4. The completed sub-corpora cover Australia, Canada, East Africa, Hong Kong, India, Ire-
land, Jamaica, New Zealand, Philippines and Singapore. Most prominent and eagerly awaited
among those still being compiled is ICE-USA, while work is going on to produce corpora for
Fiji, Ghana, Malaysia, Malta, Namibia, Nigeria, Pakistan, South Africa, Sri Lanka, and Trinidad
& Tobago. See http://ice-corpora.net/ice/for further information.
Corpora and the new Englishes 
to feel most at home in (Patrick 1999). This, however, is a bias which is likely to be re-
dressed in the foreseeable future as sociolinguists and dialectologists are becoming
increasingly receptive to the potential of corpora and corpus-related methods of inves-
tigation (cf., e.g., Kortmann 2005; Beal et al. 2007; Tagliamonte & D’Arcy 2007a, 2007b;
Mair 2009). As for non-standard corpora specifically of the New Englishes, consider
the Corpus of Nigerian Pidgin, published as part of Deuber (2005), or the Corpus of
Written British Creole (compiled by Mark Sebba and Susan Dray, cf. http://www.ling.
lancs.ac.uk/staff/mark/cwbc/cwbcman.htm).
While the attempt is still occasionally made (cf. Nelson 2006), as in most other
sub-disciplines of corpus linguistics, it is now impossible to compile tidy and complete
lists of available corpus resources for the study of the New Englishes. As corpora are
becoming mainstream and the vernacularisation of the World Wide Web is progress-
ing apace, corpora of the New Englishes of all sizes and degrees of specialisation are
proliferating, with many of them being derived from the Web, the almost inexhaust-
ible digital text archive (as, e.g., in Mukherjee & Hoffmann 2006, a study of Indian
newspaper language).
The rich corpus-linguistic working environment which we now have for the study
of the New Englishes encourages comparative research among these varieties (and, of
course, also among some or all of them and the ‘old’ Englishes which provided the input
at their genesis and have continued to influence them since). Which features are shared
by many or even all of the New Englishes (as ‘vernacular universals’, ‘Angloversals’, ‘New
Englishisms’)?5 Which features, on the other hand, are locally specific (and therefore
often traceable to particular language contact in multilingual settings)? Are there spe-
cific morphosyntactic profiles distinguishing natively spoken New Englishes such as
Canadian English, New Zealand English or Australian English from second-language
New Englishes such as West African or Southern Asian varieties? Can we establish re-
current diachronic trends in the emergence and stabilisation of the New Englishes? And
so on. Tentative answers to the first three questions are proposed, for example, in the
synoptic chapters included in the recent Handbook of Varieties of English (e.g. Kortmann
& Szmrecsanyi 2004), whereas the fourth question is at the centre of Schneider (2007).
2. Current challenges: The web as a data source

for the study of the new Englishes
After this sketch of the thriving field of corpus-based research on the New Englishes, I
would like to move on to present first findings from an ongoing research project of my
5. This is not the place to enter into a detailed discussion of contrasts and overlap between
these three terms. The notion of ‘vernacular universal’ goes back to Chambers (e.g. 2000, 2003);
‘Angloversal’ was coined by the present writer – see Mair (2003) – and was made the subject of
an extensive study by Sand (2005); ‘New Englishisms’ is used in Simo Bobda (2001, 2004).
 Christian Mair
own, the investigation of the use of Jamaican English and Jamaican Creole in a large
(> 15 million words) corpus of Jamaican web-posts which is currently being compiled
at Freiburg. The project builds on previous research based on the Jamaican component
of ICE (ICE-JA), with the obvious first goal being to position ‘cyber-Jamaican’ with
regard to spoken and written usage as documented in ICE-JA.6 As computer-mediated
communication is not restricted territorially to the island of Jamaica but involves a
large Jamaican diaspora in the US, Canada and Great Britain, it will also be interesting
to compare the strength of American influence on the web (in the Corpus of Cyber-
Jamaican) and ‘on the ground’ (in ICE-JA).
On the methodological and theoretical plane, the project positions itself at the
intersection of web-based corpus linguistics and the emerging field of the sociolinguis-
tics of computer-mediated communication. There has been a boom in all fields of web-
related corpus-linguistic activity. For example, it is now standard practice in corpus
compilation to draw on available digitised text from the web. Mark Davies’ large web-
derived Latin American newspaper corpora produced in the late 1990s (Davies 2001)
were a pioneering venture, and his 400+-million-word generically balanced Corpus of
Contemporary American English (COCA, http://www.americancorpus.org/) is con-
vincing proof of the viability of the approach in the field of English. The Special Inter-
est Group of the Association for Computational Linguistics (ACL) on Web as Corpus
(SIGWAC, http://www.sigwac.org.uk/) is one of several initiatives set up to ensure that
the technological challenges presented by the task are tackled in a co-ordinated way.
So far, most such activity has resulted in corpora documenting standard varieties of
language. The presence of informal and vernacular features has been noted in many
web-based textual genres and has usually been explained as the result of the mediated
immediacy of digital communication (its ‘pseudo-oral’ quality). Vernacular varieties
of English, however, have not usually been made the explicit target of collection.
In addition to introducing a new type of data into web-based corpus linguistics,
the corpus also holds considerable innovative potential for the study of computer-
mediated communication. Linguistic studies of computer-mediated communication
on the other hand – whether in the mainstream inspired by discourse-analytical ap-
proaches (Beißwenger & Storrer 2008) or in the emerging sociolinguistic paradigm
(Androutsopoulos 2006a) – commonly employ the methods of qualitative ethnogra-
phy or use small corpora compiled for the purposes of a particular study. Some of the
findings from such analyses are therefore necessarily provisional in the sense that it is
not clear in how far the insights are confined to the discursive constellation investi-
gated and in how far they can be generalised. In such a situation, large corpora can be
extremely helpful. To give a simple example: in a qualitative analysis of a few ‘threads’
in a particular web-based discussion forum, the researcher can never be sure whether
all relevant conventionalised spellings of some vernacular form are actually represented
6. The e-mail messages in the small W1B-section of ICE-JA represent the only infiltration of
computer-mediated language into this corpus. They will not figure in the comparison.
Corpora and the new Englishes 
or whether the statistical preference for one spelling over the others is significant; in a
corpus drawn from millions of posts by thousands of participants, the answers to these
questions can be given on much firmer ground.
3. The data: CCJ, a corpus of cyber-Jamaican English/Jamaican Creole
Before describing the data on which the present study is based, a brief word of clarifi-
cation is in order on the terminology employed. The term used to refer to the emerging
local norm of educated English usage in Jamaica, the most populous and culturally
most influential former British colony in the Caribbean, is ‘Jamaican English’
(JamEng). Jamaican English is considered one of the New Englishes which have arisen
in the wake of decolonisation and can be placed at various stages of endonormative
development as described in the widely used model proposed by Schneider (2007).
‘Jamaican Creole’ (JC), locally known as patois, on the other hand, is the name of the
English-lexicon Creole which consolidated on the island in the course of the transition
to a slave-based plantation economy in the early 18th century. It is easy to claim a sta-
tus for JC as a separate and independent language from English on the basis of its
distinctive grammar and phonology, but most speakers on the island do not draw a
rigid line of division between JamEng and JC. While small minorities of speakers may
be clearly dominant in ‘basilectal’ JC or ‘acrolectal’ JamEng, most cover a span on the
‘mesolectal’ range of the JC-JamEng ‘continuum’ in their own usage. Note again that,
as has been pointed out, even speakers who show strong and obvious influences from
JC still consider themselves as speakers of English, if of a more or less highly stigma-
tised variety.7
What are the problems this highly fluid and flexible continuum situation raises in
corpus-practical terms? They are manageable in written material: outside of experi-
mental and literary writing it is usually easy to distinguish between JamEng and JC in
written texts. From colonial times, there has been a strict diglossia, in which the lan-
guage appropriate for writing has been Standard English. JC has a phonemic orthog-
raphy developed by linguists Frederic G. Cassidy and Robert B. LePage which is used
in some scholarly linguistic writing and valiantly promoted by the University of the
West Indies-based Jamaican Language Unit but has made little headway among the
general public, who prefer an English-based ad hoc system. Full texts written in JC
tend to come from the domains of folklore, dialect poetry, or humour. In recent times,
the reggae movement, its derivatives (e.g. dancehall) and socio-religious movements
such as Rastafarianism, with their emphasis on affirming the African and Afro-Carib-
bean folk heritage of Jamaica, have added to this repertoire.
7. Is English We Speaking has thus come to serve as the appropriately emblematic title of a col-
lection of essays on Caribbean literature and culture by Mervyn Morris (1999) and is echoed in
two linguistic studies: Youssef (2004) and Deuber (2009).
 Christian Mair
In the written texts of ICE-JA, borrowings from JC are therefore very rare, and
usually identified as such, for example by the use of quotation marks, as in the follow-
ing instance:
(1) They just cannot afford to go to University and so their education ceases at the
high school sixth form, community college level. They often times do not have
the financial ‘backative’ needed to secure a Student Loan.
(ICE-JA, W1a, student writing)
The writer consciously chooses a JC term meaning ‘support’ in order to add local
colour to the text and draws attention to this device by using the quotation marks.
Given that we are dealing with student writing, the quotation marks may additionally
be used to prevent a teacher correcting the use of backative as a mistake. While there
are a number of plausible rhetorical or stylistic motivations for the occasional use of JC
vocabulary in JamEng written texts, contact features or JC borrowing at the morpho-
syntactic level would usually be considered erroneous:
(2) Secondly the government provides a Student Loan Bureau which lends quali-
fied students the tuition fee to pay the university and this money is paid back
after the student have recieve|receive their tertiary level education and can
pay back the loan. (ICE-JA, W1a, student writing)
There is self-correction of the spelling error in receive (as indicated by |), but the
absence of the plural in student (or of appropriate singular marking on have) and of
the regular past-participle ending -ed in received would be noted and disapproved
of by expert writers. This is different in spoken Jamaican English, where the occa-
sional use of such contact features is common even among the educated speakers
sampled for ICE-JA, not only as the result of occasional code-switching into JC but
also in informal upper-mesolectal JamEng. Compare the following three typical
specimens:
(3) Don’t get panicky about it don’t get vex when it’s our turn to be scrutinised but
just deal with it you know (ICE-JA, S1a, telephone conv.)
(4) No no but they’re not around but what you find is that the persons who are
teaching JAMALs [Jamaican Movement for the Advancement of Literacy
teaching modules] are person like me who no know nutten but are scared of
word ... (ICE-JA, S1a, face-to-face conv.)
(5) Worst if you value the person friendship and you think the person is some-
body you want to keep in touch with there’s no way you’re going to I mean let
that candle [go] out – going to always try to keep the candle burning
(ICE-JA, S1a, face-to-face conv.)
Example (3) is similar to the written example (1). One JC word appears in an other-
wise JamEng sentence. However, the sound files show that there are no intonational
distancing signals marking off vex which could be compared to the ‘scare’ quotes
Corpora and the new Englishes 
found around backative. The JC synonym vex for angry is used to make the point
more emphatically and functions as a conversational cue establishing a relationship
of ethnic solidarity between speaker and listener. Examples (4) and (5) show vari-
ability in inflectional marking which is essentially of the same type as that found in
the written example (2). The evaluation, however, is entirely different. What in the
written text is perceived as an error becomes a useful stylistic signal of informality
in speech.
The presence of JC within spoken JamEng, however, is not confined to such
moderate contact influence as illustrated in examples (3) to (5). In fact if there is
one lesson ICE-JA teaches the analyst it is that at the level of spontaneous face-to-
face interaction JC is an inextricable part of the linguistic resources of the educated
Jamaican speakers sampled for the corpus. There is no diglossic situation, as in
writing, but the continuum rules: given the appropriate context of situation, a
speaker of JamEng may find it convenient to reach into the lower mesolectal range,
as is illustrated in example (6). Example (7), finally, combines the phenomena il-
lustrated in (5) and (6), in that we have a baseline style which can be defined as
informal JamEng (note the absence of inflectional marking of the 3rd person singu-
lar in the present tense in she have ...) within which we find switching to JC (marked
by preverbal negation and tense and aspect marking, as in me no know how that a
go work).
(6) she naa go school tomorrow with her hair look so <#> She feel she must be the
hottest thing so mummy did haffi carry her go hairdresser
(ICE-JA, S1a, face-to-face conv.)
[she’s not going to school tomorrow with her hair looking like this – she feels she
must be the hottest thing so mummy had to take her to the hairdresser’s]
(7) A: Uh she says she wants to be a model and a lawyer <#> Oh me no know how
that a go work
[... Oh, I don’t know how that’s going to work]
B: Mhm Hope she has found a way to model skinny
A: In fact she she have more body than me but she have a nice little shape go-
ing <#> But I just miss her you know (ICE-JA, S1a, face-to-face conv.)
Considering what we know about the informality and pseudo-orality of computer-
mediated communication, we would expect cyber-Jamaican to be more like speech
than writing. And indeed all the phenomena illustrated in examples (3) to (7) from the
spoken texts of ICE-JA were attested copiously in exploratory searches; in addition,
further contact features were observed which are unusual even in face-to-face conver-
sation. This was the reason for the compilation of CCJ, the Corpus of Cyber-Jamaican
English/Jamaican Creole.
Obtaining data from the web in the quantities envisaged here (> 15 million words)
presents a modest technical challenge but is beset with a number of ethical issues in-
volving copyright and privacy. This is why – unlike other corpora compiled at Freiburg
 Christian Mair
under the direction of the present author – public access to the material will remain
restricted for the time being.8
At the start of the CCJ project in 2008, the major issue that needed to be resolved
was to identify suitable donor sites on the web. As any Google search for terms such as
riddim (‘rhythm’, particularly reggae and related styles), bashment (‘party’, ‘celebra-
tion’), oonoo (‘you’, plural), criss (‘crisp’, general-purpose term of approval), inna yaad
(‘at home’), mek I tell (‘let me tell’), im a go (‘he is going (to)’) and so forth will testify,
JamEng/JC long ago ceased to be a marginal or locally restricted variety of English;
today it has a quantitatively impressive presence on the Web (which is, incidentally,
not restricted to English-language sites).9 However, the majority of these sites are sin-
gle-issue ventures, usually devoted to reggae music, inter-racial dating, or similarly
restricted activities. After discarding these, a number of sites remained which featured
web-based discussions which met the following criteria:
a. the discussions covered a broad range of topics;
b. participants were numerous and of diverse backgrounds;
c. participants mainly originated from Jamaica or the Jamaican diaspora in Canada,
Great Britain and the United States.
As one site, http://www.jamaicans.com, exceeded all others in size and quality, it was
decided to make it the focus of the investigations rather than mix several sources in the
compilation of CCJ. It patently met the first two criteria mentioned above; as for the
third, 1,318 from among a total of 2,141 users disclosed their place of residence in their
user profiles, and selective cross checks with the contents of the posts did not give rise to
undue suspicion regarding the reliability of these self-reports. As the history of the use
of JamEng/JC on the web is short, the plan was for the corpus to be stratified by year, so
as to make possible the real-time study of the emergence of writing norms in computer-
mediated communication for this variety (for a related study see Deuber & Hinrichs
2007). From this point of view, as well, it seemed more appropriate to document conven-
tions of one representative web-based community of practice rather than mix several.
In order to make the material amenable to analysis by conventional corpus-ana-
lytical software and to preserve the data for possible replication and testing after they
were removed from the original site, large amounts of material were automatically
downloaded and manually post-edited in 2008. Table 1 surveys the material thus
8. Arguably, the fact that posters contributing to web-based discussion forums operate under
pseudonyms and should in principle be aware that what they say is in the public domain makes
it legitimate to use what they produce as the data for linguistic analysis, without infringing on
informants’ privacy. On the other hand, informants clearly have not given their explicit consent,
and in fact most of them would consider the presence of the non-participating researcher as an
instance of unwanted ‘lurking’. The fact that the material needs to be downloaded for preserva-
tion and analysis raises additional issues of copyright. For a survey of positions in this ongoing
debate, see Crystal (2006: 200–201) or Mann & Stewart (2002: 39–64).
9. See, for example, sites such as www.reggae.fr or www.raggamafia.at.
Corpora and the new Englishes 
Table 1.â•‡ CCJ – corpus size by year
Year Number of words
2000 354
2001 104,077
2002 697,184
2003 1,808,513
2004 1,683,851
2005 1,477,049
2006 2,127,915
2007 4,878,145
2008 3,833,655
Total 16,610,743
obtained (and originally produced by a total of 2,140 different contributors10 to forum

discussions) in purely quantitative terms.
To gain a first idea of the quality of the material contained in CCJ a number of
high-frequency morphosyntactic variables were looked at, such as the realisational
variants of the going to-future (cf. Table 2).
This list is not exhaustive but certainly captures the most important variants of the
variable. All those forms which would be expected to turn up in comparable web-
based discussion forums in other regions of the English-speaking world were sub-
sumed under the category ‘acrolect’. The basilectal variants conform to traditional JC
grammar, whereas the mesolectal ones are those which have arisen as compromises
between JamEng and JC norms in the continuum situation. Note that the mesolectal
forms are reductions or simplifications when compared to the corresponding standard
English ones, whereas the basilectal ones have explicit grammatical marking absent
from standard English. Psychologically, omission of marking is less salient than the
use of marking which is not part of English grammar and consequently carries less
stigma in sociolinguistic terms. Variants not listed in Table 2 include those which are
very rare (e.g. me is going – with just one attestation) or those, such as me/mi go, for
which the vast majority does not have future reference (but, in this case, past). Even
this very first count highlights what will turn out to be a recurrent feature – the
over-representation of basilectal variants in comparison to spoken data from ICE-JA.
10. This is based on login information which was cross-checked with information in the posts.
There are the expected occasional irregularities, such as the same individual shifting to a new
pseudonym, but such cases are very rare. More importantly for the analysis, there are drastic dif-
ferences in activity, with core members contributing thousands of posts and marginal ones only
a handful. Another factor which needs to be taken into account is that some forum members
reveal themselves to be in contact offline. For these, a real-life dimension is added to their com-
puter-mediated interaction, which will need to be taken into account in the detailed analyses.
 Christian Mair
Table 2.â•‡ Realisational variants of the going to-future in CCJ (1st person sg., affirmative)
I am going to 891
I am gonna 109
I’m going to 746
I’m gonna 366
subtotal acrolect 2,112
I going (to) 65*

I gwine 67
me/mi going 168
subtotal mesolect 300
me/mi gwine 327

me/mi a go 405
subtotal basilect 732
* The figures in Table 2 present raw concordance output as the searches usually ensure a satisfactory degree of
both precision and recall. In a preliminary study, for example, two samples of 100 attestations of am going to
were inspected and found to contain only 10 and 8 instances respectively in which go was used as a motion
verb. The figure for I going (to), however, is an exception, as 71 instances of am I going to had to be subtracted
as obvious mishits.
With ICE text category S1A (90 samples of face-to-face and 10 samples of telephone
conversation) amounting to ca. 200,000 words of text, there are only 47 relevant in-
stances for comparison, but the distribution is clear nevertheless: the vast majority,
namely 40, fall into the acrolectal band of Table 2, a further 6 into the mesolectal one,
and only one into the basilect.
The same stylistic spread is revealed when other variables are chosen as diagnos-
tics. For example, a search for realisations of the progressive reveals a very similar
picture, as appears from Table 3.
Unlike Table 2, this list requires extensive post-editing and is still incomplete in
minor ways, most importantly because the search strategy adopted here misses all in-
stances with adverbial phrases intervening between the pronoun or auxiliary and the
present participle – a problem which chiefly distorts the results for JamEng. On the
other hand, the search under-collects for JC because it disregards all those instances
among the 371 instances of im a [verb], which, in accordance with basilectal JC
grammar,11 have indeterminate or female gender reference.
The corresponding distribution for the conversational part of ICE-JA is as follows:
from a total of 37 comparable instances, 18 fall into the acrolect, 17 into the mesolect
and only 2 into the basilect. As in the case of the going-to future (Table 2), the basilect
is thus obviously over-represented in computer-mediated communication when com-
pared to face-to-face interaction.
11. There is no gender-contrast in the third-person singular pronoun at the level of the basilect.
Corpora and the new Englishes 
Table 3.â•‡ Realisational variants of the progressive in CCJ (3rd person sg. fem., affirmative)
all hits manually post edited conc.

she is *in(g) â•⁄â•‹793 â•⁄â•‹664
she’s *in(g) â•⁄â•‹500 â•⁄â•‹383
shes *in(g) â•⁄â•⁄â•‹21 â•⁄â•⁄â•‹18
she s *in(g) â•⁄â•⁄â•‹13 â•⁄â•⁄â•⁄â•‹5
Subtotal acrolect 1,327 1,070
she *in(g) â•⁄â•‹431 â•⁄â•‹262

(= mesolect)
she a [verb] â•⁄â•‹880 â•⁄â•‹596

(= basilect)
For a final and even more drastic illustration of this general trend, consider the use of
fi, a JC grammatical marker often corresponding to English infinitival to, but addition-
ally expressing purposive and related modality. Its extremely basilectal status is re-
flected in the fact that the 200,000 words of spontaneous conversation in ICE-JA con-
tain only 9 instances (a normalised frequency of ca. 45 per million words). In CCJ, by
contrast, it occurs 34,071 times, which corresponds to a rate of 2,051 per million words.
Clearly, this degree of mismatch in the use of basilectal features between face-to-face
and computer-mediated interaction needs an explanation.
The obvious explanation, that participants contributing to the forum discussions
are drawn from a lower social class of speakers, with a correspondingly lower average
level of education, clearly does not hold. Most posters reveal themselves to be profes-
sionals with a middle-class outlook and income and a corresponding command of
English – exactly the profile, in other words, which characterises contributors to the
spoken portions of ICE-JA. The fact to be accounted for is that the same type of speak-
er is apparently more ready to draw on JC when using the computer keyboard than
when speaking in face-to-face interaction.
To understand the causes of this phenomenon in detail, it will be necessary to fo-
cus on the usage of the around 120 key participants who each have contributed more
than 1,000 posts to the material, as individual profiles and preferences differ consider-
ably. One generalisation which holds for almost all of them, though, is that there is a
correlation between the density of JC features and the topic of a particular thread. No
topic is completely off limits for JC. For example, while most contributions on the
topic of the International Monetary Fund (IMF) and its policies’ impact on Jamaica are
formulated in JamEng, it is not difficult to find occasional counter-examples:
(8) (borrowing)? Well di IMF never waan fi len dem afta di crash [...]
(CCJ 2006)
[borrowing? Well, the IMF did not want to lend them after the crash]
 Christian Mair
However, as is to be expected, yaad (‘yard’, or ‘home’) topics encourage shifts into JC,
and so do emotional topics, such as love and romance. For example, on the topic of
jealousy more than half the contributions show strong JC influence, as is revealed by a
lexical search for jealous. Example (9) is dominantly in standard English, with one in-
stance of a copula-less progressive (now she spreading) making a minimal concession
to the standard-creole continuum. The short passage in example (10), on the other
hand, is maximally packed with non-standard morphosyntactic features:
(9) I have tried helping her with this new idiot that she met, now she has jumped
on the phone telling my aunt in California that she is getting married to the
scum “just met him not even 5 months ago” On top of that now she spreading
lies that I must be jealous and some other crap. (CCJ 2008)
(10) mi jealous...mi no have no phone fi nobody fi call mi (CCJ 2004)
[I’m jealous ... I don’t have a phone for anybody to call me]
Searching for the non-standard spelling jellus, predictably, raises the likelihood of
finding other JC features to almost 100 per cent:
(11) Compry she stop by, so de whole a we a sing har Happy Birthday, dat a when
she tell we say har fren dem a work give har a Microwave she seem happy
enuff bout it but it look like Coolbeans did a get jellus caaa Compry a gwaan
over de flowers (CCJ 2004)
[Compry, she stopped by, so all of us are singing her Happy Birthday, that’s when
she told us that her friends from work gave her a Microwave; she seemed happy
enough about it, but it looked like Coolbeans was getting jealous because Com-
pry was going on about the flowers]
The following two brief sections will focus on issues raised by the data which can be
studied before the individual speaker profiles have been compiled and analysed (work
now in progress). Section 4, on ‘anti-formality’, explores the chief factor which is re-
sponsible for lowering the inhibition to use JC in computer-mediated communication
in comparison to face-to-face interaction. Section 5, on lexical borrowings from
African languages, is more exploratory in nature and deals with web-forums as a po-
tential site for language contact unlikely to be encountered in the real world. It is in-
tended to show how a sociolinguistics of computer-mediated communication can be
developed into a sociolinguistics of globalisation.
4. Anti-formality
As has been pointed out above, CCJ shows speakers who are generally very proficient
in English incorporating considerable amounts of JC features into their texts – much
more than would be occasioned by normal ‘conversational’ informality. They thus
present very clear instances of what lexicographer Richard Allsopp has defined as the
Corpora and the new Englishes 
“anti-formal” tendency in Caribbean English usage. He defines the three stylistic lev-
els12 he distinguishes as follows:
Formal: “Accepted as educated; belonging or assignable to IAE [internationally
acceptable English]; also any regionalism which is not replaceable by
any other designation. No personal familiarity is shown when such items
are used.”
Informal: “Accepted as familiar; chosen as part of usually well-structured, casual,
relaxed speech, but sometimes characterized by morphological and
syntactic reductions of English structure and by other remainder fea-
tures of decreolization. Neither inter-personal tenseness nor intimacy is
shown when such items are used, and the speaker is usually capable of
switching to the upper level when necessary but more easily to the lower.”
Anti-formal: “Deliberately rejecting Formalness; consciously familiar and intimate;
part of a wide range from close and friendly through jocular to coarse
and vulgar; any Creolized or Creole form or structure surviving or con-
veniently borrowed to suit context or situation.
When such items are used an absence or a wilful closing of social distance
is signalled.
Such forms survive profusely in folk-proverbs and sayings, and are widely
written with conjectural spellings in attempts at realistic representations
of folk-speech in Caribbean literature.” (Allsopp 1996: lvi-lvii)
Apparently, it is not just informality but a wilful and conscious “closing of social dis-
tance” which is encouraged in web-based communication. What in Allsopp’s database
was mostly confined to folk-speech, proverbs and literary representations thereof has
apparently become general practice on the web. In conversation, anti-formality is not
without risks. As a conscious closing of social distance it is by definition face-threaten-
ing and may be perceived as rude and aggressive. In the pseudo-conversational envi-
ronment provided by the web forum, this risk is curtailed, and participants engage in
anti-formal linguistic behaviour in a spirit of playfulness.
This playfulness is apparent even at the spelling level, where legibility often takes a
back seat to expressiveness: maddaratar, maddarator, maddarater, maddrayta and mad-
rater are just a selection of the variants used to refer to the moderator of the list. Ladda-
massi is used to write ‘Lord have mercy’. The playfulness also involves the use of local or
yaad stereotypes and allusions, as in the following stretch of match-making banter:
(12) [RollinCalf] Well ep. yuh si mi a chat bout now, mi fraid a married like puss
cause di firs one was not pleasant.
[Well, EP, you see I’m talking about the present. I’m afraid of marriage like a
tomcat cause the first one was not pleasant]
[3281] e_p_11_4_08 Pray tell..wat did happen?
12. A fourth level, “erroneous”, is not relevant to the present discussion.

 Christian Mair
[RollinCalf] Well afta wi leff yard, shi go a university, get har degree an sud-
denly was too stush fi mi. Satdeh time...braps,13 beef soup done, seh shi neva
go a college fi bwile no yam. One mawnin mi a hat up little mackrel an some
pepper an shi tell mi seh mi a tink up di house. Nex ting mi know, when wi
have party, some breed a people weh no look like mi or come from weh mi
come fram pack up mi house. Anyway mek mi tap yaw cause mi blood a start
fi hat up already.
[Well after we left home, she went to university, got her degree and suddenly was
too refined for me. Saturday time, and all of a sudden the beef soup is done and
she says she didn’t go to college to cook yams. One morning I’m heating up a
little mackerel and some peppers and she tells me that I’m stinking up the house.
Next thing I know, when we had a party, a kind of people that didn’t look like me
or came from where I came from fill my house. Anyway, let me stop here cause
my blood is starting to heat up already]
[...]
Well it was a looong time before mi did check fi certain woman. Don’t get me
wrong, mi can speaky spokey wid di bess a dem wen mi ready but mi neva
grow soh.
[Well it was a long time before I understood the woman for sure. Don’t get me
wrong: I can speak posh with the best of them when I’m ready to, but I didn’t feel
like it]
[e_p_11_4_08] JAH Know rolling calf yuh soun like wan taxi man weh did a
tell mi him bout fi him situation out of the blue sky o/basically di sed ting..him
seh him file fi him wife an sen har guh college fi 6 years an afta she graduate
she was a different ooman all together...wats up with this trend? Not to say
that all women from Jamaica are like that..but I have heard a few stories...
[You know, Rolling Calf, you sound like this taxi driver that was telling me about
his situation out of the blue sky. Basically the same thing: he said he applied for
papers for his wife and sent her to college for 6 years and after she graduated she
was a different woman altogether. What’s happening with this trend? ...]
(CCJ 2004)
There is some uncertainty about the status of parts of the text. What did happen?
(second turn) looks like an instance of Standard English emphatic do but is probably
better regarded as mesolectal JC, with invariable did replacing the basilectal anterior-
ity marker ben/(w)en, with the passage thus simply translating as ‘what happened?’.
The one unusual expression is the literary archaism pray tell, also used in the second
turn. Combining 18th-century literary English and mesolectal 21st-century JC in one
utterance challenges RollinCalf to live up to expectations in his response, which he
does by styling himself as the bwoy from yaad, increasingly alienated from his wife,
who is educated and socially ambitious, or to put it in JC terms: stush (or to give the
13. I take this to be a misspelling for baps/bragadaps, ‘suddenly/abruptly’.

Corpora and the new Englishes 
more usual vernacular spelling: stoosh). Yaad is contrasted with the university; mod-
ern cuisine with boiling yams, and so on. RollinCalf even draws on the folk-linguistic
concept of speaky-spoky, which refers to a speaker’s conscious effort to use Standard
English, with the implication that the target is not necessarily reached.14
Note also that the conclusion to the story formulates a moral in informal JamEng
rather than JC, which shows that for the contributors to the forum the baseline style is
English, and JC is a resource which is mobilised for a conscious activity which com-
bines serious debate and play in about equal measure. This observation is a helpful
reminder that the greater quantitative presence of JC in web-based material does not
automatically imply a status upgrade in sociolinguistic terms.
A last important question that we must address is whether material derived from
the web can be legitimate and authentic data for a sociolinguistic analysis. This question
presents itself in a particularly pressing fashion whenever the spirit of anti-formal lin-
guistic playfulness results in forms which are highly implausible by the standards of the
JamEng-JC continuum that governs face-to-face interaction. While the examples quoted
so far were remarkable because of an extreme degree of style-shifting of the kind familiar
in real-life encounters, the following post violates the constraints of the JamEng-JC
grammatical continuum because it seems to ‘pack’ acrolectal and basilectal features into
a mix rather than rapidly shift from one level to the other. It is from Bizi_Q – at 5,512
posts one of the most active contributors – and starts off a thread on gem-cleaning:
(13) [Bizi_Q] so what mi can use fi clean gems, stones, diamonds, etc?? mi have
de silver/gold cleaning solution but it nuh seem fi do nutten fi de stone dem..
sometimes it look wussa dan when it went in. mi read up pon de net an some
sites say use dish washing liquid..others say that’s a big no no. what unu use
clean unu stones? would it be better fimi carry it go a jeweler?? how much
dat would cost?
[so what can I use to clean gems, stones, diamonds, etc.? I have the silver/gold
cleaning solution but it does not seem to do anything for the stones ... sometimes
it looks worse than when it went in. I read up on the net and some sites say ‘use
dish washing liquid’ ... others say that’s a big no no. What do you use to clean
your stones? Would it be better for me to take it to the jeweller’s? How much
would that cost?] (CCJ 2008)
Among other features, this post is characterised by the high frequency of the JC gram-
matical marker fi, which in many cases corresponds to infinitival to in English transla-
tions. In Bizi_Q’s post, fi occurs together with several other basilectal JC features, the
2nd-person plural unu, preverbal negation with no, and the optional JC creole plural-
marker dem (e.g. in de stone-dem). This pattern of co-occurrence is expected, as all
14. The speaky-spoky style is often characterised by hypercorrections such as the over-use of
prestige pronunciations associated with stoosh [or stush?] or posh talk: filther for filter, gloss for
glass, or hassist for assist. For further information see Patrick (1999: 277–278).
 Christian Mair
these features index the same stylistic range on the creole-standard continuum. How-
ever, the use of fi in phrases with inflected plurals, as in fi clean gems, stones, diamonds,
is not. Fi indexes a basilectal style (or low social status of the speaker), while the plural
inflection is in regular use only in acrolectal style (or with educated speakers). Simi-
larly, unu stones combines a basilectal possessive pronoun with an acrolectal nominal
head, which is as incongruous at first sight as the combination of an ‘English’ main
clause (would it be better) with a dependent clause that contains a JC serial-verb con-
struction (carry it go).
Judged by the norms of face-to-face interaction, such usage is incoherent. By the
standards of classical sociolinguistics, the data are unsuitable for analysis because they
are inauthentic. By contrast, I would argue that such data are not just idiosyncratic odd-
ities but that dealing with them is a high priority in a sociolinguistics for the 21st century.
JC is no longer the ‘local’ language that it used to be in earlier stages of its development,
firmly rooted in its community of speakers and largely confined to use in face-to-face
interaction. Today, JC is among the vernaculars which are regularly heard on the media,
have gone on the move and become globally available linguistic resources. One thing
which CCJ shows beyond doubt is that JC on the web is used by real people in authentic
communication. The specific mix of JamEng and JC in CCJ is therefore in no way less
real than the different one spoken in the island. As Nikolas Coupland puts it:
At least implicitly, sociolinguistics has made strong assumptions about authentic
speech and the authentic status of (some) speakers. Sociolinguistics has often as-
sumed it is dealing with ‘real language.’ [...] But ‘real language’ is an increasingly
uncertain notion. In late-modern social arrangements and in performance frames
for talk, do we have to give up on authenticity? (Coupland 2007: 179)
Coupland’s question is rhetorical, and his answer is in the negative. I agree. Bizi_Q is
styling or performing her language, but the game works because she can mobilise real
resources shared by herself and her readers. Put briefly, the effect of ‘artificially’ or
‘unexpectedly’ packing a simple request for help on a practical matter with JC mor-
pho-syntactic features is to make a somewhat businesslike request for information
part of an exercise in recreational community-building among JC speakers in the
diaspora. On a purely utilitarian level, this is useful, because it may increase the quan-
tity and quality of the responses elicited; on a more general level, it is part of the stra-
tegic way in which Bizi_Q draws on her and her community’s linguistic resources to
create her persona in the forum.
CCJ contains:
a. passages which read as if spontaneously produced spoken JC was transferred on
the screen,
b. passages such as the one discussed, in which a presumably unconscious element
of stylization is evident, and
c. passages which are consciously crafted with rhetorical skill.
Corpora and the new Englishes 
None of them is more authentic data than the others, but a framework is required to
make explicit how they are authentic in different ways. The concept of indexicality, as
proposed by Silverstein (2003) and refined by Eckert (2008), is a good starting point.
The traditional view in sociolinguistics is that variation reflects speakers’ membership
in pre-established social categories and that an individual speaker’s agency is confined
to increasing the proportion of standard-like variants of a given variable in more for-
mal (hence more monitored) styles. Drawing attention to the obvious limitations of
such a view, Eckert (2008: 453) argues that “meanings of variables are not precise or
fixed but rather constitute a field of potential meanings – an indexical field, or constel-
lation of ideologically related meanings, any one of which can be activated in the situ-
ated use of the variable”. Seeing things in this way, we can reconcile the apparent para-
dox posed by the CCJ data. We can both accept that in broad terms use of JC in
computer-mediated communication is derived from (and depends on) the conven-
tions that have evolved in the sociolinguistic continuum that governs face-to-face in-
teraction and still accommodate the very different ways in which individual variables
are deployed in the new medium. Some CCJ examples do not differ at all from face-to-
face data in their indexical potential. Consider, for instance, example (14), a part of
passage (9) discussed above:
(14) On top of that now she spreading lies that I must be jealous and some other
crap. (CCJ 2008)
According to Silverstein (2003: 227), first-order indexicality indexes the pragmatic
properties of a particular communicative act, reproducing social macro-categories in
language. Potentially, however, every linguistic variant used in this way can become
indexical at the second-order level and index speakers’ or writers’ metapragmatic
evaluation of the situation. What we have in (14) is an informal exchange between
equals. This is unproblematically indexed by the informal variant of the progressive
(she spreading) on the level of first-order (pragmatic) indexicality. On the level of sec-
ond-order (metapragmatic) indexicality, however, the fact that the informal variant of
the progressive adopted here is a typically mesolectal JC one (and not for example an
internationally current non-standard one such as she’s spreadn) becomes significant.
Higher-order indexicality, that is communicative effects of a more complex nature and
possibly unique to the specific thread, are not in evidence here.
This is different in (15) and (16), which contain highly unusual spellings involving
the sequence <kkk>, which turn out to be a part of a rather complex strategy of online
linguistic identity management:
(15) bway.... but it’s important to dress 3 y/olds like this? what did blu seh but
blakkk ooman slut culture? (CCJ 2008)
[Boy ... but it’s important to dress three-year olds like this? What did Blu say
about the slut culture of black women?]
 Christian Mair
(16) only in amerikkka does this happen, only in amerikkka does a campaign go
one for so two or more years wasting good money that could help ppl. Amer-
icans are fools if dem look pan a man ar ooman spouse an base dem decisions
of dat person pan a nex smaddy. MORE FIYAH fi dem type a tinkin deh. An
mi love Obama wife still (CCJ 2008)
[... Americans are fools if they look at a man’s or a woman’s spouse and base their
decisions on that person on someone else. More shame (= fire) on that type of
thinking. And I still love Obama’s wife]
Here, the signal is purely visual. Neither blakkk nor Amerikkka index Jamaicanness at
any level. Even so, such experimental and sensational spellings convey a message in
their context. In general terms, they are a liberally used device for emphasis and atten-
tion-getting, and three k-s is not the upper limit in this regard. Among the words in the
material featuring six or more contiguous k-s are OK, wicked, bruck (= JC ‘break’),
back, and several others. Nor is the custom confined to <k>, as is shown by spellings
such as eeeediotttttttt (‘idiot’) or ciiiigaretttttttttes. More is at stake, however, than
merely drawing attention to key terms and sensitive topics such as blackness or
America. Recall that KKK is a widely known abbreviation for the Ku-Klux-Klan, the
racist secret society founded during the Confederacy’s defeat in the Civil War, and that
a spelling such as AmeriKKKa has long had iconic status among radical minorities opt-
ing to disaffiliate themselves from the American way of life.15 As such, it is available for
further creative development, and various web-based glossaries claim that the spelling
blakkk may be used to denote individual blacks who are worse for the community than
the Klan.16 Although the few attestations of blakkk in the CCJ data have a negative con-
notation (as certainly does the “blakkk ooman slut culture” of example [15]), they do
not corroborate this specific, narrow meaning. A fairly complex process of inferencing
allows us to conclude that, for blakkk and Amerikkka, <kkk> indexes (1) writers’ high
degree of emotional involvement when mentioning key concepts in their discourse, (2)
a consciously politicised radical stance, and (3) affiliation with the ‘hip-hop nation’ and
pop culture, which are key agents propagating sensationalist non-standard spellings in
the contemporary US media.17 Just as the linguistic indexes of African-American po-
litical radicalism and pop-cultural chic percolate into JamEng/JC, the diasporic web-
forum can be a conduit through which JamEng/JC linguistic material diffuses
15. The first OED attestations are from 1969 (s.v. Amerika).
16. One example is the Urban Dictionary, at http://www.urbandictionary.com/define.
php?term=blakkk (consulted on 16 March 2010), which defines the word as “an African Amer-
ican whose actions are more damaging to a black community than the KKK” and illustrates the
use with the following unsourced citation: “We dot [sic!] rid of KKK night riders, and got blakkk
drivebys instead”.
17. This can be proved easily even in a perfunctory web search for ‘boyzz’, ‘niggazzz’, ‘greatest
hitz’, ‘bizkit’, etc., for instance. Note that in the last two examples we are not even dealing with
phonetic respellings as the /s/ in hits and biscuit is clearly voiceless.
Corpora and the new Englishes 
internationally. This brings us to the final section of the present paper, which will ex-
plore the Web as a novel site for contact between languages and varieties.
5 The globalisation of vernacular features: A ‘Black Atlantic’ on the web?
A great deal of recent research in sociolinguistics has focussed on a phenomenon

which has been referred to as the “globalisation of vernacular features” (Meyerhoff &
Niedzielski 2003). It has been noted, for example, that the new quotative be like, which
was first recorded in the United States in the 1970s and 1980s, has spread extremely
rapidly into many other varieties of English all over the world (including, on the basis
of ICE evidence, Irish, Canadian and Jamaican English).18 This almost instantaneous
global spread of linguistic innovations raises the question whether traditional pro-
cesses of diffusion via face-to-face interaction are sufficient to account for the speed of
the developments observed or whether additional media influence needs to be invoked
to account for them.
In addition, one must be careful not to regard the use of ‘American’ quotatives as
an instance of straightforward Americanisation of the other varieties affected. Peo-
ple using American quotatives rarely do so in conscious imitation of American lan-
guage norms; rather they use a linguistic resource of American origin to achieve a
communicative effect in a different ‘local’ linguistic environment. In the Jamaican
case, for instance, the new quotative be like is preferably used by (young) women,
and up to this point JamEng is indeed like American English (cf. Höhn forthcoming).
However, while in American English it is in contrast with the standard quotation-
introducing verb say and other non-standard forms such as go or be all, in JamEng it
is the third option alongside Standard English say and JC mi/yu/im etc. seh. This
three-way choice between a formal variant, a local informal variant and an imported
informal variant is specific to the sociolinguistic situation of present-day Jamaica
and thus gives an internationally available linguistic form its specifically local func-
tional value.
An interesting phenomenon to study from a language-and-globalisation per-
spective is the use of loanwords from African languages in CCJ. Since the end of the
slave trade there has been little opportunity for such loans to spread into JC through
face-to-face interaction. There is, of course, opportunity for contact between speakers
of JC and Africans in diasporic situations, for example in multilingual metropolises
such as London, New York or Toronto. However, the typical ways in which people of
18. There is no doubt that the new quotative is widely used in contemporary British English, as
well (see, e.g., Buchstaller 2006a, 2006b). However, ICE-GB was compiled too early to record
this. What is intriguing is the absence of quotative be like from the ‘true’ second-language ICE
corpora, such as ICE-India or ICE-Singapore, which raises the issue of whether this is accidental
(due, for example, to sampling dates) or systematic.
 Christian Mair
African descent in the Americas have learned about Africa in the past 150 years has
been through the work of political activists in the anti-colonial and civil-rights move-
ments, and through the work of writers and scholars. Alongside these ‘elite’ links,
there have also been popular movements. Many African-American and Caribbean
musicians and entertainers have large followings in Africa, and a literal or metaphor-
ical return to Africa is at the heart of several politico-religious movements, such as
Rastafarianism.
In principle, there is no reason why people from, say, Nigeria or South Africa
should not join thematically relevant discussions on a forum such as Jamaicans.com
(or, conversely, why Jamaicans with an Afrocentric ideological orientation should not
take part in an immensely popular Nigerian forum such as Nairaland.com). In prac-
tice, however, to the extent that contributors can be located, such cross-over among
forums still seems to be the rare exception. The overwhelming majority of contributors
to Jamaicans.com are Jamaican, or people of Jamaican background resident in the US,
the UK or Canada. The few who are not are a mixed bag – from the (presumably white)
Canadian who would like to learn JC through the woman from Kurdistan currently
resident in Austria to the Belgian who spent five years in Jamaica, with very few
Africans among them. To put it mildly, the evidence is not sufficient to claim that the
‘Black Atlantic’ cultural region which joined West Africa, Europe and large parts of the
Americas from Virginia and Maryland to the North East of Brazil in the 17th and 18th
centuries is re-emerging in cyber-space in the 21st.
However, the absence of African contributors from forum discussions does not
mean that Africa has no impact at all. The spirit of linguistic adventurousness evident
in the posts extends to lexical borrowing from African languages, and in this sense
computer-mediated communication can become one of several avenues of lexical
innovation in contemporary JC. The use of mzungu and wahala by forum partici-
pants provides an illustration. M(u)zungu is a slightly derogatory, originally Kiswahili
term for ‘white person’ widely used in Southern Africa, whereas wahala, originally
from Hausa, is also a very common word meaning ‘trouble/problem’ in Nigerian
Pidgin and hence widely known throughout West Africa. Interestingly, these two
words already have some international profile in World English, which is reflected in
the fact that they are recorded in the Oxford English Dictionary (OED). In both cases,
however, the quotations provided show that they have been circulated largely as ex-
otica, either in writing on Southern Africa or Nigeria or through literary works by
authors from the region. Here are the crucial segments of the OED documentation
for mzungu:
[1844 J. L. KRAPF Diary 25 Sept. (Birmingham University Library: C.M.S. Ar-
chives CA5/016/28) f. 496, Sheikh Ibrahim soon after my arrival dispatched a
messenger to the nearest Wanica villages, informing the chiefs of the arrival of a
M’soongo (as a European is called in the Sooahelee tongue).]
[...]
Corpora and the new Englishes 
1961 Transition No. 2. 33/2, I found myself welcomed not in spite of the fact that I
was a mzungu, not exactly because of it, but rather with the sense of being wel-
come anyway but with particular pleasure because I was white. 1975 B. KAGGIA
Roots of Freedom vii. 66 We could no longer accept the belief that a mzungu was
better than an African. 1992 Harper’s Mag. Jan. 65/1 Njoki.is almost never asked
what her family thinks of her being married to a ‘mzungu’ – a white person.
In spite of a history of almost 200 years of attestation, the OED gives the impression
that the word has never moved from specialist discourse into general circulation. For
example, there is nothing which would make us expect it to turn up in CCJ. The his-
tory of wahala is similar, if considerably shorter:
1973 W. SOYINKA Season of Anomy xii. 258 He was going to his house on the
reservation and would not step out of it until all the trouble was over. ‘I shall sim-
ply lock myself in’ he grinned, ‘stock up on stout and drink the wahala dry.’ 1982
B. EMECHETA Double Yoke (1983) x. 93 Look at all the wahala he raised about
the university forms. 1986 E. AMADI Estrangement ii. 19 The GOC came with his
wahala. 1987 C. ACHEBE Anthills of Savannah x. 137 If I for know na such big oga
de for my front for that go-slow how I go come make such wahala for am. 1988
N.Y. Times 21 Feb. VII. 26/2 The taxi driver blames the ‘wahala’ of the traffic inci-
dent on the fact that Ikem doesn’t travel around in a chauffeur-driven car that
befits his position.
All but the last citation are from works of fiction by West African authors, and even the
last quotation from the New York Times is from a book review about such work.
As in the case of mzungu, there is nothing in the OED evidence to make us expect
that wahala would occur in CCJ, because, after all, neither the human geography of
Southern and Eastern Africa nor West African writing in English are central concerns
of the contributors to the discussion. However, they do. A search for muzung* and
mzung* retrieved the vast majority of instances (= 1,010).19 It is striking to note that it
was not used before 2005, when a trickle of instances came in which developed into a
veritable flood from 2007. The first and most persistent user is Blugiant. Blugiant, who
joined the forum on 1 June 2003, has contributed a total of 2,121 posts and is thus one
of the more prolific authors. His contributions show him to be a social activist who is
informed by a black-consciousness ideology. In addition, he deserves attention be-
cause of his experimentalism in JC spelling. In 2005 he launches the term mzungu. The
context is a discussion of a newspaper article celebrating the fact that black earning
power per household in the New York City borough of Queens has outstripped white
earning power. For Blugiant, however, this is not a genuine sign of social advancement,
but merely represents cooking of the statistical books:
19. Alternative spellings do occur, but seem to be exceedingly rare: e.g. zungu (2 occurrences).
 Christian Mair
(17) [Blugiant] Quote: “In addition to the larger share of whites who are elderly,
said Andrew Hacker, a Queens College political scientist, black Queens fami-
lies usually need two earners to get to parity with working whites.” soo arjen [=
address to another participant] oww manee peeps inn dem caribbean house-
hold versus mzungu oousehold. iff itt tekk two ar more caribbean wage earnas
wukkinn more owa dan mzungu fii mekk more da mzungu oousehold widd
less peeps wat iss da artikkle seyinn bout da caribbean peeps qualitee aff life.
[So, Arjen: how many people are there in those Caribbean households versus
white households? If it takes two or more Caribbean wage earners working more
hours than whites in order to make more than a white household with fewer peo-
ple, what is the article saying about the Caribbean people’s quality of life? (...)]
[Jaded] its saying yu gotta do what yu gotta do... whatz the option..sit back an
keep paying rent an gripe that life isn’t fair
[Kingston20] Maybe we should a chill pon the corner block a play dice and
bawl bout how zungu a howl we dung..no true Blu. Nothing no satisfy some
people, atleast the majority of us a try a ting with what we got, nobody seh
America was easy the fact that you come to another person’s country and have
fi start over from scratch means that your quality of life ago suffer fi a long
time till you can access the factors of production, land,labor, capital goods
which lead to weath generation. This is basic economics.
[Maybe we should relax on the corner block and play dice and moan about how
the whites are holding us down, shouldn’t we, Blu? Some people are satisfied by
nothing. At least the majority of us is trying something with what we’ve got.
Nobody said America was easy. The fact that you come to somebody else’s coun-
try and have to start from scratch means that your quality of life is going to suf-
fer for a long time. (...)] (CCJ 2006)
Blugiant is not challenged for his use of the word, which allows two interpretations.
Either it is known by the other forum members, or they work out the meaning from the
context. In the end, it is taken up and used by other contributors, although it has to be
admitted that Blugiant remains the active promoter until the end of the period of ob-
servation in mid-2008, contributing 455 of the 588 posts in which the word occurs. His
main achievement is to have made the word well-known and familiar on the forum,
which may provide motivation for subsequent use by others and outside the forum.
The situation is somewhat more complex for wahala, whose primary use is as the
pseudonym of a contributor and term of address used by others. The screen-name or
pseudonym adopted by a contributor is a central element of self-stylisation and online
identity construction. While Bechar-Israeli (1995) points out that screen names make
few references to nationality or ethnic group in general-purpose Internet Relay Chat,
Androutsopoulos (2006b: 539) reads “the strong preference for home language screen
names as an index of ethnicity” in his study of Turkish, Iranian and Indian diasporic
web forums based in Germany. Home language in CCJ is JC, and JC is indeed a direct
Corpora and the new Englishes 
or indirect presence in many (though not the majority of) screen names, from the
straightforward Rasta Talk of Irie Dawta to the more subtle pun in SailorBuoy [bwaI].
Among the Dr. Dudds, Peppers, Style Divas, ChurchDudes, Championgals, etc., Wa-
halla’s use of an African moniker is thus clearly a marked and presumably meaningful
choice. In one instance, user Wahalla is even challenged as to whether he is aware of
the significance of his name:
(18) [queenvenus] by the way do you know what the word wahalla means in pi-
geon english, which is spoken in some parts of west Africa (CCJ 2007)
To which he replies:
(19) [Wahalla] Actually the only place I have heard the term used is along the coast
of Nigeria.. But then the Ibo say its not their word.. The people from Kanu
deny it is from them.. Tejh Ijaw deny it ever came from them. Personally I go
for the beleif that is a dervation from ishalla.. But edumicate mi.... Cause
Wahalla just chose his name by the radomised chance without ever having
uttered the word in anger in a sentence???? He never live in Nigeria... Guess its
cool to call how the nigerians speak English Pigeon.. Yu si if sumadi call patwa
pigeon english dem getta tongue lasjing fram mi in no uncertain terms.. Guess
its cool for people from Belgium to call Nigerian English Pigeon English...
Why is it that people from England assume that once words and syntax from
colonised comumication language pass into English it becomes pigeon Eng-
lish when spoken in the former anglohone countries????? (CCJ 2007)
If we are to trust this information, Wahalla picked up the word during a brief visit to
the Nigerian coast, misinterpreting its emotional connotations somewhat as expressing
anger rather than sorrow. The interesting thing here is that it is other forum members
who enlighten him on its meaning. The ensuing metalinguistic discussion is notewor-
thy because of its factual errors and conceptual flaws, but also because it is one of many
signs of the keen interest which participants take in the topic of language variation.
6. Conclusion and outlook
After presenting a brief sketch of the state-of-the-art in the corpus-based research on the
New Englishes, the present paper has gone on to explore innovations in two directions.
The first new departure concerns the type of data investigated. The World Wide Web has
not only become more linguistically heterogeneous over the past few years (cf. Dor 2004;
Danet & Herring 2007), but it has also become a repository of substantial amounts of
non-standard English, which provided the motivation to compile a web-derived
16-million-word corpus of Jamaican English/Jamaican Creole. When used in computer-
mediated communication, this variety shows a number of distinctive properties. Most
importantly, diasporic web-based communities of practice of the type investigated here
 Christian Mair
use JC forms far more commonly than in traditional writing – an expected result –, but
also more frequently than in spoken face-to-face interaction sampling sociologically
comparable groups of speakers. The reason for this greater readiness to use a stigmatised
variety in the new medium is that the covert prestige associated comes for free, that is
users do not run the risk of being categorised as uneducated or lower-class (as they
would be in face-to-face encounters) but are able to mobilise the linguistic resources of
JC for self-stylisation, the creation of atmosphere and the construction of identity.
The second innovation proposed was to study language contact on the web from
the perspective of the sociolinguistics of globalisation, as providing an arena (1) for the
world-wide spread of originally highly localised vernaculars and (2) for contact be-
tween non-standard varieties of English beyond the sphere of face-to-face interaction
in physical space. In this way, the present study contributes to an emerging sociolinguis-
tics of computer-mediated communication (on which see Androutsopoulos 2006a).
Had we chosen Chicano-run forums in the USA or Nigeria’s popular Nairaland
forum instead of Jamaicans.com, we would have been alerted to what is probably the
most glaring lacuna in corpus-based research on the New Englishes – namely a failure
to recognise their being embedded in sometimes intensely multilingual communities.
ICE-JA aims to provide a corpus of JamEng – and yet makes clear that spoken English
in Jamaica cannot be understood without also taking into account JC. In the compila-
tion of CCJ, no attempt was made to separate JamEng and JC even in the compilation
of the corpus. Future corpora of Indian, Nigerian or any other kind of second-language
English should, unlike the ICE sub-corpora, be conceived as multilingual corpora
from the very start, and the contact languages with which English interacts should
stop being treated as ‘extra-corpus’ material. As one prominent sociolinguist and ex-
pert on the New Englishes has recently put it:
For many ‘New English’ speakers, monolingualism is the marked case, a special
case outside of the multilingual prototype. Today’s ideal speaker lives in a hetero-
geneous society (stratified along increasingly globalized lines) and has to negoti-
ate interactions with different people representing all sorts of power and solidarity
positions on a regular basis. What is this ideal speaker a native speaker of, but
a polyphony of codes/languages working cumulatively (and sometimes comple-
mentarily), rather than a single, first-learned code? (Mesthrie 2006: 482)
Multilingual corpora are necessary tools to research a multilingual reality. The Web
may not be a suitable window on multilingualism in everyday face-to-face interaction,
but multilingual diasporic web forums are certainly not in short supply and await cor-
pus-linguistic exploration.
References
Allsopp, R. 1996. Dictionary of Caribbean English Usage. Oxford: OUP.

Androutsopoulos, J. (ed.). 2006a. Special issue on ‘The Sociolinguistics of Computer-mediated
Communication’. Journal of Sociolinguistics 10(5).
Corpora and the new Englishes 
Androutsopoulos, J. 2006b. Multilingualism, diaspora, and the internet: Codes and identities on
German-based diasporic websites. Journal of Sociolinguistics 10: 520–547.
Beal, J., Corrigan, K.P. & Moisl, H. (eds). 2007. Creating and Digitizing Language Corpora, Vol. 1:
Synchronic Databases. Basingstoke: Palgrave Macmillan.
Bechar-Israeli, H. 1995. From <Bonehead> to <cLoNehEAd>: Nicknames, play, and identity on
Internet Relay Chat. Journal of Computer-Mediated Communication 1(2). <http://jcmc.in-
diana.edu/vol1/issue2/bechar.html> (19 March 2010).
Beißwenger, M. & Storrer, A. 2008. Corpora of computer-mediated communication. In Corpus
Linguistics. An International Handbook, Vol. 1, A. Lüdeling & M. Kytö (eds), 292–308.
Berlin: Mouton de Gruyter.
Buchstaller, I. 2006a. Diagnostics of age-graded linguistic behaviour: The case of the quotative
system. Journal of Sociolinguistics 10: 3–30.
Buchstaller, I. 2006b. Social stereotypes, personality traits and regional perception displaced:
Attitudes towards the ‘new’ quotatives. Journal of Sociolinguistics 10: 362–381.
Chambers, J.K. 2000. Universal sources of the vernacular. In Die Zukunft der europäischen Sozi-
olinguistik/The Future of European Sociolinguistics/Le Futur de (la) sociolinguistique euro-
péenne, U. Ammon, K.J. Mattheier & P.H. Nelde (eds), 11–15. Tübingen: Niemeyer.
Chambers, J.K. 2003. The sociolinguistics of immigration. In Social Dialectology: In Honour of
Peter Trudgill [IMPACT: Studies in Language and Society 16], D. Britain & J. Cheshire
(eds), 97–113. Amsterdam: John Benjamins.
Coupland, N. 2007. Style: Language, Variation and Identity. Cambridge: CUP.
Crystal, D. 2006. Language and the Internet, 2nd edn. Cambridge: CUP.
Danet, B. & Herring, S.C. (eds). 2007. The Multilingual Internet: Language, Culture, and Com-
munication Online. Oxford: OUP.
Davies, M. 2001. Creating and using multimillion-word corpora from web-based newspapers.
In Corpus Linguistics in North America: Selections from the 1999 Symposium, R.C. Simpson
& J. Swales (eds), 58–75. Ann Arbor MI: The University of Michigan Press.
Deuber, D. 2005. Nigerian Pidgin in Lagos. Language Contact, Variation and Change in an African
Urban Setting. London: Battlebridge.
Deuber, D. 2009. ‘The English we speaking’: Morphological and syntactic variation in educated
Jamaican speech. Journal of Pidgin and Creole Languages 24: 1–52.
Deuber, D. & Hinrichs, L. 2007. Dynamics of orthographic standardization in Jamaican Creole
and Nigerian Pidgin. World Englishes 26: 22–47.
Dor, D. 2004. From Englishization to imposed multilingualism: Globalization, the Internet, and
the political economy of the linguistic code. Public Culture 16: 97–118.
Eckert, P. 2008. Variation and the indexical field. Journal of Sociolinguistics 12: 453–476.
Greenbaum, S. 1990. Standard English and the International Corpus of English. World Englishes
9: 79–83.
Greenbaum, S. (ed.). 1996. Comparing English Worldwide: The International Corpus of English.
Oxford: Clarendon.
Höhn, N. Forthcoming. Quotatives in Jamaican English. PhD dissertation, University of
Freiburg.
Kortmann, B. (ed.). 2005. A Comparative Grammar of British English Dialects: Agreement, Gen-
der, Relative Clauses. Berlin: Mouton de Gruyter.
Kortmann, B. & Szmrecsanyi, B. 2004. Global synopsis: Morphological and syntactic variation
in English. In A Handbook of Varieties of English, Vol. II: Morphology and Syntax, B.
 Christian Mair
Kortmann, K. Burridge, R. Mesthrie, E.W. Schneider & C. Upton (eds), 1142–1202. Berlin:
Mouton de Gruyter.
Mair, C. 2003. Kreolismen und verbales Identitätsmanagement im geschriebenen jamaikanis-
chen Englisch. In Zwischen Ausgrenzung und Hybridisierung: Zur Konstruktion von Iden-
titäten aus kulturwissenschaftlicher Perspektive [Identitäten und Alteritäten 14], E. Vogel, A.
Napp & W. Lutterer (eds), 79–96. Würzburg: Ergon.
Mair, C. 2009. Corpus linguistics meets sociolinguistics: The role of corpus evidence in the study
of sociolinguistic variation and change. In Corpus Linguistics: Refinements and Reassess-
ments – Proceedings of the 2007 ICAME Conference – Stratford-upon-Avon, A. Renouf & A.
Kehoe (eds), 1–26. Amsterdam: Rodopi.
Mann, C. & Stewart, F. 2002. Internet Communication and Qualitative Research: A Handbook for
Researching Online. London: Sage.
Mesthrie, R. 2006. Society and language: Overview. In Encyclopedia of Language and Linguistics,
Vol. 11, K. Brown (ed.), 472–484. Amsterdam: Elsevier.
Meyerhoff, M. & Niedzielski, N. 2003. The globalization of vernacular variation. Journal of
Sociolinguistics 7: 534–555.
Morris, M. 1999. Is English We Speaking and Other Essays. Kingston: Ian Randle.
Mukherjee, J. & Hoffmann, S. 2006. Describing verb-complementational profiles of new Eng-
lishes: A pilot study of Indian English. English World-Wide 27: 147–173.
Mukherjee, J. & Hundt, M. (eds). Forthcoming 2011. Exploring Second-Language Varieties of
English and Learner Englishes: Bridging a Paradigm Gap [Studies in Corpus Linguistics 44].
Nelson, G. 2006. World Englishes and corpora studies. In The Handbook of World Englishes, B.B.
Kachru, Y. Kachru & C.L. Nelson (eds), 733–750. Oxford: Blackwell.
OED = Oxford English Dictionary Online. <http://www.oed.com/>
Patrick, P. 1999. Urban Jamaican Creole: Variation in the Mesolect [Varieties of English around
the World G17]. Amsterdam: John Benjamins.
Platt, J., Weber, H. & Ho, M.L. 1984. The New Englishes. London: Routledge.
Sand, A. 2005. Angloversals? Shared Morphosyntactic Features in Contact Varieties of English.
Post-doctorol research thesis, Universität Freiburg im Breisgau.
Schneider, E. 2007. Postcolonial English: Varieties around the World. Cambridge: CUP.
Silverstein, M. 2003. Indexical order and the dialectics of sociolinguistic life. Language and
Communication 23: 193–229.
Simo Bobda, A. 2001. Taming the madness of English. Modern English Teacher 10(2): 11–18.
Simo Bobda, A. 2004. Linguistic apartheid: English language policy in Africa. English Today 77:
19–26.
Tagliamonte, S. & D’Arcy, A. 2007a. Frequency and variation in the community grammar: Track-
ing a new change through the generations. Language Variation and Change 19: 341–380.
Tagliamonte, S. & D’Arcy, A. 2007b. The modals of obligation/necessity in Canadian perspective.
English World-Wide 28: 47–87.
Trudgill, P. 1986. Dialects in Contact. Oxford: Blackwell.
Williams, J. 1987. Non-native varieties of English: A special case of language acquisition. English
World-Wide 8: 161–199.
Youssef, V. 2004. ‘Is English we speaking’: Trinbagonian in the twenty-first century. English To-
day 20(4): 42–49.
Towards a new generation of corpus-derived
lexical resources for language learning
David Wible and Nai-Lung Tsao
[N]ature is already, in its forms and tendencies,

describing its own design. Emerson
This chapter first argues that, despite their convenience compared to paper-
based resources, corpora are, by their very nature as collections of texts and
tokens, severely limited in what they can offer directly to language learners or
teachers. The focus here is on understanding these limitations with respect to
lexical knowledge, and it is suggested that overcoming them requires a different
sort of digital resource that mediates between corpora on the one hand and
teachers or learners on the other. The challenge is complicated by the fact that
such a lexical knowledge resource should capture patterns of word behaviors
that fall along a continuum between grammatically well-behaved and lexically
idiosyncratic. A knowledgebase called StringNet, designed to capture this range
of word behaviors, is described and motivated in detail.
1. Introduction
Arguably the most common reason that language educators turn to corpora is for help
in teaching vocabulary. They do this typically because corpora can readily show words
in the contexts of their actual use. A premise of this chapter is that, despite the advan-
tages, an unfortunate gap still stands between what learners need for vocabulary learn-
ing and what corpora currently provide. A further premise is that reducing this
distance calls for a new generation of corpus-derived resources. In what follows, we try
to develop an analysis of the nature of this gap, to motivate what sort of resource might
bridge it, and to illustrate this with one resource designed to help with the bridging.
The perspective we hope to develop makes sense only on a certain view of the
nature of vocabulary knowledge and the role of words within a language. That view is
eloquently distilled in the work of Dwight Bolinger (1977; 1985; inter alia). So we be-
gin with an extended quote from the opening of his five-page gem, “Defining the In-
definable”. His point here concerns lexicography. We quote him because his description
 David Wible and Nai-Lung Tsao
of the dictionary writer’s challenge in defining words mirrors the classroom language
teacher’s challenge in teaching them.
Lexicography is an unnatural occupation. It consists in tearing words from their
mother context and setting them in rows – carrots and onions, and beetroot and
salsify next to one another – with roots shorn like those of celery to make them fit
side by side, in an order determined not by nature but by some obscure Phoeni-
cian sailors who traded with Greeks in the long ago.1 Half of the lexicographer’s
labor is spent repairing this damage to an infinitude of natural connections that
every word in any language contracts with every other word, in a complex neural
web knit densely at the center but ever more diffusely as it spreads outward...
Undamaged definition is impossible because we know our words not as individual
bits but as parts of what Pawley and Syder (1983) call lexicalized sentence stems,
hundreds of thousands of them, conveniently memorized to repeat – and adapt –
as the occasion arises.... A speaker who does not command this array, as Pawley
and Syder point out, does not know the language, and there is little that a diction-
ary can do to promote fluency beyond offering a few hints. Bolinger (1985: 69)
As we read this now, the promise of corpora cannot help but suggest itself. “...[A]nd
there is little that a dictionary can do...” he says. “Yes, but there’s so much that corpora
can do”, we want to reply. If lexicographers must tear words from their natural habitat
to plant them in alphabetic rows, and if the resulting dictionaries are of so limited
value in light of the infinitude of interconnections lost in this process of domestication,
then corpora are the promising counter-balance. Corpora release words back into
their “mother contexts”, into the wild where teachers can guide students on highly af-
fordable digital fieldtrips into this habitat and train them to do ecological research in
vivo to balance all of the in vitro lab work with uprooted isolated words that they tra-
ditionally have done.
The urge to assign this sort of role to corpora within Bolinger’s analogy as a coun-
ter-balance to the dictionary is perfectly understandable. We want to argue, however,
that corpora in fact do not constitute this sort of natural habitat of words in the wild
that Bolinger alludes to. It is the misconception that they do so, we will suggest, which
has limited our ways of exploiting the promise of corpora for language learning. We
then describe what would be closer to the analog of such natural environments for
words, illustrate how corpora can play an essential role in creating these habitats
(or lexical ecologies), and sketch their value for language education along the way.
First, it is worth elucidating why picturing “corpora as a natural habitat of words”
within Bolinger’s metaphor is to misconstrue that metaphor, why corpora are not even
close to playing such a role in that picture. Our reasons have nothing to do with the
insufficient authenticity of corpora. There have been claims put forth that corpora lack
‘authenticity’ once their texts have been torn from larger contexts of situation
1. “Referring to the development of the alphabet by the Greeks after its invention by the Phoe-
nicians. (Ed.)” (This note appears in the original. D. W. & N.-L. T.)
Towards a new generation of corpus-derived lexical resources for language learning 
(Widdowson 2000;2 Mishan 2004; inter alia). While this criticism may be warranted, it
is irrelevant to our point.
To see our point, we need to look more closely at Bolinger’s metaphor. He evokes
an ecology of tangled roots that interconnect all the lexical fauna in ways that are lost
once they are uprooted for transplant into the ordered garden of the dictionary. But a
corpus is not such a natural ecology of words, nor even a sample of one. A corpus is
not where this myriad of original relations hold. A corpus is a collection of tokens. For
botanical plants in the real out of doors, the connections do indeed hold among tokens
(i.e. real and concrete instances of vegetation). But this is not the case in the meta-
phorical wild of words that Bolinger wants us to imagine. To think of the lexical con-
nections he has in mind as holding among tokens of words, say among words found in
a corpus, would render his whole metaphor incoherent. A token of a word in a par-
ticular utterance or line of text holds nothing near the ‘infinitude of natural connec-
tions’ among words he is indicating, at least not at that plane of existence, the plane of
tokens. At most a word token exhibits some syntagmatic connection with co-occur-
ring tokens in the same utterance, sentence, or text. But Bolinger is trying to evoke
something much more imposing and substantial. Extending the quote by a few words,
we see that he means “...an infinitude of connections that every word in any language
contracts with every other word” (emphasis added). This extent of interconnectivity
could be true only of words as types, not tokens, only of words meant as abstractions,
i.e., as lexemes. And if this is not enough to lift us from thinking at the level of tokens,
he tells us later in the sentence that the network of connections is “...a complex neural
web...”. It is neural. It is the language user’s own mental grasp of the relations among
words. So the crucial natural habitat of words is neither an alphabetic dictionary
(Bolinger’s point) nor massive collections of tokens of words in context as found in
corpora (our point). In the dictionary, the interconnections have been severed; in a
corpus, as we argue above, they have not yet taken hold.
In this article we try to build on two points drawn from Bolinger. First, his meta-
phor of a complex neural web is apt; it gives a useful (though, of course, partial) picture
of the mature language user’s grasp of the words of a language. Second, learners aspir-
ing to such a grasp will not find what they need in a dictionary. And we add a third
point: neither will they find it in a corpus. In our estimation, two of the more urgent
and at the same time tractable issues in corpus-based computational linguistics for
language education are: (1) constructing alternatives to the dictionary on the one hand
and the corpus on the other as lexical resources that more closely reflect Bolinger’s
picture of what learners need to master and (2) creating the means of making such
knowledge resources accessible when and where they matter to learners and teachers
(Wible 2008). We categorize the first issue as one of lexical knowledge discovery
(or extraction) and the second issue as lexical knowledge representation. We could
2. Our thanks to one of the reviewers for pointing out the relevance of the Widdowson (2000)
article to this point.
 David Wible and Nai-Lung Tsao
also call these the issues of What (knowledge) and How (to represent it). Of course,
progress on the first must be made before the second becomes an issue. We need to
have knowledge in hand before worrying about how to make it available. Thus, our
purpose in what follows centers around the first issue. Specifically we describe and
motivate a particular sort of lexical knowledgebase intended to capture at least some
of the lexical interconnectivity pictured by Bolinger.
We have little to say in this chapter about the second issue: how to make this
knowledge accessible to learners. Wible (2008) offers a view on a new generation of
lexical representation for language learning, especially for multiword expressions,
which require alternatives to both the dictionary and the corpus. The corpus-derived
resources we describe in this chapter are completely compatible with the approach to
lexical representation for learners that is proposed there. Near the end of this chapter
we touch on the issue of representation by describing how this is the case.
2. The gap between corpora and lexical knowledge
With a few incisive lines, Bolinger has made memorably clear a fundamental limita-
tion of the dictionary as a resource for learning words. Here we are interested in clari-
fying some limitations of corpora for that same role. As a resource for learning words,
a corpus can be seen as a repository of instances of words in use. The intuition behind
much of corpus-supported vocabulary learning and teaching, as we mentioned at the
outset, is that vocabulary learning depends upon exposure to instances or tokens of
words in use and corpora can provide users with just such exposure. We know of no
educator or researcher, however, who finds tokens of words to be interesting in them-
selves. Tokens are valuable only as windows onto what they betoken. And basically,
what they betoken is patterns of word behaviors. It is precisely for this reason that
corpus concordancing and KWIC searches are seen as useful. They are seen as useful
to the extent that they show conventional uses of words, that is, to the extent that they
reflect patterns of their behavior. Data-driven language learning approaches, for ex-
ample, are premised on the assumption that guided exposure to the data (the tokens)
will reveal these underlying patterns to the learner (Johns 1994; inter alia).
The point we want to make here, however, is that concordancing and KWIC
searches are not designed to find patterns. They are designed to find strings. Literally
and technically, they search by means of string matching, not by pattern matching.
Detecting patterns is left up to the user. The sorts of patterns that these searches are
good at making salient are cases where the forms in the strings coincide with a pattern.
For example, a query of the term look will show numerous cases of look at listed to-
gether if results are displayed alphabetically by the word to the right of the search term.
And this is because the two items are contiguous and the second one is a specific word
form, so all its tokens show up together under an alphabetic listing. But much of the
patterned use of words, say, as in formulaic sequences or multiword expressions, will
Towards a new generation of corpus-derived lexical resources for language learning 
fall through the cracks here. As Read and Nation (2004) point out, many patterns of
word behavior involve non-contiguous strings, rendering them resistant to discovery
“whether it be by human intuition or automated computer search” (pp. 31–32).
There are corpus search software programs that do support pattern searches but for
users who are willing to learn a technical language such as regular expressions or other
similarly complex notations. Once this learning curve is passed, however, the patterns
that these programs find are the patterns that the user tells them to search for. That is,
users must specify a pattern in their query. This is different from learners discovering
patterns they had not thought of looking for. In corpus applications for language learn-
ing, this point is crucial. It is precisely these sorts of facts about word behavior (those
that have not even occurred to the learner to wonder about) that underlie their most
recalcitrant errors or misconceptions in lexical knowledge (Wible 2008). So one of the
more important issues for corpus-supported language learning is how to overcome the
dilemma that corpora are best suited either for those who have a lexicographer’s nose for
lexical facts or those who already know what they are looking for before they begin.
Here, then, is the source of the gap between corpora and language learners that we
want to address. Corpora are ideal for storing instances of language in use, but the mind
is not. In the long run, the mind is better at distilling than storing. One property of
words worth distilling is the patterns of their behavior. And what Bolinger pointed out
is that, in the case of words, what the mind distills is not a list but a web. Corpora do no
such distilling; nor do they afford it in any straightforward way. Accordingly, the sort of
resource we aim for is something more akin to such a distilled web, with the dense in-
terconnections navigable not only among words but also among patterns of word uses.
3. The role of some current constructs
Researchers interested in extracting patterns of behavior of words from corpora have

relied traditionally on the construct of the n-gram. N-grams are ordered n-tuples of
grams, and the ‘grams’ of n-grams are linguistic units, typically words. So a tri-gram is
a contiguous sequence of three words; a 4-gram is such a sequence of 4 words, and so
on, with no upper limit, in principle, on the value of n. A specific n-gram is a type, and
we can search a corpus for all of its tokens. Thus, for example, put up with, taken as a
tri-gram, is a type, and we can extract all tokens of it from a corpus, using string
matching to identify each occurrence where these three word forms appear side by
side.3 Such work with n-grams lies behind much corpus-based research for second
3. Typically the grams of n-grams are specific word forms (as opposed to lexemes). Thus put
up with and putting up with would be two distinct tri-grams. This makes it possible to extract
n-grams from corpora that have not been lemmatized. But, as I try to show later, it also obscures
the obvious relationship between put up with and putting up with. StringNet encodes this rela-
tionship while still distinguishing between the two variations.
 David Wible and Nai-Lung Tsao
language learning. Important constructs such as ‘lexical bundles’ (Biber et al. 1999;
Biber & Conrad 1999; Biber et al. 2003; Biber et al. 2004) and ‘formulaic sequences’
(Simpson-Vlach & Ellis 2010), for example, have been operationalized as n-grams,
making it possible to identify them in large corpora, to rank them by frequency, com-
pute the strength of association of the co-occurring words with each other, determine
which n-grams are distinctive of particular genres, and in general to render them sus-
ceptible to analysis. This sort of corpus work has been valuable in helping determine
which multiword sequences are important for learners to learn and teachers to teach.
N-grams are one-dimensional, however. They encode syntagmatic relations of co-
occurring elements. For this reason, they flatten the space available for representing
the rich network of interconnections among words that we aim to encode. This one-
dimensionality has consequences. To illustrate just one, seen as n-grams, the strings
consider himself lucky and consider yourself lucky are simply two different tri-grams, as
distinct from each other as they are from any other tri-grams, say from stroke of luck,
a fine line, up with the, or close to you. There is something counter-intuitive about this.
There are degrees of similarity and difference among n-grams worth distinguishing
and connections among them worth capturing. To make such distinctions and capture
such connections, however, we need something other than the n-gram.
It is the same one-dimensionality of n-grams that explains in part why they are
made available to users (when made available at all) in lists. Such lists may be orga-
nized in order of frequency of occurrence in a corpus or by the strength of association
of the component grams, say, by mutual information (MI) score (Simpson-Vlach &
Ellis 2010), but always they are lists and the lists are flat, ranking n-grams but showing
no relations between or among them. It is worth recalling that our approach to nar-
rowing the gap between corpora and learners is to create, as an underlying knowledge
resource, a navigable web rather than a list. The point here is that such a web will not
be composable from the traditional unit of the n-gram.
There is research that avoids the restrictions of the n-gram, some of it explicitly
acknowledging the limitations of n-grams for lexicography and language education. In
a substantial literature on collocation extraction, for example, it is common to assume
a window of proximity for the two collocating words rather than fixed sequences of
grams in which the two occupy fixed slots (Church & Hanks 1990; Church et al. 1991;
Dunning 1993; Manning & Schutze 1999; Evert & Krenn 2001; inter alia). Collocabil-
ity is computed from instances where the two words co-occur within this window, in
some cases with no requirement that they be adjacent or part of any fixed strings like
n-grams. On this approach, verb collocates for the noun mistake are extracted by
counting (and computing conditional probabilities in the presence of) verbs that oc-
cur within, say, a five-word window of mistake. Accordingly, the V-N collocation make
mistake is extracted by taking into account indiscriminately all cases where make and
mistake co-occur within the window. This includes cases of make a mistake, make lots
of mistakes, make so many mistakes, make the mistake, make the biggest mistake, and so
on, with no regard for what intervenes in the rest of the window. Word association
Towards a new generation of corpus-derived lexical resources for language learning 
measures run on such data have produced important results in extracting collocations
from corpora.
While collocation research has proceeded apace with little regard for the n-gram,
another literature has taken aim directly at the n-gram and its limits (Cheng et al.
2006; Wible et al. 2006a). This work, however, considers the limitations arising not
from the n-gram’s one dimensionality (the limitation we are concerned with) but from
its contiguity, the restriction that the grams must co-occur in uninterrupted, contigu-
ous sequences. The alternatives explored in this literature, for example congrams and
skipgrams (Cheng et al. 2006; Cheng et al. 2009), loosen this restriction to allow for
discontinuous sequences of grams. This work opens up spaces horizontally, allowing a
wider range of variations in what sequences can be retrieved for a target word or pair
of co-occurring words. Skipgram searches detect both contiguous and non-contiguous
word co-occurrences within a window. The congram in addition allows variation in
the linear ordering of the co-occurring words, for example, both played a role and the
role she played. It is important to note that these researchers intend the constructs of
skipgrams and congrams for lexicographic research rather than for direct use by learn-
ers or teachers. While the skipgram and congram acknowledge the importance of
discontinuous strings in understanding word behavior and open up the spaces sur-
rounding the target words, what they extract are sentences containing the congrams
rather than patterns that include these intervening spaces (finding patterns again
would be done by the user). In this respect, the perspective on word behaviors that this
work takes is still essentially horizontal. The limitations of this perspective are difficult
to elucidate without a contrasting alternative view. For this, we turn next to the alter-
native we pursue in the design of the lexical knowledgebase called StringNet.
4. The lexical knowledgebase4
What is missing from the one-dimensional n-gram and its extensions, we want to sug-
gest, is the paradigmatic dimension. And this we incorporate into StringNet in a
straightforward way that has wide-ranging consequences. For a construct that cap-
tures both syntagmatic and paradigmatic aspects of word behaviors, we have intro-
duced the notion of hybrid n-gram (Tsao & Wible 2009; Wible & Tsao 2010). We hope
to show that, once the paradigmatic dimension is introduced within the minimal unit
of the hybrid n-gram, it then becomes possible from a corpus-derived set of these
units to generate not simply a list, but a net. This is because the paradigmatic dimen-
sion makes it possible not only to identify patterns of word behavior but to create a
new entire space where relations hold among these patterns and among the words in
them. Moreover, in this space these relations become susceptible to automatic detection
4. See Wible and Tsao (2010) for details on the computational aspects of the knowledge-
base design.
 David Wible and Nai-Lung Tsao
and indexing. The resulting StringNet is a massive and organic lexical knowledgebase
whose structure is not prescribed but emerges. We illustrate in what follows.5
4.1 Hybrid N-grams
Essentially, hybrid n-grams expand the inventory of gram types and allow these differ-
ent types of grams to co-occur in the same hybrid n-gram. Most notably, we add parts
of speech (POS) as a gram type. Thus, in the same hybrid n-gram, POS grams can oc-
cur alongside lexemes or word forms. For example, in the hybrid n-gram consider
[pnx] lucky, the second gram, [pnx], is the POS tag for reflexive pronouns. Thus, this
one hybrid n-gram describes consider himself lucky and consider yourself lucky and all
other instances with different reflexive pronouns in that same second slot. This both
captures the similarity between these strings and distinguishes them from other tri-
grams (e.g. from the likes of, time after time, tally up the, or so if you). Similarly, while
traditional n-grams must treat my point of view and your point of view as different
4-grams, distinct from each other as from the 4-grams once upon a time, in case of
emergency, and a friend in need, the hybrid n-gram [dps] point of view can capture the
similarity they share and differentiate them from other, dissimilar 4-grams.
The hybrid n-grams of StringNet permit four distinct types of grams: (1) specific
word forms, such as climb, climbed, climbing; (2) lexemes, (indicated in bold) such as
climb, subsuming all its word form variations; (3) fine-grained part of speech (POS)
categories (46 different categories from CLAWS 5, see Burnard 2007), indicated in
brackets like [noun sg]; and (4) coarse-grained POS categories (twelve general tags
subsuming the 46 fine-grained ones),6 also marked in brackets.
5. Another lexicographic tool that incorporates both the syntagmatic and paradigmatic di-
mensions in capturing patterns of word behavior is Sketch Engine (Kilgarriff et al. 2004). Sketch
Engine is a fundamental contribution to the ‘new generation’ of corpus-derived lexical resourc-
es referred to in the title of this chapter. StringNet and Sketch Engine differ from each other in
the approach taken to discovering and representing the syntagmatic and paradigmatic dimen-
sions. A point by point comparison of the two resources is beyond the scope of this chapter and
would be misleading because of their differing aims and approaches. One representative differ-
ence worth pointing out here is that Sketch Engine has the advantage of providing a set of im-
portant pre-determined functional slots in the context of the target word (for example, in the
case of a target verb, it clearly lays out the grammatical subject slot for that verb and shows the
various strings attested as subject of that verb, and so on). As we elaborate in the text, StringNet
takes a different tack. Specifically, its hybrid n-grams are built case by case (though automati-
cally) for a search word without targeting specified or pre-determined functional relations to the
target word that should be represented. In this and other respects, the two resources are suited
to exploring or highlighting complementary (as well as overlapping) aspects of word behavior.
6. For example, all the various verb forms distinguished in the detailed POS tag set are sub-
sumed into the single category V in the coarse-grained set.
Towards a new generation of corpus-derived lexical resources for language learning 
The main restriction imposed on the co-occurrence of gram types within a hybrid
n-gram is that at least one of the co-occurring grams must be lexical. So there must be
at least one lexeme or word form in a hybrid n-gram. This insures that hybrid n-grams
are lexically grounded and reflect word behavior. Traditional n-grams are subsumed as
one type of hybrid n-gram. Thus, the traditional 5-gram leaving aside the question of is
included in StringNet, but as just one of the numerous hybrid n-grams that also de-
scribe this same string. With four tiers of gram types available for each gram slot in
hybrid n-grams, such a single traditional 5-gram corresponds to 512 distinct hybrid
n-grams.7 Table 1 includes a small sampling for this one.
While the discussion here is focused on lexical knowledge discovery rather than
representation, it is worth mentioning one new possibility that hybrid n-grams create
for knowledge representation. Specifically, the hybrid n-grams of StringNet afford a
unique concordancing that answers queries not with lists of sentences but with lists of
patterns. We implement this concordancing through a web-based search interface
Table 1.â•‡ Sampling of the distinct hybrid n-grams corresponding to the the traditional
5-gram leaving aside the question of
Hybrid n-grams
Leaving aside the question of

Leaving aside [art] question of
Leaving aside [art] question [prp]
Leaving aside the [noun] of
Leaving aside the [noun] [prp]
Leaving aside the [noun sg] of
Leaving aside the [noun sg] [prp]
Leaving aside [art] [noun sg] [prp]
Leaving aside [art] [noun] [prp]
Leave aside [art] [noun sg] of
Leave aside [art] question of
Leave aside [art] question [prp]
[Vvg] aside the question of
[Vvg] aside the question [prp]
[verb] aside the question of
[verb] aside the question [prp]
7. Every traditional 6-gram corresponds to 2,048 hybrid 6-grams, each 7-gram to 8,192 hy-
brid 7-grams, and an 8-gram to 32,768 distinct hybrid 8-grams.
 David Wible and Nai-Lung Tsao
Figure 1.â•‡ Sample LexChecker search results for eye
called LexChecker (see Figure 1).8 The listed patterns, in turn, are each linked to all
sentences in BNC that instantiate them (see Figure 2).9
Returning now to knowledge discovery, an important feature of hybrid n-grams is
that, while they represent patterned word uses, the patterns are not prescribed, for
example by an inventory of lexical entry templates, but emerge and are discovered by
simple computational means from the BNC. Recall that hybrid n-grams draw upon
four tiers of gram types, each type with the potential to occupy any slot in a string. The
patterns that are automatically extracted for StringNet are all those describable within
this immense space of possibilities and attested in the BNC with a threshold frequen-
cy.10 A single algorithm extracts a wide range of lexical behaviors. These include not
only fixed expressions such as tongue in cheek, by word of mouth, on foot, one way or
another, slip of the tongue, in fact, wear and tear, as a matter of fact, by trial and error,
with the possible exception of, stand up and be counted, but also quirky patterns of
highly idiosyncratic word behaviors such as [noun] and [noun] alike (‘teacher and stu-
dent alike’) or vary from [noun] to [noun] as well as grammatical constructions, such
8. www.lexchecker.org. Query results are ranked by a combination of frequency of the hybrid

n-gram and the association strength among the co-occurring grams composing it (determined
by a mutual information, MI, measure). Since StringNet is huge, exceeding four terabytes (over
4000 gigabytes), the only practical means of making it available is through a web-based service.
9. We rank the search hybrid n-grams from the results according to a normalized MI measure
that enables us to place hybrid n-grams of different lengths along the same scale. For details, see
Tsao & Wible (2009).
10. Currently, the frequency threshold is set at five; that is, any hybrid n-gram attested with five
or more tokens in BNC is included in StringNet.
Towards a new generation of corpus-derived lexical resources for language learning 
Example sentences are from British National Corpus
No. Sentences
1 The mite is just visible to the naked eye and feeds on honey bees and their grubs by
sucking their body fluids.
2 But many kinds of bacteria in nature form elaborate colonies, often quite visible to the
naked eye, in which different individuals perform different functions, so that the whole
colony functions as if it were a single organism.
3 Because the creatures of the plankton on individually are small, they are not always
visible to the naked eye.
4 Because they are so faint, not a single one is visible to the naked eye.
5 Protozoa are much larger than bacteria or viruses, although still not visible to the naked
eye.
6 However, some cells, like the large eggs of frogs, are easily visible, and the human egg is
just visible to the naked eye.
7 The human and mouse eggs are about one tenth of a millimetre in diameter and are just
visible to the naked eye.
8 Quite often, olivine and pyroxene begin to crystallize out early on, so they may be present
in the final rock as quite large crystals, up to a centimetre across, many times larger than
the crystals surrounding them, and easily visible with the naked eye.
Figure 2.â•‡ Sentences linked to the hybrid n-gram visible [prep] the naked eye
as the ‘It Cleft proto-construction’: it be the [noun] that [verb] (‘It’s the thought that
counts’). Such a space is much richer and more capable of registering nuances than one
using typical lexical entries that assume a discrete delineable interface between words
and grammar and that impose a priori structure on the permitted patterns or features
of words to be encoded.
Hybrid n-grams also add important refinements to collocation knowledge by de-
tecting larger patterns in which many collocations are often embedded. This is possible
because hybrid n-grams not only detect collocating words that may or may not be
adjacent to each other, but also encode them as part of the contextual patterns within
which they co-occur. Taken as collocations pure and simple, widely cited pairs such as
spend time or make mistakes leave hidden important features of the patterns in which
they conventionally appear. A common frustration for students is to have a teacher
correct a miscollocation like I pay time... to I spend time..., only to have the revised
version then marked again for a newly discovered error when the teacher notices what
follows the collocation: I spend time to clean my room every Saturday (cf. spend time
cleaning... or take time to clean...). Hybrid n-grams express the full patterns here, and a
query of StringNet for time lists them: spend time [Vvg] and take time [to V].
Similarly, learners are commonly taught the collocation make a mistake (as op-
posed to the miscollocation do a mistake) but not the patterns of its larger context. Yet
this collocation, as so many others, is simply one part of a set of wider patterns: make
the mistake of [Vvg] but not make the mistake [to V]. Nor is this the complete pattern.
The noun mistake takes both complement types of [Vvg] and to [V], but the choice of
which forms should follow mistake is conditioned not by mistake alone but in
 David Wible and Nai-Lung Tsao
combination with what precedes it. Compare I made the mistake of trusting him and It
was a mistake to trust him. Notice too that even definiteness here (the mistake vs. a
mistake) is conditioned by context. These are subtle dependencies within which col-
location forms just one part. In StringNet they are distilled whole and set into clear
relief as distinct hybrid n-grams: be a mistake [to V] and make the mistake of [Vvg].
Such patterns, exhibiting lexically determined grammatical properties, suggest hybrid
n-grams’ potential for uncovering colligation (Hoey 2004) and for investigating lexi-
co-grammatical constructions.11
4.2 Relations among hybrid n-grams
The appearance of POS categories in the hybrid n-grams of StringNet not only creates
an explosion in the number of patterns of word behavior we detect, but also, as men-
tioned above, raises the possibility of detecting relations among these patterns.12 It is
in this respect that introducing the paradigmatic dimension makes possible the emer-
gence of an organic structure in the lexical knowledgebase. Consider the 5-gram dis-
cussed above, for example. The query results for the word aside include not only the
5-gram we saw above ‘leaving aside the question of ’, but also a parent of that hybrid
n-gram: ‘leaving aside the [noun sg] of ’. And that parent/child relation between them
is automatically detected and encoded in StringNet by virtue of the fact that question
in the first hybrid n-gram is a case of (or a child of) its counterpart gram in the second
one, the category [noun sg]. In turn, StringNet indexes this parent hybrid n-gram to
all of the other tokens that instantiate it in BNC. Further, StringNet indexes this spe-
cific [noun sg] slot in this exact hybrid n-gram to all of the particular nouns that are
attested in that specific slot in ‘leaving aside the [noun sg] of ’. As it turns out, there are
thirteen different such nouns appearing in the 28 tokens of this pattern in BNC. Thus,
this set of thirteen nouns constitutes a paradigm. The noun question of course belongs
to this specific paradigm, but this is not a discrete or independent fact. Just as ‘leaving
11. See Wible & Tsao (2010) for a description of StringNet as a resource for investigating lin-
guistic constructions.
12. We prune search results shown to users to eliminate a large number of redundant patterns.
For example, any pattern that is attested by only one of its more specific sub-patterns (or ‘chil-
dren’) is pruned as redundant and the attested sub-pattern retained. For example, the pattern
‘from [dps] point [prn] view’ is automatically pruned by comparing it with the more specific
pattern ‘from [dps] point of view’ while the latter, more specific pattern, is retained. This is be-
cause in all instantiations of ‘from [dps] point [prn] view’, the [prn] category is realized as the
preposition ‘of ’. In contrast, when this retained pattern ‘from [dps] point of view’ is compared
with its sub-patterns or children, for example with ‘from my point of view’, it will be found that
‘my’ is not the only instantiation of the [dps] slot in this pattern. There are attested cases of ‘from
their/our/his/her point of view’. Thus, the more general parent pattern here ‘from [dps] point of
view’ is not pruned and is shown in the results because the [dps] represents attested substitut-
ability in that slot. For details on this pruning, see Wible & Tsao (2010).
Towards a new generation of corpus-derived lexical resources for language learning 
aside the [noun sg] of ’ is related by parenthood to ‘leaving aside the question of ’ be-
cause question is a case of [noun sg], so it in turn has its own parent patterns. Thus,
navigating StringNet upward from the hybrid n-gram ‘leaving aside the [noun sg] of ’,
we discover the parent pattern ‘[Vvg] aside the [noun sg] of ’, with [Vvg] including the
verb ‘leaving’ as a subcase (or child).13 In this parent pattern we have co-occurring
POS slots in the same hybrid n-gram, that is, two paradigms marked as [Vvg] and
[noun sg] connected syntagmatically. Linking to its instances we find two things. First,
the [Vvg] slot in this pattern is instantiated by only two distinct verbs: setting (aside)
and leaving (aside). Second, the choice between these two verbs in this position cor-
responds to a change of the membership in the paradigm in the neighboring [noun sg]
slot. This can be seen by the comparing the two (see Figure 3).
The slight shift from leaving to setting in the [Vvg] gram here corresponds to an
accompanying change in the membership of the neighboring [noun sg] paradigm.
The noun question appears in this [noun sg] slot in the presence of both setting and
leaving, and it has only one cohort noun that shares this same distribution: issue. So
with each shift in the neighborhood (syntagmatically), this noun contracts a new set
of relations with a new set of neighbors (paradigmatically) with a slight overlap in the
two conditions.
Thus, hybrid n-grams derived from a corpus and indexed to it can tell us more
than one-dimensional constructs can about the company that words keep. One-di-
mensional n-grams can indicate a specific pattern of company kept between leaving
and question and between leaving and issue but by two separate and independent
n-grams. The added dimension of hybrid n-grams, however, can tell us that, as a con-
sequence, question and issue keep company with each other, but only in another
13. The possibility of actually ‘navigating’ StringNet by linking to the parents or children of any
hybrid n-gram that appears in any search results is realized in a prototype research interface that
we have just completed at the time of writing (http: //nav.stringnet.org). Navigation is afforded
in the prototype by a pair of links appearing beside each hybrid n-gram listed in a set of search
results alongside the current ‘Examples’ link that accompanies each hybrid n-gram in a search
result. These new links are labeled ‘Parents’ and ‘Children’. Clicking on the ‘Parent’ link, say, for
the hybrid n-gram ‘consider yourself lucky’ gives a list of its parent hybrid n-grams, for example,
‘consider [pnx] lucky’, ‘consider yourself [adj]’ and so on. These parents in turn show links to
each of their parents and children. This is invaluable for research into constructions to deter-
mine whether a specific string is a one-off frozen expression or a specific case of a more general
construction. Thus, the hybrid n-gram ‘it’s the thought that counts’ links to the parent ‘it’s the
[noun] that counts’ showing variation possible in that exact noun slot. (In fact, a list of the nouns
attested in that noun slot appears in a pop-up linked to the [noun] slot.) This hybrid n-gram
links in turn to its own parent ‘it’s the [noun] that [verb]’, showing substitutability in the verb
slots as well (and, by pop-up windows, can show the verbs that are attested in that slot), thus
leading to the proto-construction, the ‘It Cleft’. The research possibilities afforded by StringNet
proliferate with such a navigable network. For example, dependencies can be detected within
such discovered constructions, such as the non-canonical agreement holding in the ‘It Cleft’
(e.g. ‘It’s the voters that count’ vs. ‘It’s the thought that counts.’)
 David Wible and Nai-Lung Tsao
[Vvg] aside the [ noun sg ] of

study
difficulty
case
leaving ……… position
impact
question
issue
mass
setting ……… forfeiture
sum
chance
bundle
Figure 3.â•‡ Co-occurring paradigms occupying two POS slots in a hybrid n-gram
dimension (the vertical, paradigmatic here) as members of the same paradigm. And
the cross-indexing of hybrid n-grams makes it possible to detect exactly all of the in-
tersections of the vertical and horizontal dimensions that bring two words into com-
pany anywhere in this network (like drinking buddies but not only, also prayer part-
ners and sparring partners and so on) and what other words are implicated along the
way (who else frequents the same bar, church, and gym). And so, the possibilities for
‘word associations’ that we can detect under StringNet quickly proliferate.
This example shows a minute portion of the relations that a word contracts with
other words in the context of StringNet. We see, for example, two related paradigms
containing the noun question. StringNet contains hundreds of thousands of such para-
digms, and none of them are isolated. They are indexed to each other, directly or indi-
rectly, by virtue of the myriad syntagmatic and paradigmatic connections that take
hold in this web. A single word may occur in hundreds or thousands of highly specific
paradigms, implicating that word in a unique “infinitude of natural connections” that
extend horizontally and vertically. In StringNet, they are susceptible to discovery and
exploration. To take one of the simpler possibilities, information theoretic or statistical
measures can be used to determine to what extent there is really a conditioning rela-
tionship between the membership of the [Vvg] slot and that of the [noun sg] in the
family of patterns above as opposed to differences in membership we would expect by
chance given the corpus size and the frequencies of the words involved. While we have
not yet run such measures, the point here is that, by virtue of its structure, StringNet
makes it possible to do so. That is, the rich and explicitly structured web of relations
encoded in StringNet makes it possible to run such measures and discover these and
other more complex and nuanced connections.
Towards a new generation of corpus-derived lexical resources for language learning 
5. Knowledge representation and access for users
The fundamental issue motivating this chapter has been the gap between the sorts of
knowledge language learners need, on the one hand, and the sort of thing a corpus is,
on the other. We have described StringNet as a corpus-derived knowledgebase de-
signed to help bridge this gap, specifically with respect to lexical knowledge. But how
does StringNet constitute a contribution in this respect? How is StringNet closer to the
knowledge a language learner needs? As we pointed out earlier, this challenge of ‘get-
ting closer’ in the case of lexical knowledge resources involves the two aspects of lexi-
cal knowledge discovery and lexical knowledge representation. This chapter is devoted
mainly to knowledge discovery, specifically, showing how StringNet distills from cor-
pora the patterned uses of words and the relations of these patterns and these words to
each other. The main purpose has been to show how this design for a lexical knowl-
edgebase in itself comes uniquely close to the sorts of lexical knowledge a language
learner needs compared, for example, to the collection of tokens that constitutes a
corpus. With the size and complexity of StringNet, the question of representation re-
mains, however: How can learners access it? There sits this knowledge in the form of a
massive cross-indexed network of lexical patterns. How can it reach learners? String-
Net, due to its unique structure and content, in fact opens a wide range of new possi-
bilities for lexical knowledge representation.14 While the aspect of representation is
not the main concern of this chapter, we sketch briefly here one among the range of
ways that StringNet knowledge can be represented to learners. Specifically, we describe
how it can support and extend the browser-based approach laid out in Wible et al.
(2006b) and Wible (2008).
Wible et al. (2006b) and Wible (2008) describe and motivate an approach to help-
ing learners learn collocations through a browser-based agent that appears to users
as an icon on their Web browser’s toolbar. Clicking that icon triggers the agent to de-
tect, in real time, any collocations appearing in the text of the webpage the user is cur-
rently browsing. Detected collocations then appear in a dropdown menu, from which
the user can select specific collocations to highlight within the current webpage or to
show multiple example sentences from BNC containing that collocation. This brows-
er-based approach is designed on the pair of assumptions that, first, input is the key to
acquiring collocations and, second, there is little within that input which differentiates
collocations as such from free combinations. And here we add that it is not only col-
locations that fly under the radar in this way but the whole range of multiword expres-
sions. This argument is laid out in detail in Wible (2008). An agent that can detect
14. For example, it has enabled us to rely on a single algorithm to automatically detect and cor-
rect a diverse array of learner errors. To give a sampling of the range of coverage for this one
algorithm, it detects the error in my point of view and corrects it to from my point of view, detects
the error enjoy to read, correcting it to enjoy reading, and detects pay attention on as an error and
suggests correcting it as either pay attention to or focus attention on (Tsao & Wible 2009).
 David Wible and Nai-Lung Tsao
these multiword expressions in real time in the texts that users freely browse on the
Web thus provides a crucial and heretofore unavailable support to learners in facing
the challenge of learning multiword expressions from input.
Until now we have implemented this browser-based approach in a tool called Col-
locator (http://toolbar.stringnet.org). As the name suggests, however, Collocator is
limited to detecting collocations or two-word expressions. StringNet opens the possi-
bility of detecting the much wider range of multiword expressions under the same
conditions that Collocator detects and shows collocations during web browsing.15
Whereas Collocator can detect make...mistake, StringNet makes it possible to detect
make the mistake of including or It would be a mistake to include.... A further enhance-
ment StringNet makes possible is the capacity for the tool not simply to detect the
string of words that constitutes the multiword expression in the webpage (make the
mistake of including) but to show the pattern that it betokens (make the mistake of Vvg)
and to illustrate with examples the range that this pattern encompasses (make the mis-
take of including/assuming/withholding...). This is due to the fact that StringNet consists
of hybrid n-grams and can match any of these hybrid n-grams to all tokens that instan-
tiate it. As a result, a browser-based tool that has StringNet as its knowledge source can
not only identify for a reader a multiword token as an instance of a type or pattern, but
can show that reader the particular type or pattern it instantiates and provide other
tokens of this same type, illustrating the coverage of the pattern and familiarizing the
reader with the members of the family of tokens that instantiate that pattern.
The central contribution of this browser-based approach is that it opens up a rad-
ically different path of access to lexical resources. When it comes to modes of access,
corpora closely resemble traditional dictionaries in a key respect. They both require
the user to initiate a query (or look up a word), and this requires that the user has de-
cided what target word or target expression to search for. This posture that dictionaries
and corpora require of users is of little value in helping with gaps in the learners’
knowledge which they are unaware of in the first place, ones they would not therefore
deliberately seek to address with a query or search. On our browser-based approach,
in contrast, the agent on the toolbar actively discovers for the user those forms within
the texts they are reading that are worth attending to. Since access to StringNet can be
provided by such an agent piggybacking on the browser’s toolbar, it can show users
multiword expressions which they may never have thought to look up on their own
initiative and which are found in the texts they have freely chosen to read. Thus rather
than being overwhelmed or confused by needing to navigate such a huge lexical
knowledgebase, learners are shown only those exact patterns and examples the agent
brings to their attention, and these are only the ones relevant to the text they are cur-
rently reading. While this is by no means the sole application that StringNet is intended
15. Currently StringNet includes expressions (or hybrid n-grams) ranging from two to eight
grams in length.
Towards a new generation of corpus-derived lexical resources for language learning 
to support, it is an illustrative one which attends to the fundamental issue of lexical

knowledge representation in a learner-centered way.
6. An emergent langue
Corpus and usage-based linguists have typically eschewed as far as possible any traffic
with abstractions such as langue. Perhaps this has been because attempts to approxi-
mate such abstractions have traditionally entailed theorizing beyond the warrants of
data, valuing elegance over coverage and too casual an acceptance of Sapir’s aphorism
that “all grammars leak.” While shunning abstractions may have its reasons, language
teachers still need something to sate classes of thirsty learners. Perhaps the problem is
not with abstraction per se but with how we arrive at it, whether by theorizing or rather
by distilling. Stereotypically it is approached by brilliant theorizing that sees leakage as
a small price to pay for beauty and simplicity. But abstraction can be arrived at instead
by a different sort of simplicity. By the simple but relentless work of a spider, making
every single simple connection afforded by the raw data at hand and then in turn what-
ever new connections become possible by exploiting the structure from these first ones,
then more from those, and so on. Such a simple process might require not brilliant
theorizing but loyalty to data in the extreme. Certainly, the sort of abstraction that
emerges will be less wieldy, more unruly, more organic than the sort of langue typically
envisioned as a grammar. But, perhaps for this very reason, it just may hold water.
References
Biber, D., Conrad, S. & Cortes, V. 2004. If you look at...: Lexical bundles in university teaching
and textbooks. Applied Linguistics 25(3): 371–405.
Biber, D., Johannson, S., Leech, G., Conrad, S. & Finegan, E. 1999. The Longman Grammar of
Spoken and Written English. London: Longman.
Biber, D. & Conrad, S. 1999. Lexical bundles in conversation and academic prose. In Out of
Corpora: Studies in Honor of Stig Johansson, H. Hasselgard & S. Oksefjell (eds), 181–189.
Amsterdam: Rodopi.
Biber, D., Conrad, S. & Cortes, V. 2003. Lexical bundles in speech and writing: An initial tax-
onomy. In Corpus Linguistics by the Lune, A. Wilson, P. Rayson & T. McEnery (eds), 71–93.
Frankfurt: Peter Lang.
Burnard, L. (ed.) 2007. Reference Guide for the British National Corpus (XML Edition). Pub-
lished for the British National Corpus Consortium by the Research Technologies Service at
Oxford University Computing Services, February.
Bolinger, D. 1977. Idioms have relations. Forum Linguisticum 2: 157–169.
Bolinger, D. 1985. Defining the indefinable. In Dictionaries, Lexicography and Language Learn-
ing, R. Ilson (ed.), 69–73. Oxford: Pergamon Press.
 David Wible and Nai-Lung Tsao
Cheng, W., Greaves, C., Sinclair, J.M. & Warren, M. 2009. Uncovering the extent of the phraseo-
logical tendency: Towards a systematic analysis of congrams. Applied Linguistics 30(2):
236–252.
Cheng, W., Greaves, C. & Warren, M. 2006. From n-gram to skipgram to concgram. Interna-
tional Journal of Corpus Linguistics 11(4): 411–433.
Church, K.W. & Hanks, P. 1990. Word association norm, mutual information, and lexicography.
Computational Linguistics 16(1): 22–29.
Church, K.W., Gale, W., Hanks, P. & Hindle, D. 1991. Using Statistics in Lexical Analysis. In
Lexical Acquisition: Using On-line Resources to Build a Lexicon, U. Zernik (ed.), 115–164.
Hillsdale NJ: Lawrence Erlbaum Associates.
Dunning, T. 1993. Accurate methods for the statistics of surprise and coincidence. Computa-
tional Linguistics 19(1): 61–74.
Emerson, R.W. 1982. Nature. In Nature and Selected Essays. New York NY: Penguin.
Evert, S. & Krenn, B. 2001. Methods for the qualitative evaluation of lexical association mea-
sures. In Proceedings of the 39th Annual Meeting of the Association of Computational Lin-
guistics, B. L. Webber (ed.), 188–195. Toulouse, France.
Hoey, M. 2004. A theory for TaLC? The textual priming of lexis. In Corpora and Language Learn-
ers [Studies in Corpus Linguistics 17], G. Aston, S. Bernardini & D. Stewart (eds), 21–44.
Johns, T. 1994. From printout to handout: Grammar and vocabulary teaching in the context of
Data-driven Learning. In Perspectives on Pedagogical Grammar, T. Odlin (ed.), 293–313.
Cambridge: CUP.
Kilgarriff, A., Rychly, P., Smrz, P., Tugwell, D. 2004. The sketch engine. In Proceedings of the
Eleventh EURALEX International Congress, G. Williams & S. Vessier (eds), 105–116.
Lorient: UBS.
Manning, C. & Schütze, H. 1999. Foundations of Statistical Natural Language Processing.
Cambridge MA: The MIT Press.
Mishan, F. 2004. Authenticating corpora for language learning: A problem and its resolution.
ELT Journal 58(3): 219–227.
Read, J. & Nation, P. 2004. Measurement of formulaic sequences. In Formulaic Sequences: Acqui-
sition, Processing, and Use [Language Learning & Language Teaching 9], N. Schmitt (ed.),
23–35. Amsterdam: John Benjamins.
Simpson-Vlach, R. & Ellis, N. 2010. An academic formulas list: New methods in phraseology
research. Applied Linguistics 31(4): 487–512.
Tsao, N.-L. & Wible, D. 2009. A method for unsupervised error detection and correction. Paper
presented at the North American Association of Computational Linguistics (NAACL)
Conference, Workshop on Innovative Uses of NLP for Building Educational Applications.
Boulder, Colorado, June, 2009.
Wible, D. 2008. Multiword expressions and the digital turn. In Phraseology in Foreign Lan-
guage Learning and Teaching, F. Meunier & S. Granger (eds), 163–181. Amsterdam: John
Benjamins.
Wible, D. & Tsao, N.-L. 2010. StringNet as a computational resource for discovering and repre-
senting linguistic constructions. Paper presented at the North American Association of
Computational Linguistics (NAACL) Conference, Workshop on Extracting and Using
Constructions in Computational Linguistics. June, 2010.
Towards a new generation of corpus-derived lexical resources for language learning 
Wible, D., Kuo, C.-H., Chen, M.C., Tsao, N.-L. & Hung, T.F. 2006a. A computational approach
to the discovery and representation of lexical chunks. Paper presented at TALN 2006,
Leuven, Belgium, April, 2006.
Wible, D., Kuo, C.-H., Chen, M.C., Tsao, N.-L. & Hung, T.F. 2006b. A ubiquitous agent for un-
restricted vocabulary learning in noisy digital environments. Lecture Notes in Computer
Science 4053: 503–512.
Widdowson, H.G. 2000. On the limitations of linguistics applied, Applied Linguistics 21(1): 325.
Automating the creation of dictionaries
Where will it all end?
Michael Rundell and Adam Kilgarriff
The relationship between dictionaries and computers goes back around 50 years.
But for most of that period, technology’s main contributions were to facilitate
the capture and manipulation of dictionary text, and to provide lexicographers
with greatly improved linguistic evidence. Working with computers and
corpora had become routine by the mid-1990s, but there was no real sense
of lexicography being automated. In this article we review developments in
the period since 1997, showing how some of the key lexicographic tasks are
beginning to be transferred, to a significant degree, from humans to machines.
A recurrent theme is that automation not only saves effort but often leads to a
more reliable and systematic description of a language. We close by speculating
on how this process will develop in years to come.
1. Introduction
This paper describes the process by which – over a period of 50 years or so – several
important aspects of dictionary creation have been gradually transferred from human
editors to computers. We begin by looking at the early impact of computer technology,
up to and including the groundbreaking COBUILD project of the 1980s. The period
that immediately followed saw major advances in the areas of corpus building and
corpus software development, and the first dedicated dictionary writing systems began
to appear. These changes – important though they were – did not significantly advance
the process of automation. Our main focus is on the period from the late 1990s to the
present. We show how a number of lexicographic tasks, ranging from corpus creation
to example writing, have been automated to varying degrees. We then look at several
areas where further automation is achievable and indeed already being planned. Fi-
nally, we speculate on how much further this process might have to run, and on the
implications for dictionaries, dictionary-users, and dictionary-makers.
 Michael Rundell and Adam Kilgarriff
2. Computers meet lexicography: From the 1960s to the 1990s
The great dictionaries of the 18th and 19th centuries were created using basic tech-
nologies: pen, paper, and index cards for the lexicography, hot metal for the typeset-
ting and printing. In the English-speaking world, the principle that a dictionary should
be founded on objective language data was established by Samuel Johnson, and ap-
plied on a much larger scale by James Murray and his collaborators on the Oxford
English Dictionary (OED; Murray et al. 1928). The task of collecting source material
– citations extracted from texts – was immensely laborious. Johnson employed half a
dozen assistants to transcribe illustrative sentences which he had identified in the
course of his extensive reading, while the OED’s ‘corpus’ – running into several million
handwritten ‘slips’ – was collected over several decades by an army of volunteer read-
ers. And this was only the first stage in the dictionary-making process. In all of its
components, the job of compiling a dictionary was extraordinarily labour-intensive.
Johnson’s references to ‘drudgery’ are well-known, but Murray’s letters testify even
more eloquently to the stress, exasperation, exhaustion and despair which haunted his
life as the OED was painstakingly assembled (Murray 1979, esp. Ch. XI).
It was Laurence Urdang – as Editor of the Random House Dictionary of the English
Language (Stein & Urdang 1966) – who first saw the potential of computers to facilitate
and rationalize the capture, storage and manipulation of dictionary text.1 From this
point, the idea of the dictionary as a database, in which each of the components of an
entry has its own distinct field, became firmly established. An early benefit of this ap-
proach was that cross-references could be checked more systematically: the computer
generated an error report of any cross-references that did not match up, and errors
would then be dealt with manually. An extremely dull task was thus transferred from
humans to computers, but with the added benefit that the computers made a much
better job of it. And when learner’s dictionaries began to control the language of defi-
nitions by using a limited defining vocabulary (DV), similar methods could be used to
ensure that proscribed words were kept out. In a further development, the first edition
of the Longman Dictionary of Contemporary English (LDOCE1; Procter 1978) includ-
ed some categories of data (notably a complex system of semantic coding) which were
never intended to appear in the dictionary itself. In projects like these, the initial text-
compilation process remained largely unchanged, but subsequent editing was typi-
cally done on pages created by line printers, with the revisions keyed into the database
by technicians.
1. We are aware that our detailed knowledge relates mainly to developments in English lan-
guage lexicography. We apologise in advance for our Anglocentrism and any exaggerated claims
it has led to.
Automating the creation of dictionaries 
2.1 Year Zero: The COBUILD project
Some time around 1981 marks Year Zero for modern lexicography. The COBUILD
project brought many innovations in lexicographic practices and editorial styles
(as described in Sinclair 1987), but our focus here is on the impact of technology, and
its potential to take on some of the tasks traditionally performed by humans. Comput-
ers were central to the COBUILD approach from the start. Like the visible tip of an
iceberg, the eventual dictionary would be derived from a more extensive database, and
lexicographers created their entries using an array of coloured slips to record informa-
tion of different types (Krishnamurthy 1987). Every linguistic fact the lexicographers
identified would be supported by empirical evidence in the form of corpus extracts.
For the first time, a large-scale description of English was created from scratch to re-
flect actual usage as illustrated in (what was then) a large and varied corpus of texts.
The systematic application of this corpus-based methodology represents a paradigm
shift in lexicography. What was revolutionary in 1981 is now, a generation later, the
norm for any serious lexicographic enterprise. But from the point of view of the hu-
man-machine balance, COBUILD’s advances were relatively modest. Corpus creation
was still a laborious business. As the use of scanners supplemented keyboarding, data
capture was somewhat less arduous than the methods available to Henry Kučera two
decades earlier, when he used punched cards to turn a million words of text into the
Brown Corpus (Kučera & Francis 1967). But like their predecessors at Brown, the
COBUILD developers were testing available technology to its limits, and building the
corpus on which the dictionary would be based involved heroic efforts (Renouf 1987).
As for the lexicographic team, few ever got their hands on a computer. Concordances
were available in the form of microfiche printouts, and the fruits of their analysis were
written in longhand – the slips then being handed over to a separate team of computer
officers responsible for data-entry.
2.2 The 80s and 90s
The fifteen years or so that followed saw quite rapid technical advances. Computers
moved from being large and expensive machines available only to specialists, to be-
come everyday objects to be found on most desks in the developed world. This has
brought vast changes to many aspects of our lives. During this period, corpora became
larger by an order of magnitude, and improved corpus-query systems (CQS) enabled
lexicographers to search the data more efficiently. The constituent texts of a corpus
were now routinely annotated in various ways. Forms of annotation included tokeni-
zation, lemmatization, and part-of-speech tagging (see Grefenstette 1998: 28–34 and
Atkins & Rundell 2008: 84–92 for summaries), and this allowed more sophisticated,
better-targeted searches. From the beginning of the 1990s, it became normal for lexi-
cographers to work on their own computers rather than depending on technical staff
 Michael Rundell and Adam Kilgarriff
for data-entry, and the first generation of dedicated dictionary-writing systems (DWS)
were created.
By the late 1990s, the use of computers in data analysis and dictionary compila-
tion was standard practice (at least for English). But to what extent was lexicography
‘automated’ at this point? Corpus creation remained a resource-intensive business.
Corpus analysis was easier and faster, but lexicographers found themselves handling
far more data. From the point of view of producing more reliable dictionary entries,
access to higher volumes of data was a good thing. But scanning several thousand
concordance lines for a word of medium frequency (within the time constraints of a
typical dictionary project) is a demanding task – in a sense, a new form of drudgery
for the lexicographer.
On the entry-writing front, the new DWS made life somewhat easier. When we
use this kind of software, the overall shape of an entry is controlled by a ‘dictionary
grammar’. This in turn implements the decisions made in the dictionary’s style guide
about how the many varieties of lexical facts are to be classified and presented. Data
fields such as style labels, syntax codes, and part-of-speech markers have a closed set
of possible contents which can be presented to the compiler in drop-down lists. Lexi-
cographers no longer have to remember whether a particular feature should appear in
bold or italics, whether a colloquial usage is labelled ‘inf ’, ‘infml’ or ‘informal’, and so
on. In areas like these, human error is to a large extent engineered out of the writing
process. A good DWS also facilitates the job of editing. For example, an editor will
often want to restructure long entries, changing the ordering or nesting of senses and
other units. This is a hard intellectual task, but the DWS can at least make it a techni-
cally easy one.
Meanwhile, some essential but routine checks – cross-reference validation, defin-
ing vocabulary compliance, and so on – are now fully automated, taking place at the
point of compilation with little or no human intervention.
With more linguistic data at their disposal and better software to exploit it, and
with compilation programs which strangle some classes of error at birth, support the
editing process, and quietly handle a range of routine checks, lexicographers now had
the tools to produce better dictionaries: dictionaries which gave a more accurate ac-
count of how words are used, and presented it with a degree of consistency which was
hard to achieve in the pre-computer age.
Whether this makes life easier for lexicographers is another question. Delegating
low-level operations to computers is clearly a benefit for all concerned. The comput-
ers do the things they are good at (and do them more efficiently than humans), while
the lexicographers are relieved of the more tedious, undemanding tasks and thus free
to focus on the harder, more creative aspects of dictionary-writing. But the effect of
these advances is limited. The core tasks of producing a dictionary still depend almost
entirely on human effort, and there is no sense, at this point, of lexicography being
automated.
Automating the creation of dictionaries 
3. From 1997 to the present
What we describe above represents the state of the art in the late 1990s. For present
purposes, we will take as our baseline the year 1997, which is when planning began for
a new, from-scratch learner’s dictionary.
If the big change to the context of working life in the 80s and 90s was that most of
us (in lexicography and everywhere else) got a computer, the big change in the current
period is that the computer got connected to the Internet.
When work started on the Macmillan English Dictionary for Advanced Learners
(MEDAL; Rundell, ed., 2001), we had the advantage of entering the field at a point
when the corpus-based methodology was well-established, and the developments de-
scribed above were in place. But we faced the challenge of entering a mature market in
which several high-quality dictionaries were already competing for the attention of lan-
guage learners and their teachers. It was clear that any new contender could only make
a mark by doing the basic things well, and by doing new things which had not been
attempted before but which would meet known user needs. It was equally clear that
computational methods would play a key part in delivering the desired innovations.
The rest of this paper reviews developments in the period from 1997 to the pres-
ent, and discusses further advances that are still at the planning stage. The work we
describe represents a collaboration between a lexicographer and a computational lin-
guist (the authors), and shows how the job of dictionary-makers has been supported
by, and in some cases replaced by, computational techniques which originate from
research in the field of natural language processing (NLP). We will conclude with some
speculations on the direction of this trajectory: is the end point a fully-automated
dictionary? does it even make sense to think in terms of an ‘end-point’?
First, it will be helpful to give a brief inventory of the main tasks involved in creat-
ing a dictionary, so that we can assess how far we have progressed along the road to
automation. They are:
– corpus creation
– headword list development
– analysis of the corpus:
– to discover word senses and other lexical units (fixed phrases, phrasal verbs,
compounds, etc.)
– to identify the salient features of each of these lexical units
1. their syntactic behaviour
2. the collocations they participate in
3. their colligational preferences
4. any preferences they have for particular text-types or domains
– providing definitions (or translations) at relevant points
– exemplifying relevant features with material gleaned from the corpus
– editing compiled text in order to control quality and ensure consistent adherence
to agreed style policies
 Michael Rundell and Adam Kilgarriff
We look at all of these, some in more detail than others.
3.1 Corpus creation
For people in the dictionary business, one of the most striking developments of the
21st century is the ‘web corpus’. Corpora are now routinely assembled using texts from
the Internet and this has had a number of consequences. First, the curse of data-
sparseness, which has dogged lexicography from Johnson’s time onwards, has become
a thing of the past.2 The COBUILD corpus of the 1980s – an order of magnitude larger
than Brown – sought to provide enough data for a reliable account of mainstream
English, but its creators were only too aware of its limitations.3 The British National
Corpus (BNC) – larger by another order of magnitude – was another attempt to ad-
dress the issue.
As new technologies have arisen to facilitate corpus creation from the web, it has
become possible to create register-diverse corpora running into billions of words. Soft-
ware tools such as WebBootCat (Baroni & Bernardini 2004; Baroni et al. 2006) provide
a one-stop operation in which texts are selected according to user-defined parameters,
‘cleaned up’, and linguistically annotated. The timescale for creating a large lexico-
graphic corpus has been reduced from years to weeks, and for a small corpus in a
specialized domain, from months to minutes. Texts on the web are, by definition, al-
ready in digital form. The overall effect is to drastically reduce both the human effort
involved in corpus creation and the ‘entry fee’ to corpus lexicography.4 Thus the pro-
cess of collecting the raw data that will form the basis of a dictionary has to a large
extent been automated.
Inevitably there are downsides. The granularity of smaller corpora (in terms of the
balance of texts, the level of detail in document headers, and the delicacy of annota-
tion) cannot be fully replicated in corpora of several billion words. While for some
types of user (e.g. grammarians or sociolinguists) this will sometimes limit the useful-
ness of the corpus, for lexicographers working on general-purpose dictionaries, the
benefits of abundant data outweigh most of the perceived disadvantages of web cor-
pora. There were good reasons why the million-word Brown Corpus of 1962 was de-
signed with such great care: a couple of ‘rogue’ texts could have had a disruptive effect
on the statistics. In a billion-word corpus the occasional outlier will not compromise
the overall picture. We now simply aim to ensure that the major text-types are all well
represented.
2. We should perhaps add this rider: “at least for the most widely-used languages, for which
many billions of words of text are now available”.
3. “Every time COBUILD doubles its corpus, we want to double it again” (Clear 1996: 266).
4. Hence, for example, there are now substantial corpora for ‘smaller’ languages such as Irish or
the Bantu languages of southern Africa: Kilgarriff et al. (2006); de Schryver & Prinsloo (2000).
Automating the creation of dictionaries 
Concerns about the diversity of text-types available on the web have proved large-
ly unfounded. Comparisons of web-derived corpora against benchmark collections
like the BNC have produced encouraging results, suggesting that a well-designed web
corpus can provide reliable language data (Sharoff 2006; Baroni et al. 2009).5
3.2 Headword lists
Building a headword list is the most obvious way to use a corpus for making a diction-
ary. Ceteris paribus, if a dictionary is to have N words in it, they should be the N words
from the top of the corpus frequency list.
3.2.1 In search of the ideal corpus

It is never as simple as this, mainly because the corpus is never good enough. It will con-
tain noise and biases. The noise is always evident within the first few thousand words of
all the corpus frequency lists that either of us has ever looked at. In the BNC, for exam-
ple, a large amount of data from a journal on gastro-uterine diseases presents noise in
the form of words like mucosa – a term much-discussed in these specific documents, but
otherwise rare and not known to most speakers of English.6 Bias in the spoken BNC is
illustrated by the very high frequencies for words like plan, elect, councillor, statutory and
occupational: the corpus contains a great deal of material from local government meet-
ings, so the vocabulary of this area is well represented. Thus keyword lists of the BNC in
contrast to other large, general corpora show these words as particularly BNC-flavoured.
And unlike many of today’s large corpora, the BNC contains, by design, a high propor-
tion of fiction. Finally, if our dictionary is to cover the varieties of English used through-
out the world, the BNC’s exclusive focus on British English is another limitation.
If we turn to UKWaC (the UK ‘Web as Corpus’; Baroni et al. 2009), a web-sourced
corpus of around 1.6 billion words, we find other forms of noise and bias. The corpus
contains a certain amount of web spam. In particular, we have discovered that people
advertising poker are skilled at producing vast quantities of ‘word salad’ which is not
easily distinguished – using automatic routines – from bona fide English. Internet-re-
lated bias also shows up in the high frequencies for words like browser and configure.
While noise is simply wrong, and its impact is progressively reduced as ongoing clean-
ups are implemented, biases are more subtle in that they force questions about the sort
of language to be covered in the dictionary, and in what proportions.7
5. See for example Keller & Lapata (2003); Fletcher (2004). For general background to web cor-
pora, see Kilgarriff & Grefenstette (2003); Atkins & Rundell (2008: 78–80); Baroni et al. (2009).
6. In the BNC mucosa is marginally more frequent than spontaneous and enjoyment, though
of course it appears in far fewer corpus documents.
7. As is now generally recognised, the notion of ‘representativeness’ is problematical with re-
gard to general-purpose corpora like BNC and UKWaC, and there is no ‘scientific’ way of
achieving it: see e.g. Atkins & Rundell (2008: 66).
 Michael Rundell and Adam Kilgarriff
3.2.2 Multiwords
English dictionaries have a range of entries for multiword items, typically including
noun compounds (credit crunch, disc jockey), phrasal and prepositional verbs (take
after, set out) and compound prepositions and conjunctions (according to, in order to).
While corpus methods can straightforwardly find high-frequency single-word items
and thereby provide a fair-quality first pass at a headword list for those items, they can-
not do the same for multiword items. Lists of high-frequency word-pairs in any Eng-
lish corpus are dominated by items which do not merit dictionary entries: the string of
the is usually top of any list of bigrams. We have several strategies here: one is to view
multiword headwords as collocations (see discussion below) and to find multiword
headwords when working through the alphabet looking at each headword in turn.
Another, currently underway in the Kelly project (Kilgarriff 2010), is to explore lists of
translations of single-word headwords for a number of other languages into English,
and to find out what multiwords occur repeatedly.
3.2.3 Lemmatization
The words we find in texts are inflected forms; the words we put in a headword list are
lemmas. So, to use a corpus list as a dictionary headword, we need to map inflected
forms to lemmas: we need to lemmatize.
English is not a difficult language to lemmatize as no lemma has more than eight
inflectional variants (be, am, is, are, was, were, been, being), most nouns have just two
(apple, apples) and most verbs, just four (invade, invades, invading, invaded). Most
other languages, of course, present a substantially greater challenge. Yet even for Eng-
lish, automatic lemmatization procedures are not without their problems. Consider
the data in Table 1. To choose the correct rule we need an analysis of the orthography
corresponding to phonological constraints on vowel type and consonant type, for both
British and American English.8
Even with state-of-the-art lemmatization for English, an automatically extracted
lemma list will contain some errors.
These and other issues in relating corpus lists to dictionary headword lists are
described in detail in Kilgarriff (1997).
3.2.4 Practical solutions

Building a headword list for a new dictionary (or revising one for an existing title) has
never been an exact science, and little has been written about it. Headword lists are by
their nature provisional: they evolve during a project and are only complete at the end.
A good starting point is to have a clear idea of what your dictionary will be used for,
and this is where the ‘user profile’ comes in. A user-profile “seeks to characterize the
typical user of the dictionary, and the uses to which the dictionary is likely to be put”
8. The issue came to our attention when an early version of the BNC frequency list gave undue
prominence to verbal car.
Automating the creation of dictionaries 
Table 1.â•‡ Complexity in verb lemmatization rules for English
lemma -ed, -s forms Rule -ing form Rule
fix fixed, fixes delete -ed, -es fixing delete -ing

care cared, cares delete -d, -s caring delete -ing, add -e
hope hoped, hopes delete -d, -s hoping delete -ing, add -e
hop hopped delete -ed, hopping delete -ing,
undouble undouble
consonant consonant
hops delete -s
fuse fused delete -d fusing delete -ing, add -e
fuss fussed delete -ed fussing delete -ing
bus AmE bussed, busses?? delete -ed/-s, bussing delete -ing,
undouble undouble
consonant consonant
BrE bused, bused delete -ed busing delete -ing
(Atkins & Rundell 2008: 28). This is a manual task, but it provides filters with which to
sift computer-generated wordlists.
An approach which has been used with some success is to generate a wordlist
which is (say) 20% larger than the list you want to end up with – thus, a list of 60,000
words for a dictionary of 50,000 – and then whittle it down to size taking account of
the user profile. Then, if the longer list contains obsolescent terms which are used in
19th century literature, but the user profile specifies that uses are all engaged with the
contemporary language, these items could safely be deleted. If the user profile included
literary scholarship, they could not.
3.2.5 New words

As everyone involved in commercial lexicography knows, neologisms punch far above
their weight. They might not be very important for an objective description of the
language but they are loved by marketing teams and reviewers. New words and phras-
es often mark the only obvious change in a new edition of a dictionary, and dominate
the press releases.
Mapping language change has long been a central concern of corpus linguists, and
a longstanding vision is the ‘monitor corpus’, the moving corpus that lets the research-
er explore language change objectively (Clear 1988; Janicivic & Walker 1997). The core
method is to compare an older ‘reference’ corpus with an up-to-the-minute one to find
words which are not already in the dictionary, and which are in the recent corpus but
not in the older one. O’Donovan & O’Neill (2008) describe how this has been done at
Chambers Harrap Publishers, and Fairon et al. (2008) describe a generic system in
which users can specify the sources they wish to use and the terms they wish to trace.
 Michael Rundell and Adam Kilgarriff
The nature of the task is that the automatic process creates a list of candidates, and
a lexicographer then goes through them to sort the wheat from the chaff. There is al-
ways far more chaff than wheat. The computational challenge is to cut out as much
chaff as possible without losing the wheat – that is, the new words which the lexicog-
raphy team have not yet logged but which should be included in the dictionary.
For many aspects of corpus processing, we can use statistics to distinguish signal
from noise, on the basis that the phenomena we are interested in are common ones
and occur repeatedly. But new words are usually rare, and by definition are not already
known. Thus lemmatization is particularly challenging since the lemmatizer cannot
make use of a list of known words. So for example, in one list we found the ‘word’ au-
thore, an incorrect but understandable lemmatization of authored, past participle of
the unknown verb author.
For new-word finding we will want to include items in a candidate list even though
they occur just once or twice. Statistical filtering can therefore only be used minimally.
We are exploring methods which require that a word that occurred once or twice in
the old material occurs in at least three or four documents in the new material, to
make its way onto the candidate list. We use some statistical modulation to capture
new words which are taking off in the new period, as well as the items that simply have
occurred where they never did before. Many items that occur in the new words list are
simply typing errors. This is another reason why it is desirable to set a threshold high-
er than one in the new corpus.
We have found that almost all hyphenated words are chaff, and often relate to
compounds which are already treated in the dictionary as ‘solid’ or as multiword items.
English hyphenation rules are not fixed: most word pairs that we find hyphenated
(sand-box) can also be found written as one word (sandbox), as two (sand box), or as
both. With this in mind, to minimize chaff, we take all hyphenated forms and two- and
three-word items in the dictionary and ‘squeeze’ them so that the one-word version is
included in the list of already-known items, and we subsequently ignore all the hy-
phenated forms in the corpus list.
Prefixes and suffixes present a further set of items. Derivational affixes include
both the more syntactic (-ly, -ness) and the more semantic (-ish, geo-, eco-).9 Most are
chaff: we do not want plumply or ecobuddy or gangsterish in the dictionary, because,
even though they all have google counts in the thousands, they are not lexicalized and
there is nothing to say about them beyond what there is to say about the lemma, the
affix and the affixation rule. The ratio of wheat to chaff is low, but amongst the nonce
productions there are some which are becoming established and should be considered
for the dictionary. So we prefer to leave the nonce formations in place for the lexicog-
rapher to run their eye over.
9. Here we exclude inflectional morphemes, addressed under lemmatization above: in Eng-

lish a distinction between inflectional and derivational morphology is easily made.
Automating the creation of dictionaries 
For the longer term, the biggest challenge is acquiring corpora for the two time
periods which are sufficiently large and sufficiently well-matched. If the new corpus is
not big enough, the new words will simply be missed, while if the reference corpus is
not big enough, the lists will be full of false positives. If the corpora are not well-
matched but, for example, the new corpus contains a document on vulcanology and
the reference corpus does not, the list will contain words which are specialist vocabu-
lary rather than new, like resistivity and tephrochronology.
While vast quantities of data are available on the web, most of it does not come
with reliable information on when the document was originally written. While we can
say with confidence that a corpus collected from the web in 2009 represents, overall, a
more recent phase of the language than one collected in 2008, when we move to words
with small numbers of occurrences, we cannot trust that words from the 2009 corpus
are from more recently-written documents than ones from the 2008 corpus.
Two text types where date-of-writing is usually available are newspapers and
blogs. Both of these have the added advantage that they tend to be about current topics
and are relatively likely to use new vocabulary. Our current efforts for new-word-de-
tection involve large-scale gathering of one million words of newspapers and blogs per
day. The collection started in early 2009 and we need to wait at least one year or pos-
sibly two before we can assess what it achieves. Over a shorter time span lists will be
dominated by short-term items and items related to the time of year. It will take a lon-
ger view to support the automatic detection of new words which have become estab-
lished and have earned their place in the dictionary.
3.3 Collocation and word sketches
As in most areas of life, new ways of doing things typically evolve in response to known
difficulties. What has tended to happen in the dictionary-development sphere is that
we first identify a lexicographic problem, and then consider whether NLP techniques
have anything to offer in the way of solutions. And when computational solutions are
devised, we find – as often as not – that they have unforeseen consequences which go
beyond the specific problem they were designed to address.
When planning a new dictionary, it is good to pay attention to what other diction-
aries are doing, and to consider whether you can do the same things but do them bet-
ter. But this is not enough. It is also important to look at emerging trends at the theo-
retical level and at their practical implications for language description. Collocation is
a good example. The arrival of large corpora provided the empirical underpinning for
a Firthian view of vocabulary, and – thanks to the work of John Sinclair and others –
collocation became a core concept within the language-teaching community. Books
such as Lewis (1993) and McCarthy & O’Dell (2005) helped to show the relevance of
collocation at the classroom level, but in 1997 learner’s dictionaries had not yet caught
up: they showed an awareness of the concept, but their coverage of collocation was
patchy and unsystematic. This represented an opportunity for MEDAL.
 Michael Rundell and Adam Kilgarriff
The first author described the problem to the second, who felt it should be possible
to find all common collocations for all common words automatically, by using a shal-
low grammar to identify all verb-object pairs, subject-verb pairs, modifier-modifiee
pairs and so on, and then to apply statistical filtering to give a fairly clean list, as pro-
posed by Tapanainen & Järvinen (1998; and for the statistics, Church & Hanks 1990).
The project would need a very large, part-of-speech-tagged corpus of general English:
this had recently become available in the form of the British National Corpus. First
experiments looked encouraging: the publisher contracted the researcher to proceed
with the research, and the first versions of word sketches were created. A word sketch
is a one-page, corpus-based summary of a word’s grammatical and collocational be-
haviour, as illustrated in Figure 1.
Figure 1.â•‡ Part of a word sketch for return (noun)

Automating the creation of dictionaries 
As the lexicographers became familiar with the software, it became apparent that word
sketches did the job they were designed to do. Each headword’s collocations could be
listed exhaustively, to a far greater degree than was possible before. That was the im-
mediate goal. But analysis of a word’s sketch also tended to show, through its colloca-
tions, a wide range of the patterns of meaning and usage that it entered into. In most
cases, each of a word’s different meanings is associated with particular collocations, so
the collocates listed in the word sketches provided valuable prompts in the key task of
identifying and accounting for all the word’s meanings in the entry. The word sketches
functioned not only as a tool for finding collocations, but also as a useful guide to the
distinct senses of a word – the analytical core of the lexicographer’s job (Kilgarriff &
Rundell 2002).
Prior to the advent of word sketches, the primary means of analysis in corpus
lexicography was the reading of concordances. Since the earliest days of the COBUILD
project, the lexicographers scanned concordance lines – often in their thousands – to
find all the patterns of meaning and use. The more lines were scanned, the more pat-
terns would tend to be found (though with diminishing returns). This was good and
objective, but also difficult and time-consuming. Dictionary publishers are always
looking to save time, and hence budgets. Earlier efforts to offer computational support
were based on finding frequently co-occurring words in a window surrounding the
headword (Church & Hanks 1990). While these approaches had generated plenty of
interest among university researchers, they were not taken up as routine processes by
lexicographers: the ratio of noise to signal was high, the first impression of a colloca-
tion list was of a basket of earth with occasional glints of possible gems needing further
exploration, and it took too long to use them for every word.
But early in the MEDAL project, it became clear that the word sketches were more
like a contents page than a basket of earth. They provided a neat summary of most of
what the lexicographer was likely to find by the traditional means of scanning concor-
dances. There was not too much noise. Using them saved time. It was more efficient to
start from the word sketch than from the concordance.
Thus the unexpected consequence was that the lexicographer’s methodology
changed, from one where the technology merely supported the corpus-analysis pro-
cess, to one where it pro-actively identified what was likely to be interesting and di-
rected the lexicographer’s attention to it. And whereas, for a human, the bigger the
corpus, the greater the problem of how to manage the data, for the computer, the big-
ger the corpus, the better the analyses: the more data there is, the better the prospects
for finding all salient patterns and for distinguishing signal from noise. Though origi-
nally seen as a useful supplementary tool, the sketches provide a compact and reveal-
ing snapshot of a word’s behaviour and uses and have, in most cases, become the
preferred starting point in the process of analyzing complex headwords.
 Michael Rundell and Adam Kilgarriff
3.4 Word sketches and the sketch engine since 2004
Since the first word sketches were used in the late 1990s in the development of the first
edition of MEDAL, word sketches have been integrated into a general-purpose corpus
query tool, the Sketch Engine (Kilgarriff et al. 2004) and have been developed for a
dozen languages (the list is steadily growing). They are now in use for commercial and
academic lexicography in the UK (where most of the main dictionary publishers use
them), China, the Czech Republic, Germany, Greece, Japan, the Netherlands, Slovakia,
Slovenia and the USA, and for language and linguistics teaching all round the world.
Word sketches have been complemented by an automatic thesaurus (which identifies
the words which are most similar, in terms of shared collocations, to a target word)
and a range of other tools including ‘sketch difference’, for comparing and contrasting
a word with a near-synonym or antonym in terms of collocates shared and not shared.
There are also options such as clustering a word’s collocates or its thesaurus entries.
The largest corpus for which word sketches have been created so far contains over five
billion words (Pomikálek et al. 2009). In a quantitative evaluation, two thirds of the
collocates in word sketches for five languages were found to be ‘publishable quality’: a
lexicographer would want to include them in a published collocations dictionary for
the language (Kilgarriff et al. 2010).
3.5 Word sketches and the sketch engine in the NEID project
The New English-Irish Dictionary (NEID) is a project funded by Foras na Gaeilge, the
statutory language board for Ireland, and planned by the Lexicography MasterClass.10
It has provided a setting for a range of ambitious ideas about how we can efficiently
create ever more detailed and accurate descriptions of the lexis of a language. The
project makes a clear divide between the ‘source-language analysis’ phase of the proj-
ect, and the translation and final-editing phases. A consequence is that the analysis
phase is an analysis of English in which the target language (Irish) plays no part, and
the resulting ‘Database of ANalysed Texts of English’ (DANTE) is a database with po-
tential for a range of uses in lexicography and language technology. It could be used,
for example, as a launchpad for bilingual dictionaries with a different target language,
or as a resource for improving machine translation systems or text-remediation soft-
ware. The Lexicography MasterClass undertook the analysis phase, with a large team
of experienced lexicographers, over the period 2008–2010.11
The project has used the Sketch Engine with a corpus comprising UKWaC plus
the English-language part of the New Corpus for Ireland (Kilgarriff et al. 2006). In the
course of the project, three innovations were added to the standard word sketches.
10. http://www.lexmasterclass.com.
11. For an account see Atkins et al. (2010).
Automating the creation of dictionaries 
3.5.1 Customization of sketch grammar

Any dictionary uses a particular grammatical scheme in its choice of the repertoire and
meaning of the grammatical labels it attaches to words. The Sketch Engine also uses a
grammatical scheme in its ‘Sketch Grammar’, which defines the grammatical relations
according to which it will classify collocations in the word sketches: object_of, and/or
etc. in Figure 1. The Sketch Grammar also gives names to the grammatical relations.
This raises the prospect of mapping the grammatical scheme that is specified in a dic-
tionary’s Style Guide onto the scheme in the Sketch Grammar. In this way, there will be
an exact match between the inventory of grammatical relations in the dictionary, and
those presented to the lexicographer in the word sketch. A relation that is called NP_
PP, for a verb such as load (load the hay onto the cart) in the lexical database will be
called NP_PP, with exactly the same meaning, in the word sketch. Such an approach
will simplify and rationalize the analysis process for the lexicographer: for the most
part s/he will be copying a collocate of type X in the word sketch, into a collocate of type
X (under the relevant sense of the headword) in the dictionary entry s/he is writing.
The NEID was the first project where the Sketch Grammar and Dictionary Gram-
mar were fully harmonized: the Sketch Grammar was customized to express the same
grammatical constructions and collocation-types, with the same names, as the lexi-
cographers would use in their analysis. Another Macmillan project (the Macmillan
Collocations Dictionary; Rundell, ed., 2010) subsequently used the same approach.
3.5.2 ‘Constructions list’ as top-level summary of word sketch

The dictionary grammar for the NEID project is quite complex and fine-grained. In
the case of verbs, for example, any of 43 different structures may be recorded. Conse-
quently we soon found that word sketches were often rather large and hard to navigate
around. To address this, we introduced an ‘index’, which appears right at the top of the
word sketch and summarizes its contents by listing the constructions that are most
salient for that word (cf. Figure 2).
In other cases, we found that there were a large number of constructions involving
prepositions and particles, and that these could make the word sketch unwieldy. To
address this, we collected all the preposition/particle relations on a separate web page,
as in Figure 3.
3.5.3 ‘More data’ and ‘Less data’ buttons

The size of a word sketch is (inevitably) constrained by parameters which determine
how many collocates and constructions are shown. The Sketch Engine has always al-
lowed users to change the parameters, but most users are either unaware of the possi-
bility or are not sure which parameters they should change or by how much. A simple
but much-appreciated addition to the interface was ‘More data’ and ‘Less data’ buttons
so the user can, at a single click, see less data (if they are feeling overwhelmed) or more
data (if they have accounted for everything in the word sketch in front of them, but feel
they have missed something or not said enough).
 Michael Rundell and Adam Kilgarriff
Figure 2.â•‡ Part of a word sketch for remember (verb), where the verb’s main syntactic
patterns appear in the box at top left
Figure 3.â•‡ Word sketch for argue, showing part of the page devoted to prepositional phrases
Automating the creation of dictionaries 
3.6 Labels
Dictionaries use a range of labels (such as usu pl., informal, Biology, AmE) to mark
words according to their grammatical, register, domain, and regional-variety character-
istics, whenever these deviate significantly from the (unmarked) norm. All of these are
facts about a word’s distribution, and all can, in principle, be gathered automatically
from a corpus. In each of these four cases, computationalists are currently able to pro-
pose some labels to the lexicographer, though there remains much work to be done.
In each case the methodology is to:
– specify a set of hypotheses
– there will usually be one hypothesis per label, so grammatical hypotheses for
the category ‘verb’ may include:
– is it often/usually/always passive
– is it often/usually/always progressive
– is it often/usually/always in the imperative
– for each word
– test all relevant hypotheses
– for all hypotheses that are confirmed, add the information to the word
sketch.
Where no hypotheses are confirmed – when, in other words, there is nothing interest-
ing to say, which will be the usual case – nothing is added to the word sketch.
3.6.1 Grammatical labels: usu. pl, usu. passive, etc.

To determine whether a noun should be marked as ‘usually plural’, it is possible simply
to count the number of times the lemma occurs in the plural, and the number of times
it occurs overall, and divide the second number by the first to find the proportion.
Similarly, to discover how often a verb is passivized, we can count how often it is a past
participle preceded by a form of the verb be (with possible intervening adverbs) and
determine what fraction of the verb’s overall frequency the passive forms represent.
Given a lemmatized, part-of-speech-tagged corpus, this is straightforward. A large
number of grammatical hypotheses can be handled in this way.
The next question is: when is the information interesting enough to merit a label
in a dictionary? Should we, for example, label all verbs which are over 50% passive as
often passive?
To assess this question, we want to know what the implications would be: we do
not want to bombard the dictionary user with too many labels (or the lexicographer
with too many candidate-labels). What percentage of English verbs occur in the pas-
sive over half of the time? Is it 20%, or 50%, or 80%? This question is also not in prin-
ciple hard to answer: for each verb, we work out its percentage passive, and sort
according to the percentage. We can then give a figure which is, for lexicographic pur-
poses, probably more informative than ‘the percentage passive’: the percentile. The
 Michael Rundell and Adam Kilgarriff
percentile indicates whether a verb is in the top 1%, or 2%, or 5%, or 10% of verbs from
the point of view of how passive they are. We can prepare lists as in Table 2. This uses
the methodology for finding the ‘most passive’ verbs (with frequency over 500) in the
BNC. It shows that the most passive verb is station: people and things are often sta-
tioned in places, but there are far fewer cases where someone actively stations things.
For station, 72.2% of its 557 occurrences are in the passive, and this puts it in the 0.2%
‘most passive’ verbs of English. At the other end of the table, levy is in the passive just
over half the time, which puts it in the 1.9% most passive verbs. The approach is simi-
lar to the collostructional analysis of Gries & Stefanowitsch (2004).
As can be seen from this sample, the information is lexicographically valid: all the
verbs in the table would benefit from an often passive or usually passive label.
A table like this can be used by editorial policy-makers to determine a cut-off
which is appropriate for a given project. For instance, what proportion of verbs should
Table 2.â•‡ The ‘most passive’ verbs in the BNC, for which a usually passive label might be
proposed
Percentile Ratio Lemma Frequency
0.2 72.2 station â•⁄â•⁄ 557

0.2 71.8 base 19201
0.3 71.1 destine â•⁄â•⁄ 771
0.3 68.7 doom â•⁄â•⁄ 520
0.4 66.3 poise â•⁄â•⁄ 640
0.4 65.0 situate â•⁄ 2025
0.5 64.7 schedule â•⁄ 1602
0.5 64.1 associate â•⁄ 8094
0.6 63.2 embed â•⁄â•⁄ 688
0.7 62.0 entitle â•⁄ 2669
0.8 59.8 couple â•⁄ 1421
0.9 58.1 jail â•⁄â•⁄ 960
1.1 57.8 deem â•⁄ 1626
1.1 55.5 confine â•⁄ 2663
1.2 55.4 arm â•⁄ 1195
1.2 54.9 design 11662
1.3 53.9 convict â•⁄ 1298
1.5 53.1 clothe â•⁄â•⁄ 749
1.5 52.8 dedicate â•⁄ 1291
1.5 52.4 compose â•⁄ 2391
1.6 51.5 flank â•⁄â•⁄ 551
1.7 50.8 gear â•⁄â•⁄ 733
1.9 50.1 levy â•⁄â•⁄ 603
Automating the creation of dictionaries 
attract an often passive label? Perhaps the decision will be that users benefit most if the
label is not overused, so just 4% of verbs would be thus labelled. The full version of
Table 2 tells us what these verbs are. And now that we know precisely the hypothesis to
use (“is the verb in the top 4% most-passive verbs?”) and where the hypothesis is true,
the label can be added into the word sketch. In this way, the element of chance – will
the lexicographer notice whether a particular verb is typically passivized? – is elimi-
nated, and the automation process not only reduces lexicographers’ effort but at the
same time ensures a more consistent account of word behaviour.
3.6.2 Register Labels: formal, informal, etc.

Any corpus is a collection of texts. Register is in the first instance a classification that
applies to texts rather than words. A word is informal (or formal) if it shows a clear
tendency to occur in informal (or formal) texts. To label words according to register,
we need a corpus in which the constituent texts are themselves labelled for register in
the document header. Note that at this stage, we are not considering aspects of register
other than formality.
One way to come by such a corpus is to gather texts from sources known to be
formal or informal. In a corpus such as the BNC, each document is supplied with
various text type classifications, so we can, for example, infer from the fact that a doc-
ument is everyday conversation, that it is informal, or from the fact that it is an aca-
demic journal article, that it is formal.
The approach has potential, but also drawbacks. In particular, it is not possible to
apply it to any corpus which does not come with text-type information. Web corpora
do not. An alternative is to build a classifier which infers formality level on the basis of
the vocabulary and other features of the text. There are classifiers available for this task:
see for example Heylighen & Dewaele (1999) and Santini et al. (2009). Following this
route, we have recently labelled all documents in a five billion word web corpus accord-
ing to formality, so we are now in a position to order words from most to least formal.
The next tasks will be to assess the accuracy of the classification, and to consider – just
as was done for passives – the percentage of the lexicon we want to label for register.
The reasoning may seem circular: we use formal (or informal) vocabulary to find
formal (or informal) vocabulary. But it is a spiral rather than a circle: each cycle has
more information at its disposal than the previous one. We use our knowledge of the
words that are formal or informal to identify documents that are formal or informal.
That then gives us a richer dataset for identifying further words, phrases and construc-
tions which tend to be formal or informal, and allows us to quantify the tendencies.
3.6.3 Domain labels: Geol., Astron., etc.

The issues are, in principle, the same as for register. The practical difference is that
there are far more domains (and domain labels): even MEDAL, a general-purpose
learner’s dictionary, has 18 of these, while the NEID database has over 150 domain
labels. Collecting large corpora for each of these domains is a significant challenge.
 Michael Rundell and Adam Kilgarriff
It is tempting to gather a large quantity of, for example, geological texts from a
particular source, perhaps an online geology journal. But rather than being a ‘general
geology’ corpus, that subcorpus will be an ‘academic-geology-prose corpus’, and the
words which are particularly common in the subcorpus will include vocabulary typi-
cal of academic discourse in general as well as of the domain of geology. Ideally, each
subcorpus will have the same proportions of different text-types as the whole corpus.
None of this is technically or practically impossible, but the larger the number of sub-
corpora, the harder it is to achieve.
In current work, we are focusing on just three subcorpora: legal, medical and busi-
ness, to see if we can effectively propose labels for them.
Once we have the corpora and counts for each word in each subcorpus, we need
to use statistical measures for deciding which words are most distinctive of the subcor-
pus: which words are its ‘keywords’, the words for which there is the strongest case for
labelling. The maths we use is based on a simple ratio between relative frequencies, as
implemented in the Sketch Engine and presented in Kilgarriff (2009).
3.6.4 Region labels: AmE, AustrE, etc.

The issues concerning region labels are the same as for domains but in some ways a
little simpler. The taxonomy of regions, at least from the point of view of labelling
items used in different parts of the English-speaking world, is relatively limited, and a
good deal less open-ended than the taxonomy of domains. In MEDAL, for example, it
comprises just 12 varieties or dialects, including American, Australian, Irish, and
South African English.
3.7 Examples
Most dictionaries include example sentences. They are especially important in peda-
gogical dictionaries, where a carefully-selected set of examples can clarify meaning,
illustrate a word’s contextual and combinatorial behaviour, and serve as models for
language production. The benefits for users are clear, and the shift from paper to elec-
tronic media means that we can now offer users far more examples. But this comes at
a cost. Finding good examples in a mass of corpus data is labour-intensive. For all sorts
of reasons, a majority of corpus sentences will not be suitable as they stand, so the
lexicographer must either search out the best ones or modify corpus sentences which
are promising but in some way flawed.
3.7.1 GDEX
In 2007, the requirement arose – in a project for Macmillan – for the addition of new
examples for around 8,000 collocations. The options were to ask lexicographers to se-
lect and edit these in the ‘traditional’ way, or to see whether the example-finding pro-
cess could be automated. Budgetary considerations favoured the latter approach, and
Automating the creation of dictionaries 
subsequent discussions led to the GDEX (‘good dictionary examples’) algorithm,

which is described in Kilgarriff et al. (2008).
Essentially, the software applies a number of filters designed to identify those sen-
tences in a corpus which most successfully fulfil our criteria for being a ‘good’ example.
A wide range of heuristics is used, including criteria like sentence length, the presence
(or absence) of rare words or proper names, and the number of pronouns in the sen-
tence. The system worked successfully on its first outing – not in the sense that every
example it ‘promoted’ was immediately usable, but in the sense that it significantly
streamlined the lexicographer’s task. GDEX continues to be refined, as more selection
criteria are added and the weightings of the different filters adjusted. For the DANTE
database, which includes several hundred thousand examples, GDEX sorts the sen-
tences for any of the combinations shown in the word sketches, in such a way that the
ones which GDEX thinks are ‘best’ are shown first. The lexicographer can scan a short
list until they find a suitable example for whatever feature is being illustrated, and
GDEX means they are likely to find what they are looking for in the top five examples,
rather than, on average, within the top 20 to 30.
3.7.2 One-click copying

DANTE is an example-rich database in which almost all word senses, constructions,
and multiword expressions are illustrated with at least one example. All examples are
from the corpus and are unedited (DANTE is a lexical database rather than a finished
dictionary). Lexicographers are thus required to copy many example sentences from
the corpus system into the dictionary editing system. We use standard copy-and-paste
but in the past this has often been fiddly, with one click to see the whole sentence, then
manoeuvring the mouse to mark it all. So we have added a button for ‘one-click copy-
ing’: now, a single click on an icon at the right of any concordance line copies not the
visible concordance line, but the complete sentence (with headword highlighted) and
puts it on the clipboard ready for pasting into the dictionary.
3.8 Tickbox lexicography (TBL)
One-click copying is a good example of a simple software tweak that streamlines a

routine lexicographic task. This may look trivial, but in the course of a project such as
DANTE, the lexicographic team will be selecting and copying several hundred thou-
sand example sentences, so the time-savings this yields are significant.
Another development – currently in use on two lexicographic projects – takes this
process a step further, allowing lexicographers to select collocations for an entry, then
select corpus examples for each collocation, simply by ticking boxes (thus eliminating
the need to retype or cut-and-paste). We call this ‘tickbox lexicography’ (TBL), and in
this process, the lexicographer works with a modified version of the word sketches,
where each collocate listed under the various grammatical relations (‘gramrels’) has a
tickbox beside it. Then, for each word sense and each gramrel, the lexicographer:
 Michael Rundell and Adam Kilgarriff
– ticks the collocations s/he wants in the dictionary or database

– clicks a ‘Next’ button
– is then presented with a choice of six corpus examples for every collocation, each
with a tickbox beside it (six is the default, and assumes that – thanks to GDEX – a
suitable example will appear in this small set; but the defaults can of course be
changed)
– ticks the desired examples, then clicks a ‘Copy to clipboard’ button.
The system then builds an XML structure according to the Document Type Definition
(DTD) of the target dictionary (each target dictionary has its own TBL application).
The lexicographer can then paste this complex structure, in a single move, directly into
the appropriate fields in the dictionary writing system. In this way, TBL models and
streamlines the process of getting a corpus analysis out of the corpus system and into
the dictionary writing system, as the first stage in the compilation of a dictionary. Here
again, the incremental efficiency gains are substantial. The TBL process is especially
well-adapted to the emerging situation where online dictionaries give their users ac-
cess to multiple examples of a given linguistic feature (such as a collocation or syntax
pattern): with TBL, large numbers of relevant corpus examples can be selected and
copied into the database with minimum effort.
4. Conclusions
If we look back at the list of lexicographic tasks (Section 3, above), we find that the
following have been – or soon will be – automated to a significant degree:
– corpus creation
– headword list building
– identification of key linguistic features or preferences (syntactic, collocational,
colligational, and text-type-related)
– example selection.
Further improvements are possible for each of these technologies (notably the GDEX
algorithm and the text-type classifiers), and many of these are already in development.
An especially interesting approach we are now looking at is one that takes the whole
automation process a step further. In this model, we envisage a change from the cur-
rent situation, where the corpus software (some version of the word sketches) presents
data to the lexicographer in (as we have seen) intelligently pre-digested form, to a new
paradigm where the software selects what it believes to be relevant data and actually
populates the appropriate fields in the dictionary database. In this way of working, the
lexicographer’s task changes from selecting and copying data from the software, to
validating – in the dictionary writing system – the choices made by the computer.
Having deleted or adjusted anything unwanted, the lexicographer then tidies up and
Automating the creation of dictionaries 
completes the entry. The principle here is that, assuming the software can be trained to
make the ‘right’ decisions in a majority of cases, it is more efficient to edit out the com-
puter’s errors than to go through the whole data-selection process from the beginning.
If this approach can be made to work effectively, we are likely to see a further change
in lexicographers’ working practices – and a further shift towards full automation.
Automated lexicography is still some way off. In particular, we have not yet reached
the point where definition writing and (hardest of all) word sense disambiguation
(WSD) are carried out by machines. In both cases, however, it may be possible to solve
the problem by redefining the goal. If, for example, we think less in terms of discreet,
numbered ‘dictionary senses’, and more of the contribution that a word makes to the
meaning of a given communicative event, then the task starts to look less intractable.
It has become increasingly clear that the meaning of a word in a particular context is
closely associated with the specific patterning in which it appears – where ‘patterning’
encompasses features such as syntax, collocation, and domain information. A good
deal of research is going on in this area, notably Patrick Hanks’ work on ‘Corpus Pat-
tern Analysis’ (e.g. Hanks 2004), and it is self-evident that computers can identify and
count clusters of patterns more readily than they can count something as unstable as
‘senses’. This offers one way forward. Equally, definitions could become less important
if the user who encounters an unknown word could immediately access half a dozen
very similar corpus examples (filtered by GDEX or the like), and then draw his or her
own conclusions. Whether this could be a viable alternative to the traditional defini-
tion – especially when the user is a learner – remains to be seen.
We have described a long-running collaboration between a lexicographer and a
computational linguist, and its outcomes in terms of the way that dictionary text is
compiled in the early 21st century. There is plenty more to be done, but it should be
clear from this brief survey that the interaction between lexicography and language
engineering has already been fruitful and promises to deliver even greater benefits in
the future.
References
Atkins, S., Kilgarriff, A., & Rundell, M. 2010. The Database of Analysed Texts of English
(DANTE). In Proceedings of 14th EURALEX International Congress, A. Dykstra & T.
Schoonheim (eds), 549–556. Leeuwarden, The Netherlands.
Atkins, S. & Rundell, M. 2008. The Oxford Guide to Practical Lexicography. Oxford: OUP.
Baroni, M. & Bernardini, S. 2004. BootCaT: Bootstrapping corpora and terms from the web.
Proceedings of LREC 2004, 1313–1316. Lisbon: ELDA.
Baroni, M., Bernardini, S., Ferraresi, A. & Zanchetta, E. 2009. The WaCky wide web: A collec-
tion of very large linguistically processed web-crawled corpora. Language Resources and
Evaluation Journal 43(3): 209–226.
 Michael Rundell and Adam Kilgarriff
Baroni, M., Kilgarriff, A., Pomikálek, J. & Rychlý, P. 2006. WebBootCaT: A Web tool for instant
corpora. In Proceedings of 12th EURALEX International Congress, E. Corino, C. Marello, C.
Onesti (eds), 123–131. Alessandria: Edizioni Dell’Orso.
Church, K. & Hanks, P. 1990. Word association norms, mutual information and lexicography.
Computational Linguistics 16: 22–29.
Clear, J. 1988. The monitor corpus. In ZüriLEX ‘86 Proceedings, M. Snell-Hornby (ed.), 383–389.
Tübingen: Francke.
Clear, J. 1996. Technical implications of multilingual corpus lexicography. International Journal
of Lexicography 9(3): 265–273.
de Schryver, G.-M. & Prinsloo, D.J. 2000. The compilation of electronic corpora, with special
reference to the African languages. Southern African Linguistics and Applied Language
Studies 18(1–4): 89–106.
Fairon, C., Macé, K., & Naets, H. 2008. GlossaNet2: a linguistic search engine for RSS-based
corpora. Proceedings, Web As Corpus Workshop (WAC4), S. Evert, A. Kilgarriff & S. Sharoff
(eds), 34–39. Marrakesh.
Fletcher, W.H. 2004. Making the Web more useful as a source for linguistic corpora. In Applied
Corpus Linguistics: A Multidimensional Perspective [Studies in Corpus Linguistics 16], U.
Connor & T. Upton (eds), 191–205. Amsterdam: Rodopi.
Grefenstette, G. 1998. The future of linguistics and lexicographers: Will there be lexicographers
in the year 3000? In Actes EURALEX 1998, T. Fontenelle, P. Hiligsmann, A. Michiels, A.
Moulin & S. Theissen (eds), 25–42. Liège: Université de Liège.
Gries, S.T. & Stefanowitsch, A. 2004. Extending collostructional analysis: A corpus-based per-
spective on ’alternations’. International Journal of Corpus Linguistics 9(1): 97–129.
Hanks, P.W. 2004. Corpus Pattern Analysis. In Proceedings of the Eleventh EURALEX Interna-
tional Congress, G. Williams & S. Vessier (eds), 87–98. Lorient: UBS.
Heylighen F. & Dewaele, J.-M. 1999. Formality of language: Definition, measurement and be-
havioural determinants. Internal Report, Free University Brussels, <http://pespmc1.vub.
ac.be/Papers/Formality.pdf>
Janicivic, T. & Walker, D. 1997. NeoloSearch: Automatic detection of neologisms in French Inter-
net documents. Proceedings of ACH/ALLC’97, 93–94. Ontario, Canada: Queen’s University.
Keller, F. & Lapata, M. 2003. Using the web to obtain frequencies for unseen bigrams. Compu-
tational Linguistics 29(3): 459–484.
Kilgarriff, A. 1997. Putting frequencies in the dictionary. International Journal of Lexicography
10(2): 135–155.
Kilgarriff, A. 2009. Simple maths for keywords. In Proceedings, Corpus Linguistics, M. Mahlberg,
V. González-Díaz & C. Smith (eds). Liverpool: University of Liverpool. <http://ucrel.lancs.
ac.uk/publications/cl2009/>.
Kilgarriff, A. 2010. Comparable corpora within and across languages, word frequency lists and
the Kelly project. In Proceedings, 3rd Workshop on Building and Using Comparable Corpora,
R. Rapp, P. Zweigenbaum & S. Sharoff (eds), 1–5. Malta: LREC.
Kilgarriff, A. & Grefenstette, G. 2003. Introduction to the special issue on the web as corpus.
Kilgarriff, A., Husák, M., McAdam, K., Rundell, M. & Rychlý, P. 2008. GDEX: Automatically
finding good dictionary examples in a corpus. In Proceedings of the XIII Euralex Congress,
E. Bernal & J. DeCesaris (eds), 425–431. Barcelona: Universitat Pompeu Fabra.
Automating the creation of dictionaries 
Kilgarriff, A., Kovář, V. Krek, S. Srdanović, I. & Tiberius, C. 2010. A quantitative evaluation of
word sketches. In Proceedings of 14th EURALEX International Congress, A. Dykstra & T.
Schoonheim (eds), 372–379. Leeuwarden, The Netherlands.
Kilgarriff, A. & Rundell, M. 2002. Lexical profiling software and its lexicographic applications: A
case study. In Proceedings of the Tenth Euralex Congress, A. Braasch & C. Povlsen (eds),
807–818. Copenhagen: University of Copenhagen.
Kilgarriff, A., Rundell, M. & Uí Dhonnchadha, E. 2006. Efficient corpus development for lexi-
cography: Building the new corpus for Ireland. Language Resources and Evaluation Journal
40(2): 127–152.
Kilgarriff, A., Rychly, P., Smrz, P. & Tugwell, D. 2004. The sketch engine. In Proceedings of the
Eleventh EURALEX International Congress, G. Williams & S. Vessier (eds), 105–116.
Lorient: UBS.
Krishnamurthy, R. 1987. The process of compilation. In Looking Up: An Account of the COBUILD
Project in Lexical Computing, J.M. Sinclair (ed.), 62–85. London: Collins.
Kučera, H. & Francis, W.N. 1967. Computational Analysis of Present-Day American English.
Providence RI: Brown University Press.
Lewis, M. 1993. The Lexical Approach. Hove: Language Teaching Publications.
McCarthy, M. & O’Dell, F. 2005. English Collocations in Use. Cambridge: CUP.
Murray, K.E.M. 1979. Caught in the Web of Words: James A.H. Murray and the Oxford English
Dictionary. Oxford: OUP.
Murray, J., Bradley, H., Craigie, W. & Onions, C.T. 1928. Oxford English Dictionary. Oxford: OUP.
O’Donovan, R. & O’Neill, M. 2008. A systematic approach to the selection of neologisms for
inclusion in a large monolingual dictionary. In Proceedings of the XIII Euralex Congress, E.
Bernal & J. DeCesaris (eds), 571–579. Barcelona: Universitat Pompeu Fabra.
Pomikálek, J., Rychlý, P. & Kilgarriff, A. 2009. Scaling to billion-plus word corpora. Advances in
computational linguistics. Special Issue of Research in Computing Science 41: 3–14.
Procter, P. (ed.). 1978. Longman Dictionary of Contemporary English. Harlow: Longman.
Renouf, A. 1987. Corpus development. In Looking Up: An Account of the COBUILD Project in
Lexical Computing, J.M. Sinclair (ed.), 10–40. London: Collins.
Rundell, M. (ed.). 2001. Macmillan English Dictionary for Advanced Learners. Oxford: Mac-
millan Education.
Rundell, M. (ed.). 2010. Macmillan Collocations Dictionary. Oxford: Macmillan Education.
Santini, M., Rehm, G., Sharoff, S. & Mehler, A. (eds). 2009. Introduction: In Special issue on
Automatic Genre Identification: Issues and Prospects. Journal for Language Technology and
Sinclair, J.M. (ed.) Looking Up: An Account of the COBUILD Project in Lexical Computing.
London: Collins.
Sharoff, S. 2006. Creating general-purpose corpora using automated search engine queries. In
Wacky! Working Papers on Web as Corpus, M. Baroni & S. Bernardini (eds), 63–98.
Bologna: Gedit.
Stein, J. & Urdang, L. (eds). 1966. Random House Dictionary of the English Language. New York
NY: Random House.
Tapanainen, P. & Järvinen, T. 1998. Dependency concordances. International Journal of Lexi-
cography 11(3): 187–203.
addendum
Select list of publications by Sylviane Granger*
1. Books
The Louvain International Database of Spoken English Interlanguage. Handbook and CD-ROM
(G. Gilquin, S. De Cock & S. Granger eds). Presses universitaires de Louvain: Louvain-la-
Neuve, 2010.
eLexicography in the 21st Century: New Challenges, New Applications. Proceedings of ELEX2009
(S. Granger & M. Paquot eds). Cahiers du CENTAL. Presses universitaires de Louvain:
Louvain-la-Neuve, 2010.
The International Corpus of Learner English. Handbook and CD-ROM. Version 2 (S. Granger, E.
Dagneaux, F. Meunier & M. Paquot eds). Presses universitaires de Louvain: Louvain-la-
Neuve, 2009.
Phraseology: An Interdisciplinary Perspective (S. Granger & F. Meunier eds). John Benjamins:
Amsterdam, 2008.
Phraseology in Foreign Language Learning and Teaching (F. Meunier & S. Granger eds). John
Benjamins: Amsterdam, 2008.
Eigo Gakushusha Kopasu Nyumon---SLA to Kopasu no Deai (Introduction to English Learner
Corpus – SLA Meets Corpus Linguistics) (S. Granger ed.). Japanese translation of S. Granger
(ed.) Learner English on Computer (Addison Wesley Longman). Kenkyusha: Tokyo, 2008.
Corpus-based Approaches to Contrastive Linguistics and Translation Studies (S. Granger, J. Lerot
& S. Petch-Tyson eds). Foreign Language Teaching and Research Press: Beijing, 2007.
Extending the Scope of Corpus-based Research: New Applications, New Challenges (S. Granger &
S. Petch-Tyson eds), Rodopi: Amsterdam, 2003.
Corpus-based Approaches to Contrastive Linguistics and Translation Studies (S. Granger, J. Lerot
& S. Petch-Tyson eds), Rodopi: Amsterdam, 2003.
Computer Learner Corpora, Second Language Acquisition and Foreign Language Teaching
(S. Granger, J. Hung & S. Petch-Tyson eds), Language Learning and Language Teaching 6.
John Benjamins: Amsterdam, 2002.
The International Corpus of Learner English. Handbook and CD-ROM (S. Granger, E. Dagneaux
& F. Meunier eds). Presses universitaires de Louvain: Louvain-la-Neuve, 2002.
Lexis in Contrast. Corpus-based Approaches (B. Altenberg & S. Granger eds), Studies in Corpus
Linguistics 7. John Benjamins: Amsterdam, 2002.
Contrastive Linguistics and Translation (S. Granger, L. Beheydt & J.-P. Colson eds), Special issue
of Le Langage et l’Homme XXXIV(1), 1999.
Learner English on Computer (S. Granger ed.). Addison Wesley Longman: London, 1998.
* The publications are sorted by publication type (books or articles) and in reverse chrono-
logical order.
 A Taste for Corpora
Dictionnaire des Faux Amis/Dictionary of Faux Amis Français-Anglais English-French (J. Van
Roey, S. Granger & H. Swallow). Duculot: Gembloux. 3rd edition Duculot: Paris, 1998.
Perspectives on the English Lexicon (S. Granger ed.). Peeters: Louvain-la-Neuve, 1991.
Thèmes Grammaticaux Français-Anglais (S. Granger & J. Van Roey), Collection Pédasup 7. Aca-
demia: Louvain-la-Neuve, 1988.
The Be + Past Participle Construction in Spoken English with Special Emphasis on the Passive
(S. Granger), North Holland Linguistic Series. Elsevier Science Publishers: Amsterdam, 1983.
Tendencje interpretacyjne i generatywne w gramatyce transformacyjnej (S. Granger & B. Devlam-
minck), Katolicki Uniwersytet Lubelski: Lublin, 1981.
2. Articles
Error patterns and automatic L1 identification (Y. Bestgen, S. Granger & J. Thewissen). In S.
Jarvis & S. Crossley (eds) Approaching Transfer through Text Classification: Explorations in
the Detection-based Approach, forthcoming.
How to use foreign and second language learner corpora? (S. Granger) In A. Mackey & S.G.
Gass (eds) A Guide to Research Methods in Second Language Acquisition. Basil Blackwell,
forthcoming.
Learner corpora (S. Granger). In C.A. Chapelle (ed.) The Encyclopedia of Applied Linguistics.
Wiley-Blackwell: Oxford, forthcoming.
The comparative and combined contributions of n-grams, Coh-Metrix indices, and error types
in the L1 classification of learner texts (S. Jarvis, Y. Bestgen, S. Crossley, S. Granger, M.
Paquot, J. Thewissen & D. McNamara). In S. Jarvis & S. Crossley (eds) Approaching Transfer
through Text Classification: Explorations in the Detection-based Approach, forthcoming.
Language for specific purposes learner corpora (S. Granger & M. Paquot). In T.A. Upton & U.
Connor (eds) The Encyclopedia of Applied Linguistics. Language for Specific Purposes. Wiley-
Blackwell: Oxford, forthcoming.
Categorizing spelling errors to assess L2 writing (Y. Bestgen & S. Granger) International Journal
of Continuing Engineering Education and Life-Long Learning (IJCEELL), in press.
From phraseology to pedagogy: Challenges and prospects (S. Granger). In T. Herbst, P. Uhrig &
S. Schüller (eds) Chunks in the Description of Language. A Tribute to John Sinclair. Mouton
de Gruyter: Berlin, in press.
From EFL to ESL: Evidence from the International Corpus of Learner English (G. Gilquin & S.
Granger). In M. Hundt & J. Mukherjee (eds) Exploring Second-Language Varieties of English
and Learner Englishes: Bridging a Paradigm Gap. John Benjamins: Amsterdam, in press.
Comparable and translation corpora in cross-linguistic research. Design, analysis and applica-
tions (S. Granger). Contemporary Foreign Language Studies 2. Shanghai Jiao Tong Univer-
sity, 2010.
Vingt ans d’analyse de corpus d’apprenants: Leçons apprises et perspectives (S. Granger). In P.
Cappeau, H. Chuquet & F. Valetopoulos (eds) L’exemple et le corpus: quel statut? Travaux
linguistiques du CerLiCO. Presses universitaires de Rennes: Rennes, 2010, 29–42.
Customising a general EAP dictionary to meet learner needs (S. Granger & M. Paquot). In S.
Granger & M. Paquot (eds) eLexicography in the 21st Century: New Challenges, New Appli-
cations. Proceedings of ELEX2009. Cahiers du CENTAL. Presses universitaires de Louvain:
Louvain-la-Neuve, 2010, 87–96.
Select list of publications by Sylviane Granger 
How can data-driven learning be used in language teaching? (G. Gilquin & S. Granger). In A.
O’Keeffe & M. McCarthy (eds) The Routledge Handbook of Corpus Linguistics. Routledge:
London, 2010, 359–370.
Learner corpora: A window onto the L2 phrasicon (S. Granger). In A. Barfield & H. Gyllstad
(eds) Collocating in another Language: Multiple Interpretations. Palgrave Macmillan:
London, 2009, 60–65.
Lexical verbs in academic discourse: A corpus-driven study of learner use (S. Granger & M.
Paquot). In M. Charles, S. Hunston & D. Pecorari (eds) At the Interface of Corpus and Dis-
course: Analysing Academic Discourses. Continuum: London, 2009, 193–214.
In search of General Academic English: A corpus-driven study (S. Granger & M. Paquot). In K.
Katsampoxaki-Hodgetts (ed.) Options and Practices of L.S.P Practitioners Conference Pro-
ceedings. University of Crete Publications, E-media, 2009, 94–108.
Integrated Digital Language Learning (G. Antoniadis, S. Granger, O. Kraif, J. Medori, C. Ponton
& V. Zampa). In N. Balacheff, S. Ludvigsen, T. de Jong, A. Lazonder & L. Montandon (eds)
Technology-Enhanced Learning. Principles and Products. Springer: Berlin, 2009, 89–103.
The contribution of learner corpora to second language acquisition and foreign language teach-
ing: A critical evaluation (S. Granger). In K. Aijmer (ed.) Corpora and Language Teaching.
John Benjamins: Amsterdam, 2009, 13–32.
Japanese translation of ‘Prefabricated patterns in advanced EFL writing: Collocations and for-
mulae’ (OUP, 1998) (S. Granger). In A. Cowie (ed.) Phraseology: Theory, Analysis and Ap-
plications. Kurosio Publishers: Tokyo, 2009, 185–204.
Learner corpora (S. Granger). In A. Lüdeling & M. Kytö (eds) Corpus Linguistics. An Interna-
tional Handbook. Volume 1. Walter de Gruyter: Berlin, 2008, 259–275.
Disentangling the phraseological web (S. Granger & M. Paquot). In S. Granger & F. Meunier (eds)
Phraseology: An Interdisciplinary Perspective. John Benjamins: Amsterdam, 2008, 27–49.
From dictionary to phrasebook? (S. Granger & M. Paquot). In E. Bernal & J. DeCesaris (eds) Pro-
ceedings of the XIII EURALEX International Congress, Barcelona, Spain, 2008, 1345–1355.
Phraseology in language learning and teaching. Where to from here? (S. Granger & F. Meunier).
In F. Meunier & S. Granger (eds) Phraseology in Foreign Language Learning and Teaching.
John Benjamins: Amsterdam, 2008, 247–252.
Learner corpora in foreign language education (S. Granger). In N. Van Deusen-Scholl & N.H.
Hornberger (eds) Encyclopedia of Language and Education. Volume 4. Second and Foreign
Language Education. Springer: Berlin, 2008, 337–351.
Improve your writing skills (S. De Cock, G. Gilquin, S. Granger, M.-A. Lefer, M. Paquot & S.
Ricketts). In M. Rundell (editor in chief) Macmillan English Dictionary for Advanced Learn-
ers (second edition). Macmillan Education: Oxford, 2007, IW1–IW50.
Learner corpora: The missing link in EAP pedagogy (G. Gilquin, S. Granger & M. Paquot). In P.
Thompson (ed.) Corpus-based EAP Pedagogy. Special issue of Journal of English for Aca-
demic Purposes 6(4), 2007, 319–335.
Reprint of ‘The computer learner corpus: A versatile new source of data for SLA research’ (1998)
(S. Granger). In W. Teubert & R. Krishnamurthy (eds) Corpus Linguistics: Critical Concepts
in Linguistics. Volume 2. Routledge: London, 2007, 166–182.
Reprint of ‘A bird’s-eye view of computer learner corpus research’ (2002) (S. Granger). In W.
Teubert & R. Krishnamurthy (eds) Corpus Linguistics: Critical Concepts in Linguistics.
Volume 2. Routledge: London, 2007, 44–72.
Corpus d’apprenants, annotation d’erreurs et ALAO: Une synergie prometteuse (S. Granger).
Cahiers de lexicologie 91(2), 2007, 117–132.
 A Taste for Corpora
From corpora to confidence (M. Rundell & S. Granger). English Teaching Professional 50,
2007, 15–18.
Corpus linguistics, language learning & ELT: Interviewing Sylviane Granger (S. Granger & V.
Viana). Mindbite 1, 2007, 11–14.
Extraction of multi-word units from EFL and native English corpora. The phraseology of the
verb ‘make’ (S. Granger, M. Paquot & P. Rayson). In A.H. Buhofer & H. Burger (eds) Phrase-
ology in Motion I. Schneider Verlag: Baltmannsweiler, 2006, 57–68.
Quelles machines pour enseigner la langue? (G. Antoniadis, C. Fairon, S. Granger, J. Medori &
V. Zampa). In P. Mertens, C. Fairon, A. Dister & P. Watrin (eds) TALN 06: Verbum ex
Machina. Volume 2. Presses universitaires de Louvain: Louvain-la-Neuve, 2006, 795–805.
Computer learner corpora and monolingual learners’ dictionaries: The perfect match (S. De
Cock & S. Granger). In W. Teubert (ed.) Special issue of Lexicographica 20, 2005, 72–86.
Computer learner corpus research: Current status and future prospects (S. Granger). In U.
Connor & T. Upton (eds) Applied Corpus Linguistics: A Multidimensional Perspective. Ro-
dopi: Amsterdam, 2004, 123–145.
Practical applications of learner corpora (S. Granger). In B. Lewandowska-Tomaszczyk (ed.)
Language, Corpora, E-Learning. Peter Lang: Frankfurt, 2004, 291–301.
High frequency words: The bête noire of lexicographers and learners alike. A close look at the
verb ‘make’ in five advanced learners’ dictionaries of English (S. De Cock & S. Granger). In
G. Williams & S. Vesssier (eds) Proceedings of the Eleventh EURALEX International Con-
gress. Université de Bretagne-Sud: Lorient, 2004, 233–243.
The International Corpus of Learner English: A new resource for foreign language learning and
teaching and second language acquisition research (S. Granger). TESOL Quarterly 37(3),
2003, 538–546.
Error-tagged learner corpora and CALL: A promising synergy (S. Granger). CALICO (Special
issue on Error Analysis and Error Correction in Computer-Assisted Language Learning)
20(3), 2003, 465–480.
The corpus approach: A common way forward for contrastive linguistics and translation studies
(S. Granger). In S. Granger, J. Lerot & S. Petch-Tyson (eds) Corpus-based Approaches to
Contrastive Linguistics and Translation Studies. Rodopi: Amsterdam, 2003, 17–29.
A bird’s eye view of learner corpus research (S. Granger). In S. Granger, J. Hung & S. Petch-Ty-
son (eds) Computer Learner Corpora, Second Language Acquisition and Foreign Language
Teaching. Language Learning and Language Teaching 6. John Benjamins: Amsterdam,
2002, 3–33.
Recent trends in cross-linguistic lexical studies (B. Altenberg & S. Granger). In B. Altenberg &
S. Granger (eds) Lexis in Contrast. Corpus-based Approaches. Studies in Corpus Linguistics
7. Benjamins: Amsterdam, 2002, 3–48.
The grammatical and lexical patterning of make in native and non-native student writing
(B. Altenberg & S. Granger). Applied Linguistics 22(2), 2001, 173–194.
Didactique des langues étrangères, linguistique de corpus et traitement automatique des langues
(S. Granger). In M. Marquillo Larruy (ed.) Questions d’épistémologie en didactique du fran-
çais (langue maternelle, langue seconde, langue étrangère). Cahiers FORELL. Université de
Poitiers: Poitiers, 2001, 105–109.
Analyse des corpus d’apprenants pour l’ELAO basé sur le TAL (S. Granger, A. Vandeventer &
M.J. Hamel). Corpus Linguistics. Special issue of TAL (Traitement Automatique des Langues)
42(2), 2001, 609–621.
Select list of publications by Sylviane Granger 
Optimising measures of lexical variation in EFL learner corpora (S. Granger & M. Wynne). In J.
Kirk (ed.) Corpora Galore. Rodopi, Amsterdam, 1999, 249–257.
Use of tenses by advanced EFL learners: Evidence from an error-tagged computer corpus
(S. Granger). In H. Hasselgård & S. Oksefjell (eds) Out of Corpora – Studies in Honour of
Stig Johansson. Rodopi: Amsterdam, 1999, 191–202.
Prefabricated patterns in advanced EFL writing: Collocations and formulae (S. Granger). In A.
Cowie (ed.) Phraseology: Theory, Analysis and Applications. Oxford University Press:
Oxford, 1998, 145–160.
The computerized learner corpus: A versatile new source of data for SLA research (S. Granger).
In S. Granger (ed.) Learner English on Computer. Addison Wesley Longman: London,
1998, 3–18.
Tag sequences in learner corpora: A key to interlanguage grammar and discourse (S. Granger &
J. Aarts). In S. Granger (ed.) Learner English on Computer. Addison Wesley Longman:
London, 1998, 132–141.
An automated approach to the phrasicon of EFL learners (S. De Cock, S. Granger, G. Leech & T.
McEnery). In S. Granger (ed.) Learner English on Computer. Addison Wesley Longman:
London, 1998, 67–79.
Learner corpus data in the foreign language classroom: Form-focused instruction and data-
driven learning (S. Granger & C. Tribble). In S. Granger (ed.) Learner English on Computer.
Addison Wesley Longman: London, 1998, 199–209.
Automatic lexical profiling of learner texts (S. Granger & P. Rayson). In S. Granger (ed.) Learner
English on Computer. Addison Wesley Longman: London, 1998, 119–131.
Computer-aided Error Analysis (E. Dagneaux, S. Denness & S. Granger). System. An Interna-
tional Journal of Educational Technology and Applied Linguistics 26(2), 1998, 163–174.
On identifying the syntactic and discourse features of participle clauses in academic English: Na-
tive and non-native writers compared (S. Granger). In J. Aarts, I. de Mönnink & H. Wekker
(eds) Studies in English Language and Teaching. Rodopi: Amsterdam, 1997, 185–198.
The computer learner corpus: A testbed for electronic EFL tools (S. Granger). In J. Nerbonne
(ed.) Linguistic Databases. CSLI Publications: Stanford CA, 1997, 175–188.
Automated retrieval of passives from native and learner corpora: Precision and recall (S. Granger).
Journal of English Linguistics 25(4), 1997, 365–374.
From CA to CIA and back: An integrated approach to computerized bilingual and learner cor-
pora (S. Granger). In K. Aijmer, B. Altenberg & M. Johansson (eds) Languages in Contrast.
Text-based Cross-linguistic Studies. Lund Studies in English 88. Lund University Press:
Lund, 1996, 37–51.
Learner English around the world (S. Granger). In S. Greenbaum (ed.) Comparing English
World-wide. Clarendon Press: Oxford, 1996, 13–24.
Romance words in English: From history to pedagogy (S. Granger). In J. Svartvik (ed.) Words.
Proceedings of an International Symposium. Almqvist and Wiksell International: Stockholm,
1996, 105–121.
Connector usage in the English essay writing of native and non-native EFL speakers of English
(S. Granger & S. Tyson). World Englishes 15, 1996, 9–29.
The learner corpus: A revolution in applied linguistics (S. Granger). English Today 39(10/3),
1994, 25–29.
Towards a grammar checker for learners of English (S. Granger & F. Meunier). In U. Fries & G.
Tottie (eds) Creating and Using English Language Corpora. Rodopi: Amsterdam, 1994, 79–91.
 A Taste for Corpora
New insights into the learner lexicon: A preliminary report from the International Corpus of
Learner English (S. Granger, F. Meunier & S. Tyson). In L. Flowerdew & K.K. Tong (eds)
Entering Text. The Hong Kong University of Science and Technology: Hong Kong, 1994,
102–113.
La description de la compétence lexicale en langue étrangère: Perspectives méthodologiques
(S. Granger & G. Monfort). Acquisition et Interaction en Langue Etrangère (AILE) 3, 1994,
55–75.
Cognates: An aid or a barrier to successful L2 vocabulary development? (S. Granger). ITL Re-
view of Applied Linguistics 99–100, 1993, 43–56.
The International Corpus of Learner English (S. Granger). In J. Aarts, P. de Haan & N. Oostdijk
(eds) English Language Corpora: Design, Analysis and Exploitation. Rodopi: Amsterdam,
1993, 57–69.
The International Corpus of Learner English (S. Granger). The European English Messenger 2(1),
1993, 34.
False friends: A kaleidoscope of translation difficulties (S. Granger & H. Swallow). Le Langage et
l’Homme 23(2), 1988, 108–120.
The Longman Dictionary of Contemporary English and the Collins Cobuild English Language
Dictionary (S. Granger & J.P. Van Noppen). Revue Belge de Philologie et d’Histoire LXVI(3),
1988, 710–713.
Why the passive? (S. Granger). In J. Van Roey (ed.) English-French Contrastive Analyses. Con-
trastive Analysis Series. Acco: Leuven, 1976, 23–57.
A survey of transformational theories (Part 3) (S. Granger & B. Devlamminck). Le Langage et
l’Homme 30, 1976, 49–67.
l’Homme 29, 1975, 25–50.
Tendances interprétatives et génératives en grammaire transformationnelle (S. Granger & B.
Devlamminck). Cahiers de l’Institut de Linguistique de Louvain (Cours et Documents) 3,
1975–76, 25–115.
l’Homme 26, 1974, 41–55.
On some active and passive structures with infinitive and their French correspondents
(S. Granger). Cahiers de l’Institut de Linguistique de Louvain 1(5), 1972, 705–732.
Subject index
A British National Corpus compoundâ•‡ 115, 261, 264, 266

academic Englishâ•‡ xv, 63, 94 (BNC)â•‡ 10–11, 13–14, 16, 25, computer-mediated communi-
academic literacyâ•‡ 63–65, 67, 41–42, 47–48, 50, 55, 90–91, cationâ•‡ 214, 218–220, 222, 227,
80, 88 112, 177–179, 181–186, 246, 230, 233–234
academic writingâ•‡ 39, 50, 63–80, 248, 251, 262–264, 268, congramâ•‡ 243
85, 90–91, 94, 101, 185, 276 274–275 conjunctionâ•‡ 54, 101, 264
see also English for academic BNC Babyâ•‡ 90–91, 93, 98 connectorâ•‡ 41, 55
purposes BROWN corpusâ•‡ 10, 13, 16, 22, constructionâ•‡ 13, 19–23, 42,
accuracyâ•‡ 86, 109, 111, 113, 116, 37, 209, 259, 262 49, 52–54, 56, 65, 71–72, 76,
120–121 bundle see lexical bundle 78–79, 127, 226, 246, 247–249,
see also classification 271, 275, 277
accuracy C construction grammarâ•‡ 19,
acrolectâ•‡ 215, 219–220, 225–226 can do statementsâ•‡ 64, 67–69, 21–23
adjectival phraseâ•‡ 72, 77 75, 77, 79 contrastive analysisâ•‡ xiv, 2, 34,
adjectiveâ•‡ 45–48, 75, 116, 160 Chinese learners of Englishâ•‡ 39– 38, 43–44, 52, 54–55
advanced learnerâ•‡ xiv, 14, 40, 42, 113–115, 139–141, 177, 212 contrastive interlanguage analy-
43, 55–56, 114, 168 classification accuracyâ•‡ 128–153 sis (CIA)â•‡ xv, 2, 33, 38–40,
adverbâ•‡ 42, 46, 68, 72, 75, 77–79, classroomâ•‡ 55, 64, 85–86, 43–45, 52, 55–57
101, 160, 174, 273 157–159, 162–163, 168, 176, conversationâ•‡ 13, 36, 41, 217, 161,
adverbialâ•‡ 18, 43, 128, 220 238, 267 164, 220–223, 275
adverbial particleâ•‡ 177–178 classroom inputâ•‡ 174, 180, copulaâ•‡ 52–53, 72, 77, 222
affixâ•‡ 116, 266 185–186, 189–190 Corpus of Contemporary Amer-
American Englishâ•‡ 11, 13, 209, clusterâ•‡ 86–88, 93 ican English (COCA)â•‡ 185,
211, 229, 264 see also lexical bundle 214
ANOVAâ•‡ 188–189 co-frequencyâ•‡ 11–12 Corpus of Cyber-Jamaicanâ•‡ 209,
apprentice writingâ•‡ 87–88, COBUILD projectâ•‡ 85, 156, 257, 214–215, 217
90–91, 93–100, 102–103, 105, 259, 262, 269 creoleâ•‡ 211–215, 217, 222–223,
107 cognateâ•‡ 52, 187 225–226, 233
argumentative essayâ•‡ xiv, 38, cognitive linguisticsâ•‡ 21–22, 24 cross-linguistic influenceâ•‡ 127,
42–43, 67, 69, 90, 114, 128 cohesionâ•‡ 68, 77, 129 130, 148
argumentative writingâ•‡ 38, 64, coinageâ•‡ 114–115 see also transfer
80, 129 colligationâ•‡ 248, 261, 278 cross-validated classification
artificial neural networkâ•‡ 132– collocate see collocation accuracyâ•‡ 128–131, 133–147,
133, 137, 152 collocationâ•‡ 10–12, 15, 19, 26, 152–153
authorship attributionâ•‡ 143, 154 48–51, 53, 88, 99, 111, 242–243, cross-validationâ•‡ 134–135, 138
auxiliaryâ•‡ 13, 53, 220 247–248, 251–252, 267–271, Czech learners of Englishâ•‡ 139,
276–279 141
B Collocatorâ•‡ 252
Bank of Englishâ•‡ 11, 25 collostructionâ•‡ 19–20, 23, 274 D
basilectâ•‡ 215, 219–221, 224–226 collostructional analysisâ•‡ 19, 274 Danish learners of Englishâ•‡ 36
borrowingâ•‡ 115, 216, 222, 230 Common European Framework Database of ANalysed Texts of
British Academic Written Eng- of Reference (CEFR)â•‡ 63–65, English (DANTE)â•‡ 270, 277
lish (BAWE) corpusâ•‡ 90–91, 80 deicticâ•‡ 69, 71, 76–77
93, 98 competenceâ•‡ 9, 21–22, 24, 64, derivationâ•‡ 116, 266
British Englishâ•‡ 10, 13, 229, 263 67–68, 80, 155, 211 detection-based approachâ•‡ 130
 A Taste for Corpora
developmental errorâ•‡ xiv–xv, 34–35, 39, 42, grammarâ•‡ xiv, 10, 12, 16–19, 21–
developmental factorsâ•‡ 41, 52 44–45, 48, 51, 57, 63, 68, 80, 23, 35, 40, 56, 64, 79, 115–116,
developmental patternsâ•‡ 64, 86, 109–110, 113–123, 140, 159, 123, 161, 177, 215, 219–220,
70, 73–74, 76–78, 80 163, 166, 216–217, 233, 241, 247, 247, 253, 260, 268, 271
developmental stagesâ•‡ 163 251, 260 grammar checkerâ•‡ 123
diachronyâ•‡ 22, 213 error analysisâ•‡ 34, 39, 57, see also construction gram-
dialectâ•‡ 210, 215, 276 109–110, 114, 123 mar, pattern grammar, sys-
dictionary error taggingâ•‡ 109, 114–116, temic functional grammar
dictionary grammarâ•‡ 260, 271 120, 122–123 grammatical constructionâ•‡ 16,
dictionary writing see also manual error tagging 246, 248, 271
systemâ•‡ 257, 260, 278 evaluative adjectival phraseâ•‡ 72, grammatical frequencyâ•‡ 12,
discourseâ•‡ xiv, 40, 42–43, 66, 77 16, 22
70–72, 74, 77, 79–80, 87, 91, expert performanceâ•‡ 87–89 grammatical relationâ•‡ 271, 277
97, 99, 161–162, 181, 214, 228, expert writerâ•‡ 74, 97–98, 216 grammaticalizationâ•‡ 13, 22
231, 276 exposureâ•‡ 22–23, 147, 173–177,
discourse competenceâ•‡ 68, 180, 185–186, 189–190, 199, H
80 240 historical corpusâ•‡ 109–112
discourse communityâ•‡ 66, 87 historical spelling variantâ•‡ 109–
discourse markersâ•‡ 64, 68, 70 F 111, 120
discourse oriented verbsâ•‡ 71– face-to-face interactionâ•‡ 217, hybrid n-gramâ•‡ 243–250, 252
72, 74, 79 220–222, 225–227, 229, 234
discriminant analysisâ•‡ 129, 137 Federalist Papers corpusâ•‡ 137 I
see also linear discriminant Finnish learners of Englishâ•‡ 41– idiomatic expressionâ•‡ 22, 87, 165,
analysis 42, 141, 148 173–174
Dutchâ•‡ xiii, xv formulaic languageâ•‡ 173, 176 idiom principleâ•‡ 11, 15, 23, 49
Dutch learners of Englishâ•‡ 40– formulaic sequencesâ•‡ 11, 16, 23, implicit learningâ•‡ 23–24
41, 48, 141, 148 240, 242 Indian Englishâ•‡ 210, 213, 234
formulaicityâ•‡ 15–16 inflectionâ•‡ 116, 178, 181, 217, 226,
E Frenchâ•‡ xiii, xv 264, 266
elicitation testâ•‡ 34–36 French learners of Englishâ•‡ 39– integrated contrastive
English as a foreign language 41, 45–51, 109, 114–115, modelâ•‡ xv, 33, 43–45, 52, 57
(EFL)â•‡ xiii, xiv, 1, 3, 25–26, 55, 117–119, 121–122, 139, 141, 148 intelligent computer-aided lan-
63–65, 69, 74, 79–80, 113, 156, frequencyâ•‡ xiv, 7–27, 37, 39, 111, guage learning (ICALL)â•‡ 109,
173, 176, 190 174–176, 180–187, 190, 242 114, 123
English as a lingua franca frequency listâ•‡ 8–15, 263–264 International Corpus of English
(ELF)â•‡ 25–26, 155–157, see also co-frequency, gram- (ICE)â•‡ 37, 211– 212, 214, 229
159–169, 190 matical frequency, lexical ICE-GBâ•‡ 49–50, 52, 212, 229
English as a Lingua Franca in frequency, phrasal verb ICE-JAâ•‡ 214, 216–217, 219–221,
Academic Settings (ELFA) frequency, word combina- 234
corpusâ•‡ 25–26, 157, 160–162, tion frequency International Corpus of Learner
164–167 function wordâ•‡ 138, 140–142 English (ICLE)â•‡ xiv–xv,
English as a native language 2, 25–26, 33, 37–54, 56–57,
(ENL)â•‡ 159–161, 163–168 G 63–64, 69, 114-116, 131, 139-141,
English as a second language GDEXâ•‡ 276–279 147–148, 156, 164, 168, 212
(ESL)â•‡ 113–114, 173, 176, 190 genderâ•‡ 128, 138, 158, 173, 176, ICLE-FRâ•‡ 45–46, 48–51
English for academic purposes 180, 188, 220 ICLE-GEâ•‡ 42, 45–49, 51–52
(EAP)â•‡ xv, 1, 25–26, 85–88 genreâ•‡ 43, 55–56, 65–68, 87-89, ICLE-HKâ•‡ 42, 56
see also academic writing 96, 102, 128, 158, 214, 242 ICLE-NOâ•‡ 39, 43, 45–54, 56
English for specific purposes Germanâ•‡ 187 ICLE-SPâ•‡ 45–46, 48–49, 51–52
(ESP)â•‡ xiv–xv, 25–26, 87, 89 German learners of Eng- intertextualityâ•‡ 41, 63, 65–69, 79
English language teaching lishâ•‡ 39–40, 42, 45–49, 51, Irish Englishâ•‡ 211, 229, 262, 270,
(ELT)â•‡ 12, 25–27, 39 109, 114–115, 117–122, 141, 148, 276
entrenchmentâ•‡ 22–23, 166 177, 187 Italian learners of Englishâ•‡ 40,
141, 148, 176–177
Subject index 
J Longman Corpus of Spo- native languageâ•‡ 38, 44–45, 56, 89,

Jamaican Creoleâ•‡ 211, 214-222, ken American English 115, 121–122, 127, 130, 159, 166
224–234 (LCSAE)â•‡ 13, 17–18 native-speakerâ•‡ 16, 18, 25–26, 33–
Jamaican Englishâ•‡ 211-212, Longitudinal Database 36, 38–40, 46–47, 49, 52, 54–55,
214–221, 225–226, 228–229, of Learner English 57, 85–86, 88–89, 113–114, 129,
233–234 (LONGDALE)â•‡ xiv, 56 139, 155–156, 158–159, 161–162,
Japanese learners of Englishâ•‡ 40, longitudinal 164–165, 180, 211, 234
114–115, 139–141, 156 longitudinal corpusâ•‡ xiv, Natural Language Processing
26–27, 56 (NLP)â•‡ 11, 86, 109, 111, 114,
L longitudinal studyâ•‡ 74, 113 123, 261, 267
L1 detectionâ•‡ 127, 130, 139, 143 see also pseudo-longitudinal neural networkâ•‡ 128, 139, 152
Lancaster-Oslo/Bergen (LOB) study see also artificial neural
corpusâ•‡ 10–11, 13, 37, 209 Louvain Corpus of Native Eng- network
language instructionâ•‡ 36, 75–77, lish Essays (LOCNESS)â•‡ 25, New Englishesâ•‡ 209–210,
86, 101–102, 173, 175–176, 180, 38–43, 45–50, 52–53, 55, 69, 71 212–213, 215, 233–234
189–190 Louvain International Database n-gramâ•‡ 86, 127, 131, 139–143,
language proficiency see profi- of Spoken English Interlan- 147, 241–250, 252
ciency guage (LINDSEI)â•‡ xiv, 2, normâ•‡ xiv–xv, 34–35, 39, 114,
language teachingâ•‡ 3, 8, 18, 23, 25–26, 40, 56 123, 158–159, 162, 211, 215,
33–34, 57, 85, 88, 113, 123, 218–219, 226, 229, 273
156, 267 M Norwegianâ•‡ 40, 43, 45, 47, 51–54
see also English language machine learningâ•‡ 127–128 Norwegian learners of Eng-
teaching manual error taggingâ•‡ 109, 120, lishâ•‡ 33, 39, 41, 45–56, 141, 148
learning materialsâ•‡ 26–27, 163 123 nounâ•‡ 13, 19–20, 48, 114, 128, 166,
lemmatizationâ•‡ 10, 185, 241, 259, mesolectâ•‡ 212, 215–217, 219–221, 242, 244–250, 264, 268, 273
264–266, 273 224, 227 noun phraseâ•‡ 46–49, 52–53,
LexCheckerâ•‡ 246 metadiscourseâ•‡ 43 77, 174
lexical bundleâ•‡ 11, 15, 85–88, 90, metatextual functionâ•‡ 51–52
92–102, 130, 242 Michigan Corpus of Academic P
see also cluster Spoken English (MI- paradigmaticâ•‡ 243–244, 248,
lexical frequencyâ•‡ 12, 16, 175 CASE)â•‡ 25–26, 160, 167 250
lexical knowledgeâ•‡ 178, 237, misspellingâ•‡ 113, 117, 224 parameter tuningâ•‡ 133, 138, 144
240–241, 251 mistakeâ•‡ 33, 35, 45, 109–110, 120, particleâ•‡ 174, 178, 187, 271
lexical knowledge 167, 216 see also adverbial particle
discoveryâ•‡ 239, 245, 251 misuseâ•‡ 39, 46, 115–116 part-of-speech (POS) tag-
lexical knowledge representa- modalityâ•‡ 13, 17, 41, 50–54, 65, 79 gingâ•‡ 10, 110–111, 259, 268, 273
tionâ•‡ 239, 251, 253 morphemeâ•‡ 116, 127, 266 passiveâ•‡ xiii, 68, 71–72, 76,
lexical variationâ•‡ 35–36, 114, 129 morphological errorâ•‡ 114–116 273–275
lexico-grammarâ•‡ 12, 19, 39, 63, morphologyâ•‡ 45, 116, 123, 166, pattern grammarâ•‡ 19
80, 155, 165–166, 168, 248 223, 266 pedagogical applicationsâ•‡ xiv–xv,
lexicographyâ•‡ xiv–xv, 1, 11, 14, morphosyntaxâ•‡ 213, 216, 219, 222 1–2, 35–36, 55, 57, 64, 101, 163
85, 156, 222, 237–238, 241–242, multilingual corporaâ•‡ xiv–xv, 234 pedagogical dictionaryâ•‡ 276
257–262, 265–266, 269–271, multilingualismâ•‡ 155, 234 pedagogyâ•‡ 8, 17, 36, 57, 102,
273, 275–279 multi-wordâ•‡ 264, 266 157–158, 161, 163
see also tickbox lexicography multi-word expressionâ•‡ 23, phrasal verbâ•‡ 173–183, 185–191,
lexiconâ•‡ 12, 16, 22, 111–112, 114, 240, 251–252, 277 194–195, 199, 261
215, 275 multi-word sequenceâ•‡ 127, phrasal verb acquisitionâ•‡ 180,
lexisâ•‡ xiv, 11–12, 18–19, 35–36, 131, 141–142, 242 188
40, 94, 99, 270 multi-word unitâ•‡ 11, 174 phrasal verb frequencyâ•‡ 175–
linear discriminant analy- 177, 179, 181–183, 185–186,
sisâ•‡ 131–134, 137, 143–145 N 190, 194
literacyâ•‡ 65, 102, 216 native Englishâ•‡ xiv–xv, 43, 101 phrasal verb knowledgeâ•‡ 176,
see also academic literacy see also English as a native 181–182, 187–188
language
 A Taste for Corpora
phraseologyâ•‡ xiv, 1–2, 16, 18, 49, S U

52, 161, 166 second language acquisition UK Web as Corpus
Polish learners of Englishâ•‡ 40, (SLA)â•‡ xiv, 1, 10, 21, 23–24, (UKWaC)â•‡ 263, 270
141–142, 156, 177 57, 109, 113–114, 123, 156–157, usage-based approachâ•‡ 2, 21–24,
pragmaticsâ•‡ 64, 161, 174, 227 159–164, 166–168, 174 253
prepositionâ•‡ 128, 165–166, 248, second language learningâ•‡ xiii,
264, 271 168, 241 V
present progressiveâ•‡ 16–17, Sketch Engineâ•‡ 19, 244, 270–271, Variant Detector (VARD)â•‡ 109–
23, 27 276 113, 115–123
present simpleâ•‡ 16–18, 23, 27 see also word sketch varieties of Englishâ•‡ xv, 37,
productionâ•‡ xiv, 16, 27, 63–64, Sketch Grammarâ•‡ 271 210–214, 229, 234, 263
85, 87, 102, 157, 232, 276 social networkingâ•‡ 175, 180, 190 Varieties of English for
see also written production sociolinguisticsâ•‡ 209, 212, 214, Specific Purposes dAtabase
productive 219, 222, 225–227, 229, 234 (VESPA)â•‡ 56
productive masteryâ•‡ 178, Spanishâ•‡ 48, 66–67, 75, 113 verbâ•‡ 13, 17, 19–24, 45, 48,
185–186, 190 Spanish learners of Englishâ•‡ 40, 53, 65–66, 68, 70–79, 116,
productive testâ•‡ 178, 180–183, 45–49, 51, 63–64, 67, 69, 79, 128–130, 166, 220–221, 226,
185 109, 113, 115, 117–119, 121–122, 229, 242, 244–245, 247, 249,
proficiencyâ•‡ xiv, 39, 43, 52, 55, 129, 138–139, 141, 148, 177 264–266, 268, 271–275
130, 141, 148, 162–163, 176, 180, speechâ•‡ 10–11, 13–15, 35, 40–41, see also copula, discourse
187–189 89, 159, 164–165, 167–168, 177, oriented verb, modality,
proficiency levelâ•‡ 34, 38, 52, 56, 181, 217, 223, 226 phrasal verb, reporting
147, 162–163, 166, 187–188 see also spoken English verb
pronounâ•‡ 41–42, 71, 75, 174, 220, spelling variantsâ•‡ 109–111, 120 Vienna-Oxford Interna-
226, 244, 277 spoken Englishâ•‡ xiii, 10–11, 13, tional Corpus of English
pronunciationâ•‡ 88, 225 36, 41, 234 (VOICE)â•‡ 25–26, 157, 164
prototype categoryâ•‡ 23–24 see also speech vocabulary learningâ•‡ 175, 237, 240
pseudo-longitudinal studyâ•‡ 36 standard Englishâ•‡ 165, 212, 215, voiceâ•‡ 67, 69–71, 75–77, 79
219, 222, 224–225, 229, 233
Q standard varieties of Eng- W
qualitative approachâ•‡ 38–39, lishâ•‡ 161, 209, 212, 234 word combinationâ•‡ 11, 15, 40, 87
48–49, 51, 55, 70–71, 76, 113, StringNetâ•‡ 237, 241, 243–252 word combination frequen-
148, 214 Swedishâ•‡ 35, 42, 148, 159 cyâ•‡ 15–16
quantitative approachâ•‡ xiv, 35, Swedish learners of Englishâ•‡ 35, word sketchâ•‡ 19–20, 267–273,
38–39, 43, 48, 55–56, 71, 167, 39–43, 141, 148, 159 275, 277–278
219, 225, 270 syntactic patternsâ•‡ 16, 42, 56, Wordsmithâ•‡ 71–72, 92
272 writingâ•‡ 13–14, 36, 38, 40–43,
R syntagmaticâ•‡ 239, 242–244, 250 46, 52, 55, 57, 63–70, 72, 75,
receptive systemic functional gram- 79–80, 85–91, 93–101, 103, 110,
receptive knowledgeâ•‡ 173, 178, marâ•‡ 21 114–115, 127–128, 148, 168, 173,
180–181, 185, 188, 199 177, 181, 215–218, 230–231, 234,
receptive masteryâ•‡ 185–186, T 267, 279
190 teaching materialsâ•‡ 14, 35, 86 see also academic writing, ap-
receptive testâ•‡ 178, 180–183, textbookâ•‡ xiv–xv, 17, 26, 35, 49, prentice writing, argumen-
185, 188 65–66, 157, 163, 175 tative writing, dictionary
reference corpusâ•‡ 38–40, 42, 52, tickbox lexicography writing system
55, 265, 267 (TBL)â•‡ 277–278 written Englishâ•‡ 13, 40–42, 91,
registerâ•‡ 12–14, 26–27, 34, 43, transferâ•‡ xiv, 34, 41–42, 44, 211
52, 56, 75, 87–88, 90, 95, 102, 48–49, 51, 54, 75, 113, 130, 156, written productionâ•‡ 85, 89, 93
161–162, 273, 275 162, 164 written textsâ•‡ 11, 13, 15, 36, 54,
reporting verbâ•‡ 68–79 see also cross-linguistic 215–216
rhetorâ•‡ 68, 70–72, 75–76, 79 influence, detection-based
Russian learners of Englishâ•‡ 139, approach, L1 detection Z
141 Z scoreâ•‡ 133, 153–154
Name index
A Chengâ•‡ 19, 243 Englishâ•‡ 86

Ädelâ•‡ 38, 41, 43, 51, 55–56, 163, Chomskyâ•‡ 9–10, 12, 16, 21, 23 Enkvistâ•‡ 34
168 Chowdhuryâ•‡ 114 Estivalâ•‡ 138–139, 143
Adolphsâ•‡ 12 Churchâ•‡ 242, 268–269 Evertâ•‡ 242
Aijmerâ•‡ 41, 53 Cimianoâ•‡ 148
Aldersonâ•‡ 15, 23 Clearâ•‡ 262, 265 F
Allenâ•‡ 113 Conradâ•‡ 87, 242 Færchâ•‡ 36–37, 56–57
Allsoppâ•‡ 222–223 Cookâ•‡ 113 Faironâ•‡ 265
Altenbergâ•‡ 14–15, 41, 43, 54, 56, Corderâ•‡ 34, 44–45 Fallahkairâ•‡ 175
167 Cortesâ•‡ 87, 101–102 Feezâ•‡ 89
Androutsopoulosâ•‡ 214, 232, 234 Couplandâ•‡ 226 Fieldâ•‡ 133
Archerâ•‡ 110 Courtneyâ•‡ 173 Figueredoâ•‡ 113
Atkinsâ•‡ 173, 178, 259, 263, 265, Coxheadâ•‡ 14 Fillmoreâ•‡ 22
270 Crossleyâ•‡ 128, 131, 139 Firthâ•‡ 11, 159
Crystalâ•‡ 218 Fletcherâ•‡ 263
B Flowerdewâ•‡ 89
Barlowâ•‡ 21, 38 D Francisâ•‡ 10–11, 19, 64, 140, 209,
Baroniâ•‡ 262–263 Dagneauxâ•‡ 109, 114–115 259
Baronâ•‡ 110–111, 113, 119 Dagutâ•‡ 187 Frankâ•‡ 128–129, 134–135, 138, 154
Baumannâ•‡ 66 Danetâ•‡ 233 Fukuyaâ•‡ 187
Bazermanâ•‡ 65, 87–89 D’Arcyâ•‡ 213
Bealâ•‡ 213 Darwinâ•‡ 175 G
Beboutâ•‡ 113 Daviesâ•‡ 11, 173, 175, 177–178, Gabrielatosâ•‡ 17
Bechar-Israeliâ•‡ 232 185, 214 Gamonâ•‡ 114
Beißwengerâ•‡ 214 DeCarricoâ•‡ 166 Gardnerâ•‡ 11, 173, 175, 177–178
Bernardiniâ•‡ 262 De Cockâ•‡ 14, 16, 40, 176 Gavinâ•‡ 136
Bestgenâ•‡ 148 de Schryverâ•‡ 262 Gevaâ•‡ 113
Bhatiaâ•‡ 65–66 Deuberâ•‡ 213, 215, 218 Gilquinâ•‡ 2, 12, 14, 16, 24, 38, 41,
Biberâ•‡ 10, 15–18, 51, 56, 65, 85–87, Dewaeleâ•‡ 275 44, 54, 56, 64, 156, 168
94–95, 175, 178, 242 Dorâ•‡ 233 Goldbergâ•‡ 22
Bolingerâ•‡ 173–174, 178, 237–241 Dörnyeiâ•‡ 174, 180 Gouverneurâ•‡ 26
Boström Aronssonâ•‡ 42–43 Drasâ•‡ 139–141, 143 Grangerâ•‡ 2, 10, 12, 14, 24, 38–39,
Brandâ•‡ 40 Dudaâ•‡ 128 44–45, 49, 55, 57, 64, 86–87,
Briggsâ•‡ 66 Dunningâ•‡ 242 102, 109, 113–114, 123, 141, 156,
Buchstallerâ•‡ 229 Durrantâ•‡ 15 163–164, 166–167, 176
Burnardâ•‡ 244 Grayâ•‡ 175
Burnsâ•‡ 133–134 E Greenbaumâ•‡ 37, 211
Burrowsâ•‡ 154 Eckertâ•‡ 227 Greeneâ•‡ 16
Bybeeâ•‡ 22 Eeg-Olofssonâ•‡ 15 Grefenstetteâ•‡ 259, 263
Ehrenreichâ•‡ 161 Griesâ•‡ 19–20, 22, 24, 110, 274
C Ehrmanâ•‡ 10 Grootâ•‡ 178, 183
Candlinâ•‡ 86 Eiaâ•‡ 41 Guanâ•‡ 151
Carrollâ•‡ 9, 26 Eliassonâ•‡ 187 Guoâ•‡ 151
Chambersâ•‡ 213, 265 Elisseefâ•‡ 134 Guyonâ•‡ 134
Charlesâ•‡ 74 Ellisâ•‡ 10, 15, 23–24, 174, 242
 A Taste for Corpora
H Kellerâ•‡ 263 McEneryâ•‡ 8

Hallidayâ•‡ 11–12, 19, 21, 27, 92 Kemmerâ•‡ 21 McLachlanâ•‡ 134
Hammarbergâ•‡ 34 Kennedyâ•‡ 8 McNamaraâ•‡ 128
Hanksâ•‡ 242, 268–269, 279 Kilgarriffâ•‡ 19–20, 244, 262–264, Mearaâ•‡ 86
Hartâ•‡ 173 269–270, 276–277 Meiselâ•‡ 166
Hasselgårdâ•‡ 43, 49–50, 52, 54, 56 Klimpfingerâ•‡ 162 Melkaâ•‡ 180
Hasselgrenâ•‡ 14, 40, 48 Kobayashiâ•‡ 188 Mesthrieâ•‡ 234
Heiftâ•‡ 115, 117, 119 Kohaviâ•‡ 134 Meunierâ•‡ 12, 26, 49, 64, 109, 123,
Herlitzâ•‡ 130 Konishiâ•‡ 173 166, 176
Herrimanâ•‡ 43 Koppelâ•‡ 139–141, 143 Meyerhoffâ•‡ 229
Herringâ•‡ 233 Koprowskiâ•‡ 186 Millerâ•‡ 176
Hessâ•‡ 135–136 Kortmannâ•‡ 213 Millet-Roigâ•‡ 152
Heylighenâ•‡ 275 Kotsiantisâ•‡ 128–129, 134, 139, 145, Miltonâ•‡ 114–115
Hinrichsâ•‡ 218 152–153 Mindtâ•‡ 26
Hintonâ•‡ 128 Krennâ•‡ 242 Mishanâ•‡ 239
Hoeyâ•‡ 64–65, 248 Krishnamurthyâ•‡ 10, 259 Molinaroâ•‡ 129, 135–136
Hoffmannâ•‡ 16, 213 Kučeraâ•‡ 10–11, 140, 209, 259 Moonâ•‡ 12, 15, 174
Hoflandâ•‡ 10, 13 Kurhilaâ•‡ 158, 161 Morrisâ•‡ 215
Höhnâ•‡ 229 Mukherjeeâ•‡ 212–213
Hooperâ•‡ 22 L Murrayâ•‡ 258
Hooverâ•‡ 154 Ladoâ•‡ 34 Myslinâ•‡ 110
Hopperâ•‡ 22 Langackerâ•‡ 21–22
Horstâ•‡ 175 Lapataâ•‡ 263 N
Hovermaleâ•‡ 115 Lauferâ•‡ 187 Nationâ•‡ 175, 183, 241
Hülmbauerâ•‡ 162 Leaâ•‡ 88 Nattingerâ•‡ 166
Hulstijnâ•‡ 187 Lecockeâ•‡ 135–136 Neffâ•‡ 70–71, 74
Hundtâ•‡ 212 Leechâ•‡ 9–10, 13, 22, 39, 175 Nelsonâ•‡ 213
Hunstonâ•‡ 19, 34, 40, 55, 64, 175 Leeâ•‡ 115 Nesselhaufâ•‡ 12, 15, 42, 163
Hylandâ•‡ 42, 56, 87–88, 92–93 Leferâ•‡ 114–115, 123 Nichollsâ•‡ 114
Hynninenâ•‡ 159 Lennonâ•‡ 15 Nickelâ•‡ 34
Hyvärinenâ•‡ 128 Levenstonâ•‡ 34 Niedzielskiâ•‡ 229
Lewisâ•‡ 267 Nigamâ•‡ 152
I Liaoâ•‡ 187
Ibrahimâ•‡ 113 Lillisâ•‡ 88 O
Linnarudâ•‡ 35–36, 57 O’Dellâ•‡ 173, 267
J Liuâ•‡ 129 Odlinâ•‡ 165
Janicivicâ•‡ 265 Lorenzâ•‡ 45, 48 O’Donovanâ•‡ 265
Järvinenâ•‡ 268 Lorgeâ•‡ 8 Ojaâ•‡ 128
Jarvisâ•‡ 128, 130–131, 139, 141–143, O’Neillâ•‡ 265
147–148, 156, 164–165 M Osborneâ•‡ 42–43
Jelinekâ•‡ 11 Mairâ•‡ 213
Jenkinsâ•‡ 88 Manningâ•‡ 242 P
Jockersâ•‡ 137, 143 Mannâ•‡ 218 Paquotâ•‡ 14, 41, 43, 52, 56, 102,
Johanssonâ•‡ 10, 13, 35, 43, 45, Marchenaâ•‡ 187 131, 139, 141–143, 147–148, 168
52–54, 56 Marshallâ•‡ 11 Patrickâ•‡ 213, 225
Johnâ•‡ 134 Martinetâ•‡ 17 Pavlenkoâ•‡ 130, 156, 164
Johnsâ•‡ 23, 65, 85, 156, 240 Martinâ•‡ 67 Pembertonâ•‡ 175
Jonesâ•‡ 85–86, 139, 143 Marxâ•‡ 167 Pereraâ•‡ 67
Masterâ•‡ 166 Petch-Tysonâ•‡ 41–43, 54
K Mauranenâ•‡ 26, 162, 164, 166–167 Pintelasâ•‡ 128
Kämmererâ•‡ 40 Mayfield Tomokiyoâ•‡ 139, 143 Plattâ•‡ 210
Karhukorpiâ•‡ 161 McArthurâ•‡ 173, 178 Pomikálekâ•‡ 270
Kaurâ•‡ 162 McCallumâ•‡ 152 Pravecâ•‡ 37
Keerthiâ•‡ 152 McCarthyâ•‡ 173, 267 Prinslooâ•‡ 262
Name index 
Prinzieâ•‡ 153 Siyanovaâ•‡ 187 U

Procterâ•‡ 258 Sjöholmâ•‡ 175, 190 Urdangâ•‡ 258
Smitâ•‡ 162
R Sorgâ•‡ 148 V
Raileanuâ•‡ 134 Spackâ•‡ 88 Van den Poelâ•‡ 153
Ramptonâ•‡ 87 Stavestrandâ•‡ 45, 56 van Rooyâ•‡ 110
Rantaâ•‡ 166 Stefanowitschâ•‡ 19–20, 274 Virtanenâ•‡ 41
Rappoportâ•‡ 139–140, 143 Steinâ•‡ 258
Raysonâ•‡ 110–111, 113–114, 119 Stevensâ•‡ 85 W
Readâ•‡ 178, 241 Stewartâ•‡ 218 Wade-Woolleyâ•‡ 113
Renoufâ•‡ 259 Stoffelâ•‡ 134 Wagnerâ•‡ 159
Riionheimoâ•‡ 165 Storrerâ•‡ 214 Waibelâ•‡ 187
Rimrottâ•‡ 115, 117, 119 Streetâ•‡ 88 Walkerâ•‡ 265
Ringbomâ•‡ 40, 164–165 Summersâ•‡ 85 Wangâ•‡ 113
Römerâ•‡ 12, 26 Svartvikâ•‡ 34–35 Wardhaughâ•‡ 44
Rottâ•‡ 175 Swalesâ•‡ 65–66, 167 Waringâ•‡ 175
Rubinâ•‡ 16 Swanâ•‡ 163, 187 Werlichâ•‡ 65
Rundellâ•‡ 57, 85, 259, 261, 263, Szmrecsanyiâ•‡ 213 Westâ•‡ 8
265, 269, 271 Wibleâ•‡ 239–241, 243, 246, 248,
T 251
S Tagliamonteâ•‡ 213 Widdowsonâ•‡ 86, 88, 162, 239
Sampsonâ•‡ 16 Tapanainenâ•‡ 268 Wiktorssonâ•‡ 40, 49
Sandâ•‡ 213 Tapperâ•‡ 14, 41, 54 Williamsâ•‡ 212
Santiniâ•‡ 275 Tercanliogluâ•‡ 188 Willisâ•‡ 86
Schäferâ•‡ 110 Teubertâ•‡ 15 Wilsonâ•‡ 8
Schmidtâ•‡ 24 Teytaudâ•‡ 136 Winfordâ•‡ 166–167
Schmittâ•‡ 166, 175, 178, 180, 185, 187 Thagg Fisherâ•‡ 35–36, 57 Wittenâ•‡ 128–129, 134–135,
Schneiderâ•‡ 211, 213, 215 Thewissenâ•‡ 109, 114–115, 123 137–138, 143, 154
Schulzeâ•‡ 12 Thompsonâ•‡ 64, 75 Wrayâ•‡ 167
Scottâ•‡ 71, 86–87 Thomsonâ•‡ 17 Wynneâ•‡ 114
Seidlhoferâ•‡ 26, 88, 165–166 Thorndikeâ•‡ 8
Sejnowskiâ•‡ 128 Thurstonâ•‡ 86 Y
Selinkerâ•‡ 35 Tibshiraniâ•‡ 152 Youssefâ•‡ 215
Sharoffâ•‡ 263 Tomaselloâ•‡ 23, 174
Shenâ•‡ 129, 135, 137, 143, 154 Tonoâ•‡ 114 Z
Siegelâ•‡ 113 Traugottâ•‡ 22 Zhouâ•‡ 151
Silversteinâ•‡ 227 Tribbleâ•‡ 85–87, 89–90 Zipfâ•‡ 8–9
Simo Bobdaâ•‡ 213 Trudgillâ•‡ 210 Zutellâ•‡ 113
Simpson-Vlachâ•‡ 15, 242 Tsaoâ•‡ 243, 246, 248, 251
Sinclairâ•‡ 10–11, 19, 49, 68, 85–86, Tsurâ•‡ 139–140, 143
156, 173, 259 Tugwellâ•‡ 19–20

A Taste For Corpora - in Honour of Sylviane Granger

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Taste For Corpora - in Honour of Sylviane Granger

Uploaded by

Copyright:

Available Formats

Studies in Corpus Linguistics (SCL)

General Editor Consulting Editor

John Benjamins Publishing Company

American National Standard for Information Sciences – Permanence of

Cover design: Françoise Berserik

Library of Congress Cataloging-in-Publication Data

© 2011 – John Benjamins B.V.

once our professor,

Putting corpora to good uses: A guided tour 1

Frequency, corpora and language learning 7

Learner corpora and contrastive interlanguage analysis 33

Revisiting apprentice texts: Using lexical bundles to investigate

Automatic error tagging of spelling mistakes in learner corpora 109

Learners and users – Who do we want corpus data from? 155

Learner knowledge of phrasal verbs: A corpus-informed study 173

Corpora and the new Englishes: Using the ‘Corpus

Towards a new generation of Corpus-derived lexical resources

Automating the creation of dictionaries: Where will it all end? 257

Subject index 289

Bengt Altenberg Lund University, Sweden

– almost as a logical consequence – in contrastive analysis. These interests can be seen

However, interlanguage phenomena like underuse or overuse of a target language

Sylvie De Cock, Gaëtanelle Gilquin, Fanny Meunier

to reflect the cognitive processes underlying the production of (non-native) language.

I begin this chapter with a brief survey of how frequency – in particular,

2. A brief glance at history

2.1 Early frequency studies

The early chapters of introductions to corpus linguistics by Kennedy (1998) and by

2.2 The rejection of frequency

concluded that “probabilistic considerations have nothing to do with grammar”

2.3 The computer age and the revival of frequency studies

application of frequency information whether derived from general corpora, special-

2.4 Co-frequency, collocation

3. Recent progress in frequency studies relevant to language learning

3.1 How frequency is important for English Language Teaching (ELT)

3.2 Word frequency associated with language varieties

3.3 A more considered view

3.4 Frequency of word combinations: Is it more important

of acquiring individual words, but of acquiring phraseology. Hence frequency of word

3.5 Grammatical frequency

3.6 Phraseology and the interaction of lexis and grammar

collocational analysis paradigm to apply to frequency of co-occurrence of both lexical

subject-of num sal object-of num sal modifier num sal

lend 95 21.2 burst 27 16.4 central 755 25.5

num = number of tokensâ•…â•…â•…â•… sal = salience (roughly: strength of association)

verb in number of collostruction verb in number of collostruction

regard â•⁄ 80 166.476 recognise/ize 12 12.159

4. New directions in applied linguistics favourable to frequency

In this section, striking an optimistic, forward-looking note, I take account of present

4.1 Theoretical positions favouring frequency

as a cognitive phenomenon. In this sense cognitive linguistics is usage-oriented. The

4.2 Frequency effects in language change

categorical, structural or semantic trends. Rather, there seems to be a general increase

4.3 Frequency effects in language acquisition

The sequence represented graphically above is primarily, of course, to be applied to

5. Challenges and possible solutions

5.1 Challenge I: Bringing together corpus linguistic

5.2 Challenge II: Corpora do not always match learners’ needs

15. Corpus of Spoken, Professional American-English; Michigan Corpus of Academic Spoken

6. Conclusion: With words of comfort

Teaching [Studies in Corpus Linguistics 33], K. Aijmer (ed.), 179–201. Amsterdam:

Hilde Hasselgård and Stig Johansson1

(4) One possible solution is a quite radical one. (LOCNESS)